I do believe I have been bitten by Python

Lately I have found myself writing short texts in Swedish, destined to end up at a friends computer. A Windows-using friend, with all the UTF-8 / ISO-8859-1 hassles this entails. For the first file, I simply copied it onto a memory stick and rebooted into the Windows partition, and search/replaced all the offending characters (å, ä, ö and the odd é). Then rebooted again (since I don’t have my emails set up in Windows) and fired off the mail.

I simply figured that this file would be kindof a one-shot deal and nothing more. About two weeks later, I wrote a second file, and re-did the entire reboot-procedure. I found myself writing a third file yesterday… I can’t for the life of me remember the saying, or where I read it, but it was something along the lines of if you do the same thing more than twice, automate the shit out of it.

An audience with the great oracle lead me to this blog post and after trying it out manually (which required me to reboot one more time just to verify that the converted file had in fact been converted) I was all set to write a little shell script. I came so far as to write the first lines of error handling in the script (make sure that the script had recieved a filename) before I realized that I really didn’t want to write a shell script. Not when I could piece together a Python script in half that time, which would have better error checking. And yes, that time estimate included researching how to have Python execute a system call. (subprocess.call() is what I settled on, as per advise from StackOverflow. It took me a minute or so of reading the manual to figure out how to redirect the output from that command (the full text, in ISO-8859-1 encoding) to a new file (getting a file pointer to the new file, and redirecting stdout from the subprocess.call() to that file pointer)

Something along these lines:

fp = open('myfile.iso.txt', 'w')
args = ['iconv', '--from-code=UTF-8', '--to-code=ISO-8859-1', 'myfile.txt']
subprocess.call(args, stdout=fp)

No more silly rebooting to convert plaintext files for me :D

Tags: , , , ,

2 Responses to “I do believe I have been bitten by Python”

  1. Also, have a look at a tool called “enca” (detects file encoding) and a tool called “recode” (which changes file encoding).

    The later is used by issuing something like “recode WINDOWS-1251..UTF-8 filename”.

    iconv is also a good candidate for the purpose: “iconv -f WINDOWS-1251 -t UTF-8 filename > new_filename” (which you use in your script anyway).

  2. Patrik says:

    First of all, welcome :D
    enca seemed very promising, but unfortunately didn’t support Swedish, at least not on my setup (which is a pity, this software would have been awesome). Will have to look into this a bit more before writing it off, since I would like to have that functionality.
    recode, according to the man page does have some pretty interesting features (especially the flag –diacritics)
    All in all very solid tips, thank you :D