I really don’t want to know the encoding. I only want the data. In other words, I don’t want to think. I don’t want to open notepad++ and convert between types of encoding.
My old standby doesn’t work on various file encodings that aren’t ansi (ascii, cp1252, whatever):
f = open("poo.txt", "r") lines = f.readlines() f.close() for line in lines: dosomething(line)
I have had enough. (I am also venturing into Python 3 as I have been on Python 2 forever but that is a different story.)
The following code will read a file of different encoding and split them into lines:
import os def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']): for codec in codecs: try: return byteArray.decode(codec) except: pass def ReadLinesFromFile(filename): file = open(filename, "rb") rawbytes = file.read() file.close() content = DecodeBytes(rawbytes) if content is not None: return content.split(os.linesep) lines = ReadLinesFromFile("poo.txt") for line in lines: dosomething(line)
If you need to add encodings, simply add them to the codecs default assignment (or make it more elegant as you deem).