Tags: bom, codecs, encoded, example, file, programming, python, string, text, thistext, txt, utf-8, utf8
remove BOM from string read from utf-8 file
On Programmer » Python
4,713 words with 4 Comments; publish: Wed, 26 Dec 2007 23:19:00 GMT; (20093.75, « »)
Hi,
I read some text from a utf-8 encoded text file like this:
text = codecs.open('example.txt','r','utf8').read()
If I pass this text to a COM object, I can see that there is still the BOM
in the file, which marks the file as utf-8. Simply removing the first
character in the string is not ok, because the BOM is optional. So I tried
something like this:
if text.startswith(codecs.BOM_UTF8):
print "found BOM"
but then I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)
What's the right way to remove the BOM from the string?
regards,
Achim
http://python.itags.org/q_python_77170.html
All Comments
Leave a comment...
- 4 Comments

- >>>>> "Achim Domma" <domma.python.itags.org.procoders.net> (AD) wrote:
AD> Hi,
AD> I read some text from a utf-8 encoded text file like this:
AD> text = codecs.open('example.txt','r','utf8').read()
AD> If I pass this text to a COM object, I can see that there is still the BOM
AD> in the file, which marks the file as utf-8. Simply removing the first
AD> character in the string is not ok, because the BOM is optional. So I tried
AD> something like this:
The BOM is in the file, but not in the string 'text'
text is a unicode string which consists of Unicode characters and the BOM
is not a Unicode character.
Check text[0] and len(text) to verify.
Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that
is the reason for the complaint.
--
Piet van Oostrum <piet.python.itags.org.cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum.python.itags.org.hccnet.nl
#1; Wed, 26 Dec 2007 23:20:00 GMT

- "Piet van Oostrum" <piet.python.itags.org.cs.uu.nl> wrote in message
news:wzoerkinig.fsf.python.itags.org.Ordesa.local...
> Check text[0] and len(text) to verify.
That's what I did. The file contains 24 chinese characters and len(text) is
25. And 0xef is the hex code for the BOM if I'm not completely wrong.
Achim
#2; Wed, 26 Dec 2007 23:21:00 GMT

- I found myself often needing to read text files that might be utf-8, unicode
or ansi, without knowing beforehand which, so I wrote a single function to
do it. I don't know if this is the correct way to handle this situation,
but I couldn't find any function that would simply open a file with the
appropriate codec automatically, so I use this (it doesn't handle all cases,
but just the ones I've needed so far):
import os, codecs
#-----------------------
-
# OpenTextFile()
#
# Opens a file correctly whether it is unicode or ansi. If the file
# doesn't exist, then the default encoding is unicode (UTF-16).
#
# Python documentation of the codecs module is pretty weak; for instance
# there are all these:
# BOM
# BOM_BE
# BOM_LE
# BOM_UTF8
# BOM_UTF16
# BOM_UTF16_BE
# BOM_UTF16_LE
# BOM_UTF32
# BOM_UTF32_BE
# BOM_UTF32_LE
# but no explanation of how they map to the encodings like 'utf-16'. Some
# can be inferred, but some are not so clear.
#-----------------------
-
def OpenTextFile(filename,mode='r',encoding=None):
if os.path.isfile(filename):
f = file(filename,'rb')
header = f.read(4) # Read just the first four bytes.
f.close()
# Don't change this to a map, because it is ordered!!!
encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
( codecs.BOM_UTF16, 'utf-16' ),
( codecs.BOM_UTF8, 'utf-8' ) ]
for h,e in encodings:
if header.find(h) == 0:
encoding = e
break
return codecs.open(filename,mode,encoding)
#3; Wed, 26 Dec 2007 23:22:00 GMT

- >>>>> "Achim Domma" <domma.python.itags.org.procoders.net> (AD) wrote:
AD> "Piet van Oostrum" <piet.python.itags.org.cs.uu.nl> wrote in message
AD> news:wzoerkinig.fsf.python.itags.org.Ordesa.local...
>> Check text[0] and len(text) to verify.
AD> That's what I did. The file contains 24 chinese characters and len(text) is
AD> 25. And 0xef is the hex code for the BOM if I'm not completely wrong.
Sorry, I was wrong.
You have to check for text.startswith(u'\ufeff')
--
Piet van Oostrum <piet.python.itags.org.cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum.python.itags.org.hccnet.nl
#4; Wed, 26 Dec 2007 23:23:00 GMT