Tags: bin, colleagues, dear, error, file, following, home, invalid, line, pagegt, prey, programming, python, ral-cesga, saxparseexception, token, voms2users, well-formed, xml
SAXParseException: not well-formed (invalid token)
On Programmer » Python
8,571 words with 8 Comments; publish: Mon, 28 Apr 2008 01:59:00 GMT; (20047.12, « »)
Dear Colleagues,
I am getting the following error with a XML page:
> File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69, in ge tItems > d = minidom.parseString(xml.read()) > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 96 7, in parseString > return _doparse(pulldom.parseString, args, kwargs) > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 95 4, in _doparse > toktype, rootNode = events.getEvent() > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py", line 26 5, in getEvent > self.parser.feed(buf) > File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", lin e 208, in feed > self._err_handler.fatalError(exc) > File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38 , in fatalError > raise exception > xml.sax._exceptions.SAXParseException: <unknown>:553:48: not well-formed (invalid token)
> def getItems(page): > opener =urllib. URLopener(key_file=HOSTKEY,cert_file=HOS TCERT) ; > try: > xml = opener.open(page) > except: > return [] > d = minidom.parseString(xml.read()) > items = d.getElementsByTagName('item') > data = [] > for i in items: > data.append(getText(i.childNodes)) > return data
The page is
https://lcg-voms.cern.ch:8443/voms/...a pUsers
and the line with the invalid character is (the invalid character is the
final of Universit):
<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesit Catholique de
Louvain/CN=Roberfroid</item>
I have tried several options but I am not able to avoid this problem.
Any idea?.
I am starting to work with Python so I am sorry if this problem is trivial.
Thanks for your time.
Pablo Rey
http://python.itags.org/q_python_79307.html
All Comments
Leave a comment...
- 8 Comments

- On Thu, 30 Aug 2007 13:46:47 +0200, Pablo Rey wrote:
> The page is
> https://lcg-voms.cern.ch:8443/voms/...br />
mapUsers
> and the line with the invalid character is (the invalid character is the
> final é of Université):
The URL doesn't work for me in a browser. (Could not connect…)
Maybe you can download that XML file and use `xmllint` to check if it is
well formed XML!?
Ciao,
Marc 'BlackJack' Rintsch
#1; Mon, 28 Apr 2008 02:01:00 GMT

- Pablo Rey wrote:
> I am getting the following error with a XML page:
>
>
> The page is
> https://lcg-voms.cern.ch:8443/voms/...br />
mapUsers
> and the line with the invalid character is (the invalid character is the
> final of Universit):
> <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesit Catholique de
> Louvain/CN=Roberfroid</item>
>
> I have tried several options but I am not able to avoid this
> problem. Any idea?.
Looks like the page is not well-formed XML (i.e. not XML at all). If it
doesn't specify an encoding (<?xml encoding="..."?> ), you can try recoding t
he
input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
passing it to the SAX parser.
Alternatively, tell the page authors to fix their page.
Stefan
#2; Mon, 28 Apr 2008 02:02:00 GMT

- Hi Stefan,
The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
?> ).
About the possibility that you mention to recoding the input, could you
let me know how to do it?. I am sorry I am starting with Python and I
don't know how to do it.
Thanks by your help.
Pablo
On 30/08/2007 14:37, Stefan Behnel wrote:
> Pablo Rey wrote:
> Looks like the page is not well-formed XML (i.e. not XML at all). If it
> doesn't specify an encoding (<?xml encoding="..."?> ), you can try recoding
the
> input, possibly decoding it from latin-1 and re-encoding it as UTF-8 befor
e
> passing it to the SAX parser.
> Alternatively, tell the page authors to fix their page.
> Stefan
#3; Mon, 28 Apr 2008 02:03:00 GMT

- On 30/08/2007 14:35, Marc 'BlackJack' Rintsch wrote:
> On Thu, 30 Aug 2007 13:46:47 +0200, Pablo Rey wrote:
>
> The URL doesn't work for me in a browser. (Could not connect…)
Hi Marc,
To access to the page you need a X509 certificate signed by a CA
recognised by the project.
I have stored the XML file and you can find it attached.
> Maybe you can download that XML file and use `xmllint` to check if it is
> well formed XML!?
This is the output of the xmllint command:
[prey.python.itags.org.www3 voms2users]$ xmllint cms.xml
cms.xml:553: error: Input is not proper UTF-8, indicate encoding !
<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroi
^
cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroi
Thanks for your time.
Pablo
> Ciao,
> Marc 'BlackJack' Rintsch
#4; Mon, 28 Apr 2008 02:04:00 GMT

- On Thu, 30 Aug 2007 15:31:58 +0200, Pablo Rey wrote:
> On 30/08/2007 14:35, Marc 'BlackJack' Rintsch wrote:
>
> This is the output of the xmllint command:
> [prey.python.itags.org.www3 voms2users]$ xmllint cms.xml cms.xml:553: error: Input is not
> proper UTF-8, indicate encoding !
> <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
> Louvain/CN=Roberfroi
> ^
> cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
> <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
> Louvain/CN=Roberfroi
> […]
> <?xml version="1.0" encoding="UTF-8" ?>
So the XML says it is encoded in UTF-8 but it contains at least one
character that seems to be encoded in ISO-8859-1.
Tell the authors/creators of that document there XML is broken.
Ciao,
Marc 'BlackJack' Rintsch
#5; Mon, 28 Apr 2008 02:05:00 GMT

- On Thu, 2007-08-30 at 15:20 +0200, Pablo Rey wrote:
> Hi Stefan,
> The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
> ?> ).
It's possible that the encoding specification is incorrect:
'\xe9'
'\xc3\xa9'
If your input string contains the byte 0xe9 where your accented e is,
the file is actually latin-1 encoded. If it contains the byte sequence
0xc3,0xa9 it is UTF-8 encoded.
If the string is encoded in latin-1, you can transcode it to utf-8 like
this:
contents = contents.decode("latin-1").encode("utf-8")
HTH,
Carsten Haese
http://informixdb.sourceforge.net
#6; Mon, 28 Apr 2008 02:06:00 GMT

- On Thu, 2007-08-30 at 15:20 +0200, Pablo Rey wrote:
> About the possibility that you mention to recoding the input, could you
> let me know how to do it?. I am sorry I am starting with Python and I
> don't know how to do it.
While I answered this question in my previous reply, I wanted to add
that you might find the following How-To helpful in demystifying
Unicode:
http://www.amk.ca/python/howto/unicode
Carsten Haese
http://informixdb.sourceforge.net
#7; Mon, 28 Apr 2008 02:07:00 GMT

- In message <mailman.137.1188481649.28954.python-list.python.itags.org.python.org>, Carsten
Haese wrote:
> If your input string contains the byte 0xe9 where your accented e is,
> the file is actually latin-1 encoded. If it contains the byte sequence
> 0xc3,0xa9 it is UTF-8 encoded.
It is dismaying how often I come across Web pages that claim to be
UTF-8-encoded, but are actually Latin-1 or Dimdows-1252.
#8; Mon, 28 Apr 2008 02:08:00 GMT