Tags: bin, colleagues, dear, error, file, following, home, invalid, line, pagegt, prey, programming, python, ral-cesga, saxparseexception, token, voms2users, well-formed, xml

SAXParseException: not well-formed (invalid token)

On Programmer » Python

8,571 words with 8 Comments; publish: Mon, 28 Apr 2008 01:59:00 GMT; (20047.12, « »)

Dear Colleagues,

I am getting the following error with a XML page:

> File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69, in ge

tItems

> d = minidom.parseString(xml.read())

> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 96

7, in parseString

> return _doparse(pulldom.parseString, args, kwargs)

> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 95

4, in _doparse

> toktype, rootNode = events.getEvent()

> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py", line 26

5, in getEvent

> self.parser.feed(buf)

> File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", lin

e 208, in feed

> self._err_handler.fatalError(exc)

> File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38

, in fatalError

> raise exception

> xml.sax._exceptions.SAXParseException: <unknown>:553:48: not well-formed (invalid

token)

> def getItems(page):

> opener =urllib. URLopener(key_file=HOSTKEY,cert_file=HOS

TCERT) ;

> try:

> xml = opener.open(page)

> except:

> return []

> d = minidom.parseString(xml.read())

> items = d.getElementsByTagName('item')

> data = []

> for i in items:

> data.append(getText(i.childNodes))

> return data

The page is

https://lcg-voms.cern.ch:8443/voms/...a

pUsers

and the line with the invalid character is (the invalid character is the

final of Universit):

<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesit Catholique de

Louvain/CN=Roberfroid</item>

I have tried several options but I am not able to avoid this problem.

Any idea?.

I am starting to work with Python so I am sorry if this problem is trivial.

Thanks for your time.

Pablo Rey

All Comments

Leave a comment...

  • 8 Comments
    • On Thu, 30 Aug 2007 13:46:47 +0200, Pablo Rey wrote:

      > The page is

      > https://lcg-voms.cern.ch:8443/voms/...br />

      mapUsers

      > and the line with the invalid character is (the invalid character is the

      > final é of Université):

      The URL doesn't work for me in a browser. (Could not connect…)

      Maybe you can download that XML file and use `xmllint` to check if it is

      well formed XML!?

      Ciao,

      Marc 'BlackJack' Rintsch

      #1; Mon, 28 Apr 2008 02:01:00 GMT
    • Pablo Rey wrote:

      > I am getting the following error with a XML page:

      >

      >

      > The page is

      > https://lcg-voms.cern.ch:8443/voms/...br />

      mapUsers

      > and the line with the invalid character is (the invalid character is the

      > final of Universit):

      > <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesit Catholique de

      > Louvain/CN=Roberfroid</item>

      >

      > I have tried several options but I am not able to avoid this

      > problem. Any idea?.

      Looks like the page is not well-formed XML (i.e. not XML at all). If it

      doesn't specify an encoding (<?xml encoding="..."?> ), you can try recoding t

      he

      input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before

      passing it to the SAX parser.

      Alternatively, tell the page authors to fix their page.

      Stefan

      #2; Mon, 28 Apr 2008 02:02:00 GMT
    • Hi Stefan,

      The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"

      ?> ).

      About the possibility that you mention to recoding the input, could you

      let me know how to do it?. I am sorry I am starting with Python and I

      don't know how to do it.

      Thanks by your help.

      Pablo

      On 30/08/2007 14:37, Stefan Behnel wrote:

      > Pablo Rey wrote:

      > Looks like the page is not well-formed XML (i.e. not XML at all). If it

      > doesn't specify an encoding (<?xml encoding="..."?> ), you can try recoding

      the

      > input, possibly decoding it from latin-1 and re-encoding it as UTF-8 befor

      e

      > passing it to the SAX parser.

      > Alternatively, tell the page authors to fix their page.

      > Stefan

      #3; Mon, 28 Apr 2008 02:03:00 GMT
    • On 30/08/2007 14:35, Marc 'BlackJack' Rintsch wrote:

      > On Thu, 30 Aug 2007 13:46:47 +0200, Pablo Rey wrote:

      >

      > The URL doesn't work for me in a browser. (Could not connect…)

      Hi Marc,

      To access to the page you need a X509 certificate signed by a CA

      recognised by the project.

      I have stored the XML file and you can find it attached.

      > Maybe you can download that XML file and use `xmllint` to check if it is

      > well formed XML!?

      This is the output of the xmllint command:

      [prey.python.itags.org.www3 voms2users]$ xmllint cms.xml

      cms.xml:553: error: Input is not proper UTF-8, indicate encoding !

      <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de

      Louvain/CN=Roberfroi

      ^

      cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61

      <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de

      Louvain/CN=Roberfroi

      Thanks for your time.

      Pablo

      > Ciao,

      > Marc 'BlackJack' Rintsch

      #4; Mon, 28 Apr 2008 02:04:00 GMT
    • On Thu, 30 Aug 2007 15:31:58 +0200, Pablo Rey wrote:

      > On 30/08/2007 14:35, Marc 'BlackJack' Rintsch wrote:

      >

      > This is the output of the xmllint command:

      > [prey.python.itags.org.www3 voms2users]$ xmllint cms.xml cms.xml:553: error: Input is not

      > proper UTF-8, indicate encoding !

      > <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de

      > Louvain/CN=Roberfroi

      > ^

      > cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61

      > <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de

      > Louvain/CN=Roberfroi

      > […]

      > <?xml version="1.0" encoding="UTF-8" ?>

      So the XML says it is encoded in UTF-8 but it contains at least one

      character that seems to be encoded in ISO-8859-1.

      Tell the authors/creators of that document there XML is broken.

      Ciao,

      Marc 'BlackJack' Rintsch

      #5; Mon, 28 Apr 2008 02:05:00 GMT
    • On Thu, 2007-08-30 at 15:20 +0200, Pablo Rey wrote:

      > Hi Stefan,

      > The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"

      > ?> ).

      It's possible that the encoding specification is incorrect:

      '\xe9'

      '\xc3\xa9'

      If your input string contains the byte 0xe9 where your accented e is,

      the file is actually latin-1 encoded. If it contains the byte sequence

      0xc3,0xa9 it is UTF-8 encoded.

      If the string is encoded in latin-1, you can transcode it to utf-8 like

      this:

      contents = contents.decode("latin-1").encode("utf-8")

      HTH,

      Carsten Haese

      http://informixdb.sourceforge.net

      #6; Mon, 28 Apr 2008 02:06:00 GMT
    • On Thu, 2007-08-30 at 15:20 +0200, Pablo Rey wrote:

      > About the possibility that you mention to recoding the input, could you

      > let me know how to do it?. I am sorry I am starting with Python and I

      > don't know how to do it.

      While I answered this question in my previous reply, I wanted to add

      that you might find the following How-To helpful in demystifying

      Unicode:

      http://www.amk.ca/python/howto/unicode

      Carsten Haese

      http://informixdb.sourceforge.net

      #7; Mon, 28 Apr 2008 02:07:00 GMT
    • In message <mailman.137.1188481649.28954.python-list.python.itags.org.python.org>, Carsten

      Haese wrote:

      > If your input string contains the byte 0xe9 where your accented e is,

      > the file is actually latin-1 encoded. If it contains the byte sequence

      > 0xc3,0xa9 it is UTF-8 encoded.

      It is dismaying how often I come across Web pages that claim to be

      UTF-8-encoded, but are actually Latin-1 or Dimdows-1252.

      #8; Mon, 28 Apr 2008 02:08:00 GMT