If you have an XML as shown below and you use XML parser in pyquery 1.1.1, then there is no quick way to select those with namespaces:
<?xml version="1.0" encoding="UTF-8" ?> <foo xmlns:bar="http://example.com/bar" xmlns:bar2="http://example.com/bar2"> <bar:blah>What</bar:blah> <bar2:blah>Woot</bar2:blah> <idiot>123</idiot> </foo>
A simplest workaround is to load it with HTML parser, but you will get two indistinct elements of <blah/>. The more proper way is to build your own CSS selector:
from pyquery import PyQuery as pq from lxml.cssselect import CSSSelector d = pq(xml, parser='xml') sel = CSSSelector('bar|blah', namespaces={'bar': 'http://example.com/bar'}) print sel(d[0])[0].text, pq(sel(d[0])).text() sel = CSSSelector('bar2|blah', namespaces={'bar2': 'http://example.com/bar2'}) print sel(d[0])[0].text, pq(sel(d[0])).text()
A valid CSS selector for namespace in lxml is using | to separate namespace and tag names. With correct syntax, you also need to tell the namespaces by giving a dict as shown above. The results, list of elements, can be fed into pyquery, then you can use pyquery to carry out more selection.
I made a patch to pyquery, it will be easier if it gets pulled. The code would look like:
namespaces = {'bar': 'http://example.com/bar', 'bar2': 'http://example.com/bar2'} print pq('bar|blah', xml, parser='xml', namespaces=namespaces) print d('bar2|blah', namespaces=namespaces).text()
0 comments:
Post a Comment