Vote count:
0
I want to extract the information in the infobox from specific Wikipedia sites, mainly countries. Specifically I want to achieve this without scraping the page using Python + BeautifulSoup4 or any other languages + libraries, if possible, but use the official API, because I noticed the CSS tags are different for different Wikipedia sites (different sites as in other languages).
In mediawiki api: how to get infobox from a wikipedia article states that using the following method would work, which is indeed true for the given tital (Scary Monsters and Nice Sprites), but unfortunately doesn't work on the pages I tried (further below).
http://ift.tt/1DKeJl2
However, I suppose Wikimedia changed their infobox template, because when I run the above query all I get is the content, but not the infobox. E.g. running the query on Europäische_Union (European_Union) results (among others) in the following snippet
{{Infobox Europäische Union}}
<!--{{Infobox Staat}} <- Vorlagen-Parameter liegen in [[Spezial:Permanenter Link/108232313]] -->
It works fine for the English version of Wikipedia though.
So the page I want to extract the infobox from would be: http://ift.tt/HkcYMs
And this is the code I'm using:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import lxml.etree
import urllib
title = "Europäische_Union"
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"content", "rvsection":0 }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://ift.tt/1DKeLJG" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
print revs[-1].text
Am I missing something very substantial?
Extracting infobox from (German) Wikipedia using Wikimedia API
Aucun commentaire:
Enregistrer un commentaire