jeudi 5 février 2015

Extracting infobox from (German) Wikipedia using Wikimedia API


Vote count:

0




I want to extract the information in the infobox from specific Wikipedia sites, mainly countries. Specifically I want to achieve this without scraping the page using Python + BeautifulSoup4 or any other languages + libraries, if possible, but use the official API, because I noticed the CSS tags are different for different Wikipedia sites (different sites as in other languages).


In mediawiki api: how to get infobox from a wikipedia article states that using the following method would work, which is indeed true for the given tital (Scary Monsters and Nice Sprites), but unfortunately doesn't work on the pages I tried (further below).



http://ift.tt/1DKeJl2


However, I suppose Wikimedia changed their infobox template, because when I run the above query all I get is the content, but not the infobox. E.g. running the query on Europäische_Union (European_Union) results (among others) in the following snippet



{{Infobox Europäische Union}}
<!--{{Infobox Staat}} <- Vorlagen-Parameter liegen in [[Spezial:Permanenter Link/108232313]] -->


It works fine for the English version of Wikipedia though.


So the page I want to extract the infobox from would be: http://ift.tt/HkcYMs


And this is the code I'm using:



#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

import lxml.etree
import urllib

title = "Europäische_Union"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"content", "rvsection":0 }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://ift.tt/1DKeLJG" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print revs[-1].text


Am I missing something very substantial?



asked 20 secs ago







Extracting infobox from (German) Wikipedia using Wikimedia API

Aucun commentaire:

Enregistrer un commentaire