lundi 31 mars 2014

How to avoid country-based redirects with urlopen or urllib2 in Python


Vote count:

0




I am using Python 2.7.


I want to open the URL of a website and extract information out of it. The information I am looking for is within the US version of the website (http://ift.tt/1ms3x4f) . Since I am based in Canada, I get automatically redirected to the Canadian version of the website (http://ift.tt/1gVtZ6P). I am looking for a solution to try to avoid this.


If I take any browser (IE, Firefox, Chrome, ...) and navigate to http://ift.tt/1ms3x4f, I will get redirected. The website offers a menu where the visitor can pick the "country-version" of the website he wants to view. Once I select United States, I am no longer redirected to the Canadian version of the website. This is true for any new tab within the browsing session. I suspect this has to do with cookies storage.


I tried to use the following code to prevent the redirect:



import urllib2
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
result.status = code
return result
http_error_301 = http_error_303 = http_error_307 = http_error_302

opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open('http://ift.tt/1ms3x4f')


but it didn't seem to work since the only bit of code that can be extracted afterwards is:



<html><head></head><body>‹</body></html>


A solution to my problem would be to use a proxy while scraping the website but I was wondering if there is any way to prevent these kind of redirects using exclusively Python or Python packages.



asked 50 secs ago






Aucun commentaire:

Enregistrer un commentaire