mercredi 4 février 2015

Need to switch messy urllib and beautifulsoup code into requests?


Vote count:

-1




This code is extremely messy and slow. I have tried to understand how to switch the urllib and beautifulsoup stuff into simple requests code with no luck.



#get the file
f = urllib2.urlopen("http://hi.com")
s = str(f.read())
f.close()

#regular expression pattern matching everything inside < > tags and double-slashed n
pattern = r'(<.*?>|\\n)'

#replaces all instances of the pattern with a newline, then writes it into the file 'refined.txt'
ff = open('refined.txt', 'w')
ff.close()


from bs4 import BeautifulSoup
soup = BeautifulSoup(s)
links=[]
for link in soup.find_all('a'):
links.append(link.get('href'))
i=0
for element in links:
#get the file
f = urllib2.urlopen("http://hi.com/"+element)
s = str(f.read())
f.close()

#regular expression pattern matching everything inside < > tags and double-slashed n
pattern = r'(<.*?>|\\n)'

#replaces all instances of the pattern with a newline, then writes it into the file 'refined.txt'
with open('refined.txt', 'w') as ff:
ff.write(re.sub(pattern, '\n', s))

#prints the file line by line
with open('refined.txt') as of:
d=of.readlines()


asked 49 secs ago







Need to switch messy urllib and beautifulsoup code into requests?

Aucun commentaire:

Enregistrer un commentaire