Vote count:
-1
This code is extremely messy and slow. I have tried to understand how to switch the urllib and beautifulsoup stuff into simple requests code with no luck.
#get the file
f = urllib2.urlopen("http://hi.com")
s = str(f.read())
f.close()
#regular expression pattern matching everything inside < > tags and double-slashed n
pattern = r'(<.*?>|\\n)'
#replaces all instances of the pattern with a newline, then writes it into the file 'refined.txt'
ff = open('refined.txt', 'w')
ff.close()
from bs4 import BeautifulSoup
soup = BeautifulSoup(s)
links=[]
for link in soup.find_all('a'):
links.append(link.get('href'))
i=0
for element in links:
#get the file
f = urllib2.urlopen("http://hi.com/"+element)
s = str(f.read())
f.close()
#regular expression pattern matching everything inside < > tags and double-slashed n
pattern = r'(<.*?>|\\n)'
#replaces all instances of the pattern with a newline, then writes it into the file 'refined.txt'
with open('refined.txt', 'w') as ff:
ff.write(re.sub(pattern, '\n', s))
#prints the file line by line
with open('refined.txt') as of:
d=of.readlines()
asked 49 secs ago
Need to switch messy urllib and beautifulsoup code into requests?
Aucun commentaire:
Enregistrer un commentaire