lundi 1 septembre 2014

Scrape websites and export only the visible text to a text document (Python 3, BS, Pandas)


Vote count:

0




Problem: I am trying to scrape multiple websites using beautifulsoup for only the visible text and then export all of the data to a single text file.


This file will be used as a corpus for finding collocations using NLTK. I'm working with something like this so far but any help would be much appreciated!



import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://ift.tt/17ElK9c","http://ift.tt/1g8owra"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(soup.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
for item in text:
print(file, item)


Unfortunately, there are two issues with this: 1.) my code times out before it finishes scrapeing(I think I may be doing something incorrectly) and 2.) when I try to export the file to a .txt file it doesn't work.


Any ideas?



asked 2 mins ago







Scrape websites and export only the visible text to a text document (Python 3, BS, Pandas)

Aucun commentaire:

Enregistrer un commentaire