Complex Python3 Csv Scraper
I've got the code below working great when pulling data from a row, in my case row[0]. I'm wondering how to tweak it to pull data from multiple rows? Also, I would love to be able
Solution 1:
To add a per column find parameter, you could create a dictionary mapping the index number into the required find parameters as follows:
from bs4 import BeautifulSoup
import requests
import csv
class_1 = {"class": "productsPicture"}
class_2 = {"class": "product_content"}
class_3 = {"class": "id-fix"}
# map a column number to the required find parameters
class_to_find = {
0 : class_3, # Not defined in question1 : class_1,
2 : class_1,
3 : class_3, # Not defined in question4 : class_2,
5 : class_2}
withopen('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile)
writer = csv.writer(results)
for row in reader:
# get the url
output_row = []
for index, url inenumerate(row):
url = url.strip()
# Skip any empty URLsiflen(url):
#print('col: {}\nurl: {}\nclass: {}\n\n'.format(index, url, class_to_find[index]))# fetch content from servertry:
html = requests.get(url).content
except requests.exceptions.ConnectionError as e:
output_row.extend([url, '', 'bad url'])
continueexcept requests.exceptions.MissingSchema as e:
output_row.extend([url, '', 'missing http...'])
continue# soup fetched content
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find("div", class_to_find[index])
if divTag:
# Return all 'a' tags that contain an hreffor a in divTag.find_all("a", href=True):
url_sub = a['href']
# Test that link is validtry:
r = requests.get(url_sub)
output_row.extend([url, url_sub, 'ok'])
except requests.exceptions.ConnectionError as e:
output_row.extend([url, url_sub, 'bad link'])
else:
output_row.extend([url, '', 'no results'])
writer.writerow(output_row)
The enumerate()
function is used to return a counter whist iterating over a list. So index
will be 0
for the first URL, and 1
for the next. This can then be used with the class_to_find
dictionary to get the required parameters to search on.
Each URL results in 3 columns being created, the url, the sub-url if successful and the result. These can be removed if not needed.
Post a Comment for "Complex Python3 Csv Scraper"