Download csvs to desktop from csv links - python-3.x

Problem:
Don't know if google fu is failing me again but I am unable to download csvs from a list of urls. I have used requests and bs4 to gather the urls (the final list is correct) - see process below for more info.
I then followed one of the answers given here using urllib to download: Trying to download data from URL with CSV File, as well as a number other stackoverflow python answers for downloading csvs.
Currently I am stuck with an
HTTP Error 404: Not Found
(below stack trace is from last attempt where passing User-Agent)
----> 9 f = urllib.request.urlopen(req)
10 print(f.read().decode('utf-8'))
#other lines
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 404: Not Found
I tried the solution here of adding a User-Agent: Web Scraping using Python giving HTTP Error 404: Not Found , though I would have expected a 403 not 404 error code - but seems to have worked for a number of OPs.
This still failed with same error. I am pretty sure I can solve this by simply using selenium and passing the csv urls to .get but I want to know if I can solve this with requests alone.
Outline:
I visit this this page:
https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice
I grab all the monthly version links e.g. Patients Registered at a GP Practice May 2019, I then visit each of those pages and grab all the csv links within.
I loop the final dictionary of filename:download_url pairs attempting to download the files.
Question:
Can anyone see what I am doing wrong or how to fix this so I can download the files without resorting to selenium? I'm also unsure of the most efficient way to accomplish this - perhaps urllib is not actually required at all and just requests will suffice?
Python:
Without user-agent:
import requests
from bs4 import BeautifulSoup as bs
import urllib
base = 'https://digital.nhs.uk/'
all_files = []
with requests.Session() as s:
r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
soup = bs(r.content, 'lxml')
links = [base + item['href'] for item in soup.select('.cta__button')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
if file_links:
all_files.append(file_links) #ignore empty dicts as for some months there is no data yet
else:
print('no data : ' + link)
all_files = {k: v for d in all_files for k, v in d.items()} #flatten list of dicts to single dict
path = r'C:\Users\User\Desktop'
for k,v in all_files.items():
#print(k,v)
print(v)
response = urllib.request.urlopen(v)
html = response.read()
with open(path + '\\' + k + '.csv', 'wb') as f:
f.write(html)
break #as only need one test case
Test with adding User-Agent:
req = urllib.request.Request(
v,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

looking at the values, it's showing me for your links
https://digital.nhs.uk/https://files.digital.nhs.uk/publicationimport/pub13xxx/pub13932/gp-reg-patients-04-2014-lsoa.csv
I think you want to drop the base +, so use this:
file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
instead of:
file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
Edit: Full Code:
import requests
from bs4 import BeautifulSoup as bs
base = 'https://digital.nhs.uk/'
all_files = []
with requests.Session() as s:
r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
soup = bs(r.content, 'lxml')
links = [base + item['href'] for item in soup.select('.cta__button')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
if file_links:
all_files.append(file_links) #ignore empty dicts as for some months there is no data yet
else:
print('no data : ' + link)
all_files = {k: v for d in all_files for k, v in d.items()} #flatten list of dicts to single dict
path = 'C:/Users/User/Desktop/'
for k,v in all_files.items():
#print(k,v)
print(v)
response = requests.get(v)
html = response.content
k = k.replace(':', ' -')
file = path + k + '.csv'
with open(file, 'wb' ) as f:
f.write(html)
break #as only need one test case

Related

How could I extract a specific numerical value from a list item

So I'm trying to scrape data from a website and get specific values that I will use later in calculations but I am having trouble taking the data I scrape and pulling just the values I want from it. Currently, this is what I have:
import requests
from bs4 import BeautifulSoup
header = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
url = 'https://cars.usnews.com/cars-trucks/ram/1500/2021/specs/1500-tradesman-4x2-quad-cab-6-4-box-414114'
page = requests.get(url, headers=header) # change headers or get blocked
soup = BeautifulSoup(page.content, 'html.parser')
specs = soup.find_all('div', class_="trim-specs columns small-12")
spec_values = []
for spec in specs:
spec_values.extend(spec.find_all('li'))
towing = [x for x in spec_values if 'Maximum Trailering Capacity (lbs.)' in x.string]
print(towing)
From here I get this output:
[<li>Maximum Trailering Capacity (lbs.): 7730</li>]
How could I just pull the value of 7330 from here?
This is one way I found of doing this but it won't work for values that are not integers
towing_num = [int(i) for i in str(towing) if i.isdigit()]
towing_cap = int(''.join(map(str, towing_num)))
print(towing_cap)
This gives me 7730 as an output but this method does not work for any number with a decimal. Is there a more straightforward way of obtaining this value?
Thanks in advance
Looking at the page, you can split the spec with : and then second element is your number. You can then apply int() or float() on it:
import requests
from bs4 import BeautifulSoup
header = {"User-Agent": "Mozilla/5.0 (X11; Linux i686 on x86_64)"}
url = "https://cars.usnews.com/cars-trucks/ram/1500/2021/specs/1500-tradesman-4x2-quad-cab-6-4-box-414114"
page = requests.get(url, headers=header) # change headers or get blocked
soup = BeautifulSoup(page.content, "html.parser")
# load all specs into `specs` list
specs = []
for li in soup.select(".trim-specs li:not(.sub-header)"):
specs.append([w.strip() for w in li.text.split(":")])
# find "Maximum Trailering Capacity (lbs.)" in specs:
for s in specs:
if "Maximum Trailering Capacity (lbs.)" in s:
print("{} is {}".format(s[0], int(s[1])))
break
Prints:
Maximum Trailering Capacity (lbs.) is 7730

Getting incorrect link on parsing web page in BeautifulSoup

I'm trying to get the download link from the button in this page. But when I open the download link that I get from my code I get this message
I noticed that if I manually click the button and open the link in a new page the csrfKey part of the link is always same whereas when I run the code I get a different key every time. Here's my code
from bs4 import BeautifulSoup
import requests
import re
def GetPage(link):
source_new = requests.get(link).text
soup_new = BeautifulSoup(source_new, 'lxml')
container_new = soup_new.find_all(class_='ipsButton')
for data_new in container_new:
#print(data_new)
headline = data_new # Display text
match = re.findall('download', str(data_new), re.IGNORECASE)
if(match):
print(f'{headline["href"]}\n')
if __name__ == '__main__':
link = 'https://eci.gov.in/files/file/10985-5-number-and-types-of-constituencies/'
GetPage(link)
Before you get to the actual download links of the files, you need to agree to Terms and Conditions. So, you need to fake this with requests and then parse the next page you get.
Here's how:
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
link = 'https://eci.gov.in/files/file/10985-5-number-and-types-of-constituencies/'
with requests.Session() as connection:
r = connection.get("https://eci.gov.in/")
confirmation_url = BeautifulSoup(
connection.get(link).text, 'lxml'
).select_one(".ipsApp .ipsButton_fullWidth")["href"]
fake_agree_to_continue = connection.get(
confirmation_url.replace("?do=download", "?do=download&confirm=1")
).text
download_links = [
a["href"] for a in
BeautifulSoup(
fake_agree_to_continue, "lxml"
).select(".ipsApp .ipsButton_small")[1:]]
for download_link in download_links:
response = connection.get(download_link)
file_name = (
response
.headers["Content-Disposition"]
.replace('"', "")
.split(" - ")[-1]
)
print(f"Downloading: {file_name}")
with open(file_name, "wb") as f:
f.write(response.content)
This should output:
Downloading: Number And Types Of Constituencies.pdf
Downloading: Number And Types Of Constituencies.xls
And save two files: a .pdf and a .xls. The later one looks like this:

Loop pagination using a dynamic integer

So far I can scrape the initial page and save. What I'm trying to do is use the page count on the site to determine the number of loops.
The page count is found in the code with the 'count =', which in this case is 18. How can I loop my code to scrape and save each page?
Secondly, my code scrapes each url 3 times.
Is there a way to not have the duplicates?
Lastly, I'm using 'strip' to get the dynamic integer for the loop. The element returns the text: Viewing page 1 of 18. Using 'strip' returns the correct number if the last number is a single integer. In this case, since there are two (18), it only returns the 8. Can't figure that one out for the life of me.
Appreciate the help.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import re
import csv
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
### set zipcode and search length ###
zipcode = "84105"
search = "1yr" #search option: 1mo 3mo 6mo 1yr 2yr 3yr All
url = 'https://www.redfin.com/zipcode/' + zipcode + '/filter/include=sold-' + search
https = "https://www.redfin.com"
driver.get(url)
#####################################
### get page count ###
count = soup.find('span', class_='pageText').get_text() #grabs total pages to grab
pages = count.strip('Viewing page 1 of') #gives a number of pages to paginate
print("This search has " + pages + " pages" + ": " + zipcode)
print(url)
########################
data = []
for url in soup.find_all('a', attrs={'href': re.compile("^/UT/")}):
print(https + url['href'])
data.append(https + url['href'])
with open("links.csv",'a') as csvfile:
write = csv.writer(csvfile, delimiter = ' ')
write.writerows(data)
Just noticed that you want to loop without duplicates:
import requests
from bs4 import BeautifulSoup
import csv
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
def main(url):
with requests.Session() as req:
print("Extracting Page# 1")
r = req.get(url.format("1"), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
total = int(soup.select_one("span.pageText").text.split(" ")[-1]) + 1
urls = [f'{url[:22]}{a.get("href")}' for a in soup.select(
"a.slider-item")]
for page in range(2, total):
print(f"Extracting Page# {page}")
r = req.get(url.format(page), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
links = [f'{url[:22]}{a.get("href")}' for a in soup.select(
"a.slider-item")]
urls.extend(links)
mylist = list(dict.fromkeys(urls))
with open("links.csv", 'w', newline="") as f:
writer = csv.writer(f)
writer.writerow(["Links"])
writer.writerows(zip(mylist))
main("https://www.redfin.com/zipcode/84105/filter/include=sold-1yr/page-{}")

How to crawl a list of url without for loop?

I have a batch of list of url, and I want to crawl some information on these url
daa = ['https://old.reddit.com/r/Games/comments/a2p1ew/', 'https://old.reddit.com/r/Games/comments/9zzo0e/', 'https://old.reddit.com/r/Games/comments/a31a6q/', ]
for y in daa:
uClient = requests.get(y, headers = {'User-agent': 'your bot 0.1'})
page_soup = soup(uClient.content, "html.parser")
time= page_soup.findAll("p", {"class":"tagline"})[0].time.get('datetime').replace('-', '')
And I works well to get all time I want. But I need to do it without for loop or I mean I need to open and write a file at next step but if I do it in the same loop, the output is weird.
How do I get time without a for loop?
you could do as stated above use the open(file, 'a'). Or what I like to do is append everything into a table, and then write the whole thing as a file.
import requests
import bs4
import pandas as pd
results = pd.DataFrame()
daa = ['https://old.reddit.com/r/Games/comments/a2p1ew/', 'https://old.reddit.com/r/Games/comments/9zzo0e/', 'https://old.reddit.com/r/Games/comments/a31a6q/', ]
for y in daa:
w=1
uClient = requests.get(y, headers = {'User-agent': 'your bot 0.1'})
page_soup = bs4.BeautifulSoup(uClient.content, "html.parser")
time= page_soup.findAll("p", {"class":"tagline"})[0].time.get('datetime').replace('-', '')
temp_df = pd.DataFrame([[y, time]], columns=['url','time'])
results = results.append(temp_df).reset_index(drop = True)
result.to_csv('path/to_file.csv', index=False)

Not able to parse webpage contents using beautiful soup

I have been using Beautiful Soup for parsing webpages for some data extraction. It has worked perfectly well for me so far, for other webpages. But however I'm trying to count the number of < a> tags in this page,
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)
I'm getting the output as 0. This makes me think that the 2 lines after r=requests.get(url) is probably not working(there's obviously no chance that there's zero < a> tags in the page), and i'm not sure about what alternative solution I can use here. Does anybody have any solution or faced a similar kind of problem before?
Thanks, in advance.
You need to pass some of the information along with the request to the server.
Following code should work...You can play along with other parameter as well
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
headers = {
'User-agent': 'Mozilla/5.0'
}
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)
Put any url in the parser and check the number of "a" tags available on that page:
from bs4 import BeautifulSoup
import requests
url_base = "http://www.dnaindia.com/cricket?page=1"
res = requests.get(url_base, headers={'User-agent': 'Existed'})
soup = BeautifulSoup(res.text, 'html.parser')
a_tag = soup.select('a')
print(len(a_tag))

Resources