Grabbing Data from Web Page using python 3

Grabbing Data from Web Page using python 3 - python-3.x

I'm performing the same web scraping pattern that I just learned from post , however, I'm unable to scrap the using below script. I keep getting an empty return and I know the tags are there. I want to find_all "mubox" then pulls values for O/U and goalie information. This so weird, what am I missing?
from bs4 import BeautifulSoup
import requests
import pandas as pd
page_link = 'https://www.thespread.com/nhl-scores-matchups'
page_response = requests.get(page_link, timeout=10)
# here, we fetch the content from the url, using the requests library
page_content = BeautifulSoup(page_response.content, "html.parser")
# Take out the <div> of name and get its value
tables = page_content.find_all("div", class_="mubox")
print (tables)
# Iterate through rows
rows = []

This site uses an internal API before rendering the data. This api is an xml file, you can get here which contains all the match information. You can parse it using beautiful soup :
from bs4 import BeautifulSoup
import requests
page_link = 'https://www.thespread.com/matchups/NHL/matchup-list_20181030.xml'
page_response = requests.get(page_link, timeout=10)
body = BeautifulSoup(page_response.content, "lxml")
data = [
(
t.find("road").text,
t.find("roadgoalie").text,
t.find("home").text,
t.find("homegoalie").text,
float(t.find("ot").text),
float(t.find("otmoney").text),
float(t.find("ft").text),
float(t.find("ftmoney").text)
)
for t in body.find_all('event')
]
print(data)

Related

Script is not returning proper output when trying to retrieve data from a newsletter

I am trying to write a script that can retrieve album title and band name from a music store newsletter. The band name and album title are hidden in a h3 & h4 class. When executing the script I get a blank output in the csv file.
`
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('a', attrs={'class': 'row'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('td', attrs={'td_class': 'h3 class'})
band_name_element = album.find('td', attrs={'td_class': 'h4 class'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
I think the error is in the attrs part, not sure how to fix it properly. Thanks in advance!

Looking at your code I agree that the error lies in the attrs part. The problem you are facing is that the site you are trying to scrape does not contain 'a' elements with the 'row' class. Thus find_all returns an empty list. There are plenty of 'div' elements with the 'row' class, maybe you meant to look for those?
You had the right idea by looking for 'td' elements and extracting their 'h3' and 'h4' elements, but since albums is an empty list, there are no elements to find.
I changed your code slightly to look for 'td' elements directly and extract their 'h3' and 'h4' elements. With these small changes your code found 29 albums.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'} )
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for i, album in enumerate(albums):
album_title_element = album.find('h3')
band_name_element = album.find('h4')
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv', index=False)
I also took the liberty of adding index=False to the last line of your code. This makes it so each row doesn't start with a ,.
Hope this helps.

from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('h3', attrs={'class': 'header'})
band_name_element = album.find('h4', attrs={'class': 'header'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
Thanks to the anonymous hero for helping out!

How to scrape multiple pages with requests in python

recently started getting into web scraping and i have managed ok but now im stuck and i cant find the answer or figure it out.
Here is my code for scraping and exporting info from a single page
import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
print (heading, reading)
import csv
from datetime import datetime
# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([heading, reading, datetime.now()])
Problem occurs when i try to scrape multiple pages at the same time.
They are all the same just pagination changes eg
https://www.example.com/page.aspx?sign=1
https://www.example.com/page.aspx?sign=2
https://www.example.com/page.aspx?sign=3
https://www.example.com/page.aspx?sign=4 etc
Instead of writing the same code 20 times how do i stick all the data in a tuple or an array and export to csv.
Many thanks in advance.

Just try it out with a loop, until you got no page available (request is not OK). Should be easy to get.
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
results = []
page_number = 1
while True:
response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
if response.status_code != 200:
break
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
# write a list
# results.append([heading, reading, datetime.now()])
# or tuple.. your call
results.append((heading, reading, datetime.now()))
page_number = page_number + 1
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
for result in results:
writer.writerow(result)

How to scrape from web all children of an attribute with one class?

I have tried to get the highlighted area (in the screenshot) in the website using BeautifulSoup4, but I cannot get what I want. Maybe you have a recommendation doing it with another way.
Screenshot of the website I need to get data from
from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip
import urllib
import csv
import html5lib
urls = ['https://e-mehkeme.gov.az/Public/Cases?page=1',
'https://e-mehkeme.gov.az/Public/Cases?page=2'
]
# scrape elements
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
content = soup.findAll("input", class_="casedetail filled")
print(content)
My expected output is like this:
Ətraflı məlumat:
İşə baxan hakim və ya tərkib
Xəyalə Cəmilova - sədrlik edən hakim
İlham Kərimli - tərkib üzvü
İsmayıl Xəlilov - tərkib üzvü
Tərəflər
Cavabdeh: MAHMUDOV MAQSUD SOLTAN OĞLU
Cavabdeh: MAHMUDOV MAHMUD SOLTAN OĞLU
İddiaçı: QƏHRƏMANOVA AYNA NUĞAY QIZI
İşin mahiyyəti
Mənzil mübahisələri - Mənzildən çıxarılma

Using the base url first get all the caseid and then pass those caseid to target url and then get the value of the first td tag.
import requests
from bs4 import BeautifulSoup
urls = ['https://e-mehkeme.gov.az/Public/Cases?page=1',
'https://e-mehkeme.gov.az/Public/Cases?page=2'
]
target_url="https://e-mehkeme.gov.az/Public/CaseDetail?caseId={}"
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for caseid in soup.select('input.casedetail'):
#print(caseid['value'])
soup1=BeautifulSoup(requests.get(target_url.format(caseid['value'])).content,'html.parser')
print(soup1.select_one("td").text)

I would write it this way. Extracting the id that needs to be put in GET request for detailed info
import requests
from bs4 import BeautifulSoup as bs
urls = ['https://e-mehkeme.gov.az/Public/Cases?page=1','https://e-mehkeme.gov.az/Public/Cases?page=2']
def get_soup(url):
r = s.get(url)
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
for url in urls:
soup = get_soup(url)
detail_urls = [f'https://e-mehkeme.gov.az/Public/CaseDetail?caseId={i["value"]}' for i in soup.select('.caseId')]
for next_url in detail_urls:
soup = get_soup(next_url)
data = [string for string in soup.select_one('[colspan="4"]').stripped_strings]
print(data)

Loading scraped data into list

I was able to successfully scrape some text from a website and I'm now trying to load the text into a list so I can later convert it to a Pandas DataFrame.
The site supplied the data in a scsv format so it was quick to grab.
The following is my code:
import requests
from bs4 import BeautifulSoup
#Specify the url:url
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
#Extract the response:html_doc
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
#Find the tags associated with the data you need, in this case
# it's the "pre" tags
for data in soup.find_all("pre"):
print(data.text)
Sample Output
Week;Year;GID;Name;Pos;Team;h/a;Oppt;DK points;DK salary
1;2017;1254;Smith, Alex;QB;kan;a;nwe;34.02;5400 1;2017;1344;Bradford,
Sam;QB;min;h;nor;28.54;5900

use open function to write csv file
import requests
from bs4 import BeautifulSoup
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
r = requests.get(url)
html_doc = r.content
soup = BeautifulSoup(html_doc,"html.parser")
file = open(“data.csv”,”w”)
for data in soup.find("pre").text.split('\n'):
file.write(data.replace(';',','))
file.close()

Here's one thing you can do, although it's possible that someone who knows pandas better than I can suggest something better.
You have r.text. Put that into a convenient text file, let me call it temp.csv. Now you can use pandas read_csv method to get these data into a dataframe.
>>> df = pandas.read_csv('temp.csv', sep=';')
Addendum:
Suppose results were like this.
>>> results = [['a', 'b', 'c'], [1,2,3], [4,5,6]]
Then you could put them in a dataframe in this way.
>>> df = pandas.DataFrame(results[1:], columns=results[0])
>>> df
a b c
0 1 2 3
1 4 5 6

If u want to convert your existing code into list, using split method might do the job and then use pandas to convert it into dataframe.
import requests
from bs4 import BeautifulSoup
#Specify the url:url
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
#Extract the response:html_doc
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
#Find the tags associated with the data you need, in this case
# it's the "pre" tags
for data in soup.find_all("pre"):
print(data.text.split(";"))

Web scraping - extract data from a page using python

This is the code I am using. it returns an empty list. Could nt figure out what I am doing wrong!
from urllib request import urlopen
import re
url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page
html = urlopen(url).read().decode('utf-8')# decoding
cite_year='<span class="citation_year">(.+?)</span>'# extract citation year
pattern = re.compile(cite_year) #compile
citation_year = re.findall(pattern, html) #store data into a variable
print(citation_year)# and print

add header to the request, I use requests and bs4 library:
import requests
import bs4
headers = {'User-Agent':'Mozilla/5.0'}
url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page
html = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(html.text, 'lxml')
year = soup.find(class_="citation_year").text
print(year)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Grabbing Data from Web Page using python 3 - python-3.x

Related

Script is not returning proper output when trying to retrieve data from a newsletter

How to scrape multiple pages with requests in python

How to scrape from web all children of an attribute with one class?

Loading scraped data into list

Web scraping - extract data from a page using python

Categories

Resources