How do i paste my scraped data into a python dictionary (csv?) - python-3.x

I'am a beginner in this so i try to learn from youtube and stack overflow. I'm currently stuck.
I'am scraping a website using a python scraper.
Now i want to put the results of the scraper into a dictionary using Python. I choose .csv so i can easy build some type of search function within my site so people can search the .csv and the website shows them their search results. I already have something that creates the .csv, only there is nothing inside when i run it..... Any
import requests
from bs4 import BeautifulSoup
import pandas as pd
scraped_data=[]
details= {}
page=requests.get('https://www.swisssense.nl/bedden')
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("a", class_="product-item-link")
prices = soup.find_all("span", class_="price")
images = soup.find_all("img", class_="product-image-photo")
bed_data = soup.find_all('li', attrs={'class', 'item product product-item'})
for bed in bed_data:
swisssense_details = {}
bed_naam = bed.find("a", class_="product-item-link").getText()
bed_price = bed.find("span", class_="price").getText() # print(bed_naam.text, bed_price.text)
print(bed_naam, bed_price)
scraped_data.append(swisssense_details)
dataFrame = pd.DataFrame.from_dict(scraped_data)
dataFrame.to_csv('swisssense_details.csv', index=False)

You don't add anything to scraped_data list (empty dictionary):
import requests
from bs4 import BeautifulSoup
import pandas as pd
scraped_data = []
details = {}
page = requests.get("https://www.swisssense.nl/bedden")
soup = BeautifulSoup(page.content, "html.parser")
products = soup.find_all("a", class_="product-item-link")
prices = soup.find_all("span", class_="price")
images = soup.find_all("img", class_="product-image-photo")
bed_data = soup.find_all("li", attrs={"class", "item product product-item"})
for bed in bed_data:
bed_naam = bed.find("a", class_="product-item-link").getText()
bed_price = bed.find(
"span", class_="price"
).getText() # print(bed_naam.text, bed_price.text)
scraped_data.append(
{"bed_naam": bed_naam.strip(), "bed_price": bed_price.strip()}
)
dataFrame = pd.DataFrame.from_dict(scraped_data)
dataFrame.to_csv("swisssense_details.csv", index=False)
Created this dataframe:
bed_naam bed_price
0 Bedframe Balance Pure 1.080,-
1 Bedframe Balance Focus 990,-
2 Gestoffeerd Bedframe Dream Moon 949,-
3 Bedframe Balance Raw 1.090,-
4 Bedframe Balance Tender 1.290,-
5 Bedframe Balance Gentle 1.240,-
6 Gestoffeerd Bedframe Web-Only Dream Cosmos 279,-
7 Bedframe Balance Timeless 1.080,-
8 Gestoffeerd Bedframe Dream Star 899,-
9 Gestoffeerd Bedframe Web-Only Dream Galaxy 299,-
10 Gestoffeerd Bedframe Dream Lunar 949,-
11 Gestoffeerd Bedframe Web-Only Dream Comet 299,-
12 Gestoffeerd Bedframe Dream Stellar 949,-
Screenshot:

Related

How to scrape table from website, while BS4 selection won`t find it?

I'm using below code scrape table element from url (www.sfda.gov.sa/en/cosmetics-list). But its coming empty
from bs4 import BeautifulSoup
import requests
import pandas as pd
url="https://www.sfda.gov.sa/en/cosmetics-list"
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find('table', attrs={'class':'table table-striped display'})
table_rows = table.find_all('tr')
res = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text.strip() for tr in td if tr.text.strip()]
if row:
res.append(row)
df = pd.DataFrame(res, columns=["ProductName", "Category", "Country", "Company"])
print(df)
Running above code but not getting data
Data is loaded via XHR so you should use this to get your information:
url = 'https://www.sfda.gov.sa/GetCosmetics.php?page=1'
pd.DataFrame(requests.get(url).json()['results'])
Example
Loop over number of pages in range() and collect all data.
import requests
import pandas as pd
data = []
for i in range(1,5):
url = f'https://www.sfda.gov.sa/GetCosmetics.php?page={i}'
data.extend(requests.get(url).json()['results'])
pd.DataFrame(data)
Output
id
cosmatics_Id
productNotificationsId
productNumber
status
productArName
productEnName
brandName
catArabic
catEnglish
counrtyAr
counrtyEn
manufactureType
packageVolume
unitAr
unitEn
barcode
manufacturearabicname
manufactureenglishname
listedNameAr
listedNameEn
imageUrl
batchNumber
country_of_manufacturing_English
country_of_manufacturing_Arabic
productCreationDate
productexpireddate
subCategory1
subCategoryAR
storageCircumstances
protectionInstructions
usageInstructions
notes
mainCommercialRecordNumber
manufacturingLicenseNumber
0
549105
58472
10518
2020-011019101291-245945
Active
ليتسيا كوبيبا
Litsea cubeba oil
MOKSHA LIFE STYLE
منتجات العناية بالبشرة
Skin products
الهند
India
Foreign
250
ملي لتر
Milliliter (ml)
0
موكشا لايف ستايل برودكت
Moksha lifestyle products
مؤسسة شجور الارض للتجارة
shojoor alearth trading
India
الهند
2020-09-28T09:40:46
2025-10-05T09:40:46
Perfumes
العطور
room temperature
تاريخ انتهاء الصلاحية
الاستعمال الخارجي
7016000957
FR555666
...
9
84386
58481
4031
2016-0120132-048982
Active
جودي ثيرابي سيستيم للشعر بالبروتين
Judy protein & Silk hair therapy system
Judy
منتجات العناية بالشعر وفروة الرأس
Hair and scalp products
الولايات المتحدة
United States
Foreign
1000
ملي لتر
Milliliter (ml)
641243925950
معامل ناتيورال كوزماتيك
natural cosmetic labs USA Inc.,
شركه بيت جودي الدوليه للتجارة
bait gody for trading co.
United States
الولايات المتحدة
2016-12-25T14:40:44
2027-01-01T14:40:44
Hair styling products
منتجات تصفيف الشعر
7007289163
FR555666
You can use concurrent.futures to concurrently scrape pages and when all pages are complete concat the results into a single dataframe:
import concurrent.futures
import json
import os
import pandas as pd
import requests
class Scrape:
def __init__(self):
self.root_url = "https://www.sfda.gov.sa/GetCosmetics.php?"
self.pages = self.get_page_count()
self.processors = os.cpu_count()
def get_page_count(self) -> int:
return self.get_data(url=self.root_url).get("pageCount")
#staticmethod
def get_data(url: str) -> dict:
with requests.Session() as request:
response = request.get(url, timeout=30)
if response.status_code != 200:
print(response.raise_for_status())
return json.loads(response.text)
def process_pages(self) -> pd.DataFrame:
page_range = list(range(1, self.pages + 1))
with concurrent.futures.ProcessPoolExecutor(max_workers=self.processors) as executor:
return pd.concat(executor.map(self.parse_data, page_range)).reset_index(drop=True)
def parse_data(self, page: int) -> pd.DataFrame:
url = f"{self.root_url}page={page}"
data = self.get_data(url=url)
return (pd
.json_normalize(data=data, record_path="results")
)[["productEnName", "catEnglish", "counrtyEn", "brandName"]].rename(
columns={"productEnName": "ProductName", "catEnglish": "Category",
"counrtyEn": "Country", "brandName": "Company"}
)
if __name__ == "__main__":
final_df = Scrape().process_pages()
print(final_df)

Creating a CSV File from a Wikipedia table using Beautiful Soup

I am trying to use Beautiful Soup to scrape the first 3 Columns from a table in this Wikipedia Page.
I implemented the solution found here.
import requests
import lxml
import pandas as pd
from bs4 import BeautifulSoup
#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text
#parsing the page
soup = BeautifulSoup(page, "lxml")
#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")
df = pd.read_html(str(table))
df = pd.concat(df)
print(df)
df.to_csv("booker.csv", index = False)
It worked like a charm. Gave me exactly the output I was looking for:
Expected Output 1
However, the solution above uses pandas.
I want to create the same output without using pandas.
I referred to the solution here but the output I am getting looks like this:
Output 2
here is the code that generates "Output 2":
import requests
import lxml
from bs4 import BeautifulSoup
#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text
#parsing the page
soup = BeautifulSoup(page, "lxml")
#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")
with open('output.csv', 'w', newline="") as file:
writer = csv.writer(file)
writer.writerow(['Year','Author','Title'])
for tr in table.find_all('tr'):
try:
td_1 = tr.find_all('td')[0].get_text(strip=True)
except IndexError:
td_1 = ""
try:
td_2 = tr.find_all('td')[1].get_text(strip=True)
except IndexError:
td_2 = ""
try:
td_3 = tr.find_all('td')[3].get_text(strip=True)
except IndexError:
td_3 = ""
writer.writerow([td_1, td_2,td_3])
So my question is: How do I get the expected output without using Pandas?
P.S: I've tried to parse the rows in the table like this:
import requests
import lxml
from bs4 import BeautifulSoup
#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text
#parsing the page
soup = BeautifulSoup(page, "lxml")
#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")
rows = table.find_all('tr')
for row in rows:
cell = row.td
if cell is not None:
print(cell.get_text())
print(cell.next_sibling.next_sibling.get_text())
else:
print("heehee")
But the output I get looks like this:
heehee
1969
Barry England
Nicholas Mosley
Iris Murdoch
Muriel Spark
Gordon Williams
1970
A. L. Barker
Elizabeth Bowen
Iris Murdoch
William Trevor
Terence Wheeler
1970 Awarded in 2010 as the Lost Man Booker Prize[a]
Nina Bawden
Shirley Hazzard
Mary Renault
Muriel Spark
Patrick White
1971
Thomas Kilroy
Doris Lessing
Mordecai Richler
Derek Robinson
Elizabeth Taylor
1972
Susan Hill
Thomas Keneally
Try the following to get your desired results from there. Make sure your bs4 version is up to date or at least higher than 4.7.0 for it to support pseudo css selector which I've used within the script.
import csv
import lxml
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
with open('output.csv', 'w', newline="") as file:
writer = csv.writer(file)
writer.writerow(['Year','Author','Title'])
for row in soup.select('table.wikitable > tbody > tr')[1:]:
try:
year = row.select_one("td[rowspan]").get_text(strip=True)
except AttributeError: year = ""
try:
author = row.select_one("td:not([rowspan]) > a[title]").get_text(strip=True)
except AttributeError: author = ""
try:
title = row.select_one("td > i > a[title], td > i").get_text(strip=True)
except AttributeError: title = ""
writer.writerow([year,author,title])
print(year,author,title)
The easiest way is to use pandas directly:
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize"
df = pd.read_html(url)[0][["Year", "Author", "Title"]]
print(df)
Prints:
Year Author Title
0 1969 P. H. Newby Something to Answer For
1 1969 Barry England Figures in a Landscape
2 1969 Nicholas Mosley The Impossible Object
3 1969 Iris Murdoch The Nice and the Good
4 1969 Muriel Spark The Public Image
5 1969 Gordon Williams From Scenes Like These
6 1970 Bernice Rubens The Elected Member
7 1970 A. L. Barker John Brown's Body
8 1970 Elizabeth Bowen Eva Trout
9 1970 Iris Murdoch Bruno's Dream
10 1970 William Trevor Mrs Eckdorf in O'Neill's Hotel
11 1970 Terence Wheeler The Conjunction
12 1970 Awarded in 2010 as the Lost Man Booker Pr... J. G. Farrell Troubles
13 1970 Awarded in 2010 as the Lost Man Booker Pr... Nina Bawden The Birds on the Trees
14 1970 Awarded in 2010 as the Lost Man Booker Pr... Shirley Hazzard The Bay of Noon
15 1970 Awarded in 2010 as the Lost Man Booker Pr... Mary Renault Fire From Heaven
16 1970 Awarded in 2010 as the Lost Man Booker Pr... Muriel Spark The Driver's Seat
17 1970 Awarded in 2010 as the Lost Man Booker Pr... Patrick White The Vivisector
...
To CSV:
df.to_csv("data.csv", index=None)
Creates data.csv:

Unable to _scrape_ the proper data from money control board meeting information

I am able to scrape the tables from this website
but i am unable to split the record, what i wanted. Here is my code
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.moneycontrol.com/stocks/marketinfo/meetings.php?opttopic=brdmeeting')
print(r.status_code)
soup = BeautifulSoup(r.text, 'lxml')
# print(soup)
Calendar = soup.find('table', class_ = 'b_12 dvdtbl tbldata14').text
print(Calendar.strip())
for Company_Name in Calendar.find_all('tr'):
rows = Company_Name.find_all('td', class_ = 'dvd_brdb')
print(rows)
for row in rows:
pl_calender = row.find_all('b')
print(pl_calender)
Result
Company Name
Date
Agenda
Aplab
Add to Watchlist
Add to Portfolio
14-Sep-2020
Quarterly Results
I am looking output in below format
Date,Company Name,event
2020-09-14,Divi's Laboratories Ltd,AGM 14/09/2020
2020-09-14,Grasim Industries Ltd.,AGM 14/09/2020
Output picture
Thanks in advance
Jana
stay safe and live healthy
here what i try :
r = requests.get('https://www.moneycontrol.com/stocks/marketinfo/meetings.php?opttopic=brdmeeting')
soup = BeautifulSoup(r.text, 'lxml')
mytable = soup.find('table', class_ = 'b_12 dvdtbl tbldata14')
companyname = mytable.find_all('b')
date = mytable.find_all('td', attrs={'class':'dvd_brdb', 'align':'center'})
agenda = mytable.find_all('td', attrs={'class':'dvd_brdb', 'style':'text-align:left;'})
companyname_list = []
date_list = []
agenda_list = []
for cn in companyname:
companyname_list.append(cn.get_text())
for dt in date:
date_list.append(dt.get_text())
for ag in agenda:
agenda_list.append(ag.get_text())
del companyname_list[0:2]
del date_list[0:2]
fmt = '{:<8}{:<20}{:<20}{:<20}'
print(fmt.format('', 'Date', 'Company Name', 'Agenda'))
for i, (datess, companynamess, agendass) in enumerate(zip(date_list, companyname_list, agenda_list)):
print(fmt.format(i, datess, companynamess, agendass))
Result :
Date Company Name Agenda
0 15-Sep-2020 7NR Retail Quarterly Results
1 15-Sep-2020 Scan Projects Quarterly Results
2 15-Sep-2020 Avonmore Cap Quarterly Results
3 15-Sep-2020 Elixir Cap Quarterly Results
4 15-Sep-2020 Aarvee Denim Quarterly Results
5 15-Sep-2020 Vipul Quarterly Results
...
....

How can i find the title of the image from the table using beautifulsoup4

I need help how I can get the teams column from the table from this https://www.hltv.org/stats
This code gives me all values from the table but I did not get the value of teams because it is in in form of images(Hyperlink). I want to get the title of the teams.
r = requests.get("https://www.hltv.org/stats/players")
# Create a pandas with pulled data
root = bs(r.content, "html.parser")
root.prettify()
# Pull the player data out of the table and put into our dataframe
table = (str)(root.find("table"))
players = pd.read_html(table, header=0)[0]
I need to get all teams as a pandas column with a header as a team
Please help
Since the team name is contained in the alt attribute of the team images, you can simply replace the <td> content with the values from the alt attributes:
table = root.find("table")
for td in table('td', class_='teamCol'):
teams = [img['alt'] for img in td('img')]
td.string = ', '.join(teams)
players = pd.read_html(str(table), header=0)[0]
Gives
Player Teams Maps K-D Diff K/D Rating1.0
0 ZywOo Vitality, aAa 612 3853 1.39 1.29
1 s1mple Natus Vincere, FlipSid3, HellRaisers 1153 6153 1.31 1.24
2 sh1ro Gambit Youngsters 317 1848 1.39 1.21
3 Kaze ViCi, Flash, MVP.karnal 613 3026 1.31 1.20
[...]
You can do something like this using requests, pandas and BeautifulSoup:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
req = requests.get("https://www.hltv.org/stats/players")
root = bs(req.text, "html.parser")
# Find the first table in the page
table = root.find('table', {'class': 'stats-table player-ratings-table'})
# Find all td with class "teamCol"
teams = table.find_all('td', {'class': 'teamCol'})
# Get img source & title from all img tags in teams
imgs = [(elm.get('src'), elm.get('title')) for team in teams for elm in team.find_all('img')]
# Create your DataFrame
df = pd.DataFrame(imgs, columns=['source', 'title'])
print(df)
Output:
source title
0 https://static.hltv.org/images/team/logo/9565 Vitality
1 https://static.hltv.org/images/team/logo/5639 aAa
2 https://static.hltv.org/images/team/logo/4608 Natus Vincere
3 https://static.hltv.org/images/team/logo/5988 FlipSid3
4 https://static.hltv.org/images/team/logo/5310 HellRaisers
... ... ...
1753 https://static.hltv.org/images/team/logo/4602 Tricked
1754 https://static.hltv.org/images/team/logo/4501 ALTERNATE aTTaX
1755 https://static.hltv.org/images/team/logo/7217 subtLe
1756 https://static.hltv.org/images/team/logo/5454 SKDC
1757 https://static.hltv.org/images/team/logo/6301 Splyce
[1758 rows x 2 columns]

Unable to webscrape HTML table with BeautifulSoup and load it into a Pandas dataframe with Python

My objective is to access the table on the following webpage https://www.countries-ofthe-world.com/world-currencies.html and turn it into a Pandas dataframe that has columns "Country or territory", "Currency", and "ISO-4217".
I am able to access the columns correctly, but I am having a hard time figuring out how to append each row to a dataframe. Do you all have any suggestions on how I can do this? For example, on the webpage, the first row in the table is the letter "A". However, I need the first row in the dataframe to be Afghanistan, Afghan afghani, and AFN.
Here is what I have so far:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage=urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
table = soup.find("table", {"class":"codes"})
rows = table.find_all('tr')
columns = [v.text for v in rows[0].find_all('th')]
print(columns) # ['Country or territory', 'Currency', 'ISO-4217']
Please see this image as well.
Thank you all for your time.
Tony
With your fix in place, it's something that can be pretty easily parsed by pd.read_html:
url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage = urlopen(req).read()
df = pd.read_html(webpage)[0]
print(df.head())
Country or territory Currency ISO-4217
0 A A A
1 Afghanistan Afghan afghani AFN
2 Akrotiri and Dhekelia (UK) European euro EUR
3 Aland Islands (Finland) European euro EUR
4 Albania Albanian lek ALL
It has those alphabet headers, but you can get rid of those with something like df = df[df['Currency'] != df['ISO-4217']]

Resources