import requests
from bs4 import BeautifulSoup
URL = 'https://www.mohfw.gov.in/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_body = table.find_all('tbody')
print(table_body)
I want the tbody which is out of the comment. Every layer of tr and td have a span section and there are many layers of these.
Some content of tbody that you wish to grab from that page generate dynamically but you can find a link having json content if you look for it in dev tools. The data should all be there now
Try this:
import requests
URL = 'https://www.mohfw.gov.in/data/datanew.json'
page = requests.get(URL,headers={"x-requested-with":"XMLHttpRequest"})
for item in page.json():
sno = item['sno']
state_name = item['state_name']
active = item['active']
positive = item['positive']
cured = item['cured']
death = item['death']
new_active = item['new_active']
new_positive = item['new_positive']
new_cured = item['new_cured']
new_death = item['new_death']
state_code = item['state_code']
print(sno,state_name,active,positive,cured,death,new_active,new_positive,new_cured,new_death,state_code)
Output are like:
2 Andaman and Nicobar Islands 677 2945 2231 37 635 2985 2309 41 35
1 Andhra Pradesh 89932 371639 278247 3460 92208 382469 286720 3541 28
3 Arunachal Pradesh 899 3412 2508 5 987 3555 2563 5 12
4 Assam 19518 94592 74814 260 19535 96771 76962 274 18
5 Bihar 19716 124536 104301 519 19823 126714 106361 530 10
6 Chandigarh 1456 3209 1713 40 1539 3376 1796 41 04
Related
i'm trying to scrape some data for training but I'm stuck.
I would like to scrape the date, not just the year, but I couldn't quite figure out how to do this for now.
Here's the segment I would like to scrape :
htmlscrape
And here's my script so far :
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
url = "https://www.senscritique.com/films/tops/top111"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
titles = []
years = []
notes = []
synopsys = []
infos = []
dates = []
movie_div = soup.find_all('div', class_ = 'elto-flexible-column')
for container in movie_div:
title = container.h2.a.text
titles.append(title)
year = container.h2.find('span', class_ = 'elco-date').text
year = year.replace('(', '')
year = year.replace(')', '')
years.append(year)
sy = container.find('p', class_ = 'elco-description').text
synopsys.append(sy)
note = float(container.div.a.text)
notes.append(note)
info = container.find('p', class_ = 'elco-baseline elco-options').text
#type = re.sub(r'[a-z]+', '', type)
infos.append(info)
soup = container.find('p', class_ = 'elco-baseline elco-options')
for i in soup:
i = soup.find('time')
dates.append(i)
print(dates[0])
And here's the results :
result
I would like to just have the "1957-04-10" or the "10 avril 1957", whatever ! But I cannot figure it out ! I tried many things but it's the best I had so far.
Thanks :)
You can use .text property of <time> tag to get the time:
import requests
from bs4 import BeautifulSoup
url = 'https://www.senscritique.com/films/tops/top111'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for movie in soup.select('.elto-item'):
title = movie.select_one('[id^="product-title"]').text
time = movie.select_one('time')
time = time.text if time else '-'
print('{:<40} {}'.format(title, time))
Prints:
12 hommes en colère 10 avril 1957
Harakiri 16 septembre 1962
Barberousse 3 avril 1965
Le Bon, la Brute et le Truand 23 décembre 1966
Les Sept Samouraïs 26 avril 1954
Il était une fois dans l'Ouest 21 décembre 1968
Il était une fois en Amérique 23 mai 1984
Le Parrain 24 mars 1972
Le Trou 18 mars 1960
Dersou Ouzala 2 août 1975
Point limite 7 octobre 1964
Entre le ciel et l'enfer 1 mars 1963
...and so on.
I think something like this would do it for you, just returning the date.
tags = soup('time')
date_formatted = list()
for tag in tags:
date_formatted.append((tag.contents[0])))
print(date_formatted[0])
I need help how I can get the teams column from the table from this https://www.hltv.org/stats
This code gives me all values from the table but I did not get the value of teams because it is in in form of images(Hyperlink). I want to get the title of the teams.
r = requests.get("https://www.hltv.org/stats/players")
# Create a pandas with pulled data
root = bs(r.content, "html.parser")
root.prettify()
# Pull the player data out of the table and put into our dataframe
table = (str)(root.find("table"))
players = pd.read_html(table, header=0)[0]
I need to get all teams as a pandas column with a header as a team
Please help
Since the team name is contained in the alt attribute of the team images, you can simply replace the <td> content with the values from the alt attributes:
table = root.find("table")
for td in table('td', class_='teamCol'):
teams = [img['alt'] for img in td('img')]
td.string = ', '.join(teams)
players = pd.read_html(str(table), header=0)[0]
Gives
Player Teams Maps K-D Diff K/D Rating1.0
0 ZywOo Vitality, aAa 612 3853 1.39 1.29
1 s1mple Natus Vincere, FlipSid3, HellRaisers 1153 6153 1.31 1.24
2 sh1ro Gambit Youngsters 317 1848 1.39 1.21
3 Kaze ViCi, Flash, MVP.karnal 613 3026 1.31 1.20
[...]
You can do something like this using requests, pandas and BeautifulSoup:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
req = requests.get("https://www.hltv.org/stats/players")
root = bs(req.text, "html.parser")
# Find the first table in the page
table = root.find('table', {'class': 'stats-table player-ratings-table'})
# Find all td with class "teamCol"
teams = table.find_all('td', {'class': 'teamCol'})
# Get img source & title from all img tags in teams
imgs = [(elm.get('src'), elm.get('title')) for team in teams for elm in team.find_all('img')]
# Create your DataFrame
df = pd.DataFrame(imgs, columns=['source', 'title'])
print(df)
Output:
source title
0 https://static.hltv.org/images/team/logo/9565 Vitality
1 https://static.hltv.org/images/team/logo/5639 aAa
2 https://static.hltv.org/images/team/logo/4608 Natus Vincere
3 https://static.hltv.org/images/team/logo/5988 FlipSid3
4 https://static.hltv.org/images/team/logo/5310 HellRaisers
... ... ...
1753 https://static.hltv.org/images/team/logo/4602 Tricked
1754 https://static.hltv.org/images/team/logo/4501 ALTERNATE aTTaX
1755 https://static.hltv.org/images/team/logo/7217 subtLe
1756 https://static.hltv.org/images/team/logo/5454 SKDC
1757 https://static.hltv.org/images/team/logo/6301 Splyce
[1758 rows x 2 columns]
I'm trying to scrape data into a CSV file from a website that lists contact information for people in my industry. My code works well until I get to a page where one of the entries doesn't have a specific item.
So for example:
I'm trying to collect
Name, Phone, Profile URL
If there isn't a phone number listed, there isn't even a tag for that field on the page, and my code errors out with
"IndexError: list index out of range"
I'm pretty new to this, but what I've managed to cobble together so far from various youtube tutorials/this site has really saved me a ton of time completing some tasks that would take me days otherwise. I'd appreciate any help that anyone is willing to offer.
I've tried varying if/then statements where if the variable is null, then set the variable to "Empty"
Edit:
I updated the code. I switched to CSS Selectors for more specificity and readability. I also added a try/except to at least bypass the index error, but doesn't solve the problem of incorrect data being stored due to uneven amounts of data for each field. Also, the site I'm trying to scrape is in the code now.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
with open('results.csv', 'w') as f:
f.write("Name, Number, URL \n")
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
Name = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
Number = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-phone.hidden-xs.hidden-xxs')
URL = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
#Collect Data From Each Page
num_page_items = len(Name)
with open('results.csv', 'a') as f:
for i in range(num_page_items):
try:
f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
except IndexError:
f.write("Skip, Skip, Skip \n")
print("Number Missing")
continue
driver.close()
If any of the fields I'm trying to collect don't exist on individual listings, I just want the empty field to be filled in as "Empty" on the spreadsheet.
You could use try/except to take care of that. I also opted to use Pandas and BeautifulSoup as I'm more familiar with those.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
import pandas as pd
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
results = pd.DataFrame()
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
soup = BeautifulSoup(driver.page_source, 'html.parser')
agent_cards = soup.find_all('div', {'class':'agent-list-card clearfix'})
for agent in agent_cards:
try:
Name = agent.find('div', {'itemprop':'name'}).text.strip().split('\n')[0]
except:
Name = None
try:
Number = agent.find('div', {'itemprop':'telephone'}).text.strip()
except:
Number = None
try:
URL = 'https://www.realtor.com/' + agent.find('a', href=True)['href']
except:
URL = None
temp_df = pd.DataFrame([[Name, Number, URL]], columns=['Name','Number','URL'])
results = results.append(temp_df, sort=True).reset_index(drop=True)
print('Processed page: %s' %i)
driver.close()
results.to_csv('results.csv', index=False)
Output:
print (results)
Name ... URL
0 Nicole Enz ... https://www.realtor.com//realestateagents/nico...
1 Jennifer Worthington ... https://www.realtor.com//realestateagents/jenn...
2 Katherine Keener ... https://www.realtor.com//realestateagents/kath...
3 Erica Cook ... https://www.realtor.com//realestateagents/eric...
4 Jeff Thornton, Broker, Assoc Broker ... https://www.realtor.com//realestateagents/jeff...
5 Neal Sanford, Agent ... https://www.realtor.com//realestateagents/neal...
6 Sherree Zea ... https://www.realtor.com//realestateagents/sher...
7 Jennifer Cooper ... https://www.realtor.com//realestateagents/jenn...
8 Charlyn Cosgrove ... https://www.realtor.com//realestateagents/char...
9 Kathy Birchen & Chad Dutcher ... https://www.realtor.com//realestateagents/kath...
10 Nancy Petroff ... https://www.realtor.com//realestateagents/nanc...
11 The Angela Averill Team ... https://www.realtor.com//realestateagents/the-...
12 Christina Tamburino ... https://www.realtor.com//realestateagents/chri...
13 Rayce O'Connell ... https://www.realtor.com//realestateagents/rayc...
14 Stephanie Morey ... https://www.realtor.com//realestateagents/step...
15 Sean Gardner ... https://www.realtor.com//realestateagents/sean...
16 John Burg ... https://www.realtor.com//realestateagents/john...
17 Linda Ellsworth-Moore ... https://www.realtor.com//realestateagents/lind...
18 David Bueche ... https://www.realtor.com//realestateagents/davi...
19 David Ledebuhr ... https://www.realtor.com//realestateagents/davi...
20 Aaron Fox ... https://www.realtor.com//realestateagents/aaro...
21 Kristy Seibold ... https://www.realtor.com//realestateagents/kris...
22 Genia Beckman ... https://www.realtor.com//realestateagents/geni...
23 Angela Bolan ... https://www.realtor.com//realestateagents/ange...
24 Constance Benca ... https://www.realtor.com//realestateagents/cons...
25 Lisa Fata ... https://www.realtor.com//realestateagents/lisa...
26 Mike Dedman ... https://www.realtor.com//realestateagents/mike...
27 Jamie Masarik ... https://www.realtor.com//realestateagents/jami...
28 Amy Yaroch ... https://www.realtor.com//realestateagents/amy-...
29 Debbie McCarthy ... https://www.realtor.com//realestateagents/debb...
.. ... ... ...
70 Vickie Blattner ... https://www.realtor.com//realestateagents/vick...
71 Faith F Steller ... https://www.realtor.com//realestateagents/fait...
72 A. Jason Titus ... https://www.realtor.com//realestateagents/a.--...
73 Matt Bunn ... https://www.realtor.com//realestateagents/matt...
74 Joe Vitale ... https://www.realtor.com//realestateagents/joe-...
75 Reozom Real Estate ... https://www.realtor.com//realestateagents/reoz...
76 Shane Broyles ... https://www.realtor.com//realestateagents/shan...
77 Megan Doyle-Busque ... https://www.realtor.com//realestateagents/mega...
78 Linda Holmes ... https://www.realtor.com//realestateagents/lind...
79 Jeff Burke ... https://www.realtor.com//realestateagents/jeff...
80 Jim Convissor ... https://www.realtor.com//realestateagents/jim-...
81 Concetta D'Agostino ... https://www.realtor.com//realestateagents/conc...
82 Melanie McNamara ... https://www.realtor.com//realestateagents/mela...
83 Julie Adams ... https://www.realtor.com//realestateagents/juli...
84 Liz Horford ... https://www.realtor.com//realestateagents/liz-...
85 Miriam Olsen ... https://www.realtor.com//realestateagents/miri...
86 Wanda Williams ... https://www.realtor.com//realestateagents/wand...
87 Troy Seyfert ... https://www.realtor.com//realestateagents/troy...
88 Maggie Gerich ... https://www.realtor.com//realestateagents/magg...
89 Laura Farhat Bramson ... https://www.realtor.com//realestateagents/laur...
90 Peter MacIntyre ... https://www.realtor.com//realestateagents/pete...
91 Mark Jacobsen ... https://www.realtor.com//realestateagents/mark...
92 Deb Good ... https://www.realtor.com//realestateagents/deb-...
93 Mary Jane Vanderstow ... https://www.realtor.com//realestateagents/mary...
94 Ben Magsig ... https://www.realtor.com//realestateagents/ben-...
95 Brenna Chamberlain ... https://www.realtor.com//realestateagents/bren...
96 Deborah Cooper, CNS ... https://www.realtor.com//realestateagents/debo...
97 Huggler, Bashore & Brooks ... https://www.realtor.com//realestateagents/hugg...
98 Jodey Shepardson Custack ... https://www.realtor.com//realestateagents/jode...
99 Madaline Alspaugh-Young ... https://www.realtor.com//realestateagents/mada...
[100 rows x 3 columns]
I'm using Beautiful Soup and I want to scrape the data(transfer fees and players names) from this site - www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000
But you'll notice that the page only displays first 25 names. You have to click 'next' to view the next 25 names and so on and on for ten pages. However, the URL doesn't change.
I'm using this code by fcpython.com -
>
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
page1 = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2018&land_id=157&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=s"
page2 = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2018&land_id=157&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=s"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
#My Code for printing all 25 names and fees
#for i in range(0, 25):
#print(Players[i].text, Values[i].text)
PlayersList = []
ValuesList = []
for i in range(0,25):
PlayersList.append(Players[i].text)
ValuesList.append(Values[i].text)
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})
print(df.head(25))
What am I doing wrong? What can I do to get all the results at one go? Or get them at all(since I can't go past 25)?
Please find the following code to achieve your goal.You have to use webdriver to click the next button.
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000")
pageSoup = BeautifulSoup(driver.page_source, 'html.parser')
PlayersList = []
ValuesList = []
for loop in range(0,10):
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
for pl, val in zip(Players, Values):
PlayersList.append(pl.text)
ValuesList.append(val.text)
if loop==9:
break
else:
driver.find_element_by_css_selector("li.naechste-seite").click()
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})
print(df.head(250))
Output:
Players Values
0 Luís Figo £54.00m
1 Hernán Crespo £51.13m
2 Marc Overmars £36.00m
3 Gabriel Batistuta £32.54m
4 Nicolas Anelka £31.05m
5 Rio Ferdinand £23.40m
6 Flávio Conceicao £22.50m
7 Savo Milosevic £22.50m
8 David Trézéguet £20.92m
9 Claudio López £20.70m
10 Jimmy Floyd Hasselbaink £20.25m
11 Gerard López £19.44m
12 Lucas £19.17m
13 Pablo Aimar £19.13m
14 Wálter Samuel £18.72m
15 Shabani Nonda £18.00m
16 Robbie Keane £17.55m
17 José Mari £17.10m
18 Jonathan Zebina £16.56m
19 Émerson £16.20m
20 Tore André Flo £16.20m
21 Serhii Rebrov £16.20m
22 Angelo Peruzzi £16.11m
23 Diego Tristán £15.98m
24 Sylvain Wiltord £15.75m
25 Luís Figo £54.00m
26 Hernán Crespo £51.13m
27 Marc Overmars £36.00m
28 Gabriel Batistuta £32.54m
29 Nicolas Anelka £31.05m
.. ... ...
220 Tore André Flo £16.20m
221 Serhii Rebrov £16.20m
222 Angelo Peruzzi £16.11m
223 Diego Tristán £15.98m
224 Sylvain Wiltord £15.75m
225 Luís Figo £54.00m
226 Hernán Crespo £51.13m
227 Marc Overmars £36.00m
228 Gabriel Batistuta £32.54m
229 Nicolas Anelka £31.05m
230 Rio Ferdinand £23.40m
231 Flávio Conceicao £22.50m
232 Savo Milosevic £22.50m
233 David Trézéguet £20.92m
234 Claudio López £20.70m
235 Jimmy Floyd Hasselbaink £20.25m
236 Gerard López £19.44m
237 Lucas £19.17m
238 Pablo Aimar £19.13m
239 Wálter Samuel £18.72m
240 Shabani Nonda £18.00m
241 Robbie Keane £17.55m
242 José Mari £17.10m
243 Jonathan Zebina £16.56m
244 Émerson £16.20m
245 Tore André Flo £16.20m
246 Serhii Rebrov £16.20m
247 Angelo Peruzzi £16.11m
248 Diego Tristán £15.98m
249 Sylvain Wiltord £15.75m
You can use requests.session and use the ajax request done by website which you can find using your browser as suggested by #NineBerry in the comments.
This will add all the players and values to the list:
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
PlayersList = []
ValuesList = []
page_num = 2
session = requests.Session()
while True:
pageTree = session.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
for player, value in zip(Players, Values):
PlayersList.append(player.text)
ValuesList.append(value.text)
if pageSoup.find("li", {"title": "Go to next page"}):
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/ajax/yw2/saison_id/2000/plus/0/galerie/0/page/{}?ajax=yw2".format(page_num)
page_num +=1
else:
break
I'm faced with the following challenge: I want to get all financial data about companies and I wrote a code that does it and let's say that the result is like below:
Unnamed: 0 I Q 2017 II Q 2017 \
0 Przychody netto ze sprzedaży (tys. zł) 137 134
1 Zysk (strata) z działal. oper. (tys. zł) -423 -358
2 Zysk (strata) brutto (tys. zł) -501 -280
3 Zysk (strata) netto (tys. zł)* -399 -263
4 Amortyzacja (tys. zł) 134 110
5 EBITDA (tys. zł) -289 -248
6 Aktywa (tys. zł) 27 845 26 530
7 Kapitał własny (tys. zł)* 22 852 22 589
8 Liczba akcji (tys. szt.) 13 921,975 13 921,975
9 Zysk na akcję (zł) -0029 -0019
10 Wartość księgowa na akcję (zł) 1641 1623
11 Raport zbadany przez audytora N N
but 464 times more.
Unfortunately when I want to save all 464 results in one CSV file I can save only one last result. Not all 464 results, just one... Could you help me save all? Below is my code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bankier.pl/gielda/notowania/akcje'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[0]
#Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]
#get
names_of_company = df["Walor AD"].values
links_to_financial_date = []
#all linkt with the names of companies
links = []
for i in range(len(names_of_company)):
new_string = 'https://www.bankier.pl/gielda/notowania/akcje/' + names_of_company[i] + '/wyniki-finansowe'
links.append(new_string)
############################################################################
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
page2 = requests.get(url2)
soup = BeautifulSoup(page2.content,'lxml')
# Find the second table on the page
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
df2.to_csv('output.csv', index=False, header=None)
You've almost got it. You're just overwriting your CSV each time. Replace
df2.to_csv('output.csv', index=False, header=None)
with
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
in order to append to the CSV instead of overwriting it.
Also, your example doesn't work because this:
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
should be:
for i in links:
url2 = i
When the website has no data, skip and move on to the next one:
try:
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
except:
pass