I am trying to scrape articles data from trading economics with requests and bs4. Below is the script I wrote:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://tradingeconomics.com/stream?c=country+list'
r = requests.get(url).text
parser = BeautifulSoup(r)
li = parser.find_all('li', class_='list-group-item')
print(li)
I can see from inspecting the webpage that each article is under a li item and has class='list-group-item'. Strangely, bs4 cannot find any li items, returning None. What can be the case?
The data you see on the page is loaded dynamically via JavaScript, so beautifulsoup doesn't see it. You can simulate the javascript request with this example:
import json
import requests
url = "https://tradingeconomics.com/ws/stream.ashx"
params = {"start": 0, "size": 20, "c": "country list"}
data = requests.get(url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data:
print(item["title"])
print(item["description"])
print(item["url"])
print("-" * 80)
Prints:
Week Ahead
Central banks in the US, Brazil, Japan, the UK and Turkey will be deciding on monetary policy in the coming week, while key data to watch for include US and China industrial output and retail sales; Japan and Canada inflation data; UK consumer morale; Australia employment figures and retail trade; New Zealand Q4 GDP; and India wholesale prices.
/calendar?article=29075&g=top&importance=2&startdate=2021-03-12
--------------------------------------------------------------------------------
Week Ahead
The US Senate will be debating and voting on the President Biden's $1.9 trillion coronavirus aid bill during the weekend, and send it back to the House to be signed into law as soon as next week. Elsewhere, monetary policy meetings in the Eurozone and Canada will be keenly watched, as well as updated GDP figures for Japan, the Eurozone, the UK and South Africa. Other important releases include US, China and India inflation data; US and Australia consumer sentiment; UK and China foreign trade; Eurozone and India industrial output; and Japan current account.
/calendar?article=29072&g=top&importance=2&startdate=2021-03-05
--------------------------------------------------------------------------------
Week Ahead
After President Biden's $1.9 trillion pandemic aid bill narrowly passed the House in the early hours of Saturday, all attention would move now to vote in Senate. Meanwhile, investors will also turn the eyes to US jobs report due Friday; GDP data for Australia, Brazil, Canada and Turkey as well as worldwide manufacturing and services PMI surveys and monetary policy action by the Reserve Bank of Australia. Other releases include trade figures for the US and Canada, factory orders for the US and Germany, unemployment rate for the Euro Area and Japan, and industrial output and retail sales for South Korea.
/calendar?article=29071&g=top&importance=2&startdate=2021-02-27
--------------------------------------------------------------------------------
...and so on.
Related
I have a table showing order projection on pincode level
How I'm creating Chart
Create a Datastudio - Datasource and add the relevant gsheet containing data below:
Edit pincode as Postal code
Create a report > Create a Bubble Map and plot pincodes with projected load.
Output Data Chart
Sample Chart for Projected Load
https://datastudio.google.com/reporting/74fcbbe7-de10-4162-ac5d-6e3fc3a23339
Chart 2 for Current Capacity - Showing capacity in Green Color
https://datastudio.google.com/reporting/93053f41-832a-4870-a823-cd010d97c469
Expected Output - Merge two Charts into One chart showing Postal code wise Projected and Current Capacity.
I want to add the current capacity in a different color to decide on building new capacity. #Logistics
Can this be done using data studio?? or pl suggest a different tool.
https://docs.google.com/spreadsheets/d/1N3OPettlFWYiNuIBG3SWKC007iYTAQmHwXzV2E87RgQ/edit#gid=0
https://docs.google.com/spreadsheets/d/1-nZY3pQ0Nz_QyQe7FWs5L-fk3X6q1x07umUKv51afq8/edit#gid=0
pincode
city
Country
Orders Projection
110081
Delhi
India
1962
110032
Delhi
India
1696
110043
Delhi
India
1386
110093
Delhi
India
828
110083
Delhi
India
816
110077
Delhi
India
579
110091
Delhi
India
537
110087
Delhi
India
412
110095
Delhi
India
345
Table 2 - Existing Capacity
pincode
city
Country
Current Capacity
110032
Delhi
India
250
201301
Noida
India
300
I am trying to scrape the table of research studies in the below link
https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=
The table has dynamically created content using Javascript.
I tried using selenium, but intermittently getting StaleElementException.
Please help me with the same.
I want to retrieve all rows in the table and store them in a local database.
Here is what I have tried in selenium
import selenium.webdriver as webdriver
url = 'https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist='
driver=webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
data = []
for tr in driver.find_elements_by_xpath('//table[#id="theDataTable"]//tbody//tr'):
tds = tr.find_elements_by_tag_name('td')
if tds:
for td in tds:
print(td.text)
if td.text not in data:
data.append(td.text)
driver.quit()
print('*********************************************************************')
print(data)
Further the data from the 'data' variable I'll store in DB.
I am new to selenium and Web Scraping and further, I want to click on each link in the 'Study Title' column and extract data from that page for each study.
I want suggestions to avoid/handle a stale element exception or an alternative for Selenium webdriver.
Thanks in Advance!
I tried the following code of mine and all the data are stored properly. Can you try it?
CODE
driver.get("https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=")
time.sleep(5)
array = []
flag = True
next_counter = 0
time.sleep(4)
select = Select(driver.find_element_by_name('theDataTable_length'))
select.select_by_value('100')
time.sleep(5)
while flag == True:
if next_counter == 13:
print("Stoped")
else:
item = driver.find_elements_by_tag_name("tbody")[1]
rows = item.find_elements_by_tag_name('tr')
for x in range(len(rows)):
for i in range(7):
array.insert(x, rows[x].find_elements_by_tag_name('td')[i].text)
print(rows[x].find_elements_by_tag_name('td')[i].text)
time.sleep(5)
next = driver.find_element_by_id('theDataTable_next')
next.click()
next_counter = next_counter + 1
time.sleep(7)
OUTPUT
1
Not yet recruiting
NEW
Indirect Endovenous Systemic Ozone for New Coronavirus Disease (COVID19) in Non-intubated Patients
COVID
Other: Systemic indirect endovenous ozone therapy
SEOT
Valencia, Spain
2
Recruiting
NEW
Prediction of Clinical Course in COVID19 Patients
COVID 19
Other: CT-Scan
Chu Saint-Etienne
Saint-Étienne, France
3
Not yet recruiting
NEW
Risks of COVID19 in the Pregnant Population
COVID19
Other: Biospecimen collection
Mayo Clinic in Rochester
Rochester, Minnesota, United States
4
Recruiting
NEW
Saved From COVID-19
COVID
Drug: Chloroquine
Drug: Placebo oral tablet
Columbia University Irving Medical Center/NYP
New York, New York, United States
5
Recruiting
NEW
Efficacy of Convalescent Plasma Therapy in Severely Sick COVID-19 Patients
COVID
Drug: Convalescent Plasma Transfusion
Other: Supportive Care
Drug: Random Donor Plasma
Maulana Azad medical College
New Delhi, Delhi, India
Institute of Liver and Biliary Sciences
New Delhi, Delhi, India
6
Not yet recruiting
NEW
A Real-life Experience on Treatment of Patients With COVID 19
COVID
Drug: Chloroquine
Drug: Favipiravir
Drug: Nitazoxanide
(and 3 more...)
Tanta university hospital
Tanta, Egypt
7
Recruiting
International COVID19 Clinical Evaluation Registry,
COVID 19
Combination Product: Observational (registry)
Hospital Lclinico San Carlos
Madrid, Spain
8
Completed
NEW
AiM COVID for Covid 19 Tracking and Prediction
COVID 19
Other: No Intervention
Aarogyam (UK)
Leicester, United Kingdom
9
Recruiting
NEW
Establishing a COVID-19 Prospective Cohort for Identification of Secondary HLH
COVID
Department of nephrology, Klinikum rechts der Isar
München, Bavaria, Germany
10
Recruiting
NEW
Max Ivermectin- COVID 19 Study Versus Standard of Care Treatment for COVID 19 Cases. A Pilot Study
COVID
Drug: Ivermectin
Max Super Speciality hospital, Saket (A unit of Devki Devi Foundation)
New Delhi, Delhi, India
My code is doing the following logic steps:
First, I select the option to see 100 results and not 10 in order to save some time for retrieving the data.
Secondly, I read all the results of the page (100) and when I finish I go to click on the next page symbol. Then I have a sleep command to wait for 4 seconds (you can do it in a better way but I did it like this to give you something fast - you have to insert waitUntilElementIsVisible concept)
After clicking on the next page button, I again save the results (100)
This functionality will run until the flag is going to be False. It's going to be false when the next_counter is going to be 14 (more than 13 which is the maximum). Number 13 is actually 1300 (results) divided by 100 (max number of results per page) so 1300/100 = 13. SO we have 13 pages.
Editing and transferring the data is something that you can manage and there is no need for Selenium knowledge or something concerning web automation. It's a 100% Python concept.
I am extracting text from an html file in python using beautifulsoup. I want to extract all text data and discard tables. But can we do something to replace the table in the html with a text ( e.g. " #### There was a table here #### ")
I was able to read the html file using beautifulsoup and removed table uisng strip_tables(html). But not sure how remove table and replace with text specifying a table was here.
def strip_tables(soup):
"""Removes all tables from the soup object."""
for script in soup(["table"]):
script.extract()
return soup
sample_html_file = "/Path/file.html"
html = read_from_file(sample_html_file)
# This function reads the file and returns a file handle for beautifulsoup
soup = BeautifulSoup(html, "lxml")
my_text = strip_tables( soup ).text
This is html file with table:
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
Table of Contents TABLE OF CONTENTS Page QUESTIONS AND ANSWERS REGARDING THIS SOLICITATION AND VOTING AT THE ANNUAL MEETING 1 PROPOSAL ONEELECTION OF DIRECTORS 7 Classes of our Board 7 Director NomineesClass III Directors 7 Continuing DirectorsClass I and Class II Directors 8 Board of Directors Recommendation 11 PROPOSAL TWOTO APPROVE AN AMENDMENT TO OUR 2016 EQUITY INCENTIVE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN 12 Summary of the Amended 2016 Plan 13 Summary of U.S. Federal Income Tax Consequences 20 New Plan Benefits 22 Existing Plan Benefits to Employees and Directors 23 Board of Directors Recommendation 23 PROPOSAL THREETO APPROVE AN AMENDMENT TO OUR 2007 EMPLOYEE STOCK PURCHASE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN A-1 APPENDIX B AMENDED AND RESTATED 2007 EMPLOYEE STOCK PURCHASE PLAN B-1 ii Table of Contents PROXY STATEMENT FOR ACCURAY INCORPORATED 2018 ANNUAL MEETING OF STOCKHOLDERS TO BE HELD ON NOVEMBER 16, 2018
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
This is data after strip_tables:
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
Expected outcome
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
" #### There was a table here #### "
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
Please try to use replaceWith() instead extract() in strip_tables function. Hope it help you.
def strip_tables(soup):
"""Removes all tables from the soup object."""
for script in soup(["table"]):
script.replaceWith(" #### There was a table here #### ")
I have a panda table which has many countries as location based. The table is shown link this
Real Table |After conversion
Edinburgh, Scotland |UK
Nairobi, Kenya| Kenya
Manchester| UK
uk |UK
Sirajganj |Bangladesh
How to do that in python?
or coordinated to the country?
+05.0738+047.3288 to Somalia
+60.45148+022.26869 to Finland
+51.50853-000.12574 to United Kingdom
+33.24428-086.81638 to USA
+47.55839+007.57327 to Switzerland
This program is solved using Yandex python.
import geocoder
g = geocoder.yandex([55.95, 37.96], method='reverse')
print(g.json)
g.json['country']
I have some data that looks like this (This is real countries but no sensitive data) It is formatted exactly like i posted
Calgary, Alberta T2P 0B3 Canada
Thunder Bay, Ontario P7G 0A1 Canada
Toronto, Ontario M3A 1E5 Canada
Calgary, Alberta T2G 1C6 Canada
Calgary, Alberta T2N 1C2 Canada
Edmonton, Alberta T5J 2W8 Canada
New Westminster, British Columbia V3L 5T4 Canada
London, Ontario N6A 5P6 Canada
Guelph, Ontario N1G 2W1 Canada
Whitby, Ontario L1N 4Z1 Canada
Red Deer, Alberta T4R 0L4 Canada
Edmonton, Alberta T5H 0S2 Canada
West Kelowna, British Columbia V4T 3E3 Canada
I am trying to get the Canadian zip code (N6A 5P6, N1G 2W1, etc) in 1 cell (so 1 cell per line with the zipcode.
So it will look like
Original data, zip code
original data, zip code
I figured out how to do it for american zip codes, but that space is messing me up. Thank you.
I think this should work, assuming it always ends with Canada:
=MID(A1,SEARCH("Canada",A1)-8,8)
You could split the content on spaces and just drop the final content of the split (which will always be Canada), preserving the elements at index -2 and -3 (the zip elements)