Web scrape a table created by a Javascript function - python-3.x

I am trying to scrape the table of research studies in the below link
https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=
The table has dynamically created content using Javascript.
I tried using selenium, but intermittently getting StaleElementException.
Please help me with the same.
I want to retrieve all rows in the table and store them in a local database.
Here is what I have tried in selenium
import selenium.webdriver as webdriver
url = 'https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist='
driver=webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
data = []
for tr in driver.find_elements_by_xpath('//table[#id="theDataTable"]//tbody//tr'):
tds = tr.find_elements_by_tag_name('td')
if tds:
for td in tds:
print(td.text)
if td.text not in data:
data.append(td.text)
driver.quit()
print('*********************************************************************')
print(data)
Further the data from the 'data' variable I'll store in DB.
I am new to selenium and Web Scraping and further, I want to click on each link in the 'Study Title' column and extract data from that page for each study.
I want suggestions to avoid/handle a stale element exception or an alternative for Selenium webdriver.
Thanks in Advance!

I tried the following code of mine and all the data are stored properly. Can you try it?
CODE
driver.get("https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=")
time.sleep(5)
array = []
flag = True
next_counter = 0
time.sleep(4)
select = Select(driver.find_element_by_name('theDataTable_length'))
select.select_by_value('100')
time.sleep(5)
while flag == True:
if next_counter == 13:
print("Stoped")
else:
item = driver.find_elements_by_tag_name("tbody")[1]
rows = item.find_elements_by_tag_name('tr')
for x in range(len(rows)):
for i in range(7):
array.insert(x, rows[x].find_elements_by_tag_name('td')[i].text)
print(rows[x].find_elements_by_tag_name('td')[i].text)
time.sleep(5)
next = driver.find_element_by_id('theDataTable_next')
next.click()
next_counter = next_counter + 1
time.sleep(7)
OUTPUT
1
Not yet recruiting
NEW
Indirect Endovenous Systemic Ozone for New Coronavirus Disease (COVID19) in Non-intubated Patients
COVID
Other: Systemic indirect endovenous ozone therapy
SEOT
Valencia, Spain
2
Recruiting
NEW
Prediction of Clinical Course in COVID19 Patients
COVID 19
Other: CT-Scan
Chu Saint-Etienne
Saint-Étienne, France
3
Not yet recruiting
NEW
Risks of COVID19 in the Pregnant Population
COVID19
Other: Biospecimen collection
Mayo Clinic in Rochester
Rochester, Minnesota, United States
4
Recruiting
NEW
Saved From COVID-19
COVID
Drug: Chloroquine
Drug: Placebo oral tablet
Columbia University Irving Medical Center/NYP
New York, New York, United States
5
Recruiting
NEW
Efficacy of Convalescent Plasma Therapy in Severely Sick COVID-19 Patients
COVID
Drug: Convalescent Plasma Transfusion
Other: Supportive Care
Drug: Random Donor Plasma
Maulana Azad medical College
New Delhi, Delhi, India
Institute of Liver and Biliary Sciences
New Delhi, Delhi, India
6
Not yet recruiting
NEW
A Real-life Experience on Treatment of Patients With COVID 19
COVID
Drug: Chloroquine
Drug: Favipiravir
Drug: Nitazoxanide
(and 3 more...)
Tanta university hospital
Tanta, Egypt
7
Recruiting
International COVID19 Clinical Evaluation Registry,
COVID 19
Combination Product: Observational (registry)
Hospital Lclinico San Carlos
Madrid, Spain
8
Completed
NEW
AiM COVID for Covid 19 Tracking and Prediction
COVID 19
Other: No Intervention
Aarogyam (UK)
Leicester, United Kingdom
9
Recruiting
NEW
Establishing a COVID-19 Prospective Cohort for Identification of Secondary HLH
COVID
Department of nephrology, Klinikum rechts der Isar
München, Bavaria, Germany
10
Recruiting
NEW
Max Ivermectin- COVID 19 Study Versus Standard of Care Treatment for COVID 19 Cases. A Pilot Study
COVID
Drug: Ivermectin
Max Super Speciality hospital, Saket (A unit of Devki Devi Foundation)
New Delhi, Delhi, India
My code is doing the following logic steps:
First, I select the option to see 100 results and not 10 in order to save some time for retrieving the data.
Secondly, I read all the results of the page (100) and when I finish I go to click on the next page symbol. Then I have a sleep command to wait for 4 seconds (you can do it in a better way but I did it like this to give you something fast - you have to insert waitUntilElementIsVisible concept)
After clicking on the next page button, I again save the results (100)
This functionality will run until the flag is going to be False. It's going to be false when the next_counter is going to be 14 (more than 13 which is the maximum). Number 13 is actually 1300 (results) divided by 100 (max number of results per page) so 1300/100 = 13. SO we have 13 pages.
Editing and transferring the data is something that you can manage and there is no need for Selenium knowledge or something concerning web automation. It's a 100% Python concept.

Related

How to group similar news text / article in pandas dataframe

I have a pandas dataframe of news article. Suppose
id
news title
keywords
publcation date
content
1
Congress Wants to Beef Up Army Effort to Develop Counter-Drone Weapons
USA,Congress,Drone,Army
2020-12-10
SOME NEWS CONTENT
2
Israel conflict: The range and scale of Hamas' weapons ...
Israel,Hamas,Conflict
2020-12-10
NEWS CONTENT
3
US Air Force progresses testing of anti-drone laser weapons
USA,Air Force,Weapon,Dron
2020-10-10
NEWS CONTENT
4
Hamas fighters display weapons in Gaza after truce with Israel
Hamas,Gaza,Israel,Weapon,Truce
2020-11-10
NEWS CONTENT
Now
HOW TO GROUP SIMILAR DATA BASED ON NEWS CONTENT AND SORT BY PUBLICATION DATE
Note:The content may be summary of the news
So that it displays as:
Group1
id
news title
keywords
publcation date
content
3
US Air Force progresses testing of anti-drone laser weapons
USA,Air Force,Weapon,Dron
2020-10-10
NEWS CONTENT
1
Congress Wants to Beef Up Army Effort to Develop Counter-Drone Weapons
USA,Congress,Drone,Army
2020-12-10
SOME NEWS CONTENT
Group2
id
news title
keywords
publcation date
content
4
Hamas fighters display weapons in Gaza after truce with Israel
Hamas,Gaza,Israel,Weapon,Truce
2020-11-10
NEWS CONTENT
2
Israel conflict: The range and scale of Hamas' weapons ...
Israel,Hamas,Conflict
2020-12-10
NEWS CONTENT
It's a little bit complicated, I choose the easy way for the similarity, but you can change the function as you wish.
you can also use https://pypi.org/project/pyjarowinkler/ for the is_similar function instead of the "set" I did. *the function can be much more complicated than the one I did
I used two applies the first one is to fit the "grps". it will work without the first one but it will be more accurate at the second time
you can also change the range(3,-1,-1) to a higher number for the accuracy
def is_similar(txt1,txt2,level=0):
return len(set(txt1) & set(txt2))>level
grps={}
def get_grp_id(row):
row_words = row['keywords'].split(',')
if len(grps.keys())==0:
grps[1]=set(row_words)
return 1
else:
for level in range(3,-1,-1):
for grp in grps:
if is_similar(grps[grp],row_words,level):
grps[grp]= grps[grp] | set(row_words)
return grp
grp +=1
grps[grp]=set(row_words)
return grp
df.apply(get_grp_id,axis=1)
df['grp'] = df.apply(get_grp_id,axis=1)
df = df.sort_values(['grp','publcation date'])
this is the output
if you want to split it into separate df let me know

BS4 does not find li items?

I am trying to scrape articles data from trading economics with requests and bs4. Below is the script I wrote:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://tradingeconomics.com/stream?c=country+list'
r = requests.get(url).text
parser = BeautifulSoup(r)
li = parser.find_all('li', class_='list-group-item')
print(li)
I can see from inspecting the webpage that each article is under a li item and has class='list-group-item'. Strangely, bs4 cannot find any li items, returning None. What can be the case?
The data you see on the page is loaded dynamically via JavaScript, so beautifulsoup doesn't see it. You can simulate the javascript request with this example:
import json
import requests
url = "https://tradingeconomics.com/ws/stream.ashx"
params = {"start": 0, "size": 20, "c": "country list"}
data = requests.get(url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data:
print(item["title"])
print(item["description"])
print(item["url"])
print("-" * 80)
Prints:
Week Ahead
Central banks in the US, Brazil, Japan, the UK and Turkey will be deciding on monetary policy in the coming week, while key data to watch for include US and China industrial output and retail sales; Japan and Canada inflation data; UK consumer morale; Australia employment figures and retail trade; New Zealand Q4 GDP; and India wholesale prices.
/calendar?article=29075&g=top&importance=2&startdate=2021-03-12
--------------------------------------------------------------------------------
Week Ahead
The US Senate will be debating and voting on the President Biden's $1.9 trillion coronavirus aid bill during the weekend, and send it back to the House to be signed into law as soon as next week. Elsewhere, monetary policy meetings in the Eurozone and Canada will be keenly watched, as well as updated GDP figures for Japan, the Eurozone, the UK and South Africa. Other important releases include US, China and India inflation data; US and Australia consumer sentiment; UK and China foreign trade; Eurozone and India industrial output; and Japan current account.
/calendar?article=29072&g=top&importance=2&startdate=2021-03-05
--------------------------------------------------------------------------------
Week Ahead
After President Biden's $1.9 trillion pandemic aid bill narrowly passed the House in the early hours of Saturday, all attention would move now to vote in Senate. Meanwhile, investors will also turn the eyes to US jobs report due Friday; GDP data for Australia, Brazil, Canada and Turkey as well as worldwide manufacturing and services PMI surveys and monetary policy action by the Reserve Bank of Australia. Other releases include trade figures for the US and Canada, factory orders for the US and Germany, unemployment rate for the Euro Area and Japan, and industrial output and retail sales for South Korea.
/calendar?article=29071&g=top&importance=2&startdate=2021-02-27
--------------------------------------------------------------------------------
...and so on.

Replace table in html file with text....(e.g. ####There was a table here)

I am extracting text from an html file in python using beautifulsoup. I want to extract all text data and discard tables. But can we do something to replace the table in the html with a text ( e.g. " #### There was a table here #### ")
I was able to read the html file using beautifulsoup and removed table uisng strip_tables(html). But not sure how remove table and replace with text specifying a table was here.
def strip_tables(soup):
"""Removes all tables from the soup object."""
for script in soup(["table"]):
script.extract()
return soup
sample_html_file = "/Path/file.html"
html = read_from_file(sample_html_file)
# This function reads the file and returns a file handle for beautifulsoup
soup = BeautifulSoup(html, "lxml")
my_text = strip_tables( soup ).text
This is html file with table:
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
Table of Contents TABLE OF CONTENTS Page QUESTIONS AND ANSWERS REGARDING THIS SOLICITATION AND VOTING AT THE ANNUAL MEETING 1 PROPOSAL ONEELECTION OF DIRECTORS 7 Classes of our Board 7 Director NomineesClass III Directors 7 Continuing DirectorsClass I and Class II Directors 8 Board of Directors Recommendation 11 PROPOSAL TWOTO APPROVE AN AMENDMENT TO OUR 2016 EQUITY INCENTIVE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN 12 Summary of the Amended 2016 Plan 13 Summary of U.S. Federal Income Tax Consequences 20 New Plan Benefits 22 Existing Plan Benefits to Employees and Directors 23 Board of Directors Recommendation 23 PROPOSAL THREETO APPROVE AN AMENDMENT TO OUR 2007 EMPLOYEE STOCK PURCHASE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN A-1 APPENDIX B AMENDED AND RESTATED 2007 EMPLOYEE STOCK PURCHASE PLAN B-1 ii Table of Contents PROXY STATEMENT FOR ACCURAY INCORPORATED 2018 ANNUAL MEETING OF STOCKHOLDERS TO BE HELD ON NOVEMBER 16, 2018
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
This is data after strip_tables:
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
Expected outcome
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
" #### There was a table here #### "
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
Please try to use replaceWith() instead extract() in strip_tables function. Hope it help you.
def strip_tables(soup):
"""Removes all tables from the soup object."""
for script in soup(["table"]):
script.replaceWith(" #### There was a table here #### ")

Python Web Scraping : Split quantities from Unstructured Data

I am relatively new to the field of Web Scraping as well as python. I am trying to scrape data from a supermarket/Online Grocery stores.
I am facing an issue in cleaning the scraped data-
Data Sample Scraped
Tata Salt Lite, Low Sodium, 1kg
Fortune Kachi Ghani Pure Mustard Oil, 1L (Pet Bottle)
Bourbon Bliss, 150g (Buy 3 Get 1 Free) Amazon Brand
Vedaka Popular Toor/Arhar Dal, 1 kg
Eno Bottle 100 g (Regular) Pro
Nature 100% Organic Masoor Black Whole, 500g
Surf Excel Liquid Detergent 1.05 L
Considering the above data sample I would like to separate the quantities from the product names.
Required Format
Name -Tata Salt Lite, Low Sodium,
Quantity -1kg
Name - Fortune Kachi Ghani Pure Mustard Oil
Quantity - 1L and so on...
I have tried to separate the same with a regex
re.split("[,/._-]+", i)
but with partial success.
Could anyone please help me on how to handle the dataset. Thanks in advance.
You can try to implement below solution to each string:
text_content = "Tata Salt Lite, Low Sodium, 1kg"
quantity = re.search("(\d+\s?(kg|g|L))", text_content).group()
name = text_content.rsplit(quantity)[0].strip().rstrip(',')
description = "Name - {}, Quantity - {}".format(name, quantity)

Attribute extraction from Metadata using Python3

Here is the input:
Magic Bullet Magic Bullet MBR-1701 17-Piece Express Mixing Set 4.1 out of 5 stars 3,670 customer reviews | 300 answered questions List Price: $59.99 Price: $47.02 & FREE Shipping You Save: $12.97 (22%) Only 5 left in stock - order soon. Ships from and sold by shopincaldwell. This fits your . Enter your model number to make sure this fits. 17-piece high-speed mixing system chops, whips, blends, and more Includes power base, 2 blades, 2 cups, 4 mugs, 4 colored comfort lip rings, 2 sealed lids, 2 vented lids, and recipe book Durable see-through construction; press down for results in 10 seconds or less Microwave- and freezer-safe cups and mugs; dishwasher-safe parts Product Built to North American Electrical Standards 23 new from $43.95 4 used from $35.00 There is a newer model of this item: Magic Bullet Blender, Small, Silver, 11 Piece Set $34.00 (2,645) In Stock.
Expected Output:
Title: "Magic Bullet MBR-1701 17-Piece Express Mixing Set",
Customer Rating : "4.1",
customer reviews : "3670",
List Price : $59.99,
Offer Price : $47.02,
Shipping : FREE Shipping
Can anyone please help me?

Resources