How to group similar news text / article in pandas dataframe - python-3.x

I have a pandas dataframe of news article. Suppose
id
news title
keywords
publcation date
content
1
Congress Wants to Beef Up Army Effort to Develop Counter-Drone Weapons
USA,Congress,Drone,Army
2020-12-10
SOME NEWS CONTENT
2
Israel conflict: The range and scale of Hamas' weapons ...
Israel,Hamas,Conflict
2020-12-10
NEWS CONTENT
3
US Air Force progresses testing of anti-drone laser weapons
USA,Air Force,Weapon,Dron
2020-10-10
NEWS CONTENT
4
Hamas fighters display weapons in Gaza after truce with Israel
Hamas,Gaza,Israel,Weapon,Truce
2020-11-10
NEWS CONTENT
Now
HOW TO GROUP SIMILAR DATA BASED ON NEWS CONTENT AND SORT BY PUBLICATION DATE
Note:The content may be summary of the news
So that it displays as:
Group1
id
news title
keywords
publcation date
content
3
US Air Force progresses testing of anti-drone laser weapons
USA,Air Force,Weapon,Dron
2020-10-10
NEWS CONTENT
1
Congress Wants to Beef Up Army Effort to Develop Counter-Drone Weapons
USA,Congress,Drone,Army
2020-12-10
SOME NEWS CONTENT
Group2
id
news title
keywords
publcation date
content
4
Hamas fighters display weapons in Gaza after truce with Israel
Hamas,Gaza,Israel,Weapon,Truce
2020-11-10
NEWS CONTENT
2
Israel conflict: The range and scale of Hamas' weapons ...
Israel,Hamas,Conflict
2020-12-10
NEWS CONTENT

It's a little bit complicated, I choose the easy way for the similarity, but you can change the function as you wish.
you can also use https://pypi.org/project/pyjarowinkler/ for the is_similar function instead of the "set" I did. *the function can be much more complicated than the one I did
I used two applies the first one is to fit the "grps". it will work without the first one but it will be more accurate at the second time
you can also change the range(3,-1,-1) to a higher number for the accuracy
def is_similar(txt1,txt2,level=0):
return len(set(txt1) & set(txt2))>level
grps={}
def get_grp_id(row):
row_words = row['keywords'].split(',')
if len(grps.keys())==0:
grps[1]=set(row_words)
return 1
else:
for level in range(3,-1,-1):
for grp in grps:
if is_similar(grps[grp],row_words,level):
grps[grp]= grps[grp] | set(row_words)
return grp
grp +=1
grps[grp]=set(row_words)
return grp
df.apply(get_grp_id,axis=1)
df['grp'] = df.apply(get_grp_id,axis=1)
df = df.sort_values(['grp','publcation date'])
this is the output
if you want to split it into separate df let me know

Related

Web scrape a table created by a Javascript function

I am trying to scrape the table of research studies in the below link
https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=
The table has dynamically created content using Javascript.
I tried using selenium, but intermittently getting StaleElementException.
Please help me with the same.
I want to retrieve all rows in the table and store them in a local database.
Here is what I have tried in selenium
import selenium.webdriver as webdriver
url = 'https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist='
driver=webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
data = []
for tr in driver.find_elements_by_xpath('//table[#id="theDataTable"]//tbody//tr'):
tds = tr.find_elements_by_tag_name('td')
if tds:
for td in tds:
print(td.text)
if td.text not in data:
data.append(td.text)
driver.quit()
print('*********************************************************************')
print(data)
Further the data from the 'data' variable I'll store in DB.
I am new to selenium and Web Scraping and further, I want to click on each link in the 'Study Title' column and extract data from that page for each study.
I want suggestions to avoid/handle a stale element exception or an alternative for Selenium webdriver.
Thanks in Advance!
I tried the following code of mine and all the data are stored properly. Can you try it?
CODE
driver.get("https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=")
time.sleep(5)
array = []
flag = True
next_counter = 0
time.sleep(4)
select = Select(driver.find_element_by_name('theDataTable_length'))
select.select_by_value('100')
time.sleep(5)
while flag == True:
if next_counter == 13:
print("Stoped")
else:
item = driver.find_elements_by_tag_name("tbody")[1]
rows = item.find_elements_by_tag_name('tr')
for x in range(len(rows)):
for i in range(7):
array.insert(x, rows[x].find_elements_by_tag_name('td')[i].text)
print(rows[x].find_elements_by_tag_name('td')[i].text)
time.sleep(5)
next = driver.find_element_by_id('theDataTable_next')
next.click()
next_counter = next_counter + 1
time.sleep(7)
OUTPUT
1
Not yet recruiting
NEW
Indirect Endovenous Systemic Ozone for New Coronavirus Disease (COVID19) in Non-intubated Patients
COVID
Other: Systemic indirect endovenous ozone therapy
SEOT
Valencia, Spain
2
Recruiting
NEW
Prediction of Clinical Course in COVID19 Patients
COVID 19
Other: CT-Scan
Chu Saint-Etienne
Saint-Étienne, France
3
Not yet recruiting
NEW
Risks of COVID19 in the Pregnant Population
COVID19
Other: Biospecimen collection
Mayo Clinic in Rochester
Rochester, Minnesota, United States
4
Recruiting
NEW
Saved From COVID-19
COVID
Drug: Chloroquine
Drug: Placebo oral tablet
Columbia University Irving Medical Center/NYP
New York, New York, United States
5
Recruiting
NEW
Efficacy of Convalescent Plasma Therapy in Severely Sick COVID-19 Patients
COVID
Drug: Convalescent Plasma Transfusion
Other: Supportive Care
Drug: Random Donor Plasma
Maulana Azad medical College
New Delhi, Delhi, India
Institute of Liver and Biliary Sciences
New Delhi, Delhi, India
6
Not yet recruiting
NEW
A Real-life Experience on Treatment of Patients With COVID 19
COVID
Drug: Chloroquine
Drug: Favipiravir
Drug: Nitazoxanide
(and 3 more...)
Tanta university hospital
Tanta, Egypt
7
Recruiting
International COVID19 Clinical Evaluation Registry,
COVID 19
Combination Product: Observational (registry)
Hospital Lclinico San Carlos
Madrid, Spain
8
Completed
NEW
AiM COVID for Covid 19 Tracking and Prediction
COVID 19
Other: No Intervention
Aarogyam (UK)
Leicester, United Kingdom
9
Recruiting
NEW
Establishing a COVID-19 Prospective Cohort for Identification of Secondary HLH
COVID
Department of nephrology, Klinikum rechts der Isar
München, Bavaria, Germany
10
Recruiting
NEW
Max Ivermectin- COVID 19 Study Versus Standard of Care Treatment for COVID 19 Cases. A Pilot Study
COVID
Drug: Ivermectin
Max Super Speciality hospital, Saket (A unit of Devki Devi Foundation)
New Delhi, Delhi, India
My code is doing the following logic steps:
First, I select the option to see 100 results and not 10 in order to save some time for retrieving the data.
Secondly, I read all the results of the page (100) and when I finish I go to click on the next page symbol. Then I have a sleep command to wait for 4 seconds (you can do it in a better way but I did it like this to give you something fast - you have to insert waitUntilElementIsVisible concept)
After clicking on the next page button, I again save the results (100)
This functionality will run until the flag is going to be False. It's going to be false when the next_counter is going to be 14 (more than 13 which is the maximum). Number 13 is actually 1300 (results) divided by 100 (max number of results per page) so 1300/100 = 13. SO we have 13 pages.
Editing and transferring the data is something that you can manage and there is no need for Selenium knowledge or something concerning web automation. It's a 100% Python concept.

Matching list of strings in my variables to my pandas table

So i have these list of variables in my table['title']
title
0 Intern, Ops System
1 Regional Business Analyst, Fleet
2 Analyst, Performance Monitoring & Planning (PMP)
3 Designer (Contract)
4 Fashion Category Intern
5 Expert Recruiter (Contract)
6 Category Executive - Groceries
7 Guided Sales - Various Categories (Contract)
8 Category Executive - Muslimahwear (Contract)
9 Category Executive - Mother & Baby (Contract)
In my program, I'm trying to give the users a dropdown list of options which has the titles.
So users are able to choose title(s) which would create a new table showcasing what they have chosen.
Below is my code:
values = ["Intern, Ops System","Analyst, Performance Monitoring & Planning (PMP)","Fashion Category Intern","Guided Sales - Various Categories (Contract)"]
variables = "|".join(values)
filtered_table = table[table['title'].str.contains(variables, regex = True)]
I tried using this method to get my solution but somehow it only manages to show titles that does not have parenthesis in it's name
title
0 Intern, Ops System
4 Fashion Category Intern
Is there any way where I could get it all to show up in my filtered_table?
Because () are special regex characters add re.escape in generator comprehension for escape all special regex values:
import re
variables = "|".join(re.escape(x) for x in values)
filtered_table = table[table['title'].str.contains(variables, regex = True)]
print (filtered_table)
title
0 Intern, Ops System
2 Analyst, Performance Monitoring & Planning (PMP)
4 Fashion Category Intern
7 Guided Sales - Various Categories (Contract)

Python Web Scraping : Split quantities from Unstructured Data

I am relatively new to the field of Web Scraping as well as python. I am trying to scrape data from a supermarket/Online Grocery stores.
I am facing an issue in cleaning the scraped data-
Data Sample Scraped
Tata Salt Lite, Low Sodium, 1kg
Fortune Kachi Ghani Pure Mustard Oil, 1L (Pet Bottle)
Bourbon Bliss, 150g (Buy 3 Get 1 Free) Amazon Brand
Vedaka Popular Toor/Arhar Dal, 1 kg
Eno Bottle 100 g (Regular) Pro
Nature 100% Organic Masoor Black Whole, 500g
Surf Excel Liquid Detergent 1.05 L
Considering the above data sample I would like to separate the quantities from the product names.
Required Format
Name -Tata Salt Lite, Low Sodium,
Quantity -1kg
Name - Fortune Kachi Ghani Pure Mustard Oil
Quantity - 1L and so on...
I have tried to separate the same with a regex
re.split("[,/._-]+", i)
but with partial success.
Could anyone please help me on how to handle the dataset. Thanks in advance.
You can try to implement below solution to each string:
text_content = "Tata Salt Lite, Low Sodium, 1kg"
quantity = re.search("(\d+\s?(kg|g|L))", text_content).group()
name = text_content.rsplit(quantity)[0].strip().rstrip(',')
description = "Name - {}, Quantity - {}".format(name, quantity)

Load item cost from an inventory table

I have an Inventory Sheet that contains a bunch of data about products I have for sale. I have a sheet for each month where I load in my individual sales. In order to calculate my cost of sales, I enter my product cost for each sale manually. I would like a formula to load the cost automatically, using the product name as a search term.
Inventory Item | Cost Sold Item | Sale Price | Cost
Product 1 | 2.99 Product 3 | 16.99 | X
Product 2 | 4.99 Product 3 | 14.57 | X
Product 3 | 6.99 Product 1 | 7.99 | X
So basically I am looking to "solve for X".
In addition to this, the product name on the two tables are actually different lengths. For example, one item on my Inventory Table may be "This is a very, very long product name that goes on and on for up to 120 characters", and on my products sold table it will be truncated at the first 40 characters of the product name. So in the above formula, it should only search for the first 40 characters of the product name.
Due to the complicated nature of this, I haven't been able to search for a sufficient solution, since I don't really know exactly where to start to quickly explain it.
UPDATE:
The product names of my Inventory List, and the product names of my items sold aren't matching. I thought I could just search for the left-most 40 characters, but this is not the case.
Here is a sample of products I have in my Inventory List:
Ford Focus 2000 thru 2007 (Haynes Repair Manual) by Haynes, Max
Franklin Brass D2408PC Futura, Bath Hardware Accessory, Tissue Paper Holder, ...
Fuji HQ T-120 Recordable VHS Cassette Tapes ( 12 pack ) (Discontinued by Manu...
Fundamentals of Three Dimensional Descriptive Geometry [Paperback] by Slaby, ...
GE Lighting 40184 Energy Smart 55-Watt 2-D Compact Fluorescent Bulb, 250-Watt...
Get Set for School: Readiness & Writing Pre-K Teacher's Guide (Handwriting Wi...
Get the Edge: A 7-Day Program To Transform Your Life [Audiobook] [Box set] by...
Gift Basket Wrap Bag - 2 Pack - 22" x 30" - Clear [Kitchen]
GOLDEN GATE EDITION 100 PIECE PUZZLE [Toy]
Granite Ware 0722 Stainless Steel Deluxe Food Mill, 2-Quart [Kitchen]
Guess Who's Coming, Jesse Bear [Paperback] by Carlstrom, Nancy White; Degen, ...
Guide to Culturally Competent Health Care (Purnell, Guide to Culturally Compe...
Guinness World Records 2002 [Illustrated] by Antonia Cunningham; Jackie Fresh...
Hawaii [Audio CD] High Llamas
And then here is a sample of the product names in my Sold list:
My Interactive Point-and-Play with Disne...
GE Lighting 40184 Energy Smart 55-Watt 2...
Crayola® Sidewalk Chalk Caddy with Chalk...
Crayola® Sidewalk Chalk Caddy with Chalk...
First Look and Find: Merry Christmas!
Sesame Street Point-and-Play and 10-Book...
Disney Mickey Mouse Board Game - Duck Du...
Nordic Ware Microwave Vegetable and Seaf...
SmartGames BACK 2 BACK
I have played around with searching for the left-most characters, minus 3. This did not work correctly. I have also switched the [range lookup] between TRUE and FALSE, but this has also not worked in a predictable way.
Use the VLOOKUP function. Augment the lookup_value parameter with the LEFT function.
        
In the above example, LEFT(E2, 9) is used to truncate the Sold Item lookup into Inventory Item.

Excel Graph - Category and Subcategory grouping

I seldom if ever use excel and have no deep understanding of graphs and graphing-related functions. Having said that...
I have dozens of rows of data, composed by 4 columns
column 1 = amount/price (in numbers)
column 2 = description (the what
in text)
column 3 = category (in text)
column 4 = subcategory (in
text)
I want to make a bar graph of my rows of data so that, the end result looks like this:
X axis - categories
Y axis - amount/price
The trick here is for categories NOT to repeat themselves. For example, if our data is something like...
100 | boat purchase | boats | 3 engine boat
200 | boat purchase |
boats | 2 engine boat
500 | plane purchase | planes | 4 engine plane
900 | car purchase | cars | 1 engine car
Then there should only be ONE instance of boats, planes and cars in my graph, under which all associated data would be summed up.
Last but not least, I have seen graphs where, these unique-not-repeated categories, instead of just being one single 'bar' so to speak, are composed of smaller bars. In this case, I want these smaller bars to be the sub categories, so that the end result would look like this:
In that sample image, I first present a 'basic, classic' graph where blue, yellow and red each represent a unique, different category. Right below it is what I want, a 'breakdown' of each category by subcategory where blue/yellow/red each represent an imaginary 3 different subcategories per category.
This means subcategories will repeat themselves for each category, but categories themselves will not.
For clarification, I currently only have 3 main categories and 6 or so sub-categories, but this could change in the future, hence the desire to have this in an automatic/dynamic fashion
Kind regards
G.Campos
EDIT: new image:
Here i my take on it. Unfortunately I can't post the screenshots as I don't have enough posts.
One solution is to use pivot charts put Amount in "Values", Category in "Row Lables", and SubCategory in "Column Labels".
I uploaded relevant images on a free image upload service.
This is our source data:
Amount Decription Category SubCategory
100 boat purchase boats 3 engine boat
200 boat purchase boats 2 engine boat
500 plane purchase planes 4 engine plane
900 car purchase cars 1 engine car
450 boat purchase boats 2 engine boat
110 plane purchase planes 4 engine plane
550 car purchase cars 1 engine car
230 car purchase cars 2 engine car
450 car purchase cars 5 engine car
This is the desired graph (Edit: This has ghost bars):
http://imageshack.us/photo/my-images/849/pivot.gif/
I just read the comment about no ghost graphs. This might be what you are looking for:
http://imageshack.us/photo/my-images/266/pivotnoghost.gif/
Just googled and found something very similar here:
peltiertech.com/WordPress/using-pivot-table-data-for-a-chart-with-a-dual-category-axis/
You need to add http:// ( I can't have more than two hyperlinks due to low number of posts)
I am not sure this will get you exactly where you want but I find in general in excel it is easiest to summarize your graph data on a separate tab.
For sample data like this
you would create a 2nd tab in the sheet that appears something like
the totals are calculated by using the sumif formula
=SUMIF(Data!C:C,Summary!A2,Data!A:A)
For the Category totals
and
=SUMIF(Data!D:D,Summary!E2,Data!A:A)
For the sub category totals (Assuming sub-categories are mutally exclusive). Now that that data is summarized, highlight the cells and insert a column chart for the following charts.
Adding new categories and/or sub categories will require you to add lines to the summary data, and then add series to the charts. You could use a vba macro to automate that task but I suspect that is overkill since your dataset is "dozens" rather than "thousands"

Resources