Attribute extraction from Metadata using Python3 - python-3.x

Here is the input:
Magic Bullet Magic Bullet MBR-1701 17-Piece Express Mixing Set 4.1 out of 5 stars 3,670 customer reviews | 300 answered questions List Price: $59.99 Price: $47.02 & FREE Shipping You Save: $12.97 (22%) Only 5 left in stock - order soon. Ships from and sold by shopincaldwell. This fits your . Enter your model number to make sure this fits. 17-piece high-speed mixing system chops, whips, blends, and more Includes power base, 2 blades, 2 cups, 4 mugs, 4 colored comfort lip rings, 2 sealed lids, 2 vented lids, and recipe book Durable see-through construction; press down for results in 10 seconds or less Microwave- and freezer-safe cups and mugs; dishwasher-safe parts Product Built to North American Electrical Standards 23 new from $43.95 4 used from $35.00 There is a newer model of this item: Magic Bullet Blender, Small, Silver, 11 Piece Set $34.00 (2,645) In Stock.
Expected Output:
Title: "Magic Bullet MBR-1701 17-Piece Express Mixing Set",
Customer Rating : "4.1",
customer reviews : "3670",
List Price : $59.99,
Offer Price : $47.02,
Shipping : FREE Shipping
Can anyone please help me?

Related

Web scrape a table created by a Javascript function

I am trying to scrape the table of research studies in the below link
https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=
The table has dynamically created content using Javascript.
I tried using selenium, but intermittently getting StaleElementException.
Please help me with the same.
I want to retrieve all rows in the table and store them in a local database.
Here is what I have tried in selenium
import selenium.webdriver as webdriver
url = 'https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist='
driver=webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
data = []
for tr in driver.find_elements_by_xpath('//table[#id="theDataTable"]//tbody//tr'):
tds = tr.find_elements_by_tag_name('td')
if tds:
for td in tds:
print(td.text)
if td.text not in data:
data.append(td.text)
driver.quit()
print('*********************************************************************')
print(data)
Further the data from the 'data' variable I'll store in DB.
I am new to selenium and Web Scraping and further, I want to click on each link in the 'Study Title' column and extract data from that page for each study.
I want suggestions to avoid/handle a stale element exception or an alternative for Selenium webdriver.
Thanks in Advance!
I tried the following code of mine and all the data are stored properly. Can you try it?
CODE
driver.get("https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=")
time.sleep(5)
array = []
flag = True
next_counter = 0
time.sleep(4)
select = Select(driver.find_element_by_name('theDataTable_length'))
select.select_by_value('100')
time.sleep(5)
while flag == True:
if next_counter == 13:
print("Stoped")
else:
item = driver.find_elements_by_tag_name("tbody")[1]
rows = item.find_elements_by_tag_name('tr')
for x in range(len(rows)):
for i in range(7):
array.insert(x, rows[x].find_elements_by_tag_name('td')[i].text)
print(rows[x].find_elements_by_tag_name('td')[i].text)
time.sleep(5)
next = driver.find_element_by_id('theDataTable_next')
next.click()
next_counter = next_counter + 1
time.sleep(7)
OUTPUT
1
Not yet recruiting
NEW
Indirect Endovenous Systemic Ozone for New Coronavirus Disease (COVID19) in Non-intubated Patients
COVID
Other: Systemic indirect endovenous ozone therapy
SEOT
Valencia, Spain
2
Recruiting
NEW
Prediction of Clinical Course in COVID19 Patients
COVID 19
Other: CT-Scan
Chu Saint-Etienne
Saint-Étienne, France
3
Not yet recruiting
NEW
Risks of COVID19 in the Pregnant Population
COVID19
Other: Biospecimen collection
Mayo Clinic in Rochester
Rochester, Minnesota, United States
4
Recruiting
NEW
Saved From COVID-19
COVID
Drug: Chloroquine
Drug: Placebo oral tablet
Columbia University Irving Medical Center/NYP
New York, New York, United States
5
Recruiting
NEW
Efficacy of Convalescent Plasma Therapy in Severely Sick COVID-19 Patients
COVID
Drug: Convalescent Plasma Transfusion
Other: Supportive Care
Drug: Random Donor Plasma
Maulana Azad medical College
New Delhi, Delhi, India
Institute of Liver and Biliary Sciences
New Delhi, Delhi, India
6
Not yet recruiting
NEW
A Real-life Experience on Treatment of Patients With COVID 19
COVID
Drug: Chloroquine
Drug: Favipiravir
Drug: Nitazoxanide
(and 3 more...)
Tanta university hospital
Tanta, Egypt
7
Recruiting
International COVID19 Clinical Evaluation Registry,
COVID 19
Combination Product: Observational (registry)
Hospital Lclinico San Carlos
Madrid, Spain
8
Completed
NEW
AiM COVID for Covid 19 Tracking and Prediction
COVID 19
Other: No Intervention
Aarogyam (UK)
Leicester, United Kingdom
9
Recruiting
NEW
Establishing a COVID-19 Prospective Cohort for Identification of Secondary HLH
COVID
Department of nephrology, Klinikum rechts der Isar
München, Bavaria, Germany
10
Recruiting
NEW
Max Ivermectin- COVID 19 Study Versus Standard of Care Treatment for COVID 19 Cases. A Pilot Study
COVID
Drug: Ivermectin
Max Super Speciality hospital, Saket (A unit of Devki Devi Foundation)
New Delhi, Delhi, India
My code is doing the following logic steps:
First, I select the option to see 100 results and not 10 in order to save some time for retrieving the data.
Secondly, I read all the results of the page (100) and when I finish I go to click on the next page symbol. Then I have a sleep command to wait for 4 seconds (you can do it in a better way but I did it like this to give you something fast - you have to insert waitUntilElementIsVisible concept)
After clicking on the next page button, I again save the results (100)
This functionality will run until the flag is going to be False. It's going to be false when the next_counter is going to be 14 (more than 13 which is the maximum). Number 13 is actually 1300 (results) divided by 100 (max number of results per page) so 1300/100 = 13. SO we have 13 pages.
Editing and transferring the data is something that you can manage and there is no need for Selenium knowledge or something concerning web automation. It's a 100% Python concept.

Matching list of strings in my variables to my pandas table

So i have these list of variables in my table['title']
title
0 Intern, Ops System
1 Regional Business Analyst, Fleet
2 Analyst, Performance Monitoring & Planning (PMP)
3 Designer (Contract)
4 Fashion Category Intern
5 Expert Recruiter (Contract)
6 Category Executive - Groceries
7 Guided Sales - Various Categories (Contract)
8 Category Executive - Muslimahwear (Contract)
9 Category Executive - Mother & Baby (Contract)
In my program, I'm trying to give the users a dropdown list of options which has the titles.
So users are able to choose title(s) which would create a new table showcasing what they have chosen.
Below is my code:
values = ["Intern, Ops System","Analyst, Performance Monitoring & Planning (PMP)","Fashion Category Intern","Guided Sales - Various Categories (Contract)"]
variables = "|".join(values)
filtered_table = table[table['title'].str.contains(variables, regex = True)]
I tried using this method to get my solution but somehow it only manages to show titles that does not have parenthesis in it's name
title
0 Intern, Ops System
4 Fashion Category Intern
Is there any way where I could get it all to show up in my filtered_table?
Because () are special regex characters add re.escape in generator comprehension for escape all special regex values:
import re
variables = "|".join(re.escape(x) for x in values)
filtered_table = table[table['title'].str.contains(variables, regex = True)]
print (filtered_table)
title
0 Intern, Ops System
2 Analyst, Performance Monitoring & Planning (PMP)
4 Fashion Category Intern
7 Guided Sales - Various Categories (Contract)

Python Web Scraping : Split quantities from Unstructured Data

I am relatively new to the field of Web Scraping as well as python. I am trying to scrape data from a supermarket/Online Grocery stores.
I am facing an issue in cleaning the scraped data-
Data Sample Scraped
Tata Salt Lite, Low Sodium, 1kg
Fortune Kachi Ghani Pure Mustard Oil, 1L (Pet Bottle)
Bourbon Bliss, 150g (Buy 3 Get 1 Free) Amazon Brand
Vedaka Popular Toor/Arhar Dal, 1 kg
Eno Bottle 100 g (Regular) Pro
Nature 100% Organic Masoor Black Whole, 500g
Surf Excel Liquid Detergent 1.05 L
Considering the above data sample I would like to separate the quantities from the product names.
Required Format
Name -Tata Salt Lite, Low Sodium,
Quantity -1kg
Name - Fortune Kachi Ghani Pure Mustard Oil
Quantity - 1L and so on...
I have tried to separate the same with a regex
re.split("[,/._-]+", i)
but with partial success.
Could anyone please help me on how to handle the dataset. Thanks in advance.
You can try to implement below solution to each string:
text_content = "Tata Salt Lite, Low Sodium, 1kg"
quantity = re.search("(\d+\s?(kg|g|L))", text_content).group()
name = text_content.rsplit(quantity)[0].strip().rstrip(',')
description = "Name - {}, Quantity - {}".format(name, quantity)

Load item cost from an inventory table

I have an Inventory Sheet that contains a bunch of data about products I have for sale. I have a sheet for each month where I load in my individual sales. In order to calculate my cost of sales, I enter my product cost for each sale manually. I would like a formula to load the cost automatically, using the product name as a search term.
Inventory Item | Cost Sold Item | Sale Price | Cost
Product 1 | 2.99 Product 3 | 16.99 | X
Product 2 | 4.99 Product 3 | 14.57 | X
Product 3 | 6.99 Product 1 | 7.99 | X
So basically I am looking to "solve for X".
In addition to this, the product name on the two tables are actually different lengths. For example, one item on my Inventory Table may be "This is a very, very long product name that goes on and on for up to 120 characters", and on my products sold table it will be truncated at the first 40 characters of the product name. So in the above formula, it should only search for the first 40 characters of the product name.
Due to the complicated nature of this, I haven't been able to search for a sufficient solution, since I don't really know exactly where to start to quickly explain it.
UPDATE:
The product names of my Inventory List, and the product names of my items sold aren't matching. I thought I could just search for the left-most 40 characters, but this is not the case.
Here is a sample of products I have in my Inventory List:
Ford Focus 2000 thru 2007 (Haynes Repair Manual) by Haynes, Max
Franklin Brass D2408PC Futura, Bath Hardware Accessory, Tissue Paper Holder, ...
Fuji HQ T-120 Recordable VHS Cassette Tapes ( 12 pack ) (Discontinued by Manu...
Fundamentals of Three Dimensional Descriptive Geometry [Paperback] by Slaby, ...
GE Lighting 40184 Energy Smart 55-Watt 2-D Compact Fluorescent Bulb, 250-Watt...
Get Set for School: Readiness & Writing Pre-K Teacher's Guide (Handwriting Wi...
Get the Edge: A 7-Day Program To Transform Your Life [Audiobook] [Box set] by...
Gift Basket Wrap Bag - 2 Pack - 22" x 30" - Clear [Kitchen]
GOLDEN GATE EDITION 100 PIECE PUZZLE [Toy]
Granite Ware 0722 Stainless Steel Deluxe Food Mill, 2-Quart [Kitchen]
Guess Who's Coming, Jesse Bear [Paperback] by Carlstrom, Nancy White; Degen, ...
Guide to Culturally Competent Health Care (Purnell, Guide to Culturally Compe...
Guinness World Records 2002 [Illustrated] by Antonia Cunningham; Jackie Fresh...
Hawaii [Audio CD] High Llamas
And then here is a sample of the product names in my Sold list:
My Interactive Point-and-Play with Disne...
GE Lighting 40184 Energy Smart 55-Watt 2...
Crayola® Sidewalk Chalk Caddy with Chalk...
Crayola® Sidewalk Chalk Caddy with Chalk...
First Look and Find: Merry Christmas!
Sesame Street Point-and-Play and 10-Book...
Disney Mickey Mouse Board Game - Duck Du...
Nordic Ware Microwave Vegetable and Seaf...
SmartGames BACK 2 BACK
I have played around with searching for the left-most characters, minus 3. This did not work correctly. I have also switched the [range lookup] between TRUE and FALSE, but this has also not worked in a predictable way.
Use the VLOOKUP function. Augment the lookup_value parameter with the LEFT function.
        
In the above example, LEFT(E2, 9) is used to truncate the Sold Item lookup into Inventory Item.

SOLR Combined and weigthed results

I have the following task: Query SOLR and return a weighted list based on multiple conditions.
Example:
I have documents with the following fields, they mostly represent movies:
name, genre, actors, director
I want to return 20 documents sorted on the following condition
The document shares 1 actor and is from the same director (5 points)
The document shares 2 or more actors (3 points)
The document shares the director (3 points)
The document is of the same genre and shares an actor (2 points)
The document is of the same genre (1 point)
Then take these 4 movies:
Id: 1
Name: Harry Potter and the Philosopher's Stone
Genre: Adventure
Director: Chris Columbus
Actors: Daniel Radcliffe, Rupert Grint, Emma Watson
Id: 2
Name: My Week with Marilyn
Genre: Drama
Director: Simon Curtis
Actors: Michelle Williams, Eddie Redmayne, Emma Watson
Id: 3
Name: Percy Jackson & the Olympians: The Lightning Thief
Genre: Adventure
Directory: Chris Columbus
Actors: Logan Lerman, Brandon T. Jackson, Alexandra Daddario
Id: 4
Name: Harry Potter and the Chamber of Secrets
Genre: Adventure
Director: Chris Columbus
Actors: Daniel Radcliffe, Rupert Grint, Emma Watson
I want to query the SOLR as such: Return me a list of relevant movies based on movie id==4
The returned result should be:
Id: 1, points: 14 (matches all 5 conditions)
Id: 3, points: 4 (matches condition 3 and 5)
Id: 2, points: 0 (matches 0 conditions)
Is there anyway to do this directly within SOLR?
As always thanks in advance :)
You can return weighted results with the DisMax Query Parser, it's called boosting. You can give varying weights to the columns in your document by using a Query Filter like in the following example. You'll have to modify it to come up with your own formula, but you should be able to get close. Start with tweaking the numbers in the boost, but you might end up doing some more advanced Function Queries
From your example where you want to find documents that match #4
?q=Genre:'Adventure' Director:'Chris Columnbus' Actors:('Daniel Radcliffe' 'Rupert Grint' 'Emma Watson')&qf=Director^2.0+Actor^1.5+Genre^1.0&fl=*,score
//Get everything that matches #4
?q=Genre:'Adventure' Director:'Chris Columnbus' Actors:('Daniel Radcliffe' 'Rupert Grint' 'Emma Watson')
//use dismax
&defType=dismax
//boost some fields with a "query filter"
//this will make a match on director worth the most
//each actor will be worth a little bit less, but 2+ actors will be more
//all matches will be added together to create a score similar to your example
&qf=Director^2.0+Actor^1.5+Genre^1.0
//Make sure you can see the score for debugging
&fl=*,score
I don't think there is a way to do this with Solr out of the box. You could check out http://solr-ra.tgels.com/ to see if this might be something better suited to your needs or maybe show you how to make your own ranking algorithm.

Resources