Followed IG scraping Tutorial and stuck on XPath/other issue

Followed IG scraping Tutorial and stuck on XPath/other issue - python-3.x

I've been working off this tutorial here: https://medium.com/swlh/tutorial-web-scraping-instagrams-most-precious-resource-corgis-235bf0389b0c
When I try to create a simpler version of function "insta_details", that would get the likes and comments of an Instagram photo post, I can't seem to tell what's gone wrong with the code. I think I'm using the xpaths wrongly (first time), but the error message is calling for "NoSuchElementException".
from selenium.webdriver import Chrome
def insta_details(urls):
browser = Chrome()
post_details = []
for link in urls:
browser.get(link)
likes = browser.find_element_by_partial_link_text('likes').text
age = browser.find_element_by_css_selector('a time').text
xpath_comment = '//*[#id="react-root"]/section/main/div/div/article/div[2]/div[1]/ul/li[1]/div/div/div'
comment = browser.find_element_by_xpath(xpath_comment).text
insta_link = link.replace('https://www.instagram.com/p', '')
post_details.append({'link': insta_link,'likes/views': likes,'age': age, 'comment': comment})
return post_details
urls = ['https://www.instagram.com/p/CFdNu1lnCmm/', 'https://www.instagram.com/p/CFYR2OtHDbD/']
insta_details(urls)
Error Message:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":"likes"}
Copying and pasting the code from the tutorial hasn't worked for me yet. Am I calling the function wrongly or is there something else in the code?

Looking at the tutorial it seems like your code is incomplete.
Here, try this:
import time
import re
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome
def find_mentions_or_hashtags(comment, pattern):
mentions = re.findall(pattern, comment)
if (len(mentions) > 1) & (len(mentions) != 1):
return mentions
elif len(mentions) == 1:
return mentions[0]
else:
return ""
def insta_link_details(url):
chrome_options = Options()
chrome_options.add_argument("--headless")
browser = Chrome(options=chrome_options)
browser.get(url)
try:
# This captures the standard like count.
likes = browser.find_element_by_xpath(
"""/html/body/div[1]/section/main/div/div/article/
div[3]/section[2]/div/div/button/span""").text.split()[0]
post_type = 'photo'
except:
# This captures the like count for videos which is stored
likes = browser.find_element_by_xpath(
"""/html/body/div[1]/section/main/div/div/article/
div[3]/section[2]/div/span/span""").text.split()[0]
post_type = 'video'
age = browser.find_element_by_css_selector('a time').text
comment = browser.find_element_by_xpath(
"""/html/body/div[1]/section/main/div/div[1]/article/
div[3]/div[1]/ul/div/li/div/div/div[2]/span""").text
hashtags = find_mentions_or_hashtags(comment, '#[A-Za-z]+')
mentions = find_mentions_or_hashtags(comment, '#[A-Za-z]+')
post_details = {'link': url, 'type': post_type, 'likes/views': likes,
'age': age, 'comment': comment, 'hashtags': hashtags,
'mentions': mentions}
time.sleep(10)
return post_details
for url in ['https://www.instagram.com/p/CFdNu1lnCmm/', 'https://www.instagram.com/p/CFYR2OtHDbD/']:
print(insta_link_details(url))
Output:
{'link': 'https://www.instagram.com/p/CFdNu1lnCmm/', 'type': 'photo', 'likes/views': '4', 'age': '6h', 'comment': 'Natural ingredients for natural skincare is the best way to go, with:\n\n🌿The Body Shop #thebodyshopaust\n☘️The Beauty Chef #thebeautychef\n\nWalk your body to a happier, healthier you with The Body Shop’s fair trade, high quality products. Be a powerhouse of digestive health with The Beauty Chef’s ingenious food supplements. 💪 Even at our busiest, there’s always a way to take care of our health. 💙\n\n5% rebate on all online purchases with #sosure. T&Cs apply. All rates for limited time only.', 'hashtags': '#sosure', 'mentions': ['#thebodyshopaust', '#thebeautychef']}
{'link': 'https://www.instagram.com/p/CFYR2OtHDbD/', 'type': 'photo', 'likes/views': '10', 'age': '2 DAYS AGO', 'comment': 'The weather can dry out your skin and hair this season, and there’s no reason to suffer through more when there’s so much going on! 😘 Look better, feel better and brush better with these great offers for haircare, skin rejuvenation and beauty 💋 Find 5% rewards for purchases at:\n\n💙 Shaver Shop\n💙 Fresh Fragrances\n💙 Happy Hair Brush\n💕 & many more online at our website bio 👆!\n\nSoSure T&Cs apply. All rates for limited time only.\n.\n.\n.\n#sosure #sosureapp #haircare #skincare #perfume #beauty #healthylifestyle #shavershop #freshfragrances #happyhairbrush #onlineshopping #deals #melbournelifestyle #australia #onlinedeals', 'hashtags': ['#sosure', '#sosureapp', '#haircare', '#skincare', '#perfume', '#beauty', '#healthylifestyle', '#shavershop', '#freshfragrances', '#happyhairbrush', '#onlineshopping', '#deals', '#melbournelifestyle', '#australia', '#onlinedeals'], 'mentions': ''}

Related

Google Webscraper (URLS) - including more than the first page in results

Got a basic Google webscraper that returns urls of the first google search page - I want it to include URLS on further pages. What's the best way to paginate this code so as it grabs URLS from pages 2,3,4,5,6,7 etc.
Don't want to go off into space with how many pages I scrap but definitely want more than the first page !
import requests
import urllib
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
def get_source(url):
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
def scrape_google(query):
query = urllib.parse.quote_plus(query)
response = get_source("https://www.google.co.uk/search?q=" + query)
links = list(response.html.absolute_links)
google_domains = ('https://www.google.',
'https://google.',
'https://webcache.googleusercontent.',
'http://webcache.googleusercontent.',
'https://policies.google.',
'https://support.google.',
'https://maps.google.')
for url in links[:]:
if url.startswith(google_domains):
links.remove(url)
return links
print(scrape_google('https://www.google.com/search?q=letting agent'))

You can iterate over a specific range() and set the start parameter by multiply the number of iteration by 10 - Save your results to a list and use set() to remove duplicates:
data = []
for i in range(3):
data.extend(scrape_google('letting agent', i*10))
set(data)
Example
import requests
def scrape_google(query,start):
response = get_source(f"https://www.google.co.uk/search?q={query}&start={start}")
links = list(response.html.absolute_links)
google_domains = ('https://www.google.',
'https://google.',
'https://webcache.googleusercontent.',
'http://webcache.googleusercontent.',
'https://policies.google.',
'https://support.google.',
'https://maps.google.')
for url in links[:]:
if url.startswith(google_domains):
links.remove(url)
return links
data = []
for i in range(3):
data.extend(scrape_google('letting agent', i*10))
print(set(data))
Output
{'https://www.lettingagenttoday.co.uk/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://howsy.com/&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.propertymark.co.uk/professional-standards/consumer-guides/landlords/what-does-a-letting-agent-do.html&prev=search&pto=aue', 'https://www.citizensadvice.org.uk/housing/renting-privately/during-your-tenancy/complaining-about-your-letting-agent/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.allagents.co.uk/find-agent/&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.theonlinelettingagents.co.uk/&prev=search&pto=aue', 'https://www.which.co.uk/money/mortgages-and-property/buy-to-let/using-a-letting-agent-a16lu1w364rv', 'https://www.gov.uk/government/publications/non-resident-landord-guidance-notes-for-letting-agents-and-tenants-non-resident-landlords-scheme-guidance-notes', 'https://lettingagentregistration.gov.scot/renew', 'https://en.wikipedia.org/wiki/Letting_agent#Services_and_fees', 'https://patriciashepherd.co.uk/', 'https://dict.leo.org/englisch-deutsch/letting%20agent', 'https://www.diamonds-salesandlettings.co.uk/', 'https://www.lettingagentproperties.com/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.ukala.org.uk/&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://register.lettingagentregistration.gov.scot/search&prev=search&pto=aue', 'https://context.reverso.net/%C3%BCbersetzung/englisch-deutsch/letting+agent', 'https://www.cubittandwest.co.uk/landlord-guides/what-is-a-letting-agent/', 'https://en.wikipedia.org/wiki/Letting_agent', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://safeagents.co.uk/&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://charlesroseproperties.co.uk/news/letting-agent-vs-estate-agent-the-differences/&prev=search&pto=aue', 'https://www.tenantshop.co.uk/letting-agents/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://lettingagentregistration.gov.scot/renew&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.winkworth.co.uk/&prev=search&pto=aue', 'https://objego.de/lp-immobilienverwaltung/', 'https://www.facebook.com/agestateagents/videos/looking-to-instruct-a-letting-agent-not-sure-what-you-should-be-looking-for-or-w/688390845096579/', 'https://www.ukala.org.uk/', 'https://en.wikipedia.org/wiki/Letting_agent#Regulation', 'https://www.foxtons.co.uk/', 'https://ibizaprestige.com/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.which.co.uk/money/mortgages-and-property/buy-to-let/using-a-letting-agent-a16lu1w364rv&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.tenantshop.co.uk/letting-agents/&prev=search&pto=aue', 'https://www.dict.cc/?s=letting+agent', 'https://www.landlordaccreditationscotland.com/letting-agent-training/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.gov.uk/government/publications/non-resident-landord-guidance-notes-for-letting-agents-and-tenants-non-resident-landlords-scheme-guidance-notes&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.propertyinvestmentsuk.co.uk/what-is-a-letting-agent/&prev=search&pto=aue', 'https://www.propertyinvestmentsuk.co.uk/what-is-a-letting-agent/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.leaders.co.uk/&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://en.wikipedia.org/wiki/Letting_agent&prev=search&pto=aue', 'https://www.allagents.co.uk/find-agent/', 'https://www.leaders.co.uk/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.foxtons.co.uk/&prev=search&pto=aue', 'https://howsy.com/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://patriciashepherd.co.uk/&prev=search&pto=aue', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.lettingagenttoday.co.uk/&prev=search&pto=aue', 'https://register.lettingagentregistration.gov.scot/search', 'https://www.linguee.de/englisch-deutsch/uebersetzung/letting+agent.html', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.diamonds-salesandlettings.co.uk/&prev=search&pto=aue', 'https://www.theonlinelettingagents.co.uk/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.lettingagentproperties.com/&prev=search&pto=aue', 'http://www.paul-partner.com/', 'https://www.homeday.de/de/homeday-makler/rhein-main-gebiet-sued/?utm_medium=seo&utm_source=gmb&utm_campaign=rhein_main_gebiet_sued', 'https://www.propertymark.co.uk/professional-standards/consumer-guides/landlords/what-does-a-letting-agent-do.html', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.citizensadvice.org.uk/housing/renting-privately/during-your-tenancy/complaining-about-your-letting-agent/&prev=search&pto=aue', 'https://safeagents.co.uk/', 'https://charlesroseproperties.co.uk/news/letting-agent-vs-estate-agent-the-differences/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.landlordaccreditationscotland.com/letting-agent-training/&prev=search&pto=aue', 'https://move.uk.net/', 'https://www.winkworth.co.uk/', 'https://translate.google.co.uk/translate?hl=de&sl=en&u=https://www.cubittandwest.co.uk/landlord-guides/what-is-a-letting-agent/&prev=search&pto=aue'}

You can scrape Google Search Results using BeautifulSoup web scraping library without the need to use requests-html.
To extract all the results from all possible pages dynamically, we need to use while loop with a specific condition to exit the loop. It will go through all of them no matter how many pages there're. Basically, we don't hardcode page numbers to go from N to N pages.
In this case, pagination is possible as long as the next button exists (determined by the presence of a button selector on the page, in our case the CSS selector .d6cvqb a[id=pnnext], you need to increase the value of ["start"] by 10 to access the next page (non-token pagination), if present, otherwise, we need to exit the while loop:
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Google, like other sites, may block your request thinking you are a bot if you use requests, since the default user-agent library in requests is python-requests.
To avoid it, one of the steps could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on. The most reliable way is to use rotating proxies, user-agents, and a captcha solver.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "letting agent", # query
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.co.uk/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
website_link = result.select_one(".yuRUbf a")["href"]
try:
snippet = result.select_one(".lEBKkf span").text
except:
None
website_data.append({
"title": title,
"snippet": snippet,
"website_link": website_link
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Letting agents in York Anderton McClements. Luxury Lets in ...",
"snippet": "Anderton McClements are the Letting Agents in York. We offer the best possible service in property letting in York. Contact us today.",
"website_link": "https://andertonmcclements.co.uk/"
},
{
"title": "Letting Agents near Swansea | Reviews - Yell",
"snippet": "Search for Letting Agents near you, or submit your own review. ... an experienced letting agent can help you discover your next property to let.",
"website_link": "https://www.yell.com/s/letting+agents-swansea.html"
},
other results...
]
As an alternative, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key from https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "letting agent", # search query
"gl": "uk", # country of the search, UK -> United Kingdom
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Appeal to private landlords to offer tenancy to those in need",
"snippet": "“If you are unsure if your property will be suitable, please call us to discuss and if you are a landlord who uses a letting agent and would ...",
"link": "https://newsroom.shropshire.gov.uk/2022/12/appeal-to-private-landlords-to-offer-tenancy-to-those-in-need/"
},
other results...
]

Receiving Error: NameError: name 'size_cost' is not define

I am having a bit of issues running my code. I keep getting a NameError: name 'size_cost' is not define.
A little background about my code/what I am trying to do. I am trying to create a program that runs through the terminal on VSCODE that has the user enter their pizza size they want and then return a price that the size is associated with in the dictionary called 'size_cost'.
I bleieve my issue is with the location of the dictionary(size_cost) or with my class/functions I am trying to create.
Here is the code I am running:
class PizzaOrderingSys:
size_cost = {
'small': 9.75,
'large': 12.23,
'extra large': 13.80,
'party size': 26.50
}
pizza_size_order = []
available_toppings = ['Anchovies', 'Artichoke Hearts', 'Bacon', 'Basil (Fresh)', 'Bell Peppers', 'Black Olives', 'Chicken', 'Extra Cheese',
'Green Chiles', 'Green Olives','Pepperoni', 'Ground Beef', 'Jalapenos', 'Mushrooms']
customer_requested_toppings = []
number_of_toppings = 0
def __init__(self, size, toppings):
self.size = size
self.toppings= toppings
def shop_title():
print("Hello and thank you for choosing The Pizza Pie Place! \nBegin your order by telling us what size of pizza you would like.")
print("After you have chosen your pizza size you will pick your toppings")
return None
def size_order():
print('\n12 inch - Small ($9.75 + Tax), 14 inch- Large ($12.30 + Tax), 16 inch- Extra Large ($13.80 + Tax), 24 inch Party Pizza($26.50 + Tax)')
print('\tWhat size pizza do you want?')
user_size = input('')
print(f'Your {user_size} pizza will cost ${size_cost[user_size] }')
list_pizza = size_cost[user_size]
pizza_size_order.append(list_pizza)
order_1 = shop_title()
order_1= size_order()
So what I am asking is; Why do I keep getting this error message? Is it because of where my dictionary is located? or am I having issues with my class/functions and if so what is wrong with them?
I am fairly new to the coding world so thought I would start working with some fundamental elements of python.
ANY advice would be greatly appreciated! Thank you!

As ForceBru pointed out, size_cost does not exist in the function as it is a class variable. However, what I fail to understand is how you are initialising your class? Your second function (are they inside the class? Unclear to see as there are no indentations) asks for a size from the user, but that size is passed on when you initialise the class? To answer your question: yes, you can put the dictionary inside the function. However, if other class methods need to be able to access it, I think you'd be better of having it as a class variable, which is accessible through a class instance.
As a minimal example:
class PizzaOrderingSys:
size_cost = {
'small': 9.75,
'large': 12.23,
'extra large': 13.80,
'party size': 26.50
}
pizza_size_order = []
available_toppings = ['Anchovies', 'Artichoke Hearts', 'Bacon', 'Basil (Fresh)', 'Bell Peppers', 'Black Olives', 'Chicken', 'Extra Cheese',
'Green Chiles', 'Green Olives','Pepperoni', 'Ground Beef', 'Jalapenos', 'Mushrooms']
customer_requested_toppings = []
number_of_toppings = 0
def __init__(self):
pass
def my_function(self, string):
self.cost = self.size_cost[string]
if __name__ == '__main__':
order_1 = PizzaOrderingSys()
order_1.my_function('small')
print(order_1.cost)

I am not able to scrape the data in Python for following HTML

I am trying to scrape the data from the MouthShut.com user review. If I am looking at the Reviews Devtools the required text of the review is inside the following tag.- more review data
<div class="more reviewdata"> Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its Seems as An Alien, But Technically Iphone is Copying features and Function of Androids and Having Custom Os Phones.Triple Camera is Great! for Wide Angle Photography.But The looks of Iphone 11 pro X isn't Good.If ...<a style="cursor:pointer" onclick="bindreviewcontent('2958778',this,false,'I found this review of Apple iPhone 11 Pro Max 512GB pretty useful',925993570,'.png','I found this review of Apple iPhone 11 Pro Max 512GB pretty useful %23WriteShareWin','https://www.mouthshut.com/review/Apple-iPhone-11-Pro-Max-512GB-review-omnstsstqun','Apple iPhone 11 Pro Max 512GB',' 1/5','omnstsstqun');">Read More</a></div>
I wanted to extract only the text content of the review, Can anybody help on how to extract as there is no unique separator for it do so.
I have done the following code :
from requests import get
bse_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
response = get(url)
print(response.text[:100])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
reviews = html_soup.find_all('div', class_ = 'more reviewdata')
print(type(reviews))
print(len(reviews))
first_review = reviews[2]
first_review.div

To scrape all reviews from the page, you can use this example. Some larger reviews are scraped separately as POST request:
import re
import requests
from textwrap import wrap
from bs4 import BeautifulSoup
base_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
data = {
'type': 'review',
'reviewid': -1,
'corp': 'false',
'catname': ''
}
more_url = 'https://www.mouthshut.com/review/CorporateResponse.ashx'
output = []
with requests.session() as s:
soup = BeautifulSoup(s.get(base_url).text, 'html.parser')
for review in soup.select('.reviewdata'):
a = review.select_one('a[onclick^="bindreviewcontent"]')
if a:
data['reviewid'] = re.findall(r"bindreviewcontent\('(\d+)", a['onclick'])[0]
comment = BeautifulSoup( s.post(more_url, data=data).text, 'html.parser' )
comment.div.extract()
comment.ul.extract()
output.append( comment.get_text(separator=' ', strip=True) )
else:
review.div.extract()
output.append( review.get_text(separator=' ', strip=True) )
for i, review in enumerate(output, 1):
print('--- Review no.{} ---'.format(i))
print(*wrap(review), sep='\n')
print()
Prints:
--- Review no.1 ---
As you all know Apple products are too expensive this one is damn one
but who needs to sell his kidney to buy its look is not that much ease
than expected. For me it's 2 star phone
--- Review no.2 ---
Very disappointing product.nothing has changed in operating system,
only camera look has changed which is very odd looking.Device weight
is not light and dont fit in one hand.
--- Review no.3 ---
Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its
Seems as An Alien, But Technically Iphone is Copying features and
Function of Androids and Having Custom Os Phones. Triple Camera is
Great! for Wide Angle Photography. But The looks of Iphone 11 pro X
isn't Good. If You Have 3 Kidneys, Then You Can Waste one of them to
... and so on.

Extract URL link(s) from Web page

I am new to web scraping. I am trying to extract data using python from https://www.clinicaltrialsregister.eu using keywords "acute myeloid leukemia", "chronic myeloid leukemia", "acute lymphoblastic leukemia" to extract following information-EudraCT Number, Trial Status, Full title of the trial, Name of Sponsor, Country, Medical condition(s) being investigated, Investigator Networks to be involved in the Trial.
I am trying to collect URL from each link and then go to each page and extract the information, but I am not getting a proper link.
I want URL like "https://www.clinicaltrialsregister.eu/ctr-search/trial/2014-000526-37/DE" but getting
'/ctr-search/trial/2014-000526-37/DE',
'/ctr-search/trial/2006-001777-19/NL',
'/ctr-search/trial/2006-001777-19/BE',
'/ctr-search/trial/2007-000273-35/IT',
'/ctr-search/trial/2011-005934-20/FR',
'/ctr-search/trial/2006-004950-25/GB',
'/ctr-search/trial/2009-017347-33/DE',
'/ctr-search/trial/2012-000334-19/IT',
'/ctr-search/trial/2012-001594-93/FR',
'/ctr-search/trial/2012-001594-93/results',
'/ctr-search/trial/2007-003103-12/DE',
'/ctr-search/trial/2006-004517-17/FR',
'/ctr-search/trial/2013-003421-28/DE',
'/ctr-search/trial/2008-002986-30/FR',
'/ctr-search/trial/2008-002986-30/results',
'/ctr-search/trial/2013-000238-37/NL',
'/ctr-search/trial/2010-018418-53/FR',
'/ctr-search/trial/2010-018418-53/NL',
'/ctr-search/trial/2010-018418-53/HU',
'/ctr-search/trial/2010-018418-53/DE',
'/ctr-search/trial/2010-018418-53/results',
'/ctr-search/trial/2006-006852-37/DE',
'/ctr-search/trial/2006-006852-37/ES',
'/ctr-search/trial/2006-006852-37/AT',
'/ctr-search/trial/2006-006852-37/CZ',
'/ctr-search/trial/2006-006852-37/NL',
'/ctr-search/trial/2006-006852-37/SK',
'/ctr-search/trial/2006-006852-37/HU',
'/ctr-search/trial/2006-006852-37/BE',
'/ctr-search/trial/2006-006852-37/IT',
'/ctr-search/trial/2006-006852-37/FR',
'/ctr-search/trial/2006-006852-37/GB',
'/ctr-search/trial/2008-000664-16/IT',
'/ctr-search/trial/2005-005321-63/IT',
'/ctr-search/trial/2005-005321-63/results',
'/ctr-search/trial/2011-005023-40/GB',
'/ctr-search/trial/2010-022446-24/DE',
'/ctr-search/trial/2010-019710-24/IT',
Attempted Code -
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia&page=1')
soup = BeautifulSoup(page.text, 'html.parser')
#links = [a['href'] for a in soup.find_all('a', href=True) if a.text]
#links_with_text = []
#for a in soup.find_all('a', href=True):
# if a.text:
# links_with_text.append(a['href'])
links = [a['href'] for a in soup.find_all('a', href=True)]
OutPut-
'/help.html',
'/ctr-search/search',
'/joiningtrial.html',
'/contacts.html',
'/about.html',
'/about.html',
'/whatsNew.html',
'/dataquality.html',
'/doc/Sponsor_Contact_Information_EUCTR.pdf',
'/natauthorities.html',
'/links.html',
'/about.html',
'/doc/How_to_Search_EU_CTR.pdf#zoom=100,0,0',
'javascript:void(0)',
'javascript:void(0)',
'javascript:void(0)',
'javascript:void();',
'#tabs-1',
'#tabs-2',
'&page=2',
'&page=3',
'&page=4',
'&page=5',
'&page=6',
'&page=7',
'&page=8',
'&page=9',
'&page=2',
'&page=19',
'/ctr-search/trial/2014-000526-37/DE',
'/ctr-search/trial/2006-001777-19/NL',
'/ctr-search/trial/2006-001777-19/BE',
'/ctr-search/trial/2007-000273-35/IT',
'/ctr-search/trial/2011-005934-20/FR',
'/ctr-search/trial/2006-004950-25/GB',
'/ctr-search/trial/2009-017347-33/DE',
'/ctr-search/trial/2012-000334-19/IT',
'/ctr-search/trial/2012-001594-93/FR',
'/ctr-search/trial/2012-001594-93/results',
'/ctr-search/trial/2007-003103-12/DE',
'/ctr-search/trial/2006-004517-17/FR',
'/ctr-search/trial/2013-003421-28/DE',
'/ctr-search/trial/2008-002986-30/FR',
'/ctr-search/trial/2008-002986-30/results',
'/ctr-search/trial/2013-000238-37/NL',
'/ctr-search/trial/2010-018418-53/FR',
'/ctr-search/trial/2010-018418-53/NL',
'/ctr-search/trial/2010-018418-53/HU',
'/ctr-search/trial/2010-018418-53/DE',
'/ctr-search/trial/2010-018418-53/results',
'/ctr-search/trial/2006-006852-37/DE',
'/ctr-search/trial/2006-006852-37/ES',
'/ctr-search/trial/2006-006852-37/AT',
'/ctr-search/trial/2006-006852-37/CZ',
'/ctr-search/trial/2006-006852-37/NL',
'/ctr-search/trial/2006-006852-37/SK',
'/ctr-search/trial/2006-006852-37/HU',
'/ctr-search/trial/2006-006852-37/BE',
'/ctr-search/trial/2006-006852-37/IT',
'/ctr-search/trial/2006-006852-37/FR',
'/ctr-search/trial/2006-006852-37/GB',
'/ctr-search/trial/2008-000664-16/IT',
'/ctr-search/trial/2005-005321-63/IT',
'/ctr-search/trial/2005-005321-63/results',
'/ctr-search/trial/2011-005023-40/GB',
'/ctr-search/trial/2010-022446-24/DE',
'/ctr-search/trial/2010-019710-24/IT',
'javascript:void(0)',
'&page=2',
'&page=3',
'&page=4',
'&page=5',
'&page=6',
'&page=7',
'&page=8',
'&page=9',
'&page=2',
'&page=19',
'https://servicedesk.ema.europa.eu',
'/disclaimer.html',
'http://www.ema.europa.eu',
'http://www.hma.eu'

As i said, you can achieve this by concatenating the required part of url to every result.
Try this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia&page=1')
soup = BeautifulSoup(page.text, 'html.parser')
links = ["https://www.clinicaltrialsregister.eu" + a['href'] for a in soup.find_all('a', href=True)]

This script will traverse all pages of the search results and try to find relevant information.
It's necessary to add full url, not just https://www.clinicaltrialsregister.eu.
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia'
url = base_url + '&page=1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
page = 1
while True:
print('Page no.{}'.format(page))
print('-' * 160)
print()
for table in soup.select('table.result'):
print('EudraCT Number: ', end='')
for span in table.select('td:contains("EudraCT Number:")'):
print(span.get_text(strip=True).split(':')[1])
print('Full Title: ', end='')
for td in table.select('td:contains("Full Title:")'):
print(td.get_text(strip=True).split(':')[1])
print('Sponsor Name: ', end='')
for td in table.select('td:contains("Sponsor Name:")'):
print(td.get_text(strip=True).split(':')[1])
print('Trial protocol: ', end='')
for a in table.select('td:contains("Trial protocol:") a'):
print(a.get_text(strip=True), end=' ')
print()
print('Medical condition: ', end='')
for td in table.select('td:contains("Medical condition:")'):
print(td.get_text(strip=True).split(':')[1])
print('-' * 160)
next_page = soup.select_one('a:contains("Next»")')
if next_page:
soup = BeautifulSoup(requests.get(base_url + next_page['href']).text, 'lxml')
page += 1
else:
break
Prints:
Page no.1
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2014-000526-37
Full Title: An Investigator-Initiated Study To Evaluate Ara-C and Idarubicin in Combination with the Selective Inhibitor Of Nuclear Export (SINE)
Selinexor (KPT-330) in Patients with Relapsed Or Refractory A...
Sponsor Name: GSO Global Clinical Research B.V.
Trial protocol: DE
Medical condition: Patients with relapsed/refractory Acute Myeloid Leukemia (AML)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2006-001777-19
Full Title: A Phase II multicenter study to assess the tolerability and efficacy of the addition of Bevacizumab to standard induction therapy in AML and
high risk MDS above 60 years.
Sponsor Name: HOVON foundation
Trial protocol: NL BE
Medical condition: Acute myeloid leukaemia (AML), AML FAB M0-M2 or M4-M7;
diagnosis with refractory anemia with excess of blasts (RAEB) or refractory anemia with excess of blasts in transformation (RAEB-T) with an IP...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2007-000273-35
Full Title: A Phase II, Open-Label, Multi-centre, 2-part study to assess the Safety, Tolerability, and Efficacy of Tipifarnib Plus Bortezomib in the Treatment of Newly Diagnosed Acute Myeloid Leukemia AML ...
Sponsor Name: AZIENDA OSPEDALIERA DI BOLOGNA POLICLINICO S. ORSOLA M. MALPIGHI
Trial protocol: IT
Medical condition: Acute Myeloid Leukemia
----------------------------------------------------------------------------------------------------------------------------------------------------------------
...and so on.

Web scraping with Python that requires login to view output

I am trying to output the job's salary but it says need login to view. I can successfully output the other jobs' descriptions like the job title, company, location, etc. I have tried logged in with my account and logged out but it still says login to view salary. My question is, how do I show the salary which requires login to view? Need someone to help me.
import requests
from bs4 import BeautifulSoup
from mechanize import Browser
import http.cookiejar as cookielib
#creates browser
br = Browser()
#browser options
br.set_handle_robots(False) #ignore robots
br.set_handle_refresh(False) #can sometimes hang without this
br.addheaders = [('User-Agent', 'Firefox')]
login_url = "https://myjobstreet.jobstreet.com.my/home/login.php"
cj = cookielib.CookieJar()
br.set_cookiejar(cj)
response = br.open('https://myjobstreet.jobstreet.com.my/home/login.php')
#view available forms
for f in br.forms():
print(f)
br.select_form('login')
br.set_all_readonly(False) #allows everything to be written to
br.form['login_id'] = 'my_id'
br.form['password'] = 'my_password'
#submit current form
br.submit()
r = requests.get(url, headers=headers, auth=('user', 'pass'))
soup = BeautifulSoup(r.text, 'lxml')
jobs = soup.find_all("div", {"class": "rRow"})
for job in jobs:
try:
salary = job.find_all("div", {"class": "rRowLoc"})
job_salary = salary[0].text.strip()
except IndexError:
pass
print("Salary: ", job_salary)
This is the output:
Job: Sales Executive
Company: Company
Location: Earth
Salary: Login to view salary
Expected output:
Job: Sales Executive
Company: Company
Location: Earth
Salary: 1000

Your code is not working, but your goal is to scrape Company Name, Position, Location and Salary from page.
You can do your login process using requests.
Salary detail is not available into HTML because it is coming through Ajax request, So every time you find Salary into HTML it will be blank.
import requests
import bs4 as bs
headers = {
'Host': 'myjobstreet.jobstreet.com.my',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31',
}
login_url = 'https://myjobstreet.jobstreet.com.my/home/login.php?site=&language_code=3'
post_data_for_login = {
"referer_url":"",
"mobile_referer":"",
"login_id":"**YOUR EMAIL ID**",
"password":"**YOUR PASSWORD**",
"remember":"on",
"btn_login":"",
"login":"1"
}
# Create Session.
session = requests.session()
# Login request to get cookies.
response = session.post(login_url, data=post_data_for_login, headers=headers)
print('login_response:', response.status_code)
job_page_url = 'https://www.jobstreet.com.my/en/job/fb-service-team-4126557'
job_page_json_url = job_page_url + '/panels'
# Update Host in headers.
headers['Host'] = 'www.jobstreet.com.my'
# Get Job details.
response = session.get(job_page_url, headers=headers)
# Fetch Company Name, Position and Location details from HTML.
soup = bs.BeautifulSoup(response.text, 'lxml')
company_name = soup.find("div", {"id": "company_name"}).text.strip()
position_title = soup.find("h1", {"id": "position_title"}).text.strip()
work_location = soup.find("span", {"id": "single_work_location"}).text.strip()
print('Company:', company_name);print('Position:', position_title);print('Location:', work_location)
# Get Salary data From JSON.
response = session.get(job_page_json_url, headers=headers)
# Fetch Salary details from JSON.
if response.status_code == 200:
json_data = response.json()
salary_tag = json_data['job_salary']
soup = bs.BeautifulSoup(salary_tag, 'lxml')
salary_range = soup.find("span", {"id": "salary_range"}).text
print('Salary:', salary_range)
Output:
login_response: 200
Company: Copper Bar and Restaurant (88 Armenian Sdn Bhd)
Position: F&B Service Team
Location: Malaysia - Penang
Salary: MYR 2,000 - MYR 2,500

That code is not runnable. There are multiple issues I can see. You don't use login_url, the variables url and headers are not defined. You're instantiating a browser br, use it to login using br.open but then you stop using the browser. You should keep using the browser instead of requests.get. Your goal should be to get the cookies after login and keep using the cookies for the next page. I'm not familiar with mechanize, though this would be how you would get the html from an open.
response = br.open(url)
print(response.read()) # the text of the page
A better option might be to open developer tools, look at the network request, right-click it and click "copy as cURL". which will show you how to repeat the request at the commandline with cookies and all. See a better explanation plus gif at https://developers.google.com/web/updates/2015/05/replay-a-network-request-in-curl

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Followed IG scraping Tutorial and stuck on XPath/other issue - python-3.x

Related

Google Webscraper (URLS) - including more than the first page in results

Receiving Error: NameError: name 'size_cost' is not define

I am not able to scrape the data in Python for following HTML

Extract URL link(s) from Web page

Web scraping with Python that requires login to view output

Categories

Resources