Extract URL link(s) from Web page - python-3.x

I am new to web scraping. I am trying to extract data using python from https://www.clinicaltrialsregister.eu using keywords "acute myeloid leukemia", "chronic myeloid leukemia", "acute lymphoblastic leukemia" to extract following information-EudraCT Number, Trial Status, Full title of the trial, Name of Sponsor, Country, Medical condition(s) being investigated, Investigator Networks to be involved in the Trial.
I am trying to collect URL from each link and then go to each page and extract the information, but I am not getting a proper link.
I want URL like "https://www.clinicaltrialsregister.eu/ctr-search/trial/2014-000526-37/DE" but getting
'/ctr-search/trial/2014-000526-37/DE',
'/ctr-search/trial/2006-001777-19/NL',
'/ctr-search/trial/2006-001777-19/BE',
'/ctr-search/trial/2007-000273-35/IT',
'/ctr-search/trial/2011-005934-20/FR',
'/ctr-search/trial/2006-004950-25/GB',
'/ctr-search/trial/2009-017347-33/DE',
'/ctr-search/trial/2012-000334-19/IT',
'/ctr-search/trial/2012-001594-93/FR',
'/ctr-search/trial/2012-001594-93/results',
'/ctr-search/trial/2007-003103-12/DE',
'/ctr-search/trial/2006-004517-17/FR',
'/ctr-search/trial/2013-003421-28/DE',
'/ctr-search/trial/2008-002986-30/FR',
'/ctr-search/trial/2008-002986-30/results',
'/ctr-search/trial/2013-000238-37/NL',
'/ctr-search/trial/2010-018418-53/FR',
'/ctr-search/trial/2010-018418-53/NL',
'/ctr-search/trial/2010-018418-53/HU',
'/ctr-search/trial/2010-018418-53/DE',
'/ctr-search/trial/2010-018418-53/results',
'/ctr-search/trial/2006-006852-37/DE',
'/ctr-search/trial/2006-006852-37/ES',
'/ctr-search/trial/2006-006852-37/AT',
'/ctr-search/trial/2006-006852-37/CZ',
'/ctr-search/trial/2006-006852-37/NL',
'/ctr-search/trial/2006-006852-37/SK',
'/ctr-search/trial/2006-006852-37/HU',
'/ctr-search/trial/2006-006852-37/BE',
'/ctr-search/trial/2006-006852-37/IT',
'/ctr-search/trial/2006-006852-37/FR',
'/ctr-search/trial/2006-006852-37/GB',
'/ctr-search/trial/2008-000664-16/IT',
'/ctr-search/trial/2005-005321-63/IT',
'/ctr-search/trial/2005-005321-63/results',
'/ctr-search/trial/2011-005023-40/GB',
'/ctr-search/trial/2010-022446-24/DE',
'/ctr-search/trial/2010-019710-24/IT',
Attempted Code -
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia&page=1')
soup = BeautifulSoup(page.text, 'html.parser')
#links = [a['href'] for a in soup.find_all('a', href=True) if a.text]
#links_with_text = []
#for a in soup.find_all('a', href=True):
# if a.text:
# links_with_text.append(a['href'])
links = [a['href'] for a in soup.find_all('a', href=True)]
OutPut-
'/help.html',
'/ctr-search/search',
'/joiningtrial.html',
'/contacts.html',
'/about.html',
'/about.html',
'/whatsNew.html',
'/dataquality.html',
'/doc/Sponsor_Contact_Information_EUCTR.pdf',
'/natauthorities.html',
'/links.html',
'/about.html',
'/doc/How_to_Search_EU_CTR.pdf#zoom=100,0,0',
'javascript:void(0)',
'javascript:void(0)',
'javascript:void(0)',
'javascript:void();',
'#tabs-1',
'#tabs-2',
'&page=2',
'&page=3',
'&page=4',
'&page=5',
'&page=6',
'&page=7',
'&page=8',
'&page=9',
'&page=2',
'&page=19',
'/ctr-search/trial/2014-000526-37/DE',
'/ctr-search/trial/2006-001777-19/NL',
'/ctr-search/trial/2006-001777-19/BE',
'/ctr-search/trial/2007-000273-35/IT',
'/ctr-search/trial/2011-005934-20/FR',
'/ctr-search/trial/2006-004950-25/GB',
'/ctr-search/trial/2009-017347-33/DE',
'/ctr-search/trial/2012-000334-19/IT',
'/ctr-search/trial/2012-001594-93/FR',
'/ctr-search/trial/2012-001594-93/results',
'/ctr-search/trial/2007-003103-12/DE',
'/ctr-search/trial/2006-004517-17/FR',
'/ctr-search/trial/2013-003421-28/DE',
'/ctr-search/trial/2008-002986-30/FR',
'/ctr-search/trial/2008-002986-30/results',
'/ctr-search/trial/2013-000238-37/NL',
'/ctr-search/trial/2010-018418-53/FR',
'/ctr-search/trial/2010-018418-53/NL',
'/ctr-search/trial/2010-018418-53/HU',
'/ctr-search/trial/2010-018418-53/DE',
'/ctr-search/trial/2010-018418-53/results',
'/ctr-search/trial/2006-006852-37/DE',
'/ctr-search/trial/2006-006852-37/ES',
'/ctr-search/trial/2006-006852-37/AT',
'/ctr-search/trial/2006-006852-37/CZ',
'/ctr-search/trial/2006-006852-37/NL',
'/ctr-search/trial/2006-006852-37/SK',
'/ctr-search/trial/2006-006852-37/HU',
'/ctr-search/trial/2006-006852-37/BE',
'/ctr-search/trial/2006-006852-37/IT',
'/ctr-search/trial/2006-006852-37/FR',
'/ctr-search/trial/2006-006852-37/GB',
'/ctr-search/trial/2008-000664-16/IT',
'/ctr-search/trial/2005-005321-63/IT',
'/ctr-search/trial/2005-005321-63/results',
'/ctr-search/trial/2011-005023-40/GB',
'/ctr-search/trial/2010-022446-24/DE',
'/ctr-search/trial/2010-019710-24/IT',
'javascript:void(0)',
'&page=2',
'&page=3',
'&page=4',
'&page=5',
'&page=6',
'&page=7',
'&page=8',
'&page=9',
'&page=2',
'&page=19',
'https://servicedesk.ema.europa.eu',
'/disclaimer.html',
'http://www.ema.europa.eu',
'http://www.hma.eu'

As i said, you can achieve this by concatenating the required part of url to every result.
Try this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia&page=1')
soup = BeautifulSoup(page.text, 'html.parser')
links = ["https://www.clinicaltrialsregister.eu" + a['href'] for a in soup.find_all('a', href=True)]

This script will traverse all pages of the search results and try to find relevant information.
It's necessary to add full url, not just https://www.clinicaltrialsregister.eu.
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia'
url = base_url + '&page=1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
page = 1
while True:
print('Page no.{}'.format(page))
print('-' * 160)
print()
for table in soup.select('table.result'):
print('EudraCT Number: ', end='')
for span in table.select('td:contains("EudraCT Number:")'):
print(span.get_text(strip=True).split(':')[1])
print('Full Title: ', end='')
for td in table.select('td:contains("Full Title:")'):
print(td.get_text(strip=True).split(':')[1])
print('Sponsor Name: ', end='')
for td in table.select('td:contains("Sponsor Name:")'):
print(td.get_text(strip=True).split(':')[1])
print('Trial protocol: ', end='')
for a in table.select('td:contains("Trial protocol:") a'):
print(a.get_text(strip=True), end=' ')
print()
print('Medical condition: ', end='')
for td in table.select('td:contains("Medical condition:")'):
print(td.get_text(strip=True).split(':')[1])
print('-' * 160)
next_page = soup.select_one('a:contains("Next»")')
if next_page:
soup = BeautifulSoup(requests.get(base_url + next_page['href']).text, 'lxml')
page += 1
else:
break
Prints:
Page no.1
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2014-000526-37
Full Title: An Investigator-Initiated Study To Evaluate Ara-C and Idarubicin in Combination with the Selective Inhibitor Of Nuclear Export (SINE)
Selinexor (KPT-330) in Patients with Relapsed Or Refractory A...
Sponsor Name: GSO Global Clinical Research B.V.
Trial protocol: DE
Medical condition: Patients with relapsed/refractory Acute Myeloid Leukemia (AML)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2006-001777-19
Full Title: A Phase II multicenter study to assess the tolerability and efficacy of the addition of Bevacizumab to standard induction therapy in AML and
high risk MDS above 60 years.
Sponsor Name: HOVON foundation
Trial protocol: NL BE
Medical condition: Acute myeloid leukaemia (AML), AML FAB M0-M2 or M4-M7;
diagnosis with refractory anemia with excess of blasts (RAEB) or refractory anemia with excess of blasts in transformation (RAEB-T) with an IP...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2007-000273-35
Full Title: A Phase II, Open-Label, Multi-centre, 2-part study to assess the Safety, Tolerability, and Efficacy of Tipifarnib Plus Bortezomib in the Treatment of Newly Diagnosed Acute Myeloid Leukemia AML ...
Sponsor Name: AZIENDA OSPEDALIERA DI BOLOGNA POLICLINICO S. ORSOLA M. MALPIGHI
Trial protocol: IT
Medical condition: Acute Myeloid Leukemia
----------------------------------------------------------------------------------------------------------------------------------------------------------------
...and so on.

Related

Followed IG scraping Tutorial and stuck on XPath/other issue

I've been working off this tutorial here: https://medium.com/swlh/tutorial-web-scraping-instagrams-most-precious-resource-corgis-235bf0389b0c
When I try to create a simpler version of function "insta_details", that would get the likes and comments of an Instagram photo post, I can't seem to tell what's gone wrong with the code. I think I'm using the xpaths wrongly (first time), but the error message is calling for "NoSuchElementException".
from selenium.webdriver import Chrome
def insta_details(urls):
browser = Chrome()
post_details = []
for link in urls:
browser.get(link)
likes = browser.find_element_by_partial_link_text('likes').text
age = browser.find_element_by_css_selector('a time').text
xpath_comment = '//*[#id="react-root"]/section/main/div/div/article/div[2]/div[1]/ul/li[1]/div/div/div'
comment = browser.find_element_by_xpath(xpath_comment).text
insta_link = link.replace('https://www.instagram.com/p', '')
post_details.append({'link': insta_link,'likes/views': likes,'age': age, 'comment': comment})
return post_details
urls = ['https://www.instagram.com/p/CFdNu1lnCmm/', 'https://www.instagram.com/p/CFYR2OtHDbD/']
insta_details(urls)
Error Message:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":"likes"}
Copying and pasting the code from the tutorial hasn't worked for me yet. Am I calling the function wrongly or is there something else in the code?
Looking at the tutorial it seems like your code is incomplete.
Here, try this:
import time
import re
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome
def find_mentions_or_hashtags(comment, pattern):
mentions = re.findall(pattern, comment)
if (len(mentions) > 1) & (len(mentions) != 1):
return mentions
elif len(mentions) == 1:
return mentions[0]
else:
return ""
def insta_link_details(url):
chrome_options = Options()
chrome_options.add_argument("--headless")
browser = Chrome(options=chrome_options)
browser.get(url)
try:
# This captures the standard like count.
likes = browser.find_element_by_xpath(
"""/html/body/div[1]/section/main/div/div/article/
div[3]/section[2]/div/div/button/span""").text.split()[0]
post_type = 'photo'
except:
# This captures the like count for videos which is stored
likes = browser.find_element_by_xpath(
"""/html/body/div[1]/section/main/div/div/article/
div[3]/section[2]/div/span/span""").text.split()[0]
post_type = 'video'
age = browser.find_element_by_css_selector('a time').text
comment = browser.find_element_by_xpath(
"""/html/body/div[1]/section/main/div/div[1]/article/
div[3]/div[1]/ul/div/li/div/div/div[2]/span""").text
hashtags = find_mentions_or_hashtags(comment, '#[A-Za-z]+')
mentions = find_mentions_or_hashtags(comment, '#[A-Za-z]+')
post_details = {'link': url, 'type': post_type, 'likes/views': likes,
'age': age, 'comment': comment, 'hashtags': hashtags,
'mentions': mentions}
time.sleep(10)
return post_details
for url in ['https://www.instagram.com/p/CFdNu1lnCmm/', 'https://www.instagram.com/p/CFYR2OtHDbD/']:
print(insta_link_details(url))
Output:
{'link': 'https://www.instagram.com/p/CFdNu1lnCmm/', 'type': 'photo', 'likes/views': '4', 'age': '6h', 'comment': 'Natural ingredients for natural skincare is the best way to go, with:\n\n🌿The Body Shop #thebodyshopaust\n☘️The Beauty Chef #thebeautychef\n\nWalk your body to a happier, healthier you with The Body Shop’s fair trade, high quality products. Be a powerhouse of digestive health with The Beauty Chef’s ingenious food supplements. 💪 Even at our busiest, there’s always a way to take care of our health. 💙\n\n5% rebate on all online purchases with #sosure. T&Cs apply. All rates for limited time only.', 'hashtags': '#sosure', 'mentions': ['#thebodyshopaust', '#thebeautychef']}
{'link': 'https://www.instagram.com/p/CFYR2OtHDbD/', 'type': 'photo', 'likes/views': '10', 'age': '2 DAYS AGO', 'comment': 'The weather can dry out your skin and hair this season, and there’s no reason to suffer through more when there’s so much going on! 😘 Look better, feel better and brush better with these great offers for haircare, skin rejuvenation and beauty 💋 Find 5% rewards for purchases at:\n\n💙 Shaver Shop\n💙 Fresh Fragrances\n💙 Happy Hair Brush\n💕 & many more online at our website bio 👆!\n\nSoSure T&Cs apply. All rates for limited time only.\n.\n.\n.\n#sosure #sosureapp #haircare #skincare #perfume #beauty #healthylifestyle #shavershop #freshfragrances #happyhairbrush #onlineshopping #deals #melbournelifestyle #australia #onlinedeals', 'hashtags': ['#sosure', '#sosureapp', '#haircare', '#skincare', '#perfume', '#beauty', '#healthylifestyle', '#shavershop', '#freshfragrances', '#happyhairbrush', '#onlineshopping', '#deals', '#melbournelifestyle', '#australia', '#onlinedeals'], 'mentions': ''}

How can i add all the elements which has same class in the variable using selenium in the python

content = driver.find_element_by_class_name('topics-sec-block')
container = content.find_elements_by_xpath('//div[#class="col-sm-7 topics-sec-item-cont"]')
the code is below:
for i in range(0, 40):
title = []
url = []
heading=container[i].find_element_by_xpath('//div[#class="col-sm-7 topics-sec-item-cont"]/a/h2').text
link = container[i].find_element_by_xpath('//div[#class="col-sm-7 topics-sec-item-cont"]/a')
title.append(heading)
url.append(link.get_attribute('href'))
print(title)
print(url)
it is giving me the 40 number of lines but all lines have same title and url as (some of them is given below):
['Stuck in Mexico: Central American asylum seekers in limbo']
['https://www.aljazeera.com/news/2020/03/stuck-mexico-central-american-asylum-seekers-limbo-200305103910955.html']
['Stuck in Mexico: Central American asylum seekers in limbo']
['https://www.aljazeera.com/news/2020/03/stuck-mexico-central-american-asylum-seekers-limbo-200305103910955.html']

I am not able to scrape the data in Python for following HTML

I am trying to scrape the data from the MouthShut.com user review. If I am looking at the Reviews Devtools the required text of the review is inside the following tag.- more review data
<div class="more reviewdata"> Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its Seems as An Alien, But Technically Iphone is Copying features and Function of Androids and Having Custom Os Phones.Triple Camera is Great! for Wide Angle Photography.But The looks of Iphone 11 pro X isn't Good.If ...<a style="cursor:pointer" onclick="bindreviewcontent('2958778',this,false,'I found this review of Apple iPhone 11 Pro Max 512GB pretty useful',925993570,'.png','I found this review of Apple iPhone 11 Pro Max 512GB pretty useful %23WriteShareWin','https://www.mouthshut.com/review/Apple-iPhone-11-Pro-Max-512GB-review-omnstsstqun','Apple iPhone 11 Pro Max 512GB',' 1/5','omnstsstqun');">Read More</a></div>
I wanted to extract only the text content of the review, Can anybody help on how to extract as there is no unique separator for it do so.
I have done the following code :
from requests import get
bse_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
response = get(url)
print(response.text[:100])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
reviews = html_soup.find_all('div', class_ = 'more reviewdata')
print(type(reviews))
print(len(reviews))
first_review = reviews[2]
first_review.div
To scrape all reviews from the page, you can use this example. Some larger reviews are scraped separately as POST request:
import re
import requests
from textwrap import wrap
from bs4 import BeautifulSoup
base_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
data = {
'type': 'review',
'reviewid': -1,
'corp': 'false',
'catname': ''
}
more_url = 'https://www.mouthshut.com/review/CorporateResponse.ashx'
output = []
with requests.session() as s:
soup = BeautifulSoup(s.get(base_url).text, 'html.parser')
for review in soup.select('.reviewdata'):
a = review.select_one('a[onclick^="bindreviewcontent"]')
if a:
data['reviewid'] = re.findall(r"bindreviewcontent\('(\d+)", a['onclick'])[0]
comment = BeautifulSoup( s.post(more_url, data=data).text, 'html.parser' )
comment.div.extract()
comment.ul.extract()
output.append( comment.get_text(separator=' ', strip=True) )
else:
review.div.extract()
output.append( review.get_text(separator=' ', strip=True) )
for i, review in enumerate(output, 1):
print('--- Review no.{} ---'.format(i))
print(*wrap(review), sep='\n')
print()
Prints:
--- Review no.1 ---
As you all know Apple products are too expensive this one is damn one
but who needs to sell his kidney to buy its look is not that much ease
than expected. For me it's 2 star phone
--- Review no.2 ---
Very disappointing product.nothing has changed in operating system,
only camera look has changed which is very odd looking.Device weight
is not light and dont fit in one hand.
--- Review no.3 ---
Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its
Seems as An Alien, But Technically Iphone is Copying features and
Function of Androids and Having Custom Os Phones. Triple Camera is
Great! for Wide Angle Photography. But The looks of Iphone 11 pro X
isn't Good. If You Have 3 Kidneys, Then You Can Waste one of them to
... and so on.

Text in a class - href statement

How could i get all the categories mentioned on each listing page of the same website "https://www.sfma.org.sg/member/category". for example, when i choose alcoholic beverage category on the above mentioned page, the listings mentioned on that page has the category information like this :-
Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier
how can i extract the categories mentioned here with in same variable.
The code i have written for this is :-
category = soup_2.find_all('a', attrs ={'class' :'clink'})
links = [links['href'] for links in category]
cat_name = [cat_name.text.strip() for cat_name in links]
but it is producing the below output which are all the links on the page & not the text with in the href:-
['http://www.sfma.org.sg/about/singapore-food-manufacturers-association',
'http://www.sfma.org.sg/about/council-members',
'http://www.sfma.org.sg/about/history-and-milestones',
'http://www.sfma.org.sg/membership/',
'http://www.sfma.org.sg/member/',
'http://www.sfma.org.sg/member/alphabet/',
'http://www.sfma.org.sg/member/category/',
'http://www.sfma.org.sg/resources/sme-portal',
'http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore',
'http://www.sfma.org.sg/resources/import-export-requirements-and-procedures',
'http://www.sfma.org.sg/resources/labelling-guidelines',
'http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes',
'http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard',
'http://www.sfma.org.sg/resources/p-max',
'http://www.sfma.org.sg/event/',
.....]
What i need is the below data for all the listings of all the categories on the base URL which is "https://www.sfma.org.sg/member/category/"
['Ang Leong Huat Pte Ltd',
'16 Tagore Lane
Singapore (787476)',
'Tel: +65 6749 9988',
'Fax: +65 6749 4321',
'Email: sales#alh.com.sg',
'Website: http://www.alh.com.sg/',
'Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier'
Please excuse if the question seems to be novice, i am just very new to python,
Thanks !!!
The following targets the two javascript objects housing mapping info about companies names, categories and the shown tags e.g. bakery product. More more detailed info on the use of regex and splitting item['category'] - see my SO answer here.
It handles unquoted keys with hjson library.
You end up with a dict whose keys are the company names (I use permalink version of name, over name, as this should definitely be unique), and whose values are a tuple with 2 items. The first item is the company page link; the second is a list of the given tags e.g. bakery product, alcoholic beverage). The logic is there for you to re-organise as desired.
import requests
from bs4 import BeautifulSoup as bs
import hjson
base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
p1 = re.compile(r'var ddObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
lookup_data = hjson.loads(p1.findall(r.text)[0])
name_dict = {item['id']:item['name'] for item in lookup_data['category']}
companies = {}
for item in data['tmember']:
companies[item['permalink']] = (base + item['permalink'], [name_dict[i] for i in item['category'].split(',')])
print(companies)
Updating for your additional request at end (Address info etc):
I then loop companies dict visiting each company url in tuple item 1 of value for current dict key; extract the required info into a dict, which I add the category info to, then update the current key:value with the dictionary just created.
import requests
from bs4 import BeautifulSoup as bs
import hjson
base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
p1 = re.compile(r'var ddObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
lookup_data = hjson.loads(p1.findall(r.text)[0])
name_dict = {item['id']:item['name'] for item in lookup_data['category']}
companies = {}
for item in data['tmember']:
companies[item['permalink']] = (base + item['permalink'], [name_dict[i] for i in item['category'].split(',')])
with requests.Session() as s:
for k,v in companies.items():
r = s.get(v[0])
soup = bs(r.content, 'lxml')
tel = soup.select_one('.w3-text-sfma ~ p:contains(Tel)')
fax = soup.select_one('.w3-text-sfma ~ p:contains(Fax)')
email = soup.select_one('.w3-text-sfma ~ p:contains(Email)')
website = soup.select_one('.w3-text-sfma ~ p:contains(Website)')
if tel is None:
tel = 'N/A'
else:
tel = tel.text.replace('Tel: ','')
if fax is None:
fax = 'N/A'
else:
fax = fax.text.replace('Fax: ','')
if email is None:
email = 'N/A'
else:
email = email.text.replace('Email: ','')
if website is None:
website = 'N/A'
else:
website = website.text.replace('Website: ','')
info = {
# 'Address' : ' '.join([i.text for i in soup.select('.w3-text-sfma ~ p:not(p:nth-child(n+4) ~ p)')])
'Address' : ' '.join([i.text for i in soup.select('.w3-text-sfma ~ p:nth-child(-n+4)')])
, 'Tel' : tel
, 'Fax': fax
, 'Email': email
,'Website' : website
, 'Categories': v[1]
}
companies[k] = info
Example entry in companies dict:

Web scraping with Python that requires login to view output

I am trying to output the job's salary but it says need login to view. I can successfully output the other jobs' descriptions like the job title, company, location, etc. I have tried logged in with my account and logged out but it still says login to view salary. My question is, how do I show the salary which requires login to view? Need someone to help me.
import requests
from bs4 import BeautifulSoup
from mechanize import Browser
import http.cookiejar as cookielib
#creates browser
br = Browser()
#browser options
br.set_handle_robots(False) #ignore robots
br.set_handle_refresh(False) #can sometimes hang without this
br.addheaders = [('User-Agent', 'Firefox')]
login_url = "https://myjobstreet.jobstreet.com.my/home/login.php"
cj = cookielib.CookieJar()
br.set_cookiejar(cj)
response = br.open('https://myjobstreet.jobstreet.com.my/home/login.php')
#view available forms
for f in br.forms():
print(f)
br.select_form('login')
br.set_all_readonly(False) #allows everything to be written to
br.form['login_id'] = 'my_id'
br.form['password'] = 'my_password'
#submit current form
br.submit()
r = requests.get(url, headers=headers, auth=('user', 'pass'))
soup = BeautifulSoup(r.text, 'lxml')
jobs = soup.find_all("div", {"class": "rRow"})
for job in jobs:
try:
salary = job.find_all("div", {"class": "rRowLoc"})
job_salary = salary[0].text.strip()
except IndexError:
pass
print("Salary: ", job_salary)
This is the output:
Job: Sales Executive
Company: Company
Location: Earth
Salary: Login to view salary
Expected output:
Job: Sales Executive
Company: Company
Location: Earth
Salary: 1000
Your code is not working, but your goal is to scrape Company Name, Position, Location and Salary from page.
You can do your login process using requests.
Salary detail is not available into HTML because it is coming through Ajax request, So every time you find Salary into HTML it will be blank.
import requests
import bs4 as bs
headers = {
'Host': 'myjobstreet.jobstreet.com.my',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31',
}
login_url = 'https://myjobstreet.jobstreet.com.my/home/login.php?site=&language_code=3'
post_data_for_login = {
"referer_url":"",
"mobile_referer":"",
"login_id":"**YOUR EMAIL ID**",
"password":"**YOUR PASSWORD**",
"remember":"on",
"btn_login":"",
"login":"1"
}
# Create Session.
session = requests.session()
# Login request to get cookies.
response = session.post(login_url, data=post_data_for_login, headers=headers)
print('login_response:', response.status_code)
job_page_url = 'https://www.jobstreet.com.my/en/job/fb-service-team-4126557'
job_page_json_url = job_page_url + '/panels'
# Update Host in headers.
headers['Host'] = 'www.jobstreet.com.my'
# Get Job details.
response = session.get(job_page_url, headers=headers)
# Fetch Company Name, Position and Location details from HTML.
soup = bs.BeautifulSoup(response.text, 'lxml')
company_name = soup.find("div", {"id": "company_name"}).text.strip()
position_title = soup.find("h1", {"id": "position_title"}).text.strip()
work_location = soup.find("span", {"id": "single_work_location"}).text.strip()
print('Company:', company_name);print('Position:', position_title);print('Location:', work_location)
# Get Salary data From JSON.
response = session.get(job_page_json_url, headers=headers)
# Fetch Salary details from JSON.
if response.status_code == 200:
json_data = response.json()
salary_tag = json_data['job_salary']
soup = bs.BeautifulSoup(salary_tag, 'lxml')
salary_range = soup.find("span", {"id": "salary_range"}).text
print('Salary:', salary_range)
Output:
login_response: 200
Company: Copper Bar and Restaurant (88 Armenian Sdn Bhd)
Position: F&B Service Team
Location: Malaysia - Penang
Salary: MYR 2,000 - MYR 2,500
That code is not runnable. There are multiple issues I can see. You don't use login_url, the variables url and headers are not defined. You're instantiating a browser br, use it to login using br.open but then you stop using the browser. You should keep using the browser instead of requests.get. Your goal should be to get the cookies after login and keep using the cookies for the next page. I'm not familiar with mechanize, though this would be how you would get the html from an open.
response = br.open(url)
print(response.read()) # the text of the page
A better option might be to open developer tools, look at the network request, right-click it and click "copy as cURL". which will show you how to repeat the request at the commandline with cookies and all. See a better explanation plus gif at https://developers.google.com/web/updates/2015/05/replay-a-network-request-in-curl

Resources