Python - requests, lmxl and xpath not working - python-3.x

I am trying to write some python to scrape the web for firmware/driver updates but different web pages are responding differently.
I've used the requests and lxml packages to find the information based on xpath. Xpath was found by opening URL in chrome, right clicking on the data and inspecting it, then right click again when it is showing the code and selecting copy xpath.
WORKING EXAMPLE
Intel NUC at https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK.
At 2019-12-25 the data value it correctly picks up is "24.3".
import requests
from lxml import html
url="https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK"
page = requests.get(url)
XpathToFWtype = '//*[#id="search-results"]/tbody/tr[1]/td[4]/text()'
tree.xpath(XpathToFWtype)
FAILING EXAMPLE
Similar logic fails for ASUS website, where it should scape firmware text Version 1.1.2.3_790:
https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/
The failing xpath returns from inspect statement as:
//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]span[1]
Everything I try fails, whether I add "/text()" or any variation. The webpages differ in that the "view source" shows the text for the Intel url, and not the Asus so it is being dynamically generated somewhere - but I am unsure after days of trying everything what to do next.
import requests
from lxml import html
url="https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/"
page = requests.get(url)
XpathToFWtype = '//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]/span[1]/text()'
tree.xpath(XpathToFWtype)
# etc -> many traceback errors from lxml :-(
Thanks for any suggestion or direction, its really appreciated

For INTEL website you can do the following:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("td", {'class': 'dc-version collapsible-col collapsible1'}):
item = item.text
print(item[0:item.find("L")])
Output:
24.3
0054
1.0.0
6.1.9
15.40.41.5058
1.01
1
6.0.1.7982
11.0.6.1194
15.36.28.4332
15.40.13.4331
15.36.26.4294
14.5.0.1081
2.4.2013.711
10.1.1.8
10.0.27
2.4.2013.711
2.4.2013.711
For ASUS website it's actually using JavaScript to render it's content. so you will need to use Selenium or PhantomJS. but I've been able to locate the XHR to the JSON API and called it by a request :).
import requests
r = requests.get(
"https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=").json()
for item in r['Result']['Obj']:
for data in item['Files']:
print(data['Version'])
Output:
1.1.2.3_790
1.1.2.3_743
1.1.2.3_674
1.1.2.3_617
1.1.2.3_552
1.1.2.3_502
1.1.2.3_473
You can parse whatever from here :) https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=

Related

Why do I get gibberish when I try to web scrap Google search results?

I am trying to make a web scrapper using Python3. I am trying to webscrap the URLs from the Google search page.
My code is as follows:
from bs4 import BeautifulSoup
from requests import *
e = get("https://www.google.com/search?q=Keyword", verify=True)
soup = BeautifulSoup(e.content, 'html.parser')
print(soup.cite)
Now when I try to get the URls(which are marked with the tag 'cite'), I don't get any.
Everything seems ok, but I don't get any results. What's going on?

How to know which tags to use in scraping?

Is there any logic which tags should be used in scraping?
Right now I'm just doing "trial-and-error" on different tag variations to see which works. It takes a lot of time and is really frustrating. I can't understand the logic as to why some tags work and some dont. For example, the code below works fine:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test1 = soup.find_all('div', attrs={'id':'app'})
print(test1)
However, just a slight change to the code and the result is "None":
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test2 = soup.find_all('div', attrs={'id':'YDC-Lead-Stack-Composite'})
print(test2)
Is there any logical explanation why the first example (test1) returns values and why the second example (test2) doesn't return any value?
Is there an efficient way to know which tags will work?
Looks to me like you're trying to scrape a react webapp which will be impossible via the usual web scraping methods.
If you view the raw source (before the scripts are loaded, you'll find that the app is not loaded (as it runs in javascript and fetches the data).
There are two options here:
Find out if there is an API you can query (instead of scraping)
Load the page in a browser and use selenium to scrape (see https://selenium-python.readthedocs.io/getting-started.html)

Scraping from javascript in HTML tags using beautifulsoup

I am trying to scrape the names from all the alphabets (Ato Z and also 0-9) of this website http://www.smfederation.org.sg/membership/members-directory
But the names seems to be hidden in href ="javascript:void(0)"
Below is my code
import requests
from bs4 import BeautifulSoup
url = "http://www.smfederation.org.sg/membership/members-directory"
for item in url:
detail = requests.get(item)
soup = BeautifulSoup(detail.content, 'html.parser')
I have no idea how to approach javascript with in HTML.
What should i add to the above code to fetch the names of all listings?
You are scraping the wrong url. Open the inspector of your browser, go to the Network tab and you will see that the names are at http://smfederation.org.sg/account/getaccounts
It's in json format, so it will automatically be a Python dictionary when you load it using the .json() method of the response object returned by requests:
>>> import requests
>>> accounts = requests.get("http://www.smfederation.org.sg/account/getaccounts").json()
>>> accounts["data"][0]["accountname"]
'OPTO-PRECISION PTE LTD'
You can also get all accounts using a for loop, such as:
for account in accounts["data"]:
print(account["accountname"])

Web scraping issue searching for contents in Youtube trending page with BeautifulSoup

I am trying to build an app that returns the top 10 youtube trending videos into an excel file but ran into an issue right at the beginning. For some reason, whenever I try to use "soup.find" on any of the id's on this YouTube page, it returns "None" as the result.
I have made sure that my spelling is perfect and everything but it still won't work. I have tried this same code using other sites and get the same error.
#What I did for Youtube which resulted in output being "None"
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
videos = soup.find(id= "contents")
print(videos)
I expect it to provide me with the HTML code that has this id that I have specified but it keeps saying "None".
The page is using heavy Javascript to modify class, attributes of tags. What you see in Developer Tools isn't always what requests provides you. I recommend to call print(soup.prettify()) and see with what markup you're working with.
You can use this script to get first 10 trending videos:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
for i, a in enumerate(soup.select('h3.yt-lockup-title a[title]')[:10], 1):
print('{: <4}{}'.format(str(i)+'.', a['title']))
Prints (in my case in Estonia):
1. Jaanus Saks - Su säravad silmad
2. Егор Крид - Сердцеедка (Премьера клипа, 2019)
3. Comment Out #11/ Ольга Бузова х Фёдор Смолов
4. 5MIINUST x NUBLU - (ei ole) aluspükse
5. Артур Пирожков - Алкоголичка (Премьера клипа 2019)
6. Slav school of driving - driving instructor Boris
7. ЧТО ЕДЯТ В АРМИИ США VS РОССИИ?
8. RC Airplane Battle | Dude Perfect
9. ЧЕЙ КОРАБЛИК ОСТАНЕТСЯ ПОСЛЕДНИЙ, ПОЛУЧИТ 1000$ !
10. Khloé Kardashian's New Mom Beauty Routine | Beauty Secrets | Vogue
Since YouTube uses too much of javascript to render and modify the way pages load, it's a better idea to make the page load in a browser and then use it's page source for rendering in BeautifulSoup scripts. So we use Selenium for this purpose. Here once the soup object is obtained then you can do whatever you want with it.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import os
driver = webdriver.Firefox(executable_path="/home/rishabh/Documents/pythonProjects/webScarapping/geckodriver")
driver.get('https://www.youtube.com/feed/trending')
content = driver.page_source
driver.close()
soup = BeautifulSoup(content, 'html.parser')
#Do whatever you want with it
Configure Selenium https://selenium-python.readthedocs.io/installation.html

How to webscrape flights using Python

I am webscraping a website for flight tickets. My problem is: I am using Chrome developer to identify the class of the HTML object I want to scrape. However, my code does not find it. It looks like I am not downloading the HTML code I can see in the Chrome Developer Extension. (inspect item...)
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.momondo.de/flightsearch/?Search=true&TripType=2&SegNo=2&SO0=BOS&SD0=LON&SDP0=07-09-2016&SO1=LON&SD1=BOS&SDP1=12-09-2016&AD=1&TK=ECO&DO=false&NA=false'
req = requests.get(url)
soup = BeautifulSoup(req.content)
x = soup.findAll("span" ,{"class":"value"} )
Please try the following:
from bs4 import BeautifulSoup
import urllib.request
source = urllib.request.urlopen('http://www.momon...e&NA=false').read()
soup = BeautifulSoup(source,'html5lib')
for item in soup.find_all("span", class_="value"):
print(item.text)
With this you can scrape all the spans of the webpage with the class "value". If you want to see the whole html element and its attributes instead of just the content, remove .text from print(item.text).
You will probably need to install html5lib with pip, if you are having trouble doing this try running CMD as admin (assuming you are using windows).
You can also try this:
for values_in_x in x:
print(values_in_x.text)

Resources