bs4 parses different html than browser

bs4 parses different html than browser - python-3.x

I'm trying to scrape farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282) with Beautifulsoup4 and I am not able to find the same components (tags or text in general) of the parsed text (dumped to soup.html) as in the browser in the dev tools view (when searching for matching strings with CTRL + F).
There is nothing wrong with my code but redardless of that here it is:
#!/usr/bin/python
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup
# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")
# write parsed soup to file
with open("soup.html", "a") as dumpfile:
dumpfile.write(str(page_soup))
When I drag the soup.html file into the browser, all content loads as it should (like the real url). I assume it to be some kind of protection against parsing? I tried to put in a connection header which tells the webserver on the other side that I am requesting this from a real browser but it didnt work either.
Has anyone encountered something similar before?
Is there a way to get the REAL html as shown in the browser?
When I search the wanted content in the browser it (obviously) shows up...
Here the parsed html saved as "soup.html". The content I am looking for can not be found, regardless of how I search (CTRL+F) or bs4 function find_all() or find() or what so ever.

Based on your comment, here is an example how you could extract some information from products that are on discount:
import requests
from bs4 import BeautifulSoup
url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):
link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
price = product.select_one('[data-test="price"]').get_text(strip=True)
images = [i['content'] for i in product.select('meta[itemprop="image"]')]
print('Link :', link)
print('Brand :', brand)
print('Description :', desc)
print('Initial price :', init_price)
print('Price :', price)
print('Images :', images)
print('-' * 80)
Prints:
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : printed button up shirt
Initial price : CHF 438
Price : CHF 219
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
--------------------------------------------------------------------------------
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : corduroy T-Shirt
Initial price : CHF 259
Price : CHF 156
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
--------------------------------------------------------------------------------
... and so on.

The following helped me:
instead of the following code
page_soup = soup(page_html, "html.parser")
use
page_soup = soup(page_html, "html")

Related

Scraping reports from a website using BeautifulSoup in python

I am trying to download reports from a companies website, https://www.investorab.com/investors-media/reports-presentations/. In the end, I would like to download all the available reports.
I have next to none experience in webscraping, so I have some trouble defining the correct search pattern. Previously I have needed to take out all links containing pdfs, i.e. I could use soup.select('div[id="id-name"] a[data-type="PDF"]'). But for this website, there is not listed a datatype for the links. How do I select all links under "Report and presentations"? Here is what I have tried, but it returns an empty list:
from bs4 import BeautifulSoup
import requests
url = "https://www.investorab.com/investors-media/reports-presentations/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Select all reports, publication_dates
reports = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] a[href]')
pub_dates = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] div[class="field-content"]')
I would also like to select all publications date, but also ends up with an empty list. Any help in the right direction is appreciated.

What you'll need to do is iterate through the pages, or what I did was just iterate through the year parameter. Once you get the list for the year, get the link of each report, then within each link, find the pdf link. You'll then use that pdf link to write to file:
from bs4 import BeautifulSoup
import requests
import os
# Gets all the links
linkList = []
url = 'https://vp053.alertir.com/v3/en/events-archive?'
for year in range(1917,2021):
query = 'type%5B%5D=report&type%5B%5D=annual_report&type%5B%5D=cmd&type%5B%5D=misc&year%5Bvalue%5D%5Byear%5D=' + str(year)
response = requests.get(url + query )
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
linkList += [link['href'] for link in links if 'v3' in link['href']]
print ('Gathered links for year %s.' %year)
# Go to each link and get the pdsf within them
print ('Downloading PDFs...')
for link in linkList:
url = 'https://vp053.alertir.com' + link
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for pdflink in soup.select("a[href$='.pdf']"):
folder_location = 'C:/test/pdfDownloads/'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
try:
filename = os.path.join(folder_location,pdflink['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get('https://vp053.alertir.com' + pdflink['href']).content)
print ('Saved: %s' %pdflink['href'].split('/')[-1])
except Exception as ex:
print('%s not saved. %s' %(pdflink['href'],ex))

Unable to read wiki page by BeautifulSoup

I tried to read wiki page using urllib and beautiful soup as follows.
I tried according to this.
import urllib.parse as parse, urllib.request as request
from bs4 import BeautifulSoup
name = "メインページ"
root = 'https://ja.wikipedia.org/wiki/'
url = root + parse.quote_plus(name)
response = request.urlopen(url)
html = response.read()
print (html)
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
print (soup)
The code run without error but could not read Japanese characters.

Your approach seems correct and working for me.
Try printing soup parsed data using following code and check the output.
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
some_japanese = soup.find('div', {'id': 'mw-content-text'}).text.strip()
print(some_japanese)
In my case, I am getting the following(this is part of the output) -
ウィリアム・バトラー・イェイツ（1865年6月13日 - 1939年1月28日）は、アイルランドの詩人・劇作家。幼少のころから親しんだアイルランドの妖精譚などを題材とする抒情詩で注目されたのち、民族演劇運動を通じてアイルランド文芸復興の担い手となった。……
If this is not working for you, then try to save html content to file, and check the page in browser, if japanese text is fetching properly or not. (Again, its working fine for me)

Web Scraping from Amazon website is giving HTTP Error

I am using Python: 3.7.1 version and using this, I want to do a web scraping of I-Phone user comments (or Customer Reviews) present in Amazon website (link below).
Link (to be scraped):
https://www.amazon.in/Apple-iPhone-Silver-64GB-Storage/dp/B0711T2L8K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1548335262&sr=1-1&keywords=iphone+X
When I try below code then it is giving me below error:
CODE:
# -*- coding: utf-8 -*-
#import the library used to query a website
import urllib.request
from bs4 import BeautifulSoup
#specify the url
scrap_link = "https://www.amazon.in/Apple-iPhone-Silver-64GB-Storage/dp/B0711T2L8K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1548335262&sr=1-1&keywords=iphone+X"
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
#Query the website and return the html to the variable 'page'
page = urllib.request.urlopen(scrap_link)
#page = urllib.request.urlopen(wiki)
print(page)
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)
print(soup.prettify())
ERROR:
File "C:\Users\bsrivastava\AppData\Local\Continuum\anaconda3\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Service Unavailable
NOTE: When I try to scrape wiki link (shown in code) then it is working fine.
So why am I getting this error using Amazon link and how can I overcome it?
Also, when I get this Customer Reviews data then I need to store it in a structured format as shown below. How can I do it? ( I am totally new to NLP so need some guidance over here)
Structure:
a. Reviewer’s Name
b. Date of review
c. Color
d. Size
e. Verified Purchase (True or False)
f. Rating
g. Review Title
h. Review Description

NLP? are you sure?
import requests
from bs4 import BeautifulSoup
scrap_link = "https://www.amazon.in/Apple-iPhone-Silver-64GB-Storage/dp/B0711T2L8K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1548335262&sr=1-1&keywords=iphone+X"
req = requests.get(scrap_link)
soup = BeautifulSoup(req.content, 'html.parser')
container = soup.findAll('div', attrs={'class':'a-section review aok-relative'})
data = []
for x in container:
ReviewersName = x.find('span', attrs={'class':'a-profile-name'}).text
data.append({'ReviewersName':ReviewersName})
print(data)
#later save the dictionary to csv

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.

Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)

Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

Beautifulsoup get rid of embedded js and css in html

I need to parse multiple html through requests.get(). I just need to keep the content of the page and get rid of the embedded javascript and css. I saw the following post but no solution works for me.
http://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python, http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text, and http://stackoverflow.com/questions/2081586/web-scraping-with-python
I got a working code that doesn't strip js either css... here is my code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
I tried to include solutions from the links above mention but no code works for me.
What line of code can get rid of the embedded js and embedded css
Question Update 4 OCT 2016
The file that read.csv is something like this...
trump,clinton
data science, operating system
windows,linux
diabetes,cancer
I hit gigablast.com with those terms to search one row at the time. One search will be trump clinton. The result is a list of urls. I requests.get(url) and I process those urls getting rid of timeouts, status_code = 400s, and building a clean list of clean_urls = []. After that I fire the following code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
On this line of code page = BeautifulSoup(page.content, 'html.parser').text I have the text of the entire web page, including styles and scripts if they were embedded. I can't target them with BeautifulSoup because the tags are no longer there. I did try page = BeautifulSoup(page.content, 'html.parser') and find_all('<script>') and try to get rid of the script but I ended up erasing the entire file. The desired outcome will be all the text of the html without any...
body {
font: something;
}
or any javascript...
$(document).ready(function(){
$some code
)};
The final file should have no code what so ever, just the content of the document.

I used this code to get rid of javascript and CSS code while scraping HTML page
import requests
from bs4 import BeautifulSoup
url = 'https://corporate.walmart.com/our-story/our-business'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.title.string
for script in soup(["script", "style"]):
script.decompose()
with open('output_file.txt', "a") as text_file:
text_file.write("\nURL : "+ url)
text_file.write("\nTitle : " + title)
for p_tag_data in soup.find_all('p'):
text_file.write("\n"+p_tag_data.text)
for li_tag_data in soup.find_all('li'):
text_file.write("\n"+li_tag_data.text)
for div_tag_data in soup.find_all('div'):
text_file.write("\n"+div_tag_data.text)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

bs4 parses different html than browser - python-3.x

The following helped me: instead of the following code page_soup = soup(page_html, "html.parser") use page_soup = soup(page_html, "html")

Related

Scraping reports from a website using BeautifulSoup in python

Unable to read wiki page by BeautifulSoup

Web Scraping from Amazon website is giving HTTP Error

How can I scrape data which is not having any of the source code?

Beautifulsoup get rid of embedded js and css in html

Categories

Resources