I need to parse multiple html through requests.get(). I just need to keep the content of the page and get rid of the embedded javascript and css. I saw the following post but no solution works for me.
http://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python, http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text, and http://stackoverflow.com/questions/2081586/web-scraping-with-python
I got a working code that doesn't strip js either css... here is my code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
I tried to include solutions from the links above mention but no code works for me.
What line of code can get rid of the embedded js and embedded css
Question Update 4 OCT 2016
The file that read.csv is something like this...
trump,clinton
data science, operating system
windows,linux
diabetes,cancer
I hit gigablast.com with those terms to search one row at the time. One search will be trump clinton. The result is a list of urls. I requests.get(url) and I process those urls getting rid of timeouts, status_code = 400s, and building a clean list of clean_urls = []. After that I fire the following code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
On this line of code page = BeautifulSoup(page.content, 'html.parser').text I have the text of the entire web page, including styles and scripts if they were embedded. I can't target them with BeautifulSoup because the tags are no longer there. I did try page = BeautifulSoup(page.content, 'html.parser') and find_all('<script>') and try to get rid of the script but I ended up erasing the entire file. The desired outcome will be all the text of the html without any...
body {
font: something;
}
or any javascript...
$(document).ready(function(){
$some code
)};
The final file should have no code what so ever, just the content of the document.
I used this code to get rid of javascript and CSS code while scraping HTML page
import requests
from bs4 import BeautifulSoup
url = 'https://corporate.walmart.com/our-story/our-business'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.title.string
for script in soup(["script", "style"]):
script.decompose()
with open('output_file.txt', "a") as text_file:
text_file.write("\nURL : "+ url)
text_file.write("\nTitle : " + title)
for p_tag_data in soup.find_all('p'):
text_file.write("\n"+p_tag_data.text)
for li_tag_data in soup.find_all('li'):
text_file.write("\n"+li_tag_data.text)
for div_tag_data in soup.find_all('div'):
text_file.write("\n"+div_tag_data.text)
Related
I am trying to download reports from a companies website, https://www.investorab.com/investors-media/reports-presentations/. In the end, I would like to download all the available reports.
I have next to none experience in webscraping, so I have some trouble defining the correct search pattern. Previously I have needed to take out all links containing pdfs, i.e. I could use soup.select('div[id="id-name"] a[data-type="PDF"]'). But for this website, there is not listed a datatype for the links. How do I select all links under "Report and presentations"? Here is what I have tried, but it returns an empty list:
from bs4 import BeautifulSoup
import requests
url = "https://www.investorab.com/investors-media/reports-presentations/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Select all reports, publication_dates
reports = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] a[href]')
pub_dates = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] div[class="field-content"]')
I would also like to select all publications date, but also ends up with an empty list. Any help in the right direction is appreciated.
What you'll need to do is iterate through the pages, or what I did was just iterate through the year parameter. Once you get the list for the year, get the link of each report, then within each link, find the pdf link. You'll then use that pdf link to write to file:
from bs4 import BeautifulSoup
import requests
import os
# Gets all the links
linkList = []
url = 'https://vp053.alertir.com/v3/en/events-archive?'
for year in range(1917,2021):
query = 'type%5B%5D=report&type%5B%5D=annual_report&type%5B%5D=cmd&type%5B%5D=misc&year%5Bvalue%5D%5Byear%5D=' + str(year)
response = requests.get(url + query )
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
linkList += [link['href'] for link in links if 'v3' in link['href']]
print ('Gathered links for year %s.' %year)
# Go to each link and get the pdsf within them
print ('Downloading PDFs...')
for link in linkList:
url = 'https://vp053.alertir.com' + link
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for pdflink in soup.select("a[href$='.pdf']"):
folder_location = 'C:/test/pdfDownloads/'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
try:
filename = os.path.join(folder_location,pdflink['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get('https://vp053.alertir.com' + pdflink['href']).content)
print ('Saved: %s' %pdflink['href'].split('/')[-1])
except Exception as ex:
print('%s not saved. %s' %(pdflink['href'],ex))
I am trying to extract comments from the website and whenever there is a reply to the comment the previous post is included in the comments. I am trying to ignore those replies while extracting
url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
comments_lst= soup.findAll('div',attrs={"class":"ism-true"})
comments =[]
for item in comments_lst:
result = [item.get_text(strip=True, separator=" ")]
comments.append(result)
quotes = []
for item in soup.findAll('div',attrs={"class":"panel alt2"}):
result = [item.get_text(strip=True, separator=" ")]
quotes.append(result)
For the final result I do not want data from quotes list to be included in my comments. I tried using if but it gives incorrect result.
Example comments[6] gives below result
'Quote: Originally Posted by jeff_the_pilot What the difference between adaptive cruise control on 2018 versus 2017? I believe mine brakes if I encroach another vehicle. It will work in stop and go traffic!'
my expected result
It will work in stop and go traffic!
You need to add some logic to take out the text contained in divs with class panel alt2:
comments =[]
for item in comments_lst:
result = [item.get_text(strip=True, separator=" ")]
if div := item.find('div', class_="panel alt2"):
result[0] = ' '.join(result[0].split(div.text.split()[-1])[1:])
comments.append(result)
>>> comments[6]
[' It will work in stop and go traffic!']
This will get all messages without Quotes:
import requests
from bs4 import BeautifulSoup
url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
msgs = []
for msg in soup.select('[id^="post_message_"]'):
for div in msg.select('div:has(> div > label:contains("Quote:"))'):
div.extract()
msgs.append( msg.get_text(strip=True, separator='\n') )
#print(msgs) # <-- uncomment to see all messages without Quoted messages
print(msgs[6])
Prints:
It will work in stop and go traffic!
I'm trying to scrape farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282) with Beautifulsoup4 and I am not able to find the same components (tags or text in general) of the parsed text (dumped to soup.html) as in the browser in the dev tools view (when searching for matching strings with CTRL + F).
There is nothing wrong with my code but redardless of that here it is:
#!/usr/bin/python
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup
# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")
# write parsed soup to file
with open("soup.html", "a") as dumpfile:
dumpfile.write(str(page_soup))
When I drag the soup.html file into the browser, all content loads as it should (like the real url). I assume it to be some kind of protection against parsing? I tried to put in a connection header which tells the webserver on the other side that I am requesting this from a real browser but it didnt work either.
Has anyone encountered something similar before?
Is there a way to get the REAL html as shown in the browser?
When I search the wanted content in the browser it (obviously) shows up...
Here the parsed html saved as "soup.html". The content I am looking for can not be found, regardless of how I search (CTRL+F) or bs4 function find_all() or find() or what so ever.
Based on your comment, here is an example how you could extract some information from products that are on discount:
import requests
from bs4 import BeautifulSoup
url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):
link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
price = product.select_one('[data-test="price"]').get_text(strip=True)
images = [i['content'] for i in product.select('meta[itemprop="image"]')]
print('Link :', link)
print('Brand :', brand)
print('Description :', desc)
print('Initial price :', init_price)
print('Price :', price)
print('Images :', images)
print('-' * 80)
Prints:
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : printed button up shirt
Initial price : CHF 438
Price : CHF 219
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
--------------------------------------------------------------------------------
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : corduroy T-Shirt
Initial price : CHF 259
Price : CHF 156
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
--------------------------------------------------------------------------------
... and so on.
The following helped me:
instead of the following code
page_soup = soup(page_html, "html.parser")
use
page_soup = soup(page_html, "html")
There are some question like this online but I looked at them and none of them have helped me I am currently working on a script that pulls an item name from http://www.supremenewyork.com/shop/all/accessories
I want it to pull this information from supreme uk but im having trouble with the proxy stuff but right now im strugglinh with this script everytime I run it I get the error listed above in the title.
Here is my Script:
import requests
from bs4 import BeautifulSoup
URL = ('http://www.supremenewyork.com/shop/all/accessories')
proxy_script = requests.get(URL).text
soup = BeautifulSoup(proxy_script, 'lxml')
for item in soup.find_all('div', class_='inner-article'):
name = soup.find('h1', itemprop='name').text
print(name)
I am always getting this error and when I run the script without the .text at the end of the itemprop=name I just get a bunch of None's
like this:
None
None
None etc......
The exact amount of Nones as there are Items available to print
Here we go, I've commented the code that I've used below.
and the reason we use class_='something is because the word class is reserved for classes in Python.
URL = ('http://www.supremenewyork.com/shop/all/accessories')
#UK_Proxy1 = '178.62.13.163:8080'
#proxies = {
# 'http': 'http://' + UK_Proxy1,
#'https': 'https://' + UK_Proxy1
#}
#proxy_script = requests.get(URL, proxies=proxies).text
proxy_script = requests.get(URL).text
soup = BeautifulSoup(proxy_script, 'lxml')
thetable = soup.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
only_text = item.h1.a.text
# by doing .<tag> we extract information just from that tag
# example bsobject = <html><body><b>ey</b></body</html>
# if we print bsobject.body.b it will return `<b>ey</b>`
color = item.p.a.text
print(only_text, color)
I am practicing crawling with python3.
I am crawling this site.
http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fview&_EXT_BBS_sCategory=&_EXT_BBS_sKeyType=&_EXT_BBS_sKeyword=&_EXT_BBS_curPage=1&_EXT_BBS_optKeyType1=&_EXT_BBS_optKeyType2=&_EXT_BBS_optKeyword1=&_EXT_BBS_optKeyword2=&_EXT_BBS_sLayoutId=0
I want to get the address of pdf from html code.
ex) In html, pdf down url is
http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=1&p_p_state=exclusive&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fget_file&_EXT_BBS_extFileId=5326
But, My crawler results
http://www.keri.org/web/www/research_0201**;jsessionid=3875698676A3025D8877C4EEBA67D6DF**p_p_id=EXT_BBS&p_p_lifecycle=1&p_p_state=exclusive&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fget_file&_EXT_BBS_extFileId=5306
I can not even download the file to the address below.
Where did jsessionid come from?
I can just erase it, but I wonder why.
**
Why is the URL so long? lol
I tested in my code that jsessionid dose not affect the download file:
import requests, bs4
r = requests.get('http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fview&_EXT_BBS_sCategory=&_EXT_BBS_sKeyType=&_EXT_BBS_sKeyword=&_EXT_BBS_curPage=1&_EXT_BBS_optKeyType1=&_EXT_BBS_optKeyType2=&_EXT_BBS_optKeyword1=&_EXT_BBS_optKeyword2=&_EXT_BBS_sLayoutId=0')
soup = bs4.BeautifulSoup(r.text, 'lxml')
down_links = [(a.get('href'), a.find_previous('a').text )for a in soup('a', class_="download")]
for link, title in down_links:
filename = title + '.pdf'
r = requests.get(link, stream=True, headers=headers)
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)