Jsessionid interferes with crawling - python-3.x

I am practicing crawling with python3.
I am crawling this site.
http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fview&_EXT_BBS_sCategory=&_EXT_BBS_sKeyType=&_EXT_BBS_sKeyword=&_EXT_BBS_curPage=1&_EXT_BBS_optKeyType1=&_EXT_BBS_optKeyType2=&_EXT_BBS_optKeyword1=&_EXT_BBS_optKeyword2=&_EXT_BBS_sLayoutId=0
I want to get the address of pdf from html code.
ex) In html, pdf down url is
http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=1&p_p_state=exclusive&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fget_file&_EXT_BBS_extFileId=5326
But, My crawler results
http://www.keri.org/web/www/research_0201**;jsessionid=3875698676A3025D8877C4EEBA67D6DF**p_p_id=EXT_BBS&p_p_lifecycle=1&p_p_state=exclusive&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fget_file&_EXT_BBS_extFileId=5306
I can not even download the file to the address below.
Where did jsessionid come from?
I can just erase it, but I wonder why.
**
Why is the URL so long? lol

I tested in my code that jsessionid dose not affect the download file:
import requests, bs4
r = requests.get('http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fview&_EXT_BBS_sCategory=&_EXT_BBS_sKeyType=&_EXT_BBS_sKeyword=&_EXT_BBS_curPage=1&_EXT_BBS_optKeyType1=&_EXT_BBS_optKeyType2=&_EXT_BBS_optKeyword1=&_EXT_BBS_optKeyword2=&_EXT_BBS_sLayoutId=0')
soup = bs4.BeautifulSoup(r.text, 'lxml')
down_links = [(a.get('href'), a.find_previous('a').text )for a in soup('a', class_="download")]
for link, title in down_links:
filename = title + '.pdf'
r = requests.get(link, stream=True, headers=headers)
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)

Related

Trying to download a file from a link but failed (using python request)

I got a link that downloads a pdf file:
https://eclass.cccmyc.edu.hk/home/imail/downloadattachment.php?CampusMailID=441933&b_filename=TEkgSE8gWVVFTjREMTAucGRm
But instead of downloading the pdf, it redirects me to the login page of that platform. I am now trying to use python to login and download the file, here is my code:
import requests
url = "https://eclass.cccmyc.edu.hk/login.php"
download_url = "https://eclass.cccmyc.edu.hk/home/imail/downloadattachment.php?CampusMailID=441933&b_filename=TEkgSE8gWVVFTjREMTAucGRm"
data = {
"UserLogin": "myUsername",
"UserPassword": "secret"
}
r = requests.get(url, data=data)
I suppose I have logged in by now, so I download that file:
r = requests.get(download_url, allow_redirects=True)
open('my_pdf', 'wb').write(r.content)
Sadly, instead of downloading the pdf file, the program gives me some bunches of HTML codes. What should I do? I am pretty sure that I have made a big error in the program.
Thanks!
Here is a code snippet you can use and try.
with requests.get(download_url, allow_redirects=True, stream=True) as r:
r.raise_for_status()
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
# If you have chunk encoded response uncomment if
# and set chunk_size parameter to None.
if chunk:
f.write(chunk)

Request.get not rendering all 'hrefs' in HTML Python

I am trying to fetch the "Contact Us" page of multiple websites. It works for some of the websites, but for some, the text rendered by request.get does not contain all the 'href" links. When i inspect the page in browser, it is visible but not coming through in requests.
Tried to look for the solution , but to no luck:-
Below is the code and the webpage i am trying to scrape https://portcullis.co/ :-
headers = {"Accept-Language": "en-US, en;q=0.5"}
def page_contact(url):
r = requests.get(url, headers = headers)
txt = BeautifulSoup(r.text, 'html.parser')
links = []
for link in txt.findAll('a'):
links.append(link.get('href'))
return r, links
The output generated is :-
<Response [200]> []
Since it is working fine for some other websites, i would prefer to edit it in a way where it doesn't just cater to this website, but to all websites,
Any help is highly appreciated !!
Thanks !!!
This is another way to solve this using only selenium and not BeautifulSoup
browser = selenium.webdriver.Chrome(chrome.exe)
browser.get(url)
browser.set_page_load_timeout(100)
time.sleep(3)
WebDriverWait(browser, 20).until(lambda d: d.find_element_by_tag_name("a"))
time.sleep(20)
elements = browser.find_elements_by_xpath("//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') , 'contact')]")
for el in elements:
final_link.append(el.get_attribute("href"))
This would fetch you the source page info, and you can find the relevant links by passing it to beautifulsoup
from selenium import webdriver
import time
browser = webdriver.Chrome(r'path to your chrome exe')
browser.get('Your url')
time.sleep(5)
htmlSource = browser.page_source
txt = BeautifulSoup(htmlSource, 'html.parser')
browser.close()
links = []
for link in txt.findAll('a'):
links.append(link.get('href'))

Scraping reports from a website using BeautifulSoup in python

I am trying to download reports from a companies website, https://www.investorab.com/investors-media/reports-presentations/. In the end, I would like to download all the available reports.
I have next to none experience in webscraping, so I have some trouble defining the correct search pattern. Previously I have needed to take out all links containing pdfs, i.e. I could use soup.select('div[id="id-name"] a[data-type="PDF"]'). But for this website, there is not listed a datatype for the links. How do I select all links under "Report and presentations"? Here is what I have tried, but it returns an empty list:
from bs4 import BeautifulSoup
import requests
url = "https://www.investorab.com/investors-media/reports-presentations/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Select all reports, publication_dates
reports = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] a[href]')
pub_dates = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] div[class="field-content"]')
I would also like to select all publications date, but also ends up with an empty list. Any help in the right direction is appreciated.
What you'll need to do is iterate through the pages, or what I did was just iterate through the year parameter. Once you get the list for the year, get the link of each report, then within each link, find the pdf link. You'll then use that pdf link to write to file:
from bs4 import BeautifulSoup
import requests
import os
# Gets all the links
linkList = []
url = 'https://vp053.alertir.com/v3/en/events-archive?'
for year in range(1917,2021):
query = 'type%5B%5D=report&type%5B%5D=annual_report&type%5B%5D=cmd&type%5B%5D=misc&year%5Bvalue%5D%5Byear%5D=' + str(year)
response = requests.get(url + query )
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
linkList += [link['href'] for link in links if 'v3' in link['href']]
print ('Gathered links for year %s.' %year)
# Go to each link and get the pdsf within them
print ('Downloading PDFs...')
for link in linkList:
url = 'https://vp053.alertir.com' + link
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for pdflink in soup.select("a[href$='.pdf']"):
folder_location = 'C:/test/pdfDownloads/'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
try:
filename = os.path.join(folder_location,pdflink['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get('https://vp053.alertir.com' + pdflink['href']).content)
print ('Saved: %s' %pdflink['href'].split('/')[-1])
except Exception as ex:
print('%s not saved. %s' %(pdflink['href'],ex))

How can I ensure that relative links are saved as absolute URLs in the output file?

I need to develop a web links scraper program in Python that extracts all of the unique web links that point out to other web pages from the HTML code of the "Current Estimates" web link, both from the "US Census Bureau" website (see web link below) and outside that domain, and that populates them in a comma-separated values (CSV) file as absolute uniform resource indicators (URIs).
I use the code below in Jupyter Notebook and it seems it generates a CSV but part of my code is generating a double https:// on the already absolute links when it should just be adding it to the relative links.
http:https://www.census.gov/data/training-workshops.html
http:https://www.census.gov/programs-surveys/sis.html
I need a better code that can change the relative links to absolute I believe the full_url = urljoin(url, link.get("href")) should be doing this, but something is incorrect.
How can I ensure that relative links are saved as absolute URLs in the output file?
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
from urllib.parse import urljoin
import re
url = 'https://www.census.gov/programs-surveys/popest.html'
r = requests.get(url)
raw_html = r.text
print(r.text)
soup = BeautifulSoup(raw_html, 'html.parser')
print(soup.prettify())
for link in soup.find_all('a',href=True):
full_url = urljoin(url, link.get("href"))
print(link.get('href'))
links_set = set()
for link in soup.find_all(href=re.compile('a')):
print(link.get('href'))
for item in soup.find_all('a',href=re.compile(r'html')):
links_set.add(item.get('href'))
links = [x[:1]=='http' and x or 'http:'+x for x in links_set]
with open("C996FinalAssignment.csv", "w") as csv_file:
writer = csv.writer(csv_file,delimiter="\n")
writer.writerow(links)
Try this.
import requests
import csv
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://www.census.gov/programs-surveys/popest.html'
r = requests.get(url)
raw_html = r.text
print(r.text)
doc = SimplifiedDoc(raw_html)
lstA = doc.listA(url=url) # It will help you turn relative links into absolute links
links = [a.url for a in lstA]
with open("C996FinalAssignment.csv", "w") as csv_file:
writer = csv.writer(csv_file,delimiter="\n")
writer.writerow(links)

Beautifulsoup get rid of embedded js and css in html

I need to parse multiple html through requests.get(). I just need to keep the content of the page and get rid of the embedded javascript and css. I saw the following post but no solution works for me.
http://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python, http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text, and http://stackoverflow.com/questions/2081586/web-scraping-with-python
I got a working code that doesn't strip js either css... here is my code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
I tried to include solutions from the links above mention but no code works for me.
What line of code can get rid of the embedded js and embedded css
Question Update 4 OCT 2016
The file that read.csv is something like this...
trump,clinton
data science, operating system
windows,linux
diabetes,cancer
I hit gigablast.com with those terms to search one row at the time. One search will be trump clinton. The result is a list of urls. I requests.get(url) and I process those urls getting rid of timeouts, status_code = 400s, and building a clean list of clean_urls = []. After that I fire the following code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
On this line of code page = BeautifulSoup(page.content, 'html.parser').text I have the text of the entire web page, including styles and scripts if they were embedded. I can't target them with BeautifulSoup because the tags are no longer there. I did try page = BeautifulSoup(page.content, 'html.parser') and find_all('<script>') and try to get rid of the script but I ended up erasing the entire file. The desired outcome will be all the text of the html without any...
body {
font: something;
}
or any javascript...
$(document).ready(function(){
$some code
)};
The final file should have no code what so ever, just the content of the document.
I used this code to get rid of javascript and CSS code while scraping HTML page
import requests
from bs4 import BeautifulSoup
url = 'https://corporate.walmart.com/our-story/our-business'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.title.string
for script in soup(["script", "style"]):
script.decompose()
with open('output_file.txt', "a") as text_file:
text_file.write("\nURL : "+ url)
text_file.write("\nTitle : " + title)
for p_tag_data in soup.find_all('p'):
text_file.write("\n"+p_tag_data.text)
for li_tag_data in soup.find_all('li'):
text_file.write("\n"+li_tag_data.text)
for div_tag_data in soup.find_all('div'):
text_file.write("\n"+div_tag_data.text)

Resources