Web Links Scraping - python-3.x

I'm working on a project that requires me to web scrape unique links from a website and save them to a CSV file. I've read through quite a bit of material for how to do this, I've watched videos, done trainings on Pluralsight and LinkedIn Learning and I mostly have this situation figured out there is one aspect of the assignment that I'm not sure how to do.
The program is supposed to scrape web links from both the Domain that is given (see code below) and any web links outside of the domain.
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
BASE_url = urllib.request.urlopen("https://www.census.gov/programs-surveys/popest.html").read()
soup = bs.BeautifulSoup(BASE_url, "html.parser")
filename = "C996JamieCooperTask1.csv"
file = open(filename, "w")
headers = "WebLinks as of 4/7/2019\n"
file.write(headers)
all_Weblinks = soup.find_all('a')
url_set = set()
def clean_links(tags, base_url):
cleaned_links = set()
for tag in tags:
link = tag.get('href')
if link is None:
continue
if link.endswith('/') or link.endswith('#'):
link = link[-1]
full_urls = urllib.parse.urljoin(base_url, link)
cleaned_links.add(full_urls)
return cleaned_links
baseURL = "https://www.census.gov/programs-surveys/popest.html"
cleaned_links = clean_links(all_Weblinks, baseURL)
for link in cleaned_links:
file.write(str(link) + '\n')
print ("URI's written to .CSV File")
The code works for all web links that are internal to the baseURL so that exist in that website but doesn't grab any that point external to the site. I know the answer has to be something simple but after working on this project for some time I just can't see what is wrong with it so please help me.

You might try a selector such as follows inside a set comprehension. This looks for a tag elements with href that starts with http or /. It is a starting point you can tailor. You would need more logic because there is at least one url which is simply / by itself.
links = {item['href'] for item in soup.select('a[href^=http], a[href^="/"]')}
Also, check that all expected urls are present in soup as I suspect some require javascript to run on page.

Related

Download data from xml using xpath - returns empty list

I am fairly new to using python to collect data from the web. I am interested in writing a script that collects data from an xml webpage. Here is the address:
https://www.w3schools.com/xml/guestbook.asp
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("/guestbook/guest/fname")
print(guest)
I am not certain why this is returning an empty list. I've tried numerous syntax in the xpath statement, so I'm losing confidence my overall structure is correct.
For context, I want to write something that will parse the entire xml webpage and return a csv that can be used within other programs. I'm starting with the basics to make sure I understand how the various packages work. Thank you for any help.
This should do it
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("//guestbook/guest/fname")
for i in guest:
print(i.text)
In the xpath, you need a double-dash in the beginning. Also, this returns a list with elements. The text of each element can be extracted using .text

Python web-scraping misses an element from the list of searched objects

I'm trying to scrape some data using beautifulsoup and requests libraries in Python 3.7. For each of the items (tag article) on this webpage, there is a youtube link. After finding all the instances of article, I can successfully extract the headlines. This code also successfully finds instances of youtube-player class inside each article, except at index 7, where the output is None.
from bs4 import BeautifulSoup
import requests
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
articles = soup.find_all('article')
for article in articles:
headline = article.h2.a.text
print(headline)
link = article.find('iframe', {'class': 'youtube-player'})
print(link)
However, from the source (output of beautifulsoup), if I directly search for youtube-player, I get all the instances correctly.
links = soup.find_all('iframe', {'class': 'youtube-player'})
for link in links:
print(link)
How can I improve my code to get all the youtube-player instances within article loop?
You can use zip() built-in function to tie titles and youtube links together.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for title, player in zip(soup.select('.entry-title'),
soup.select('iframe.youtube-player')):
print('{:<75}{}'.format(title.text, player['src']))
Prints:
Git: Difference between “add -A”, “add -u”, “add .”, and “add *” https://www.youtube.com/embed/tcd4txbTtAY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Programming Terms: Combinations and Permutations https://www.youtube.com/embed/QI9EczPQzPQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Chrome Quick Tip: Quickly Bookmark Open Tabs for Later Viewing https://www.youtube.com/embed/tsiSg_beudo?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Comprehensions – How they work and why you should be using them https://www.youtube.com/embed/3dt4OGnU5sM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Generators – How to use them and the benefits you receive https://www.youtube.com/embed/bD05uGo_sVI?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Quickest and Easiest Way to Run a Local Web-Server https://www.youtube.com/embed/lE6Y6M9xPLw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Git for Beginners: Command-Line Fundamentals https://www.youtube.com/embed/HVsySz-h9r4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Time-Saving Keyboard Shortcuts for the Mac Terminal https://www.youtube.com/embed/TXzrk3b9sKM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Overview of Online Learning Resources in 2015 https://www.youtube.com/embed/QGy6M8HZSC4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Else Clauses on Loops https://www.youtube.com/embed/Dh-0lAyc3Bc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
EDIT: It seems that when you use html.parser, BeautifulSoup doesn't recognize the youtube link on one place, use lxml or html5lib instead:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "lxml")
for article in soup.select('article'):
title = article.select_one('.entry-title')
player = article.select_one('iframe.youtube-player') or {'src':''}
print('{:<75}{}'.format(title.text, player['src']))

are there alternate ways to download pdfs from the internet without them being corrupted?

I have written code for a web scraper program that is as follows (in python) -
import requests, bs4 #you probably need to install requests and bs4, just go online and type beautiful soup 4 installation and requests installation
link_list = []
res = requests.get('https://scholar.google.com/scholar?start=0&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
for x in range(1,100):
i = str(x*10)
url = f'https://scholar.google.com/scholar?start={i}&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1'
res_2 = requests.get(url)
soup = bs4.BeautifulSoup(res_2.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
if(link_list):
for x in range(0,len(link_list)):
res_3 = requests.get(link_list[x])
with open(f'/Users/atharvanaik/Desktop/Cursed/{x}.pdf', 'wb') as f: #parameter 1 of the open function is set to a file path that is available only on my computer
f.write(res_3.content)
print(x) #Set it to something that is accessible on your computer.
else: #Your final path should be -
print('sorry, unavailable') #Something\something\something\{x}.pdf
#Do not change the last part !
For context, I am trying to bulk download pdfs from a google scholar search, instead of doing it manually.
I manage to download a vast majority of the pdfs, but some of the pdfs, when I tried opening them, gave me this message -
"It may be damaged or use a file format that Preview doesn’t recognise."
As seen in the above code, I am using requests to download the content and write to the file. Is there a way to work around this ?

Web scraping issue searching for contents in Youtube trending page with BeautifulSoup

I am trying to build an app that returns the top 10 youtube trending videos into an excel file but ran into an issue right at the beginning. For some reason, whenever I try to use "soup.find" on any of the id's on this YouTube page, it returns "None" as the result.
I have made sure that my spelling is perfect and everything but it still won't work. I have tried this same code using other sites and get the same error.
#What I did for Youtube which resulted in output being "None"
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
videos = soup.find(id= "contents")
print(videos)
I expect it to provide me with the HTML code that has this id that I have specified but it keeps saying "None".
The page is using heavy Javascript to modify class, attributes of tags. What you see in Developer Tools isn't always what requests provides you. I recommend to call print(soup.prettify()) and see with what markup you're working with.
You can use this script to get first 10 trending videos:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
for i, a in enumerate(soup.select('h3.yt-lockup-title a[title]')[:10], 1):
print('{: <4}{}'.format(str(i)+'.', a['title']))
Prints (in my case in Estonia):
1. Jaanus Saks - Su säravad silmad
2. Егор Крид - Сердцеедка (Премьера клипа, 2019)
3. Comment Out #11/ Ольга Бузова х Фёдор Смолов
4. 5MIINUST x NUBLU - (ei ole) aluspükse
5. Артур Пирожков - Алкоголичка (Премьера клипа 2019)
6. Slav school of driving - driving instructor Boris
7. ЧТО ЕДЯТ В АРМИИ США VS РОССИИ?
8. RC Airplane Battle | Dude Perfect
9. ЧЕЙ КОРАБЛИК ОСТАНЕТСЯ ПОСЛЕДНИЙ, ПОЛУЧИТ 1000$ !
10. Khloé Kardashian's New Mom Beauty Routine | Beauty Secrets | Vogue
Since YouTube uses too much of javascript to render and modify the way pages load, it's a better idea to make the page load in a browser and then use it's page source for rendering in BeautifulSoup scripts. So we use Selenium for this purpose. Here once the soup object is obtained then you can do whatever you want with it.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import os
driver = webdriver.Firefox(executable_path="/home/rishabh/Documents/pythonProjects/webScarapping/geckodriver")
driver.get('https://www.youtube.com/feed/trending')
content = driver.page_source
driver.close()
soup = BeautifulSoup(content, 'html.parser')
#Do whatever you want with it
Configure Selenium https://selenium-python.readthedocs.io/installation.html

Making webcrawler - Wont go into my for-loop

I'm making a webcrawler for fun. Basically what I want to do for example is to crawl this page
http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=2010&view=.dateSeason
and first of all get all the home teams. Here is my code:
def urslit_spider(max_years):
year = 2010
while year <= max_years:
url = 'http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=' + str(year) + '&view=.dateSeason'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class' : 'clubs rHome'}):
lid = link.string
print(lid)
year += 1
I've found out that the code wont enter the for loop. It gives me no error but it doesn't do anything. Tried to search for this but can't find what's wrong.
The link you provided redirected me to the homepage. Tinkering with the URL I get to http://br.premierleague.com/en-gb/matchday/results.html
In this URL I get all the home teams name using
soup.findAll('td', {'class' : 'home'}):
How can I navigate to the link you provided? Maybe the HTML is different on that page
Edit: Looks like the content of this website is loaded from this URL: http://br.premierleague.com/pa-services/api/football/lang_en_gb/i18n/competition/fandr/api/gameweek/1.json
Tinkering with the url parameters, you can find lots of informations.
I still cant open the url you provided, it keeps redirecting me, but in the link I provided, I cant extract the table info from html (and BeautifulSoup) because it is gathering the info from that JSON above.
The best thing to do is using that json to get the information you need. My advice is to use json package from python.
If you are new to JSON, you can use this website to make the JSON more readable: https://jsonformatter.curiousconcept.com/

Resources