Making webcrawler - Wont go into my for-loop - python-3.x

I'm making a webcrawler for fun. Basically what I want to do for example is to crawl this page
http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=2010&view=.dateSeason
and first of all get all the home teams. Here is my code:
def urslit_spider(max_years):
year = 2010
while year <= max_years:
url = 'http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=' + str(year) + '&view=.dateSeason'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class' : 'clubs rHome'}):
lid = link.string
print(lid)
year += 1
I've found out that the code wont enter the for loop. It gives me no error but it doesn't do anything. Tried to search for this but can't find what's wrong.

The link you provided redirected me to the homepage. Tinkering with the URL I get to http://br.premierleague.com/en-gb/matchday/results.html
In this URL I get all the home teams name using
soup.findAll('td', {'class' : 'home'}):
How can I navigate to the link you provided? Maybe the HTML is different on that page
Edit: Looks like the content of this website is loaded from this URL: http://br.premierleague.com/pa-services/api/football/lang_en_gb/i18n/competition/fandr/api/gameweek/1.json
Tinkering with the url parameters, you can find lots of informations.
I still cant open the url you provided, it keeps redirecting me, but in the link I provided, I cant extract the table info from html (and BeautifulSoup) because it is gathering the info from that JSON above.
The best thing to do is using that json to get the information you need. My advice is to use json package from python.
If you are new to JSON, you can use this website to make the JSON more readable: https://jsonformatter.curiousconcept.com/

Related

Scraping from website list returns a null result based of Xpath

So I'm trying to scrape the job listing off this site https://www.dsdambuster.com/careers .
I have the following code:
url = "https://www.dsdambuster.com/careers"
page = requests.get(url, verify=False)
tree = html.fromstring(page.content)
path = '/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
jobs = tree.xpath(xpath)
for job in jobs:
Title = (job.text)
print(Title)
not too sure why it wouldnt work...
I see 2 issues here:
You are using very bad XPath. It is extremely fragile and not reliable.
Instead of
'/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
Please use
'//div[#class="vf-vacancy-title"]'
You are possibly missing a wait / delay.
I'm not familiar with the way you are using here, but with Selenium that I do familiar with, you will need to wait for the elements completely loaded before extracting their text contents.

Scraping websites with Python3 (Scrapy, BS4) does yield incomplete data. Can not figure out why

Some time ago I have set up a web scraper using BS4, logging the value of a whisky each day
import requests
from bs4 import BeautifulSoup
def getPrice() -> float:
try:
URL = "https://www.thewhiskyexchange.com/p/2940/suntory-yamazaki-12-year-old"
website = requests.get(URL)
except:
print("ERROR requesting Price")
try:
soup = BeautifulSoup(website.content, 'html.parser')
price = str(soup.find("p", class_="product-action__price").next)
price = float(price[1::])
return price
except:
print("ERROR parsing Price")
This worked as intended. The request contained the complete website and the correct value was extracted.
I was now trying to scrape other sites for data on other whiskys this time using SCRAPY.
I tried the following URLS:
https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance
https://www.ebay.de/sch/i.html?_sacat=0&LH_Complete=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=10&_fpos=&LH_SALE_CURRENCY=0&_sop=12&_dmd=1&_fosrp=1&_nkw=macallan&rt=nc
import scrapy
class QuotesSpider(scrapy.Spider):
name = "whisky"
def start_requests(self):
user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
urls = [
'https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'whisky-{page}.html'
#data = response.css('.itemDetails').getall()
with open(filename, 'wb') as f:
f.write(response.body)
I just customized the basic example from the tutorial to create the fast prototype above.
However it did not return the complete website. The body of the response did miss several tags and especially the content I was looking for.
I tried to solve this with BS4 again like this:
import requests
from bs4 import BeautifulSoup
URL = "https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance"
website = requests.get(URL)
soup = BeautifulSoup(website.content, 'html.parser')
with open("whiskeySoup.html", 'w') as f:
f.write(str(soup.body))
To my surprise this produced the same result. The request and its body did not contain the complete website, missing all the data I was looking for.
I also included a user-agent header since I learned that some sites recognize requests from bots and spiders and do not deliver all their data. However, this did not solve the problem.
I am unable to figure out or debug why the data requested from those URLs is incomplete.
Is there a way to solve this using SCRAPY?
A lot of websites heavily relies on javascript to generate the final html page of website. When you send request to server it returns html code with some script web browsers like chrome, Firefox and others process that javascript code and the final html that you can see appears. But when you are using scrapy, request or some library they do not come with the functionality of executing the javascript code and hence the html code is different in html, and as the crawler sees the webpage.
If you want to see how crawler sees the website ( the html code of webpage as seen by crawler ) you can run command 'scrapy view {url}' this will open page in browser or if you want to get the html code of webpage as seen by crawler you can run command 'scrapy fetch {url}'. When you are working with scrapy it is good idea to open the url in shell ( the command is 'scrapy shell {url}' ) and then test your extracting desired content logic there with xpath or css method ( response.css('some_css').css('again_some_css'). ) and then finally add this logic to your final crawler. If you want to see what response you got in shell. you can just type view(response) and it will open the response received in browser. I hope that is clear. But if you want to process the javascript before finally processing the response ( when it is necessary ) you can use selenium which is headless browser or splash which is lightweight web browser. selenium is pretty easy to use.
Edit 1. For the first url : go to scrapy shell and check the css path div.bidPrice::text. Inside that you will see that content inside is generated dynamically and there is no html code and content is being generated dynamically.

Unable to get values from a GET request

I am trying to scrape a website to extract values.
I can text back in the response, but cannot see any of the values on the website in the response.
How do i get the values please ?
I have used basic code from stackoverflow as a test to explore the data. The code is posted below. It works on other sites, but not this site ?
import requests
url = 'https://www.ishares.com/uk/individual/en/products/253741/'
data = requests.get(url).text
with open('F:\\webScrapeFolder\\out.txt', 'w') as f:
print(data.encode("utf-8"), file=f)
print('--- end ---')
There is no error message.
The file is written correctly.
However, i do not see any of the numbers ?!?
Check with this
url ="https://www.ishares.com/uk/individual/en/products/253741/?switchLocale=y&siteEntryPassthrough=true"
and try to get the response and still if you cannot get what is expected can you brief more on what is needed

Web Links Scraping

I'm working on a project that requires me to web scrape unique links from a website and save them to a CSV file. I've read through quite a bit of material for how to do this, I've watched videos, done trainings on Pluralsight and LinkedIn Learning and I mostly have this situation figured out there is one aspect of the assignment that I'm not sure how to do.
The program is supposed to scrape web links from both the Domain that is given (see code below) and any web links outside of the domain.
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
BASE_url = urllib.request.urlopen("https://www.census.gov/programs-surveys/popest.html").read()
soup = bs.BeautifulSoup(BASE_url, "html.parser")
filename = "C996JamieCooperTask1.csv"
file = open(filename, "w")
headers = "WebLinks as of 4/7/2019\n"
file.write(headers)
all_Weblinks = soup.find_all('a')
url_set = set()
def clean_links(tags, base_url):
cleaned_links = set()
for tag in tags:
link = tag.get('href')
if link is None:
continue
if link.endswith('/') or link.endswith('#'):
link = link[-1]
full_urls = urllib.parse.urljoin(base_url, link)
cleaned_links.add(full_urls)
return cleaned_links
baseURL = "https://www.census.gov/programs-surveys/popest.html"
cleaned_links = clean_links(all_Weblinks, baseURL)
for link in cleaned_links:
file.write(str(link) + '\n')
print ("URI's written to .CSV File")
The code works for all web links that are internal to the baseURL so that exist in that website but doesn't grab any that point external to the site. I know the answer has to be something simple but after working on this project for some time I just can't see what is wrong with it so please help me.
You might try a selector such as follows inside a set comprehension. This looks for a tag elements with href that starts with http or /. It is a starting point you can tailor. You would need more logic because there is at least one url which is simply / by itself.
links = {item['href'] for item in soup.select('a[href^=http], a[href^="/"]')}
Also, check that all expected urls are present in soup as I suspect some require javascript to run on page.

Extract only first post content from URL that has multiple tumblr posts with PYTHON

I am trying to extract only actual content/text from given input URL using newspaper package in python3. I have succeded in doing so but one of my URL consists of multiple tumblr posts in the same page.
In the below URL I want content of first post only i.e., paragraph starting with "The Karnataka Assembly election 2018 result is close to being known as vote counting is underway on Tuesday, "
https://poonamparekh.tumblr.com/post/173920050130/karnataka-election-results-modi-rallies-set-to
In my working while extracting content from above URL instead of first post I am getting 6th post content as my output. But that's not what I need. I require first post to be as my output. Can anyone help me out in achieving this ?
Here is my code:
from newspaper import Article
url="https://poonamparekh.tumblr.com/post/173920050130/karnataka-election-results-modi-rallies-set-to"
print(url)
article = Article(url, language='en')
article.download()
article.download_state
print('articlee_state : ',article.download_state)
if article.download_state == 2:
try:
article.parse()
result=article.text[0]
print(result[:150])
if result=='':
print('----MESSAGE : No description written for this post')
except Exception as e:
print(e)

Resources