Using Scrapy to Extract Wikipedia Links - python-3.x

I am learning about scrapy and am trying to use it to scrape the below Wikipedia page:
https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s
I would like to scrape each country and the hyperlink attached to that country and below is my code so far:
import scrapy
class CountrypopSpider(scrapy.Spider):
name = 'countryPop'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']
def parse(self, response):
countries = response.xpath('//table//b//#title').extract()
for country in countries:
country_url = response.xpath('//table//b[contains(#href, 'Afghanistan')]').extract()
yield {'countries': country}
What it currently does is get all the countries from the main table and then I want it to loop through each of these countries, using the country name to get the url. I am having trouble though finding a way of using the country name to find the url, my latest attempt was using contains().
Any other comments on my scraping code would be appreciated.
Thanks

Try this
Approach 1
import scrapy
class CountrypopSpider(scrapy.Spider):
name = 'countryPop'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']
def parse(self, response):
coutries=200
cnames=['Australia','Bhutan']
noduplicateset= set()
for cname in cnames:
for title in response.xpath('//table[1]//a[contains(#title,'+cname+')]'):
if cname not in noduplicateset:
yield {cname:'https://en.wikipedia.org'+title.css('a').get().split("\"")[1]}
noduplicateset.add(cname)
Approach 2
import scrapy
class CountrypopSpider(scrapy.Spider):
LOG_LEVEL = 'INFO'
name = 'countryPop'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']
def parse(self, response):
coutries=200
cnames=['Australia','Bhutan']
for i in range(5,coutries):
for title in response.xpath('//*[#id="mw-content-text"]/div/table[1]/tbody/tr['+str(i+2)+']/td[1]/b/a'):
name=title.css('a ::text').get()
if name in cnames:
yield {name:'https://en.wikipedia.org'+title.css('a').get().split("\"")[1]}
If you output to json file, it will look like this

Related

Scrape Multiple articles from one page with each article with separate href

I am new to scrapy and writing my first spider make a scrapy spider for website similar to https://blogs.webmd.com/diabetes/default.htm
I want to scrape Headlines and then navigate to each article scrape the text content for each article.
I have tried by using rules and linkextractor but it's not able to navigate to next page and extract. i get the ERROR: Spider error processing https://blogs.webmd.com/diabetes/default.htm> (referer: None)
Below is my code
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['https://blogs.webmd.com/diabetes/default.htm']
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
Rules = (Rule(LinkExtractor(allow=(), restrict_css=('.posts-list-post-content a ::attr(href)')), callback="parse", follow=True),)
def parse(self, response):
headline = response.css('.posts-list-post-content::text').extract()
body = response.css('.posts-list-post-desc::text').extract()
print("%s : %s" % (headline, body))
next_page = response.css('.posts-list-post-content a ::attr(href)').extract()
if next_page:
next_href = next_page[0]
next_page_url = next_href
request = scrapy.Request(url=next_page_url)
yield request
Please guide a newbie in scrapy to get this spider right for multiple articles on each page.
Usually when using scrapy each response is parsed by parse callback. The main parse method is the callback for the initial response obtained for each of the start_urls.
The goal for that parse function should then be to "Identify article links", and issue requests for each of them. Those responses would then be parsed by another callback, say parse_article that would then extract all the contents from that particular article.
You don't even need that LinkExtractor. Consider:
import scrapy
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['blogs.webmd.com'] # Only the domain, not the URL
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
def parse(self, response):
article_links = response.css('.posts-list-post-content a ::attr(href)')
for link in article_links:
url = link.get()
if url:
yield response.follow(url=url, callback=self.parse_article)
def parse_article(self, response):
headline = 'some-css-selector-to-get-the-headline-from-the-aticle-page'
# The body is trickier, since it's spread through several tags on this particular site
body = 'loop-over-some-selector-to-get-the-article-text'
yield {
'headline': headline,
'body': body
}
I've not pasted the full code because I believe you still want some excitement learning how to do this, but you can find what I came up with on this gist
Note that the parse_article method is returning dictionaries. These are using Scrapy's items pipelines. You can get a neat json output by running your code using: scrapy runspider headlines/spiders/medical.py -o out.json

Where are my mistakes in my scrapy codes?

I want to crawl a website via scrapy but my codes come up with an error.
I have tried to use xpath but it seems I can not define the div class in the web site.
The following code raises an error on ("h2 ::text").extract().
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class MySpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["baltictriennial13.org"]
start_urls = ["https://www.baltictriennial13.org/artist/caroline-achaintre/"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='artist']")
items = []
for titles in titles:
item = ArtistlistItem()
item["artist"] = titles.select("h2 ::text").extract()
item["biograpy"] = titles.select("p::text").extract()
items.append(item)
return items
I want to crawl the web site and store the data in a .csv file.
The main issue with your code is using of .select instead of .css. Here is what do you need but I'm not sure about titles part (may be you need it on other pages):
def parse(self, response):
titles = response.xpath("//div[#class='artist']")
# items = []
for title in titles:
item = ArtistlistItem()
item["artist"] = title.css("h2::text").get()
item["biograpy"] = title.css("p::text").get()
# items.append(item)
yield item
try to remove the space in h2 ::text --> h2::text. If that doesn't work try h2/text()

Scrapy spider returns no items data

My scrapy script seems not to follow links, which ends up not extracting data from each of them (to pass some content as scrapy items).
I am trying to scrape a lot of data from a news website. I managed to copy/write a spider that, as I assumed, should read links from a file (I've generated it with another script), put them in start_urls list and start following these links to extract some data, and then pass it as items, and also -- write each item's data in a separate file (last part is actually for another question).
After running scrapy crawl PNS, script goes through all the links from start_urls but does nothing more -- it follows links read from start_urls list (I'm getting "GET link" message in bash), but seems not to enter them and read some more links to follow and extract data from.
import scrapy
import re
from ProjectName.items import ProjectNameArticle
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
start_urls = []
with open('start_urls.txt', 'r') as file:
for line in file:
start_urls.append(line.strip())
def parse(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
Expected result:
Script opens start_urls.txt file, reads its lines (every line contains a single link), puts these links to start_urls list,
For each link opened spider extracts deeper links to be followed (that's about 50-200 links for each start_urls link),
Followed links are the main target from which I want to extract specific data: article title, date, time, text.
For now never mind writing each scrapy item to a distinc .txt file.
Actual result:
Running my spider triggers GET for each start_urls link, goes through around 150000, doesn't create a list of deeper links, nor enters them to extract any data.
Dude, I have been coding in Python Scrapy for long time and I hate using start_urls
You can simply use start_requests which is very easy to read, and also very easy to learn for beginners
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
def start_requests(self):
with open('start_urls.txt', 'r') as file:
for line in file:
yield Request(line.strip(),
callback=self.my_callback_func)
def my_callback_func(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
I also have never used Item class and find it useless too
You can simply have data_dic = {} instead of data_dic = ProjectNameArticle()

Scrapy - xpath - extract returns null

My goal is to build a scraper that extract data from a table from this site.
Initially I followed the tutorial of Scrapy where I succeeded in extracting data from the test site. When I try to replicate it for Bitinfocharts, first issue is I need to use xpath, which the tutorial doesn't cover in detail (they use css only). I have been able to scrape the specific data I want through shell.
My current issue is understanding how I can scrape them all from my code and at the same time write the results to a .csv / .json file?
I'm probably missing something completely obvious. If you can have a look at my code and let me know I'm doing wrong, I would deeply appreciate it.
Thanks!
First attempt:
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
scrapy.Item in RichlistTestItem()
scrapy.Item['wallet'] = sel.xpath('td[2]/a/text()').extract()[0]
scrapy.Item['balance'] = sel.xpath('td[3]/a/text').extract()[0]
scrapy.Item['percentage_of_coins'] = sel.xpath('/td[4]/a/text').extract()[0]
yield('wallet', 'balance', 'percentage_of_coins')
Second attempt: (probably closer to 50th attempt)
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/a/text').extract()
percentage_of_coins = sel.xpath('/td[4]/a/text').extract()
print(wallet, balance, percentage_of_coins)
I have fixed your second trial, specifically the code snippet below
for sel in response.xpath("//*[#id=\"tblOne\"]/tbody/tr"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/text()').extract()
percentage_of_coins = sel.xpath('td[4]/text()').extract()
The problems, I found are
there was a trailing "/" for the table row selector.
for balance the
value was inside td not inside a link inside td
for percetag.. again
the value was inside td.
Also there is a data-val property for each of the td. Scraping those might be little easier than getting the value from inside of td.

yield scrapy.Request does not return the title

I am new to Scrapy and try to use it to practice crawling the website. However, even I followed the codes provided by the tutorial, it does not return the results. It looks like yield scrapy.Request does not work. My codes are as follow:
Import scrapy
from bs4 import BeautifulSoup
from apple.items import AppleItem
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
appleitem = AppleItem()
appleitem['title'] = res.select('h1')[0].text
appleitem['content'] = res.select('.trans')[0].text
appleitem['time'] = res.select('.gggs time')[0].text
return appleitem
It shows that spider was opened and closed but it returns nothing. The version of Python is 3.6. Can anyone please help? Thanks.
EDIT I
The crawl log can be reached here.
EDIT II
Maybe if I change the codes as below will make the issue clearer:
Import scrapy
from bs4 import BeautifulSoup
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
print(res.select('#h1')[0].text)
The codes should print out the url and the title separately but it does not return anything.
Your log states:
2017-07-10 19:12:47 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to
'www.appledaily.com.tw': http://www.appledaily.com.tw/realtimenews/article/life/201
70710/1158177/oBike%E7%A6%81%E5%81%9C%E6%A9%9F%E8%BB%8A%E6%A0%BC%E3%80%80%E6%96%B0%E5%8C%
97%E7%81%AB%E9%80%9F%E5%86%8D%E5%85%AC%E5%91%8A6%E5%8D%80%E7%A6%81%E5%81%9C>
Your spider is set to:
allowed_domains = ['appledaily.com']
So it should probably be:
allowed_domains = ['appledaily.com.tw']
It seems like the content you are interested in your parse method (i.e. list items with class rtddt) is generated dynamically -- it can be inspected for example using Chrome, but is not present in HTML source (what Scrapy obtains as a response).
You will have to use something to render the page for Scrapy first. I would recommend Splash together with scrapy-splash package.

Resources