So my code (pasted) below almost does what I want. Instead, it covers 29/30 pages, and then leaves out the last. Furthermore, I would preferably have it go beyond, but the website has no button for it (the pages actually do work when you manually fill in page=31 in the link). When Depth_Limit is 29 it's all fine, but on 30 I get the following error in the command prompt:
File "C:\Users\Ewald\Scrapy\OB\OB\spiders\spider_OB.py", line 23, in parse
next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[#class="volgende"]/#href').extract()[0]
IndexError: list index out of range
I've tried various approaches, but they all seem to fail me...
class OB_Crawler(CrawlSpider):
name = 'OB5'
allowed_domains = ["https://www.officielebekendmakingen.nl/"]
start_urls = ["https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=DatumPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"]
custom_settings = {
'BOT_NAME': 'OB-crawler',
'DEPTH_LIMIT': 30,
'DOWNLOAD_DELAY': 0.1
}
def parse(self, response):
s = Selector(response)
next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[#class="volgende"]/#href').extract()[0]
if len(next_link):
yield self.make_requests_from_url(next_link)
posts = response.selector.xpath('//div[#class = "lijst"]/ul/li')
for post in posts:
i = TextPostItem()
i['title'] = ' '.join(post.xpath('a/#href').extract()).replace(';', '').replace(' ', '').replace('\r\n', '')
i['link'] = ' '.join(post.xpath('a/text()').extract()).replace(';', '').replace(' ', '').replace('\r\n', '')
i['info'] = ' '.join(post.xpath('a/em/text()').extract()).replace(';', '').replace(' ', '').replace('\r\n', '').replace(',', '-')
yield i
The index out of range error is the result of an incorrect xpath (you end up calling for the first item of an empty list).
change your "next_link = ... " to
next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[contains(#class, "volgende")]/#href').extract()[0]
You need to use contains, which runs a predicate search.. filters for what you want
Related
I am new to scraping and I am trying to scrape data from this website https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml
When I try to get data without applying filters everything is working. But the data I need should be for a specific power plant and date. I am getting a hard time finding why I cannot apply the filters.
from scrapy.http import FormRequest
from ..items import EpiasscrapingItem
class EpiasSpider(scrapy.Spider):
name = 'epias'
start_urls =[
'https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml'
]
def parse(self, response):
return FormRequest.from_response(response, formdata = {
'j_idt205':'j_idt205',
'j_idt205:date1_input' : '20.03.2021',
'j_idt205:date2_input' : '20.03.2021',
'j_idt205:powerPlant_input' : '2614',
}, callback=self.start_scraping)
def start_scraping(self,response):
items = EpiasscrapingItem()
table_epias = response.css('.ui-datatable-odd')
for epias in table_epias:
date = epias.css('.ui-widget-content .TexAlCenter:nth-child(1)').css('::text').extract()
time = epias.css('.ui-widget-content .TexAlCenter:nth-child(2)').css('::text').extract()
biogas = epias.css('.ui-widget-content .TexAlCenter:nth-child(15)').css('::text').extract()
items['date'] = date
items['time'] = time
items['biogas'] = biogas
yield items```
You forgot to include javax.faces.ViewState and few other fields within parameters supposed to be sent with post requests. You can now change the value of date1_input, date2_input and powerPlant_input to fetch the relevant content. The following script should work:
class EpiasSpider(scrapy.Spider):
name = 'epias'
start_urls = [
'https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml'
]
post_url = 'https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml'
def parse(self, response):
payload = {
'j_idt205': 'j_idt205',
'j_idt205:date1_input': '11.02.2021',
'j_idt205:date2_input': '20.03.2021',
'j_idt205:powerPlant_focus': '',
'j_idt205:powerPlant_input': '2336',
'j_idt205:goster': '',
'j_idt205:dt_rppDD': '24',
'javax.faces.ViewState': response.css(".ContainerIndent input[name='javax.faces.ViewState']::attr(value)").get()
}
yield scrapy.FormRequest(self.post_url,formdata=payload,callback=self.parse_content)
def parse_content(self,response):
for epias in response.css('.ui-datatable-odd'):
items = {}
date = epias.css('tr.ui-widget-content > .TexAlCenter:nth-child(1)::text').get()
time = epias.css('tr.ui-widget-content > .TexAlCenter:nth-child(2)::text').get()
total = epias.css('tr.ui-widget-content > .TexAlCenter:nth-child(3)::text').get()
items['date'] = date
items['time'] = time
items['total'] = total
yield items
I am scraping this website : https://www.epicery.com/c/promos?gclid=CjwKCAjw97P5BRBQEiwAGflV6bGzNEAz7MTIrgelBkTR277v3lhStP5tH0wgxuLj1ytlcQAAjb-cxBoCsVwQAvD_BwE
And I am trying to retrive some info in the script path like the description.
I get the script content with the xpath and make some regex and try to load it as json:
script_path = response.xpath('/html/body/script[1]').get()
j_list = re.findall(r'\[(.*)\}\]',script_path)
j = j[0].replace("'","")
json_script = json.loads(j)
But I have this following error that I cannot handle :
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 152446 (char 152445)
I'm not sure what do you want but this works for me:
def parse(self, response):
taxons_str = response.xpath('//script[contains(., "var taxons")]/text()').re_first(r'(?s)var taxons = (.+?)var shops')
if taxons_str:
taxons = json.loads(taxons_str)
for product in taxons:
process_your_product(product)
I configured the server-side processing of datatables. On the server side I use python3 and mongodb.
I think my paging logic is good as you can see from the code:
PYTHON:
#bp.route('/_ajax_products', methods=['GET', 'POST'])
#login_required
def ajax_products():
num = int(request.args.get('page_num')) + 1
total_items = product_db.count()
items_to_show = 100
result = {"draw": num, "recordsTotal": total_items, "recordsFiltered": total_items}
list_prod = product_db.find().sort([("Code", 1)]).skip(items_to_show * (num - 1)).limit(items_to_show)
final_list = []
for i in list_prod:
# iteration on products and addition to the final list
result['data'] = final_list
return jsonify(result)
DATATABLE INITIALISATION:
$('#ProductsList').DataTable({
"dom": 'Brlf<t><"clear">p',
"pageLength": 100,
select: true,
"processing": true,
"serverSide": true,
"ajax": {
url:"/_ajax_products",
data: function ( d ) {
var datatable = $('#ProductsList').DataTable();
var currentPage = datatable.page.info().page;
d.page_num = currentPage;
}
},
"columns":[...]
...
})
The data loads well in my datatable. When I call the next page either there is no problem.
The problem appears when I call a previous page.
The display starts on page 1. when I press for example the pagination button 3, I can see in my console:
"GET /_ajax_products?draw=3&
But when I try to go back to page 1, the draw parameter goes to 4:
"GET /_ajax_products?draw=4&
... and it continues to increment.
On the server side the good data are found but they are not displayed in the datatable.
How can I solve this problem?
I finally found a solution. The error came from a misunderstanding of what the draw option was doing.
Indeed I thought that the value of draw corresponded to the page to be displayed, which is not the case.
Here is the new version of the code in case it can help someone:
#bp.route('/_ajax_products', methods=['GET', 'POST'])
#login_required
def ajax_products():
num = int(request.args.get('page_num')) + 1
total_items = product_db.count()
items_to_show = 100
result = {"recordsTotal": total_items, "recordsFiltered": total_items}
list_prod = product_db.find().sort([("Code", 1)]).skip(items_to_show * (num - 1)).limit(items_to_show)
final_list = []
for i in list_prod:
# iteration on products and addition to the final list
result['data'] = final_list
return jsonify(result)
I am trying to create a bar graph in pygal that uses the api for hacker news and charts the most active news based on comments. I posted my code below, but I cannot figure out why my graph keep saying "No data"??? Any suggestions? Thanks!
import requests
import pygal
from pygal.style import LightColorizedStyle as LCS, LightenStyle as LS
from operator import itemgetter
# Make an API call, and store the response.
url = 'https://hacker-news.firebaseio.com/v0/topstories.json'
r = requests.get(url)
print("Status code:", r.status_code)
# Process information about each submission.
submission_ids = r.json()
submission_dicts = []
for submission_id in submission_ids[:30]:
# Make a separate API call for each submission.
url = ('https://hacker-news.firebaseio.com/v0/item/' +
str(submission_id) + '.json')
submission_r = requests.get(url)
print(submission_r.status_code)
response_dict = submission_r.json()
submission_dict = {
'comments': int(response_dict.get('descendants', 0)),
'title': response_dict['title'],
'link': 'http://news.ycombinator.com/item?id=' + str(submission_id),
}
submission_dicts.append(submission_dict)
# Visualization
my_style = LS('#336699', base_style=LCS)
my_config = pygal.Config()
my_config.show_legend = False
my_config.title_font_size = 24
my_config.label_font_size = 14
my_config.major_label_font_size = 18
my_config.show_y_guides = False
my_config.width = 1000
chart = pygal.Bar(my_config, style=my_style)
chart.title = 'Most Active News on Hacker News'
chart.add('', submission_dicts)
chart.render_to_file('hn_submissons_repos.svg')
The values in the array passed to the add function need to be either numbers or dicts that contain the key value (or a mixture of the two). The simplest solution would be to change the keys used when creating submission_dict:
submission_dict = {
'value': int(response_dict.get('descendants', 0)),
'label': response_dict['title'],
'xlink': 'http://news.ycombinator.com/item?id=' + str(submission_id),
}
Notice that link has become xlink, this is one of the optional parameters that are defined in the Value Configuration section of the pygal docs.
"Update"
*Finally resolved the issue, changed the try except to include TypeError and also use pass instead of continue in the except.
"End of update"
I wrote code to search for distance between two locations using Google Distance Matrix API. The origin location are fixed, however for the destination, I get it from an xlsx file. I was expecting to get Dictionary with Destination as the Key and the distance as value. When I run the code below, after certain loop I'm stumbled with this error code:
TypeError: Expected a lat/lng dict or tuple, but got NoneType
Can you help me understand the cause of the error? Here is the code (pygmap.py):
import googlemaps
import openpyxl
#get origin and destination locations
def cleanWB(file_path):
destination = list()
wb = openpyxl.load_workbook(filename=file_path)
ws = wb.get_sheet_by_name('Sheet1')
for i in range(ws.max_row):
cellValueLocation = ws.cell(row=i+2,column=1).value
destination.append(cellValueLocation)
#remove duplicates from destination list
unique_location = list(set(destination))
return unique_location
def getDistance(origin, destination):
#Google distance matrix API key
gmaps = googlemaps.Client(key = 'INSERT API KEY')
distance = gmaps.distance_matrix(origin, destination)
distance_status = distance['rows'][0]['elements'][0]['status']
if distance_status != 'ZERO_RESULTS':
jDistance = distance['rows'][0]['elements'][0]
distance_location = jDistance['distance']['value']
else:
distance_location = 0
return distance_location
And I run it using this code:
import pygmap
unique_location = pygmap.cleanWB('C:/Users/an_id/Documents/location.xlsx')
origin = 'alam sutera'
result = {}
for i in range(len(unique_location)):
try:
result[unique_location[i]] = pygmap.getDistance(origin, unique_location[i])
except (KeyError, TypeError):
pass
If I print results it will show that I have successfully get 46 results
result
{'Pondok Pinang': 25905, 'Jatinegara Kaum': 40453, 'Serdang': 1623167, 'Jatiasih
': 44737, 'Tanah Sereal': 77874, 'Jatikarya': 48399, 'Duri Kepa': 20716, 'Mampan
g Prapatan': 31880, 'Pondok Pucung': 12592, 'Johar Baru': 46791, 'Karet': 26889,
'Bukit Duri': 34039, 'Sukamaju': 55333, 'Pasir Gunung Selatan': 42140, 'Pinangs
ia': 30471, 'Pinang Ranti': 38099, 'Bantar Gebang': 50778, 'Sukabumi Utara': 204
41, 'Kembangan Utara': 17708, 'Kwitang': 25860, 'Kuningan Barat': 31231, 'Cilodo
ng': 58879, 'Pademangan Barat': 32585, 'Kebon Kelapa': 23452, 'Mekar Jaya': 5381
0, 'Kampung Bali': 1188894, 'Pajang': 30008, 'Sukamaju Baru': 53708, 'Benda Baru
': 19965, 'Sukabumi Selatan': 19095, 'Gandaria Utara': 28429, 'Setia Mulya': 635
34, 'Rawajati': 31724, 'Cireundeu': 28220, 'Cimuning': 55712, 'Lebak Bulus': 273
61, 'Kayuringin Jaya': 47560, 'Kedaung Kali Angke': 19171, 'Pagedangan': 16791,
'Karang Anyar': 171165, 'Petukangan Selatan': 18959, 'Rawabadak Selatan': 42765,
'Bojong Sari Baru': 26978, 'Padurenan': 53216, 'Jati Mekar': 2594703, 'Jatirang
ga': 51119}
Resolved the issue to include TypeError in the Try Except. And also use pass instead of continue
import pygmap
unique_location = pygmap.cleanWB('C:/Users/an_id/Documents/location.xlsx')
origin = 'alam sutera'
result = {}
#get getPlace
for i in range(len(unique_location)):
try:
result[unique_location[i]] = pygmap.getDistance(origin, unique_location[i])
except (KeyError, TypeError):
pass
I skipped some locations using this solution though.