Cannot import items.py in scrapy spider - python-3.x

I cannot run my spider, using the shell command "scrapy crawl kbb," due to an error in finding my items module.
My folder path follows the standard scrapy orientation.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.loader import ItemLoader
from kbb.items import KelleyItem
class KbbSpider(scrapy.Spider):
name = 'kbb'
allowed_domains = ['kbb.com']
start_urls = ['https://www.kbb.com/cars-for-sale/cars/?distance=75']
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.xpath('Title','//div[#class="listings-container-redesign"]/div/div/a/text()').extract()
l.xpath('Price','//div[#class="listings-container-redesign"]/div/div/div/div/span/text()').extract()
return l.load_item()
items.py:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class KelleyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
price = scrapy.Field()
When running this via the shell command, "scrapy crawl kbb," I get the following error: "ModuleNotFoundError: No module named kbb"

If your project use the standard scrapy folder structure, you can use this:
from ..items import KelleyItem
See relative imports in Python

from items import KelleyItem
Try this one.

Related

GETTING ERROR WITH IMPORTED MODULE IN SCRAPY IN PYTHON

I am trying to implement a spider in scrapy and I am getting an error when I run the spider and tried several things but couldn't resolved.The error is as follows,
runspider: error: Unable to load 'articleSpider.py': No module named 'wikiSpider.wikiSpider'
I still learning python as well as scrapy package . But I think this is to do with module import from a different directory , so I have include my directory tree in my virtual environment created in pycharm as below image.
Also note that it is python 3.9 I am using as my interpreter for my virtual environment.
Code I am using for this with spider is as follows,
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wikiSpider.wikiSpider.items import Article
class ArticleSpider(CrawlSpider):
name = 'articleItems'
allowed_domains = ['wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Benevolent'
'_dictator_for_life']
rules = [Rule(LinkExtractor(allow='(/wiki/)((?!:).)*$'),
callback='parse_items', follow=True)]
def parse_items(self, response):
article = Article()
article['url'] = response.url
article['title'] = response.css('h1::text').extract_first()
article['text'] = response.xpath('//div[#id='
'"mw-content-text"]//text()').extract()
lastUpdated = response.css('li#footer-info-lastmod::text').extract_first()
article['lastUpdated'] = lastUpdated.replace('This page was last edited on ', '')
return article
and this is the code in file generating the error ,
import scrapy
class Article(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
text = scrapy.Field()
lastUpdated = scrapy.Field()
from "wikiSpider".wikiSpider.items import Article
change this folder name.
and then edit: from wikiSpider.items import Article
Solved.

How to get returned list from scrapy spider

So I'm making a script to test my spiders but I don't know how to capture the returned data in the script that it's running the spider.
I have this return [self.p_name, self.price, self.currency] to return at the end of the spider.
In the spider tester I have this script:
#!/usr/bin/python3
#-*- coding: utf-8 -*-
# Import external libraries
import scrapy
from scrapy.crawler import CrawlerProcess
from Ruby.spiders.furla import Furla
# Import internal libraries
# Variables
t = CrawlerProcess()
def test_furla():
x = t.crawl(Furla, url='https://www.furla.com/pt/pt/eshop/furla-sleek-BAHMW64BW000ZN98.html?dwvar_BAHMW64BW000ZN98_color=N98&cgid=SS20-Main-Collection')
return x
test_furla()
t.start()
It's running properly the only problem is that I don't know how to catch that return at the tester side. The output from the spider is ['FURLA SLEEK', '250.00', '€'].
If you need to access items yielded from the spider, I would probably use signals for the job, specifically item_scraped signal. Adapting your code it would like something like this:
from scrapy import signals
# other imports and stuff
t = CrawlerProcess()
def item_scraped(item, response, spider):
# do something with the item
def test_furla():
# we need Crawler instance to access signals
crawler = t.create_crawler(Furla)
crawler.signals.connect(item_scraped, signal=signals.item_scraped)
x = t.crawl(crawler, url='https://www.furla.com/pt/pt/eshop/furla-sleek-BAHMW64BW000ZN98.html?dwvar_BAHMW64BW000ZN98_color=N98&cgid=SS20-Main-Collection')
return x
test_furla()
t.start()
Additional info can be found in the CrawlerProcess documentation. If you on the other hand would need to work with the whole output from the spider, all the items, you would need to accumulate items using the above mechanism and work with them once crawl finishes.

Can not run 2 spiders successfully one after another in scrapy using a script

#I am trying to run a script following these requirements:
After running the demo10.py script, The AmazonfeedSpider will crawl the product information using the generated urls saved in Purl and save the output into the dataset2.json file
After successfully crawling and saving data into dataset2.json file , The ProductfeedSpider will run and grab the 5 urls returned by the Final_Product() method of CompareString Class..
Finally after grabing the final product_url list from Comparestring4 Class, The ProductfeedSpider will scrape data from the returned url list and save the result into Fproduct.json file.
#Here is the demo10.py file:
import scrapy
from scrapy.crawler import CrawlerProcess
from AmazonScrap.spiders.Amazonfeed2 import AmazonfeedSpider
from scrapy.utils.project import get_project_settings
from AmazonScrap.spiders.Productfeed import ProductfeedSpider
import time
# from multiprocessing import Process
# def CrawlAmazon():
def main():
process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(AmazonfeedSpider)
process1.start()
process1.join()
# time.sleep(20)
process2 = CrawlerProcess(settings=get_project_settings())
process2.crawl(ProductfeedSpider)
process2.start()
process2.join()
if __name__ == "__main__":
main()
#After running the file it causes exception in the compiletime and says that dataset.json file doesn't exist. Do I need to use multiprocessing in order to create delay between the spiders? then how can I implement it?
#I am looking forward to hearing from experts

Scrapy Rules: Exclude certain urls with process links

I am very happy to having discovered the Scrapy Crawl Class with its Rule Objects. However when I am trying to extract urls which contain the word "login" with process_links it doesn't work. The solution I implemented comes from here: Example code for Scrapy process_links and process_request but it doesn't exclude the pages I want
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from accenture.items import AccentureItem
class AccentureSpiderSpider(CrawlSpider):
name = 'accenture_spider'
start_urls = ['https://www.accenture.com/us-en/internet-of-things-index']
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[contains(#href, "insight")]'), callback='parse_item',process_links='process_links', follow=True),
)
def process_links(self, links):
for link in links:
if 'login' in link.text:
continue # skip all links that have "login" in their text
yield link
def parse_item(self, response):
loader = ItemLoader(item=AccentureItem(), response=response)
url = response.url
loader.add_value('url', url)
yield loader.load_item()
My mistake was to use link.text
When using link.url it works fine :)

Nmap7.5:how to add scripts into its path except /usr/share/nmap/scripts

I once installed nmap7.1 on my Ubuntu and added some nse scripts into its path /usr/share/nmap/scriptsbut when I remove nmap7.1 and install nmap7.5 with the official source and compiled myself ,I find there is a few nse scripts which are those I once added and the nmap command can not using these scripts.But my python programs need these nse scripts ,so my question is where should these nse scripts added to OR which command could add this nse scripts into Nmap Path and working properly.Thanks !
I have resolved my problem:this is my cocument structure:
scan_s/nse/modicon-info.nse
/namp.py
this my python script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from libnmap.process import NmapProcess
from time import sleep
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import sys
nmap_proc = NmapProcess(targets="45.76.197.202", options="--script nse/s7-info.nse -p 102")
nmap_proc.run_background()
while nmap_proc.is_running():
sleep(2)
xml = nmap_proc.stdout
print xml
xmlfiled = open('testxml.xml','w')
xmlfiled.write(xml)
xmlfiled.close()
try:
tree = ET.parse('testxml.xml')
root = tree.getroot()
test = root.find('host').find('ports').find('port').find('script')
test_dic = dict(test.attrib)
s = test_dic['output']
except:
s = 'unknown'
print s
hope it can help you :)

Resources