scrapy - [twisted] NameError: name 'connect' is not defined

scrapy - [twisted] NameError: name 'connect' is not defined - python-3.x

I am trying to use scrapy on Ubuntu 16 with python 3.6. When I run "scrapy crawl mySpiderName", it shows an error as:
2018-06-17 11:18:50 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: ToutiaoAppSpider)
2018-06-17 11:18:50 [scrapy.utils.log] INFO: Versions: lxml 3.5.0.0, libxml2 2.9.3, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.5.2 (default, Nov 23 2017, 16:37:01) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.13.0-43-generic-x86_64-with-Ubuntu-16.04-xenial
2018-06-17 11:18:50 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_MODULES': ['ToutiaoAppSpider.spiders'], 'CONCURRENT_REQUESTS_PER_IP': 10, 'DOWNLOAD_TIMEOUT': 15, 'CONCURRENT_REQUESTS': 10, 'NEWSPIDER_MODULE': 'ToutiaoAppSpider.spiders', 'BOT_NAME': 'ToutiaoAppSpider', 'CONCURRENT_REQUESTS_PER_DOMAIN': 10}
2018-06-17 11:18:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.memusage.MemoryUsage']
2018-06-17 11:18:50 [ToutiaoHao] INFO: Reading start URLs from redis key 'ToutiaoHao:start_urls' (batch size: 10, encoding: utf-8
2018-06-17 11:18:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-17 11:18:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2018-06-17 11:18:50 [twisted] CRITICAL: Unhandled error in Deferred:
2018-06-17 11:18:50 [twisted] CRITICAL:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 70, in __init__
self.scraper = Scraper(crawler)
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/usr/local/lib/python3.5/dist-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python3.5/dist-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/home/tuijian/crawler/ToutiaoAppSpider/ToutiaoAppSpider/pipelines.py", line 34, in from_crawler
redis_host=crawler.settings['REDIS_HOST'])
File "/home/tuijian/crawler/ToutiaoAppSpider/ToutiaoAppSpider/pipelines.py", line 17, in __init__
self.db = connect[mongodb_db]
NameError: name 'connect' is not defined
I searched this situation on stack overflow but most questions are not suitable for my problem.
Settings.py content is shown below:
# -*- coding: utf-8 -*-
# Scrapy settings for ToutiaoAppSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'ToutiaoAppSpider'
SPIDER_MODULES = ['ToutiaoAppSpider.spiders']
NEWSPIDER_MODULE = 'ToutiaoAppSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ToutiaoAppSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 10
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 10
CONCURRENT_REQUESTS_PER_IP = 10
DOWNLOAD_TIMEOUT = 15
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ToutiaoAppSpider.middlewares.ToutiaoappspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
#'ToutiaoAppSpider.middlewares.ProxyMiddleware': 543,
#'ToutiaoAppSpider.middlewares.Push2RedisMiddleware': 544,
#'ToutiaoAppSpider.middlewares.SkipYg365Middleware': 545,
}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'ToutiaoAppSpider.pipelines.ToutiaoappspiderPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HEADERS = {
'Host': 'is.snssdk.com',
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip',
'request_timestamp_client': '332763604',
'X-SS-REQ-TICKET': '',
'User-Agent': 'Dalvik/2.1.0 (Linux; U; Android 7.1.2; Redmi 4X MIUI/8.2.1) NewsArticle/6.5.8 cronet/58.0.2991.0'
}
DETAIL_HEADERS = {
'Accept-Encoding': 'gzip',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Linux; Android 7.1.2; Redmi 4X Build/N2G47H; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/62.0.3202.84 Mobile Safari/537.36 JsSdk/2 NewsArticle/6.5.8 NetType/wifi',
'Connection': 'Keep-Alive'
}
COMPLEX_DETAIL_HEADERS = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'
}
TOUTIAOHAO_HEADERS = {
'accept': 'application/json, text/javascript',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
'content-type': 'application/x-www-form-urlencoded',
'dnt': '1',
'referer': 'https://www.toutiao.com/c/user/{user_id}/',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'}
FEED_URL_old = 'https://is.snssdk.com/api/news/feed/v77/?list_count={list_count}&category={category}&refer=1&count=20' \
'{max_behot_time}&last_refresh_sub_entrance_interval=1519286082&loc_mode=7&tt_from=pre_load_more' \
'&plugin_enable=3&_d_s=1&iid=26133660951&device_id=37559129771&ac=wifi&channel=tianzhuo_toutiao_sg' \
'&aid=13&app_name=news_article&version_code=658&version_name=6.5.8&device_platform=android' \
'&ab_version=281719%2C278039%2C249665%2C249684%2C249686%2C249642%2C249670%2C249673%2C281732%2C229304%2C249671' \
'%2C282686%2C282218%2C275584%2C277466%2C281426%2C280418%2C232362%2C265707%2C279809%2C239097%2C170988%2C281158' \
'%2C269426%2C273499%2C279386%2C281391%2C281612%2C276203%2C281098%2C257281%2C281472%2C280149%2C277718%2C278670' \
'%2C271717%2C259492%2C280773%2C282147%2C272683%2C251795%2C276758%2C282776%2C251713%2C280097%2C282669%2C31210' \
'%2C279129%2C270335%2C280969%2C227649%2C280220%2C264034%2C258356%2C247850%2C280448%2C282714%2C281293%2C278160' \
'%2C249045%2C244746%2C264615%2C260657%2C241181%2C282157%2C271178%2C252767%2C249828%2C246859' \
'&ab_client=a1%2Cc4%2Ce1%2Cf2%2Cg2%2Cf7&ab_group=100167&ab_feature=94563%2C102749&abflag=3&ssmix=a' \
'&device_type=Redmi+4X&device_brand=Xiaomi&language=zh&os_api=25&os_version=7.1.2&uuid=864698037116551' \
'&openudid=349f495868d2f06d&manifest_version_code=658&resolution=720*1280&dpi=320&update_version_code=65809' \
'&_rticket={_rticket}&plugin=10575&fp=TlTqLYFuLlXrFlwSPrU1FYmeFSwt&rom_version=miui_v9_8.2.1' \
'&ts={ts}&as={_as}&mas={mas}&cp={cp}'
FEED_URL = 'http://ib.snssdk.com/api/news/feed/v78/?fp=RrTqLWwuL2GSFlHSFrU1FYFeL2Fu&version_code=6.6.0' \
'&app_name=news_article&vid=49184A00-1B68-4FE9-9FD5-B24BED70032C&device_id=48155472608&channel=App%20Store' \
'&resolution=1242*2208&aid=13&ab_version=264071,275782,275652,271178,252769,249828,246859,285542,278039,' \
'249664,249685,249687,249675,249668,249669,249673,281730,229305,249672,285965,285209,283849,277467,283757,' \
'286568,285618,284439,280773,286862,286446,285535,251712,283794,285944,31210,285223,283836,286558,258356,' \
'247849,280449,284752,281296,249045,275612,278760,283491,264613,260652,286224,286477,261338,241181,283777,' \
'285703,285404,283370,286426,239096,286955,170988,273497,285788,279386,281389,276203,286300,286056,286878,' \
'257282,281472,277769,280147,284908&ab_feature=201616,z1&ab_group=z1,201616' \
'&openudid=a8b364d577ac6c59e96dbcf3cc57d9c4eaa9420c&idfv=49184A00-1B68-4FE9-9FD5-B24BED70032C&ac=WIFI' \
'&os_version=11.2.5&ssmix=a&device_platform=iphone&iid=25922108613&ab_client=a1,f2,f7,e1' \
'&device_type=iPhone%208%20Plus&idfa=2F0B6A5F-B939-4EB5-9D51-D81698D8729D&refresh_reason=4&category={category}' \
'&last_refresh_sub_entrance_interval=1519794464&detail=1&tt_from=unknown&count=20&list_count={list_count}' \
'&LBS_status=authroize&loc_mode=1&cp={cp}&{max_behot_time}&session_refresh_idx=1&image=1&strict=0&refer=1' \
'&city=%E5%8C%97%E4%BA%AC&concern_id=6215497896830175745&language=zh-Hans-CN&st_time=67&as={_as}&ts={ts}'
DETAIL_URL = 'http://is.snssdk.com/2/article/information/v23/?fp=RrTqLWwuL2GSFlHSFrU1FYFeL2Fu&version_code=6.6.0' \
'&app_name=news_article&vid=49184A00-1B68-4FE9-9FD5-B24BED70032C&device_id=48155472608' \
'&channel=App%20Store&resolution=1242*2208&aid=13&ab_version=264071,275782,275652,271178,252769,249828,' \
'246859,283790,278039,249664,249685,249687,249675,249668,249669,249673,281730,229305,249672,283849,' \
'277467,283757,284439,280773,283091,283787,282775,251712,283794,280097,284801,31210,283489,283836,280966,' \
'280220,258356,247849,280449,284752,281296,281724,249045,281414,275612,278760,283491,264613,282974,260652,' \
'261338,241181,283777,284201,282897,283370,239096,170988,269426,273497,279386,281389,281615,276203,279014,' \
'257282,281472,277769,280147&ab_feature=201616,z1&ab_group=z1,201616' \
'&openudid=a8b364d577ac6c59e96dbcf3cc57d9c4eaa9420c&idfv=49184A00-1B68-4FE9-9FD5-B24BED70032C' \
'&ac=WIFI&os_version=11.2.5&ssmix=a&device_platform=iphone&iid=25922108613&ab_client=a1,f2,f7,e1' \
'&device_type=iPhone%208%20Plus&idfa=2F0B6A5F-B939-4EB5-9D51-D81698D8729D&article_page=0' \
'&group_id={group_id}&device_id=48155472608&aggr_type=1&item_id={item_id}&from_category=__all__' \
'&as={_as}&ts={ts}'
COMMENTS_URL = 'https://is.snssdk.com/article/v2/tab_comments/?group_id={group_id}&item_id={item_id}&aggr_type=1' \
'&count=20&offset={offset}&tab_index=0&fold=1&iid=26133660951&device_id=37559129771&ac=wifi' \
'&channel=tianzhuo_toutiao_sg&aid=13&app_name=news_article&version_code=658&version_name=6.5.8' \
'&device_platform=android&ab_version=281719%2C278039%2C249665%2C249684%2C249686%2C283244%2C249642' \
'%2C249670%2C249673%2C281732%2C229304%2C249671%2C282686%2C282218%2C275584%2C277466%2C281426' \
'%2C280418%2C282898%2C232362%2C265707%2C279809%2C239097%2C170988%2C281158%2C269426%2C273499%2C279386' \
'%2C281391%2C281612%2C276203%2C281098%2C257281%2C281472%2C280149%2C277718%2C283104%2C271717%2C259492' \
'%2C283184%2C280773%2C282147%2C272683%2C251795%2C283177%2C282776%2C251713%2C280097%2C282669%2C31210' \
'%2C283097%2C283138%2C270335%2C280969%2C227649%2C280220%2C264034%2C258356%2C247850%2C280448%2C283165' \
'%2C281293%2C278160%2C249045%2C244746%2C264615%2C282973%2C260657%2C241181%2C282157%2C271178%2C252767' \
'%2C249828%2C246859&ab_client=a1%2Cc4%2Ce1%2Cf2%2Cg2%2Cf7&ab_group=100167&ab_feature=94563%2C102749' \
'&abflag=3&ssmix=a&device_type=Redmi+4X&device_brand=Xiaomi&language=zh&os_api=25&os_version=7.1.2' \
'&uuid=864698037116551&openudid=349f495868d2f06d&manifest_version_code=658&resolution=720*1280' \
'&dpi=320&update_version_code=65809&_rticket={_rticket}&plugin=10575&fp=TlTqLYFuLlXrFlwSPrU1FYmeFSwt' \
'&rom_version=miui_v9_8.2.1&ts={ts}&as={_as}&mas={mas}'
ACCOUNT_HEADERS = {
'accept': 'application/json, text/javascript',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
'content-type': 'application/x-www-form-urlencoded',
# 'cookie:WEATHER_CITY=%E5%8C%97%E4%BA%AC; UM_distinctid=160b9d56b07aa1-0322b0f040feaa-32607e02-13c680-160b9d56b08f0f; uuid="w:38808a712dd144679f3b524be9378a9e"; __utmc=24953151; __utmz=24953151.1515146646.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=24953151.940475929.1515146646.1515146646.1515149010.2; _ga=GA1.2.940475929.1515146646; utm_source=toutiao; tt_webid=74484831828; CNZZDATA1259612802=2101983015-1514944629-https%253A%252F%252Fwww.google.com.ph%252F%7C1519894157; __tasessionId=ni5yklwbb1519898143309; tt_webid=6527912825075254788
'dnt': '1',
# 'referer': 'https://www.toutiao.com/search/?keyword={}',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
MONGODB_URI = 'mongodb://wifi:zenmen#10.19.83.217'
MONGODB_DB = 'toutiao_app'
MONGODB_COLLECTION_NEWS = 'news'
MONGODB_COLLECTION_DETAIL = 'detail'
MONGODB_COLLECTION_COMMENTS = 'comments'
REDIS_HOST = '54.169.202.250'
Please is there anyone can help me to fix this problem?
Thank you!

Related

Find marked elements, using python

I need to get an element from this website: https://channelstore.roku.com/en-gb/details/38e7b84fe064cf927ad471ed632cc3d8/vlrpdd2
Task image:
I tried this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://channelstore.roku.com/en-gb/details/38e7b84fe064cf927ad471ed632cc3d8/vlrpdd2')
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
I got a document, but I only see metadata without the result I expected:

If you inspect your browsers Network calls (Click on F12), you'll see that the data is loaded dynamically from:
https://channelstore.roku.com/api/v6/channels/detailsunion/38e7b84fe064cf927ad471ed632cc3d8
So, to mimic the response, you can send a GET request to the URL.
Note, there's no need for BeautifulSoup:
import requests
headers = {
'authority': 'channelstore.roku.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'cookie': '_csrf=ZaFYG2W7HQA4xqKW3SUfuta0; ks.locale=j%3A%7B%22language%22%3A%22en%22%2C%22country%22%3A%22GB%22%7D; _usn=c2062c71-f89e-456f-9374-7c7767afc665; _uc=54c11aa8-597e-4155-bfa1-379add00fc85%3Aa044adeb2f02798a3c8335d874d49562; _ga=GA1.3.760826055.1671811832; _gid=GA1.3.2471563.1671811832; _ga=GA1.1.760826055.1671811832; _cs_c=0; roku_test; AWSELB=0DC9CDB91658555B919B869A2ED9157DFA13B446022D0100EBAAD7261A39D5A536AC0223E5570FAECF0099832FA9F5DB8028018FCCD9D0A49D8F2BDA087916BC1E51F73D1E; AWSELBCORS=0DC9CDB91658555B919B869A2ED9157DFA13B446022D0100EBAAD7261A39D5A536AC0223E5570FAECF0099832FA9F5DB8028018FCCD9D0A49D8F2BDA087916BC1E51F73D1E; _gat_UA-678051-1=1; _ga_ZZXW5ZLMQ5=GS1.1.1671811832.1.1.1671812598.0.0.0; _cs_id=8a1f1aec-e083-a585-e054-158b6714ab4a.1671811832.1.1671812598.1671811832.1.1705975832137; _cs_s=10.5.0.1671814398585',
'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
}
params = {
'country': 'GB',
'language': 'en',
}
response = requests.get(
'https://channelstore.roku.com/api/v6/channels/detailsunion/38e7b84fe064cf927ad471ed632cc3d8',
params=params,
headers=headers,
).json()
# Uncomment to print all the data
# from pprint import pprint
# pprint(response)
print(response.get("feedChannel").get("name"))
print("rating: ", response.get("feedChannel").get("starRatingCount"))
Prints:
The Silver Collection Comedies
rating: 3

you will need to use selenium for python as you need to load the javascript.
beautifulsoup can only really handle static websites.

Curl gives response but python does not and the request call does not terminate?

I am trying this following curl request
curl 'https://www.nseindia.com/api/historical/cm/equity?symbol=COALINDIA&series=\[%22EQ%22\]&from=03-05-2020&to=03-05-2021&csv=true' \
-H 'authority: www.nseindia.com' \
-H 'accept: */*' \
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/88.0.4324.182 Safari/537.36' \
-H 'x-requested-with: XMLHttpRequest' \
-H 'sec-gpc: 1' \
-H 'sec-fetch-site: same-origin' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-dest: empty' \
-H 'referer: https://www.nseindia.com/get-quotes/equity?symbol=COALINDIA' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'cookie: ak_bmsc=2D5CCD6F330B77016DD02ADFD8BADB8A58DDD69E733C0000451A9060B2DF0E5C~pllIy1yQvFABwPqSfaqwV4quP8uVOfZBlZe9dhyP7+7vCW/YfXy32hQoUm4wxCSxUjj8K67PiZM+8wE7cp0WV5i3oFyw7HRmcg22nLtNY4Wb4xn0qLv0kcirhiGKsq4IO94j8oYTZIzN227I73UKWQBrCSiGOka/toHASjz/R10sX3nxqvmMSBlWvuuHkgKOzrkdvHP1YoLPMw3Cn6OyE/Z2G3oc+mg+DXe8eX1j8b9Hc=; nseQuoteSymbols=[{"symbol":"COALINDIA","identifier":null,"type":"equity"}]; nsit=X5ZCfROTTuLVwZzLBn7OOtf0; AKA_A2=A; bm_mi=6CE0B82205ACE5A1F72250ACDDFF563E~LZ4/HQ257rSMBPCrxy0uSDvrSxj4hHpLQqc8R5JZOzUZYo1OqZg5Q/GOt88XNtMbsWM8bB22vtCXzvksGwPcC/bH2nPFEZr0ci6spQ4GOpCa/TM7soc02HVf0tyDTkmg/ZdLZlWzond4r0vn+QpSB7f3fiVza1Gdx9OaFL1i3rvqe1OKmFONreHEue20PL0hlREVWeLcFM/5DxKArPwzCSopPp62Eea1510iivl7GmY=; nseappid=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJhcGkubnNlIiwiYXVkIjoiYXBpLm5zZSIsImlhdCI6MTYyMDA2MTQ5OSwiZXhwIjoxNjIwMDY1MDk5fQ.YBTQ0MqRayD3QBM3V6zUt5zbRRICkbIhWWNedkDYrdU; bm_sv=C49B743B48F174C77F3DDAD188AA6D87~bm5TD36snlaRLx9M5CS+FOUicUcbVV3OIKjZU2WLwd1PtHYUum7hnBfYeUCDv+5Xdb9ADklnmm1cwZGJJbiBstcA6c5vju53C7aTFBorl8SJZjBN/4ku61oz0ncrQYCaSxkFGkRRY9VMWm6SpQwHXfMsUzc/Qk7301zs7KZuGCY=' \
--compressed
This gives us the required response (example below)
"Date ","series ","OPEN ","HIGH ","LOW ","PREV. CLOSE ","ltp ","close ","vwap ","52W H","52W L ","VOLUME ","VALUE ","No of trades "
"03-May-2021","EQ","133.00","133.45","131.20","133.05","132.20","132.20","132.21","163.00","109.55",10262391,"1,356,811,541.80",59409
But if I use the following python script to get the data
import requests
headers = {
'authority': 'www.nseindia.com',
'accept': '*/*',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'sec-gpc': '1',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.nseindia.com/get-quotes/equity?symbol=COALINDIA',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8','cookie':'ak_bmsc=2D5CCD6F330B77016DD02ADFD8BADB8A58DDD69E733C0000451A9060B2DF0E5C~pllIy1yQvFABwPqSfaqwV4quP8uVOfZBlZe9dhyP7+7vCW/YfXy32hQoUm4wxCSxUjj8K67PiZM+8wE7cp0WV5i3oFyw7HRmcg22nLtNY4Wb4xn0qLv0kcirhiGKsq4IO94j8oYTZIzN227I73UKWQBrCSiGOka/toHASjz/R10sX3nxqvmMSBlWvuuHkgKOzrkdvHP1YoLPMw3Cn6OyE/Z2G3oc+mg+DXe8eX1j8b9Hc=; nseQuoteSymbols=[{"symbol":"COALINDIA","identifier":null,"type":"equity"}]; nsit=X5ZCfROTTuLVwZzLBn7OOtf0; AKA_A2=A; bm_mi=6CE0B82205ACE5A1F72250ACDDFF563E~LZ4/HQ257rSMBPCrxy0uSDvrSxj4hHpLQqc8R5JZOzUZYo1OqZg5Q/GOt88XNtMbsWM8bB22vtCXzvksGwPcC/bH2nPFEZr0ci6spQ4GOpCa/TM7soc02HVf0tyDTkmg/ZdLZlWzond4r0vn+QpSB7f3fiVza1Gdx9OaFL1i3rvqe1OKmFONreHEue20PL0hlREVWeLcFM/5DxKArPwzCSopPp62Eea1510iivl7GmY=; nseappid=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJhcGkubnNlIiwiYXVkIjoiYXBpLm5zZSIsImlhdCI6MTYyMDA2MTQ5OSwiZXhwIjoxNjIwMDY1MDk5fQ.YBTQ0MqRayD3QBM3V6zUt5zbRRICkbIhWWNedkDYrdU; bm_sv=C49B743B48F174C77F3DDAD188AA6D87~bm5TD36snlaRLx9M5CS+FOUicUcbVV3OIKjZU2WLwd1PtHYUum7hnBfYeUCDv+5Xdb9ADklnmm1cwZGJJbiBstcA6c5vju53C7aTFBorl8SJZjBN/4ku61oz0ncrQYCaSxkFGkRRY9VMWm6SpQwHXfMsUzc/Qk7301zs7KZuGCY=',}
params = (
('symbol', 'COALINDIA'),
('series', '/["EQ"/]'),
('from', '30-04-2021'),
('to', '03-05-2021'),
('csv', 'true'),
)
response = requests.get('https://www.nseindia.com/api/historical/cm/equity', headers=headers, params=params)
It gets stuck in the last line.
I am using python3.9 and urllib3.
Not sure what is the problem.
This url downloads a csv file from the website.

You have to jump through some loops with Python to get the file you're after. Mainly, you need to get the request header cookie part right, otherwise you'll keep getting 401 code.
First, you need to get the regular cookies from the authority www.nseindia.com. Then, you need to get the bm_sv cookie from the https://www.nseindia.com/json/quotes/equity-historical.json. Finally, add something that's called nseQuoteSymbols.
Glue all that together and make the request to get the file.
Here's how:
from urllib.parse import urlencode
import requests
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/88.0.4324.182 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'referer': 'https://www.nseindia.com/get-quotes/equity?symbol=COALINDIA',
}
payload = {
"symbol": "COALINDIA",
"series": '["EQ"]',
"from": "04-04-2021",
"to": "04-05-2021",
"csv": "true",
}
api_endpoint = "https://www.nseindia.com/api/historical/cm/equity?"
nseQuoteSymbols = 'nseQuoteSymbols=[{"symbol":"COALINDIA","identifier":null,"type":"equity"}]; '
def make_cookies(cookie_dict: dict) -> str:
return "; ".join(f"{k}={v}" for k, v in cookie_dict.items())
with requests.Session() as connection:
authority = connection.get("https://www.nseindia.com", headers=headers)
historical_json = connection.get("https://www.nseindia.com/json/quotes/equity-historical.json", headers=headers)
bm_sv_string = make_cookies(historical_json.cookies.get_dict())
cookies = make_cookies(authority.cookies.get_dict()) + nseQuoteSymbols + bm_sv_string
connection.headers.update({**headers, **{"cookie": cookies}})
the_real_slim_shady = connection.get(f"{api_endpoint}{urlencode(payload)}")
csv_file = the_real_slim_shady.headers["Content-disposition"].split("=")[-1]
with open(csv_file, "wb") as f:
f.write(the_real_slim_shady.content)
Output -> a .csv file that looks like this:

How can get the json data automatically instead of copy and paste manually?

I want to get the json data in the target url:
target url
To get it manually :open it in brower manually and copy,paste.I want a more samrt way--programmatically and automatically,have tried with several way,all failed.
Method 1--traditional way with wget or curl:
wget https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300
--2021-02-09 11:55:44-- https://xueqiu.com/stock/cata/stocktypelist.json?page=1
Resolving xueqiu.com (xueqiu.com)... 39.96.249.191
Connecting to xueqiu.com (xueqiu.com)|39.96.249.191|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-02-09 11:55:44 ERROR 403: Forbidden.
Method 2--scrapy with selenium:
>>> from selenium import webdriver
>>> browser = webdriver.Chrome()
>>> url="https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300"
>>> browser.get(url)
It happen to me in the browser:
{"error_description":"遇到错误，请刷新页面或者重新登录帐号后再试","error_uri":"/stock/cata/stocktypelist.json","error_code":"400016"}
Method 3--build a mitmproxy:
mitmweb --listen-host 127.0.0.1 -p 8080
Set proxy in browser and open the target url in browser
Error info in terminal:
Web server listening at http://127.0.0.1:8081/
Opening in existing browser session.
Proxy server listening at http://127.0.0.1:8080
127.0.0.1:41268: clientconnect
127.0.0.1:41270: clientconnect
127.0.0.1:41268: HTTP/2 connection terminated by client: error code: 0, last stream id: 0, additional data: None
Error info in browser:
error_description "遇到错误，请刷新页面或者重新登录帐号后再试"
error_uri "/stock/cata/stocktypelist.json"
error_code "400016"
So powerful site to protect the data ,is there no way to get the data automatically?

You could use requests module
import json
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",}
import requests
cookies = {
'xq_a_token': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
'xqat': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
'xq_r_token': '2c9b0faa98159f39fa3f96606a9498edb9ddac60',
'xq_id_token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYxMzQ0MzE3MSwiY3RtIjoxNjEyODQ5MDY2ODI3LCJjaWQiOiJkOWQwbjRBWnVwIn0.VuyNicSjIvVkp9FrCzIlRyx8487XM4HH1C3X9KsFA2FipFiilSifBhux9pMNRyziHHiEifhX-xOgccc8IG1mn8cOylOVy3b-L1YG2T5Hs8MKgx7qm4gnV5Mzm_5_G5BiNtO44aczUcmp0g53dp7-0_Bvw3RlwXzT1DTvCKTV-s_zfBsOPyFTfiqyDUxU-oBRvkz1GpgVJzJL4EmZ8zDE2PBqeW00ueLLC7qPW50WeDCsEFS4ZPAvd2SbX9JPk-lU2WzlcMck2S9iFYmpDwuTeQuPbSeSl6jt5suwTImSgJDIUP9o2TX_Z7nNRDTYxvbP8XlejSt8X0pRDPDd_zpbMQ',
'u': '661612849116563',
'device_id': '24700f9f1986800ab4fcc880530dd0ed',
'Hm_lvt_1db88642e346389874251b5a1eded6e3': '1612849123',
's': 'c111f3y1kn',
'Hm_lpvt_1db88642e346389874251b5a1eded6e3': '1612849252',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'sec-ch-ua': '"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'no-cors',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'image',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
'Referer': '',
}
params = (
('page', '1'),
('size', '300'),
)
response = requests.get('https://xueqiu.com/stock/cata/stocktypelist.json', headers=headers, params=params, cookies=cookies)
print(response.status_code)
json_data = response.json()
print(json_data)

You could use scrapy:
import json
import scrapy
class StockSpider(scrapy.Spider):
name = 'stock_spider'
start_urls = ['https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300']
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:85.0) Gecko/20100101 Firefox/85.0',
'Host': 'xueqiu.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US',
'Accept-Encoding': 'gzip,deflate,br',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Referer': '',
},
'ROBOTSTXT_OBEY': False
}
handle_httpstatus_list = [400]
def parse(self, response):
json_result = json.loads(response.body)
yield json_result
Run spider: scrapy crawl stock_spider

Python - webscraping , get(url) taking infinite time

get is fetching information from other website , but not from this particular website "nseindia"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko)
Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer': 'https://cssspritegenerator.com',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}

page_url = "https://www1.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17"
d = get(page_url,headers =hdr)

Try requests:
pip install requests
import requests
from requests.exceptions import HTTPError
page_url = "https://www1.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17"
try:
response.headers['content-type'] = '...yours header ...'
response = requests.get(page_url)
except HTTPError as http_err:
print(f'HTTP error occurred: {http_err}')
except Exception as err:
print(f'Other error occurred: {err}')

why i can't get the result from lagou this web site by using web scraping

I m using python 3.6.5 and my os system is macOS 10.13.6.
I m learning Web Scraping and I want to catch data from this web site(https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=)
Here is my code:
# encoding: utf-8
import requests
from lxml import etree
def parse_list_page():
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':None,
'X-Requested-With':'XMLHttpRequest',
}
data = {
'first':'false',
'pn':1,
'kd':'python',
}
response = requests.post(url,headers=headers,data=data)
print(response.json())
def main():
parse_list_page()
if __name__ == '__main__':
main()
I m appreciate you for spending the time to answer my question.

I got the answer, here is the code below:
# encoding: utf-8
import requests
from lxml import etree
import time
def parse_list_page():
url = 'https://www.lagou.com/jobs/list_python?px=default&city=%E6%B7%B1%E5%9C%B3'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/',
'Connection':'keep-alive',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8',
'Upgrade-Insecure-Requests':'1',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Cache-Control':'no-cache',
'Pragma':'no-cache',
}
response = requests.get(url,headers=headers)
# print(response.text)
r = requests.utils.dict_from_cookiejar(response.cookies)
print(r)
print('='*30)
# r['LGUID'] = r['LGRID']
# r['user_trace_token'] = r['LGRID']
# r['LGSID'] = r['LGRID']
cookies = {
# 'X_MIDDLE_TOKEN':'df7c1d3cfdf279f0caf13df990723620',
# 'JSESSIONID':'ABAAABAAAIAACBI29FE9BDFB6838D8DD69C580E517292C9',
# '_ga':'GA1.2.820168368.1551196380',
# '_gat':'1',
# 'Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6':'1551196381',
# 'user_trace_token':'20190226235303-99bc357a-39de-11e9-921f-525400f775ce',
# 'LGSID':'20190311094827-c3bc2393-439f-11e9-a15a-525400f775ce',
# 'PRE_UTM':'',
# 'PRE_HOST':'',
# 'PRE_SITE':'',
# 'PRE_LAND':'https%3A%2F%2Fwww.lagou.com%2F',
# 'LGUID':'20190226235303-99bc3944-39de-11e9-921f-525400f775ce',
# '_gid':'GA1.2.1391680888.1552248111',
# 'index_location_city':'%E6%B7%B1%E5%9C%B3',
# 'TG-TRACK-CODE':'index_search',
# 'LGRID':'20190311100452-0ed0525c-43a2-11e9-9113-5254005c3644',
# 'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6':'1552269893',
# 'SEARCH_ID':'aae3c38ec76545fc86cd4e23153afe44',
}
cookies.update(r)
print(r)
print('=' * 30)
print(cookies)
print('=' * 30)
headers = {
'Origin':'https://www.lagou.com',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': None,
'X-Requested-With': 'XMLHttpRequest',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'h-CN,zh;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Referer': 'https://www.lagou.com/jobs/list_python?px=default&city=%E6%B7%B1%E5%9C%B3',
'Connection': 'keep-alive',
}
params = {
'px':'default',
'city':'深圳',
'needAddtionalResult':'false'
}
data = {
'first':'true',
'pn':1,
'kd':'python',
}
url_json = 'https://www.lagou.com/jobs/positionAjax.json'
response = requests.post(url=url_json,headers=headers,params=params,cookies=cookies,data=data)
print(response.json())
def main():
parse_list_page()
if __name__ == '__main__':
main()
The reason why I can't get the json as response is the against web scraping rules here is you need to use the first cookie when you send the request.
so when you first send the request you need to save the cookies and then update it to use your second page request. Hope it will helpful for you to do web scraping when you face this problem

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

scrapy - [twisted] NameError: name 'connect' is not defined - python-3.x

Related

Find marked elements, using python

Curl gives response but python does not and the request call does not terminate?

How can get the json data automatically instead of copy and paste manually?

Python - webscraping , get(url) taking infinite time

why i can't get the result from lagou this web site by using web scraping

Categories

Resources