Crawling data but the url doesn't change

Crawling data but the url doesn't change - python-3.x

I want to crawl data using python from this webpage:
https://www.discountoptiondata.com/freedata/
by keep same value for expiration date and symbol but iterating over all values of the start date.
The problem is that the URL stays same for all combinations and hence I cannot get a list of the URLs I want to crawl.
anybody has ideas about how I can do that?

The website you are trying to parse is dynamic, which means it runs some code when you download it in your browser. In your case, the code is set to fetch the data when the "Get OptionData" button is clicked.
You can actually see the browser fetch the data in the Network tab of your browsers Developer Tools. F12 → Network → (Refresh the page) → Fill out the form and Click "Get OptionData". It will show up as a XHR request in the Network Tab list.
The response of the data fetch will look a bit like this
{
"AskPrice": "5.7",
"AskSize": "",
"BidPrice": "0.85",
"ExpirationDate": "2019-06-21",
"LastPrice": "4.4",
"StrikePrice": "1000",
"Symbol": "SPX"
}
The data returned from the data fetch is encoded as JSON, and lucky for us, its very easy to parse in Python. You can get the above JSON code by investigating the XHR request in the Network tab, this was the URL for me
https://www.discountoptiondata.com/freedata/getoptiondatajson?symbol=spx&datadate=2018-06-01&expirationDate=2018-06-15
I am unfamiliar with scrapy, but for JSON based parsing, I would recommend the 'requests' module. Here is an example program that will fetch the data shown on the webpage
import requests
ROOT_URL = "https://www.discountoptiondata.com/freedata/getoptiondatajson"
def fetch_option_data(symbol, datadate, expiration_date):
response = requests.get(ROOT_URL, params={"symbol": symbol, "datadate": datadate, "expirationDate": expiration_date})
return response.json()
data = fetch_option_data('spx', '2018-06-01', '2018-06-15')
for item in data:
print("AskPrice:", item['AskPrice'], "Last Price:", item["LastPrice"])

To view the request or response HTTP headers in Google Chrome, take the following steps :
In Chrome, visit a URL, right click, select Inspect to open the developer tools.
Select Network tab.
Reload the page, select any HTTP request on the left panel, and the HTTP headers will be displayed on the right panel.
Source
In your case,
Open https://www.discountoptiondata.com/freedata/ in Google Chrome
Right click, select Inspect and select Network Tab
Now if you Select start date, you will find the request url under "Headers" Tab.
Same way you can view the response under "Response" Tab.
Here are the screenshots:
start_date_request_url
option_data_request_url
Example:
Start Date Request URL:
https://www.discountoptiondata.com/freedata/getexpirationdates?symbol=spx&datadate=2018-06-01
Option Data Request URL:
https://www.discountoptiondata.com/freedata/getoptiondatajson?symbol=spx&datadate=2018-06-01&expirationDate=2018-06-15

Related

Python# select download link BeautifulSoup

is there a way to select a certain "link", in a list of links es: link1 - link2 - link3 in a web page and "click" on it thus downloading the content of that link only?
I would like to select "MAX" and once after selected download the content of "donwload data"
I have already created a program using selenium but it is too slow for the number of downloads I have to run
I put a link to the web page from which I want to extract the data:
https://www.nasdaq.com/market-activity/stocks/clvs/historical

Just use the api on the backend and play around with the dates and limit. I changed the limit to 10000 (defaults to just 18). Then manage the json response however you wish. You don't need BeautifulSoup to do this, just requests()
https://api.nasdaq.com/api/quote/CLVS/historical?assetclass=stocks&fromdate=2012-01-17&limit=10000&todate=2022-01-17

Irregular behaivour of URL Link - Multiple pages webscraping

I am writing a Python code to get data from a sharepoint website using Beautiful Soup.
Each page has 10 rows of details. So I should be collecting all the links up to the last page and then get the entire list of data I need.
Issues
When I am trying to open the page2 urllink Using Python code, it is still opening the page1 (base url) link.
When I open the base url from browser (page1) link and from there using next button, I am able to navigate to page2. But the same when I open a new tab and directly copy paste the page2 link, it refreshes and opens and page1 (base url) link.
Code:
import requests
from requests_ntlm import HttpNtlmAuth
session = requests.Session()
session.auth = HttpNtlmAuth('username','password')
r = session.get("UrlLinkOfPage2")
print(r.status_code)
print(r.content)

The issue with some websites is that they are expecting some special headers to be sent, so open the site home page in your browser (login if required), open the network tab in developer tools and inspect the request your browser make to access the second page. Then copy all the headers it is sending and make a dictionary in your python code containing those headers like:
my_headers = {
'some-header': value,
'another-header': another-value
}
then use requests library to send those headers when requesting the page like:
response = session.get(second_page_url, headers=my_headers)

How to scrape data after clicking button

I am trying to scrape data from website with beautiful soup, but to scrape all content, I have to click button
<button class="show-more">view all 102 items</button>
to load every item. I have heard that it could by done with selenium, but it means that i have to open browser with script, and then scrape the data. Are there any other ways to solve this problem.

You can use the same API endpoint the page does which returns all the info in json form. Set a records return count higher than the total expected number. I show parsing out the album titles/urls from the json. You can explore response here. You can find this endpoint in the browser network tab when refreshing the url you supplied.
import requests
data = {"fan_id":1812622,"older_than_token":"1557167238:2897209009:a::","count":1000}
r = requests.post('https://bandcamp.com/api/fancollection/1/wishlist_items', json = data).json()
details = [(item['album_title'], item['item_url']) for item in r['items']]
print(details)

Python hit 'submit' button on web form, how I find out what gets sent to the website AND how its sent?

The title of the post pretty much says it all. I fill a one box form and hit the submit button and get back the results from the website. I need to find out what/how the text box data goes to the website. How do I find out what gets sent AND how its sent? I want to be able to 'autofill' in the form and submit it but I don't know how to do it.
I can bring up the headers but, yeah, I'm a little lost in what/how they behave. I tried a simple program a couple of days ago but I couldn't tell what happened with the program...was it actually submitting anything or not. I don't know how to test it since I'm not using a browser for making the call.
import requests
import json
from urllib.parse import quote_plus
URL = 'https://us.hidester.com/do.php?action=go'
data = {'langsel':'en','u':'https://www.esrl.noaa.gov/psd/cgi-bin/data/composites/comp.day.pl?var=Geopotential+Height&level=1000mb&monr1=1&dayr1=1&monr2=1&dayr2=1&iyr[1]=2018&filenamein=&type=1&proj=ALL','encodeURL':'on','allowCookies':'on','stripJS':'on'}
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
resp = requests.get(URL, params=data, headers=headers)
I did the above a couple of days ago and when I went and tried
print(resp)
All I got back was
Response [200]
Whatever that means.
How do I go about getting what the webpage returns to me, aka another webpage when that webpage is all dependent on what I sent it initially, and that gets encrypted.
Yeah, I'm a little lost.

You can use chrome developer tools to find what is the post request and what form data and header is posted when you click the button.
After which you can build your python code to perform the form post with the required data and check the response.
to check what is the response you can add code print (resp.content) to print the response to your request

scrapy pagination without href

I created a spider that takes the information from the table below, but I can not change to the previous table because it does not have "href", how do I?
https://br.soccerway.com/teams/italy/as-roma/1241/
previous button without href
<a rel="previous" class="previous " id="page_team_1_block_team_matches_summary_7_previous">« anterior</a>

If you look at network inspector in your browser you can see an XHR request being made when you click next button:
That request return json response with html changes:
You need to reverse engineer how your page generated this url (from the first image):
https://br.soccerway.com/a/block_team_matches_summary?block_id=page_team_1_block_team_matches_summary_7&callback_params=%7B%22page%22%3A0%2C%22bookmaker_urls%22%3A%7B%2213%22%3A%5B%7B%22link%22%3A%22http%3A%2F%2Fwww.bet365.com%2Fhome%2F%3Faffiliate%3D365_371546%22%2C%22name%22%3A%22Bet%20365%22%7D%5D%7D%2C%22block_service_id%22%3A%22team_summary_block_teammatchessummary%22%2C%22team_id%22%3A1241%2C%22competition_id%22%3A0%2C%22filter%22%3A%22all%22%2C%22new_design%22%3Afalse%7D&action=changePage&params=%7B%22page%22%3A1%7D
And then you can use that to retrieve following pages.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Crawling data but the url doesn't change - python-3.x

Related

Python# select download link BeautifulSoup

Irregular behaivour of URL Link - Multiple pages webscraping

How to scrape data after clicking button

Python hit 'submit' button on web form, how I find out what gets sent to the website AND how its sent?

scrapy pagination without href

Categories

Resources