Scrape data from a dynamic web table using parsel selector

Scrape data from a dynamic web table using parsel selector - python-3.x

I'm trying to get the address in the 'From' column for the first-ever transaction for any token. Since there are new transactions so often thus making this table dynamic, I'd like to be able to fetch this info at any time using parsel selector. Here's my attempted approach:
First step: Fetch the total number of pages
Second step: Insert that number into the URL to get to the earliest page number.
Third step: Loop through the 'From' column and extract the first address.
It returns an empty list. I can't figure out the source of the issue. Any advice will be greatly appreciated.
from parsel import Selector
contract_address = "0x431e17fb6c8231340ce4c91d623e5f6d38282936"
pg_num_url = f"https://bscscan.com/token/generic-tokentxns2?contractAddress={contract_address}&mode=&sid=066c697ef6a537ed95ccec0084a464ec&m=normal&p=1"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"}
response = requests.get(pg_num_url, headers=headers)
sel = Selector(response.text)
pg_num = sel.xpath('//nav/ul/li[3]/span/strong[2]').get() # Attempting to extract page number
url = f"https://bscscan.com/token/generic-tokentxns2?contractAddress={contract_address}&mode=&sid=066c697ef6a537ed95ccec0084a464ec&m=normal&p={pg_num}" # page number inserted
response = requests.get(url, headers=headers)
sel = Selector(response.text)
addresses = []
for row in sel.css('tr'):
addr = row.xpath('td[5]//a/#href').re('/token/([^#?]+)')[0][45:]
addresses.append(addr)
print(addresses[-1]) # Desired address

Seems like the website is using server side session tracking and a security token to make scraping a bit more difficult.
We can get around this by replicating their behaviour!
If you take a look at web inspector you can see that some cookies are being sent to us once we connect to the website for the first time:
Further when we click next page on one of the tables we see these cookies being sent back to the server:
Finally the url of the table page contains something called sid this often stands for something like security id which can be found in 1st page body. If you inspect page source you can find it hidden away in javascript:
Now we need to put all of this together:
start a requests Session which will keep track of cookies
go to token homepage and receive cookies
find sid in token homepage
use cookies and sid token to scrape the table pages
I've modified your code and it ends up looking something like this:
import re
import requests
from parsel import Selector
contract_address = "0x431e17fb6c8231340ce4c91d623e5f6d38282936"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
# we need to start a session to keep track of cookies
session = requests.session()
# first we make a request to homepage to pickup server-side session cookies
resp_homepage = session.get(
f"https://bscscan.com/token/{contract_address}", headers=headers
)
# in the homepage we also need to find security token that is hidden in html body
# we can do this with simple regex pattern:
security_id = re.findall("sid = '(.+?)'", resp_homepage.text)[0]
# once we have cookies and security token we can build the pagination url
pg_num_url = (
f"https://bscscan.com/token/generic-tokentxns2?"
f"contractAddress={contract_address}&mode=&sid={security_id}&m=normal&p=2"
)
# finally get the page response and scrape the data:
resp_pagination = session.get(pg_num_url, headers=headers)
addresses = []
for row in Selector(resp_pagination.text).css("tr"):
addr = row.xpath("td[5]//a/#href").get()
if addr:
addresses.append(addr)
print(addresses) # Desired address

Related

find_all() method of BS4 not capturing everything

I wrote this code to extract the past filmography of filmmaker, Ken Loach. I wanted the past work of the filmmaker as a director which includes 56 entries, so I targeted the <ul> tag by applying find() method which worked but when I tried to target all the <li> tags under it, which are supposed to be 56, I get only 15! Below is my code:
url="https://www.imdb.com/name/nm0516360/?ref_=nv_sr_srsg_0"
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
resp=requests.get(url, headers=hdr)
html=resp.content
soup=BeautifulSoup(html, "html.parser")
uls=soup.find('ul', 'ipc-metadata-list ipc-metadata-list--dividers-between ipc-metadata-list--base')
films=uls.find_all('li', 'ipc-metadata-list-summary-item ipc-metadata-list-summary-item--click sc-139216f7-1 fFMbUG')
print(len(films))
What is it that I am doing wrong?
PS: I'm learning Web-scraping and I am a beginner at it.

That's because the page only contains 15 links to begin with. It loads up more dynamically when you click the button to show more.
You can manually check this through "view page source" button in your browser.
You can use selenium or any other JS rendering service to access the page and simulate user clicks.

Soup is empty problem is that. Use cookie and accept in your headers. Also you have to click view all for all films. You can do it with selenium or playwright.
import requests
from bs4 import BeautifulSoup
head = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'Cookie': 'session-id=140-8492617-3903853; session-id-time=2082787201l; csm-hit=tb:7VKG5AZH38ZHJTWZ0TDQ+s-7VKG5AZH38ZHJTWZ0TDQ|1675775852695&t:1675775852695&adb:adblk_no',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
}
url = 'https://www.imdb.com/name/nm0516360/?ref_=nv_sr_srsg_0'
r = requests.get(url, headers=head)
soup = bs(r.text, 'html.parser')
print(soup)
lists = soup.find_all('li', class_ = 'ipc-metadata-list-summary-item ipc-metadata-list-summary-item--click sc-139216f7-1 fFMbUG')
for z in lists:
print(z.text)

How can get all non us-gaap concepts from some api or file?

Companies are allowed to create their own concepts. The conccept AccruedAndOtherCurrentLiabilities is generated by tesla. Get all us-gaap concepts from ssec's RESTful api with python code:
import requests
import json
cik='1318605' #tesla's cik
url = 'https://data.sec.gov/api/xbrl/companyfacts/CIK{:>010s}.json'.format(cik)
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
}
res = requests.get(url=url,headers=headers)
result = json.loads(res.text)
us_gaap_concepts = list(result['facts']['us-gaap'].keys())
Revenues is a us-gaap concept,verify it with code.
'Revenues' in us_gaap_concepts
True
Verify that AccruedAndOtherCurrentLiabilities is not in us_gaap_concepts.
'AccruedAndOtherCurrentLiabilities' in us_gaap_concepts
False
How can get all company customized concepts from sec's data api or some file then?

If I understand you correctly, one way to get the company's US GAAP taxonomy concept extensions (there may be others) is to do the following. Note that the data is in xml format, not json, so you will need to use an xml parser.
If you look at the company's 10-K filing for 2020, for example, you will notice that, at the bottom, there is a list of data files, the first one described as "XBRL TAXONOMY EXTENSION SCHEMA" and named "tsla-20201231.xsd". That's the file you're looking for. Copy the url and get started. BTW, it's probably possible to automate all this, but that's a different topic.
from lxml import etree
import requests
#get the file
url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-20201231.xsd'
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
}
req = requests.get(url,headers=headers)
#load it into lxml for parsing
doc = etree.XML(req.content)
#search for the customized concepts
tsla_concepts = doc.xpath('//*[#id[starts-with(.,"tsla_")]]/#name')
tsla_concepts
You get a list of 328 customized concepts. Your AccruedAndOtherCurrentLiabilities is somewhere near the top:
['FiniteLivedIntangibleAssetsLiabilitiesOther',
'IndefiniteLivedIntangibleAssetsGrossExcludingGoodwill',
'IndefiniteLivedIntangibleAssetsOtherAdjustments',
etc.

python urllib.request.URLOpener returns 301 response

I was trying to download materials off the web site which didn't allow bots. I could manage to pass a header to Request this way:
url = 'https://www.superdatascience.com/machine-learning/'
req = urllib.request.Request(url, headers = {'user-agent':'Mozilla/5.0'})
res = urllib.request.urlopen(req)
soup = bs(res,'lxml')
links = soup.findAll('a')
res.close()
hrefs = [link.attrs['href'] for link in links]
# Now am filtering in zips only
zips = list(filter(lambda x : 'zip' in x, hrefs))
I hope that Kiril forgives me for that, honestly I didn't mean anything unethical. Just wanted to make it programmatically.
Now when I have all the links for zip files I need to retrieve the content off them. And urllib.request.urlretrieve obviously forbids downloading through a script. So, I'm doing it through URLOpener:
opener = urllib.request.URLopener()
opener.version = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
for zip in zips:
file_name = zip.split('/')[-1]
opener.retrieve(zip, file_name)
the above returned:
HTTPError: HTTP Error 301: Moved Permanently
I tried without a loop, having thought something silly, and made it with a method addheaders:
opener = urllib.request.URLopener()
opener.addheaders = [('User-agent','Mozilla/5.0')]
opener.retrieve(zips[1], 'file.zip')
But it returned the same response with no resource being loaded.
I've two questions:
1. Is there something wrong with my code? and if yes what did I do wrong?
2. is there another way to make it working?
Thanks a lot in advance !

Python requests login to website

I can't seem to login to my university website using python requests.session() function. I have tried retrieving all the headers and cookies needed to login but it does not successfully log in with my credentials. It does not show any error but the source code I review after it is supposed to have logged in shows that it is still not logged in.
All my code is below. I fill the login and password with my credentials, but the rest is the exact code.
import requests
with requests.session() as r:
url = "https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA /user/login"
page = r.get(url)
aspsessionid = r.cookies["ASPSESSIONID"]
ouacapply1 = r.cookies["OUACApply1"]
LOGIN = ""
PASSWORD = ""
login_data = dict(ASPSESSIONID=aspsessionid, OUACApply1=ouacapply1, login=LOGIN, password=PASSWORD)
header = {"Referer":"https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/user/login", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
logged_in = r.post(url, data=login_data, headers=header)
new_page = r.get(url="https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/profile/")
plain_text = new_page.text
print(plain_text)

You are missing two inputs which is needed to be posted -
name='submitButton', value='Log In'
name='csrf', value=''
The value for second keeps changing so you need to get its value dynamically.
If you want to see where this input is then goto the forms closing tag, just above the closing tag there you will find an input which hidden.
So include these two values in your login_data and you will be able to login.

Python - request module - HTTP 500 error when retrieving webpage

This code should download the html page and just print it to screen, but instead I get an HTTP 500 error exception, which I cant figure how to manage.
Any ideas?
import requests ,bs4
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'}
#Load mainPage
_requestResult = requests.get("http://www.geometriancona.it/categoria_albo/albo/",headers = headers, timeout = 20)
_requestResult.raise_for_status()
_htmlPage = bs4.BeautifulSoup(_requestResult.text, "lxml")
print(_htmlPage)
#search for stuff in html code

You can use the urllib module to download individual URLs but this will just return the data. It will not parse the HTML and automatically download things like CSS files and images.
If you want to download the "whole" page you will neestrong textd to parse the HTML and find the other things you need to download. You could use something like Beautiful Soup to parse the HTML you retrieve.
This question has some sample code doing exactly that.

Try to visit: http://www.geometriancona.it/categoria_albo/albo/ with your anonymous browser, it gives HTTP 500 Error
because you need to log in, don't you?
Maybe you should try this sintaxt:
r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
your code works but you have to
print(_htmlPage)
try it with
_requestResult = requests.get("http://www.google.com",headers = headers, timeout = 20)
UPDATE
The problem was the cookies, after packet analysis i found four cookies so that's the code that works for me
import requests ,bs4
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'}
jar = requests.cookies.RequestsCookieJar()
jar.set('PHPSESSID', '1bj8opfs9nb41l9dgtdlt5cl63', domain='geometriancona.it')
jar.set('wfvt', '587b6fcd2d87b', domain='geometriancona.it')
jar.set('_iub_cs-7987130', '%7B%22consent%22%3Atrue%2C%22timestamp%22%3A%222017-01-15T12%3A17%3A09.702Z%22%2C%22version%22%3A%220.13.9%22%2C%22id%22%3A7987130%7D', domain='geometriancona.it')
jar.set('wordfence_verifiedHuman', 'e8220859a74b2ee9689aada9fd7349bd', domain='geometriancona.it')
#Load mainPage
_requestResult = requests.get("http://www.geometriancona.it/categoria_albo/albo/",headers = headers,cookies=jar)
_requestResult.raise_for_status()
_htmlPage = bs4.BeautifulSoup(_requestResult.text, "lxml")
print(_htmlPage)
That's my output: http://prnt.sc/dvw2ec

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scrape data from a dynamic web table using parsel selector - python-3.x

Related

find_all() method of BS4 not capturing everything

How can get all non us-gaap concepts from some api or file?

python urllib.request.URLOpener returns 301 response

Python requests login to website

Python - request module - HTTP 500 error when retrieving webpage

Categories

Resources