Scraping earnings data - python-3.x

So I posted an earlier thread on this, but I've reposted this with some new changes. Basically the website was detecting a web scraper so I added a FireFox user agent. However, my code still seems to be failing. It works for some securities but others it fails.
Note: I've got a sleep in there as I had a for loop for a list of securities (Not shown in code)
sec = "BZUN" <- If you replace with ABAC, then it works
print ("Retrieving earnings for ", sec)
url = 'https://seekingalpha.com/symbol/' + sec + '/earnings'
r = requests.get(url, proxies={'http':'62.105.128.174'}, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}).text
s = soup(r, 'lxml')
panel = s.find_all('div', {'class':'panel-heading'})
#summary = s.find_all('div', {'class':'data-container'})
print(s)
time.sleep(7)
print(len(panel))
for item in panel:
period = item.find('span', {'class':'title-period'})
eps = item.find('span', {'class':'eps'})
if (item.text).find("Revenue of") == -1:
revenue = ""
else:
revenue = "Revenue of" + (item.text).split("Revenue of")[1].split("/")[0]
For example, if I try to request security "BZUN" and I print out the request results, I get a "Please click I am not a robot to continue".
Any ideas?

Related

Scrape data from a dynamic web table using parsel selector

I'm trying to get the address in the 'From' column for the first-ever transaction for any token. Since there are new transactions so often thus making this table dynamic, I'd like to be able to fetch this info at any time using parsel selector. Here's my attempted approach:
First step: Fetch the total number of pages
Second step: Insert that number into the URL to get to the earliest page number.
Third step: Loop through the 'From' column and extract the first address.
It returns an empty list. I can't figure out the source of the issue. Any advice will be greatly appreciated.
from parsel import Selector
contract_address = "0x431e17fb6c8231340ce4c91d623e5f6d38282936"
pg_num_url = f"https://bscscan.com/token/generic-tokentxns2?contractAddress={contract_address}&mode=&sid=066c697ef6a537ed95ccec0084a464ec&m=normal&p=1"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"}
response = requests.get(pg_num_url, headers=headers)
sel = Selector(response.text)
pg_num = sel.xpath('//nav/ul/li[3]/span/strong[2]').get() # Attempting to extract page number
url = f"https://bscscan.com/token/generic-tokentxns2?contractAddress={contract_address}&mode=&sid=066c697ef6a537ed95ccec0084a464ec&m=normal&p={pg_num}" # page number inserted
response = requests.get(url, headers=headers)
sel = Selector(response.text)
addresses = []
for row in sel.css('tr'):
addr = row.xpath('td[5]//a/#href').re('/token/([^#?]+)')[0][45:]
addresses.append(addr)
print(addresses[-1]) # Desired address
Seems like the website is using server side session tracking and a security token to make scraping a bit more difficult.
We can get around this by replicating their behaviour!
If you take a look at web inspector you can see that some cookies are being sent to us once we connect to the website for the first time:
Further when we click next page on one of the tables we see these cookies being sent back to the server:
Finally the url of the table page contains something called sid this often stands for something like security id which can be found in 1st page body. If you inspect page source you can find it hidden away in javascript:
Now we need to put all of this together:
start a requests Session which will keep track of cookies
go to token homepage and receive cookies
find sid in token homepage
use cookies and sid token to scrape the table pages
I've modified your code and it ends up looking something like this:
import re
import requests
from parsel import Selector
contract_address = "0x431e17fb6c8231340ce4c91d623e5f6d38282936"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
# we need to start a session to keep track of cookies
session = requests.session()
# first we make a request to homepage to pickup server-side session cookies
resp_homepage = session.get(
f"https://bscscan.com/token/{contract_address}", headers=headers
)
# in the homepage we also need to find security token that is hidden in html body
# we can do this with simple regex pattern:
security_id = re.findall("sid = '(.+?)'", resp_homepage.text)[0]
# once we have cookies and security token we can build the pagination url
pg_num_url = (
f"https://bscscan.com/token/generic-tokentxns2?"
f"contractAddress={contract_address}&mode=&sid={security_id}&m=normal&p=2"
)
# finally get the page response and scrape the data:
resp_pagination = session.get(pg_num_url, headers=headers)
addresses = []
for row in Selector(resp_pagination.text).css("tr"):
addr = row.xpath("td[5]//a/#href").get()
if addr:
addresses.append(addr)
print(addresses) # Desired address

Python - how to return to for loop using +=

I have an issue with a for loop I wrote, I can't get the for loop to return to the first for statement:
def output(query,page,max_page):
"""
Parameters:
query: a string
max_page: maximum pages to be crawled per day, integer
Returns:
List of news dictionaries in a list: [[{...},{...}..],[{...},]]
"""
news_dicts_all = []
news_dicts = []
# best to concatenate urls here
date_range = get_dates()
for date in get_dates():
s_date = date.replace(".","")
while page < max_page:
url = "https://search.naver.com/search.naver?where=news&query=" + query + "&sort=0&ds=" + date + "&de=" + date + "&nso=so%3Ar%2Cp%3Afrom" + s_date + "to" + s_date + "%2Ca%3A&start=" + str(page)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
req = requests.get(url,headers=header)
cont = req.content
soup = BeautifulSoup(cont, 'html.parser')
for urls in soup.select("._sp_each_url"):
try:
if urls["href"].startswith("https://news.naver.com"):
news_detail = get_news(urls["href"])
adict = dict()
adict["title"] = news_detail[0]
adict["date"] = news_detail[1]
adict["company"] = news_detail[3]
adict["text"] = news_detail[2]
news_dicts.append(adict)
except Exception as e:
continue
page += 10
news_dicts_all.append(news_dicts)
return news_dicts_all
I've executed the code, and it seems that page += gets the code back to "while" part, but would not get back to for date in get_dates() part after the page reaches max_page.
What I would want essentially is the code to return to for date in get_dates() after it reaches max_page, but I don't know how I can make this work.
You never reset page so when it moves onto the next date in your for loop, page > max_page is already true so it skips the while loop completely.
You'll need to do something like change your page argument to start_page then have page = start_page at the start of your for loop.

python urllib.request.URLOpener returns 301 response

I was trying to download materials off the web site which didn't allow bots. I could manage to pass a header to Request this way:
url = 'https://www.superdatascience.com/machine-learning/'
req = urllib.request.Request(url, headers = {'user-agent':'Mozilla/5.0'})
res = urllib.request.urlopen(req)
soup = bs(res,'lxml')
links = soup.findAll('a')
res.close()
hrefs = [link.attrs['href'] for link in links]
# Now am filtering in zips only
zips = list(filter(lambda x : 'zip' in x, hrefs))
I hope that Kiril forgives me for that, honestly I didn't mean anything unethical. Just wanted to make it programmatically.
Now when I have all the links for zip files I need to retrieve the content off them. And urllib.request.urlretrieve obviously forbids downloading through a script. So, I'm doing it through URLOpener:
opener = urllib.request.URLopener()
opener.version = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
for zip in zips:
file_name = zip.split('/')[-1]
opener.retrieve(zip, file_name)
the above returned:
HTTPError: HTTP Error 301: Moved Permanently
I tried without a loop, having thought something silly, and made it with a method addheaders:
opener = urllib.request.URLopener()
opener.addheaders = [('User-agent','Mozilla/5.0')]
opener.retrieve(zips[1], 'file.zip')
But it returned the same response with no resource being loaded.
I've two questions:
1. Is there something wrong with my code? and if yes what did I do wrong?
2. is there another way to make it working?
Thanks a lot in advance !

Python requests login to website

I can't seem to login to my university website using python requests.session() function. I have tried retrieving all the headers and cookies needed to login but it does not successfully log in with my credentials. It does not show any error but the source code I review after it is supposed to have logged in shows that it is still not logged in.
All my code is below. I fill the login and password with my credentials, but the rest is the exact code.
import requests
with requests.session() as r:
url = "https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA /user/login"
page = r.get(url)
aspsessionid = r.cookies["ASPSESSIONID"]
ouacapply1 = r.cookies["OUACApply1"]
LOGIN = ""
PASSWORD = ""
login_data = dict(ASPSESSIONID=aspsessionid, OUACApply1=ouacapply1, login=LOGIN, password=PASSWORD)
header = {"Referer":"https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/user/login", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
logged_in = r.post(url, data=login_data, headers=header)
new_page = r.get(url="https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/profile/")
plain_text = new_page.text
print(plain_text)
You are missing two inputs which is needed to be posted -
name='submitButton', value='Log In'
name='csrf', value=''
The value for second keeps changing so you need to get its value dynamically.
If you want to see where this input is then goto the forms closing tag, just above the closing tag there you will find an input which hidden.
So include these two values in your login_data and you will be able to login.

How to Force Close a scrapy spider from a script

I have a scrapy bot that runs from a script,My problem is:After the spyder has finished crawling,the program does not end,so basically the program runs for ever until I manually shut it down,now this spyder is part of a bigger program so i cannot afford to shut it down like that as other processes havent happened.So how do i shut it down safely.
i have already surfed stackoverflow and other forums for this and i got this and this,the first one is totally not usable,trust me,i have tried,the second one looked promising but for some reason,close spider doesnt seem to close my spider when i get the signal spider closed
Here is the bot:
def pricebot(prod_name):
class PriceBot(scrapy.Spider):
name = 'pricebot'
query = prod_name
if query.find(' ') is not -1:
query = query.replace(' ', '-')
start_urls = ['http://www.shopping.com/'+query+'/products?CLT=SCH']
def parse(self, response):
prices_container = response.css('div:nth-child(2) > span:nth-child(1) > a:nth-child(1)')
t_cont = response.css('div:nth-child(2)>h2:nth-child(1)>a:nth-child(1)>span:nth-child(1)')
title = t_cont.xpath('#title').extract()
price = prices_container.xpath('text()').extract()
#Sanitise prices results
prices = []
for p in price:
prices.append(p.strip('\n'))
#Grouping Prices To Their Actual Products
product_info = dict(zip(title, prices))
with open('product_info.json','w') as f:
f.write(json.dumps(product_info))
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(PriceBot)
process.start()
After it is done,i need to do other things,call 3 other functions to be exact

Resources