Facebook login -Python 3 Requests module - python-3.x

Why does this script still bring me to the main page (not logged in)?
Imagine the email and pass were good
import requests
par = {'email':'addddd','Pass':'ggdddssd'}
r = requests.post('https://www.facebook.com',data=par)
print (r.text)
or even trying trying to search with the search bar on youtube:
<input id="masthead-search-term" name="search_query" value="" type="text" tabindex="1"title="Rechercher">
import requests
par = {'search_query':'good_songs_on_youtube'}
r = requests.post('https://www.youtube.com',data=par)
print (r.text)
doesn't make any search... any ideas why?

You won't be able to login like this for two reasons:
When you post an HTML form - you should use the form's "action" as the url. For example:
import requests
url = 'https://www.facebook.com/login.php?login_attempt=1' #the form's action url
par = {"email": "********", "password": "********"} # the forms parameters
r = requests.post(url, data=par)
print (r.content) # you can't use r.text because of encoding
In your case, this is also not good enough cause FB requires cookies, you might be able to program your way out of it, but it won't be easy/straight-forward.
So, how can you login (programmatically) to FB and post queries ?
You can install and use their SDK, and there are also open source projects such as this one that uses FB API and supports oauth as well.
You will have to go to https://developers.facebook.com/tools/explorer/ and get your API token in order to use it with the oauth authentication.

For youtube, you should use "GET" request instead of post to get the valid response. I've given a demo using your search query "good_songs" and showed how to fetch the results:
import requests
from lxml import html
url = 'https://www.youtube.com/results?search_query='
search = 'good_songs'
response = requests.get(url + search).text
tree = html.fromstring(response)
for item in tree.xpath("//a[contains(concat(' ', #class, ' '), ' yt-uix-tile-link ')]/text()"):
print (item)

Related

Scraping Site Data with out Selenium

Currently I am trying to pull CMS historical data from there site. I have got some working code to pull the download links from the page. My problem is that the links are divided into pages. I need to iterate through all the available pages and extract the download links. The obvious choice here is to use Selenium to click next pages and get data. Due to company policy i can not run selenium in the environment. Is there a way I can got through the pages and extract link. The website does not show the post link once you try to go to next page. I am out of ideas to try and get to next page without post link or not using selenium.
Current working code to pull links from first page
import pandas as pd
from datetime import datetime
#from selenium import webdriver
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslink = "https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-
Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report"
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath('//a/#href')
df1 = pd.DataFrame(headers,columns= ['links'])
df1SubSet = df1[df1['links'].str.contains('contract-summary', case=False)]
These are the two urls that will give you the total 166 entries. I have also changed the condition for capturing hrefs. Give this a try.
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
df1 = pd.DataFrame(headers,columns= ['links'])
df=df.append(df1)

How to fix "businessObject not defined"

I am a newbie to Python and web scraping. To practice, I am just trying to pull some business names from some HTML tags a website. However, the code is not running and is throwing an 'object is not defined' error.
from bs4 import BeautifulSoup
import requests
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
content = BeautifulSoup(response.content, "html.parser")
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
businessObject = {
"BusinessName": business.find('h4', attrs={"class": "groomer-salon-card__name"}).text.encode('utf-8')
}
print (businessObject)
Expected: I am trying to retrieve the business names from this web page.
Result:
NameError: name 'businessObject' is not defined
When you did
content.find_all('div', attrs={"class": "groomer-salon-card__details"})
you actually got an empty list as no match.
So, when you did
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
you didn't generate
businessObject
As mentioned in comments, that led to your error.
Content is dynamically loaded from elswhere in the DOM using javascript (as well as other DOM modifications). You can still regex out the javascript object which contains the content used to update the DOM as you saw it in browser. You then parse with json parser as follows:
import requests, re, json
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
p = re.compile(r'state: (.*?)\n', re.DOTALL)
data = json.loads(p.findall(response.text)[0])
for listing in data['content']['search_results']['pages']['data']:
print(listing['organization_name'])
If you view page source on webpage you will see that the DOM is essentially dynamically populated from top to bottom with mutation observers monitoring progress.

Unable to get the expected html element details using Python

I am trying to scrape a website using Python. I have been able to scrape it successfully, however the expected resulted is not fetching up. I think there is something to do with the JavaScript of the web page.
My Code below:
driver.get(
"https://my website")
soup=BeautifulSoup(driver.page_source,'lxml')
all_text = soup.text
ct = all_text.replace('\n', ' ')
cl_text = ct.replace('\t', ' ')
cln_text_t = cl_text.replace('\r', ' ')
cln_text = re.sub(' +', ' ', cln_text_t)
print(cln_text)
Instead of giving me the website details it is giving the below data. Any idea how could I fix this?
html, body {height:100%;margin:0;} You have to enable javascript in your browser to use an application built with Vaadin.........
Why do you need this BeautifulSoup at all? It doesn't seem to support JavaScript.
If you need to get web page text you can fetch the document root using simple XPath selector of //html and get innerText property of the resulting WebElement
Suggested code change:
driver.get(
"my website")
root = driver.find_element_by_xpath("//html")
all_text = root.get_attribute("innerText")

Python 3.6 Downloading .csv files from finance.yahoo.com using requests module

I was trying to download a .csv file from this url for the history of a stock. Here's my code:
import requests
r = requests.get("https://query1.finance.yahoo.com/v7/finance/download/CHOLAFIN.BO?period1=1514562437&period2=1517240837&interval=1d&events=history&crumb=JaCfCutLNr7")
file = open(r"history_of_stock.csv", 'w')
file.write(r.text)
file.close()
But when I opened the file history_of_stock.csv, this was what I found: {
"finance": {
"error": {
"code": "Unauthorized",
"description": "Invalid cookie"
}
}
}
I couldn't find anything that could fix my problem. I found this thread in which someone has the same problem except that it is in C#: C# Download price data csv file from https instead of http
To complement the earlier answer and provide a concrete completed code, I wrote a script which accomplishes the task of getting historical stock prices in Yahoo Finance. Tried to write it as simply as possible. To give a summary: when you use requests to get a URL, in many instances you don't need to worry about crumbs or cookies. However, with Yahoo finance, you need to get the crumbs and the cookies. Once you get the cookies, then you are good to go! Make sure to set a timeout on the requests.get call.
import re
import requests
import sys
from pdb import set_trace as pb
symbol = sys.argv[-1]
start_date = '1442203200' # start date timestamp
end_date = '1531800000' # end date timestamp
crumble_link = 'https://finance.yahoo.com/quote/{0}/history?p={0}'
crumble_regex = r'CrumbStore":{"crumb":"(.*?)"}'
cookie_regex = r'set-cookie: (.*?);'
quote_link = 'https://query1.finance.yahoo.com/v7/finance/download/{}?period1={}&period2={}&interval=1d&events=history&crumb={}'
link = crumble_link.format(symbol)
session = requests.Session()
response = session.get(link)
# get crumbs
text = str(response.content)
match = re.search(crumble_regex, text)
crumbs = match.group(1)
# get cookie
cookie = session.cookies.get_dict()
url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumbs)
r = requests.get(url,cookies=session.cookies.get_dict(),timeout=5, stream=True)
out = r.text
filename = '{}.csv'.format(symbol)
with open(filename,'w') as f:
f.write(out)
There was a service for exactly this but it was discontinued.
Now you can do what you intend but first you need to get a Cookie. On this post there is an example of how to do it.
Basically, first you need to make a useless request to get the Cookie and later, with this Cookie in place, you can query whatever else you actually need.
There's also a post about another service which might make your life easier.
There's also a Python module to work around this inconvenience and code to show how to do it without it.

Python: Facebook Graph API - pagination request using facebook-sdk

I'm trying to query Facebook for different information, for example - friends list. And it works fine, but of course it only gives limited number of results. How do I access the next batch of results?
import facebook
import json
ACCESS_TOKEN = ''
def pp(o):
with open('facebook.txt', 'a') as f:
json.dump(o, f, indent=4)
g = facebook.GraphAPI(ACCESS_TOKEN)
pp(g.get_connections('me', 'friends'))
The result JSON does give me paging-cursors-before and after values - but where do I put it?
I'm exploring Facebook Graph API through the facepy library for Python (works on Python 3 too), but I think I can help.
TL-DR:
You need to append &after=YOUR_AFTER_CODE to the URL you've called (e.g: https://graph.facebook/v2.8/YOUR_FB_ID/friends/?fields=id,name), giving you a link like this one: https://graph.facebook/v2.8/YOUR_FB_ID/friends/?fields=id,name&after=YOUR_AFTER_CODE, that you should make a GET Request.
You'll need requests in order to make a get request for Graph API using your user ID (I'm assuming you know how to find it programatically) and some url similar to the one I give you below (see URL variable).
import facebook
import json
import requests
ACCESS_TOKEN = ''
YOUR_FB_ID=''
URL="https://graph.facebook.com/v2.8/{}/friends?access_token={}&fields=id,name&limit=50&after=".format(YOUR_FB_ID, ACCESS_TOKEN)
def pp(o):
all_friends = []
if ('data' in o):
for friend in o:
if ('next' in friend['paging']):
resp = request.get(friend['paging']['next'])
all_friends.append(resp.json())
elif ('after' in friend['paging']['cursors']):
new_url = URL + friend['paging']['cursors']['after']
resp = request.get(new_url)
all_friends.append(resp.json())
else:
print("Something went wrong")
# Do whatever you want with all_friends...
with open('facebook.txt', 'a') as f:
json.dump(o, f, indent=4)
g = facebook.GraphAPI(ACCESS_TOKEN)
pp(g.get_connections('me', 'friends'))
Hope this helps!

Resources