Trying to scrape a website that requires login - python-3.x

so i m new to this and been at it for almost a week now trying to scrape a website i use to collect analytics data (think of it like google analytics).
I tried playing around with xpath to figure out what this script is able to pull but all i get is "[]" as an output after running it.
Please help me find what i'm missing.
import requests
from lxml import html
#credentials
payload = {
'username': '<my username>',
'password': '<my password>',
'csrf-token': '<auth token>'
}
#open a session with login
session_requests = requests.session()
login_url = '<my website>'
result = session_requests.get(login_url)
#passing the auth token
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath('//input[#name=\'form_token\']/#value')))[0]
result = session_requests.post(
login_url,
data=payload,
headers=dict(referer=login_url)
)
#scrape the analytics dashboard from this event link
url = '<my analytics webpage url>'
result = session_requests.get(
url,
headers=dict(referer=url)
)
#print output using xpath to find and load what i need
trees = html.fromstring(result.content)
bucket_names = trees.xpath("//*[#id='statistics_dashboard']/div[1]/text()")
print(bucket_names)
print(result.ok)
print(result.status_code)
..........
this is what i get as a result
[]
True
200
Process finished with exit code 0
which is a big step for me because i've been getting so many errors to get to this point.

Related

python using requests and a webpage with a login issue

I'm trying to login to a website via python to print the info. So I don't have to keep logging into multiple accounts.
In the tutorial I followed, he just had a login and password, but this one has
Website Form Data
Does the _wp attributes change each login?
The code I use:
mffloginurl = ('https://myforexfunds.com/login-signup/')
mffsecureurl = ('https://myforexfunds.com/account-2')
payload = {
'log': '*****#gmail.com',
'pdw': '*****'
'''brandnestor_action':'login',
'_wpnonce': '9d1753c0b6',
'_wp_http_referer': '/login-signup/',
'_wpnonce': '9d1753c0b6',
'_wp_http_referer': '/login-signup/'''
}
r = requests.post(mffloginurl, data=payload)
print(r.text)
using the correct details of course, but it doesn't login.
I tried without the extra wordpress elements and also with them but it still just goes to the signin page.
python output
different site addresses, different login details
Yeah the nonce will change with every new visit to the page.
I would use request.session() so that it automatically stores session cookies and all that good stuff.
Do a session.GET('some_login_page.com')
Parse with the response content with BeautifulSoup to retrieve the nonce.
Then add that into the payload of your POST request when you login.
A very quick and dirty example:
import requests
from bs4 import BeautifulSoup as bs
email = 'test#email.com'
password = 'password1234'
url = 'https://myforexfunds.com/account-2/'
# Start a session
with requests.session() as session:
# Send a GET request to the login page
r = session.get(url)
# Check if the request was successful
if r.status_code != 200:
print("Get Request Failed")
# Parse the HTML content of the page
soup = bs(r.content, 'lxml')
# Extract the value of the nonce from the HTML
nonce = soup.find(id='woocommerce-login-nonce')['value']
# Set up the login form data
params ={
"username": email,
"password": password,
"woocommerce-login-nonce": nonce,
"_wp_http_referer": "/account-2/",
"login": "Log+in"
}
# Send a POST request with the login form data
r = session.post(url, params=params)
# Check if the request was successful
if r.status_code != 200:
print("Login Failed")

Getting status code 403 when working with Tweepy search API

I am trying to get the tweets in a 10 miles radius around Twitter's headquarters in San Francisco location using Tweepy API. I have authenitication information (api key, etc.) in an external .json file.
Below is my code:
import tweepy as tw
import json
import pandas as pd
import datetime
secrets = json.loads(open('secrets.json').read())
api_key = secrets['api_key']
api_secret_key = secrets['api_secret_key']
access_token = secrets['access_token']
access_token_secret = secrets['access_token_secret']
auth = tw.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth)
try:
api.verify_credentials()
print('Authenticated')
except:
print("Error authenticating")
# After running the above, I get status 'Authenitcated'
start = '2020-09-01'
end = '2020-09-26'
location = '37.7749, 122.4149, 10miles'
query = "*"
max_tweets = 1000
tweet_search_results = [status for status in tw.Cursor(api.search, query=query, geocode=location, since=start, until=end).items(max_tweets)]
After running the above, I get this error:
TweepError: Twitter error response: status code = 403
This status code means access to the requested resource is forbidden for some reason. The server understood the request, but will not fulfill due to the client-related issues.
Within the same code I am able to get the list of tweets for a specific user name, but without geolocation data.
Why am I getting 403 when trying to get geolocation tweets?

Scraping from site that requires login, how to access the contents?

So I am trying to scrape a website that requires a login. I have used requests and submitted my login details, although when I try to extract the data from the website, I am not getting the website I am looking for.
USERNAME = "test#gmail.com"
PASSWORD = "test"
#MIDDLEWARE_TOKEN = "TESTTOKEN"
LOGIN_URL = "https://vrdistribution.com.au/auth/login/process"
VR_URL = "https://vrdistribution.com.au/categories/tabletop-gaming?page=1"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='_token']/#value")))
# Create payload
payload = {
"email": USERNAME,
"password": PASSWORD,
"csrfmiddlewaretoken": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
#Scrape
result = session_requests.get(VR_URL, headers =dict(referer=VR_URL))
response = requests.get(VR_URL)
soup = BeautifulSoup(response.text, 'lxml')
print(soup)
The output is not the same contents as the VR_URL(https://vrdistribution.com.au/categories/tabletop-gaming?page=1) that I had specified, when I inspect the page I want to scrape as opposed to the output of the soup object, it is completely different.
Is there a way for me to access and scrape contents off the VR_URL?

Bad Request using python_HTTP:400

Im using python 3 to scrape facebook page of "nytimes".
I tried to create my own Api first so i can get my app_id && app_secret.
when i tried to ping NYT's Facebook page to verify that the access_token works and the page_id is valid i got this ERROR.
HTTPError: HTTP Error 400: Bad Request
My code is Below
First connect to my API via id and secret code.
`
import urllib
import datetime
import json
import time
import urllib.request
id = "111111111111"
secret ="123123123123123113"
token = id +"|"+ secret`
Second To ping the page i did:
page_id = 'nytimes'
Then
def testFacebookPageData(page_id, token):
# construct the URL string
base = "https://graph.facebook.com/v2.4"
node = "/" + page_id
parameters = "/?token=%s" % token
url = base + node + parameters
# retrieve data
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
data = json.loads(response.read())
print (json.dumps(data, indent=4, sort_keys=True))
testFacebookPageData(page_id, token)
Hope that im clear..
I need your help
thank You
According to Facebooks docs for Graph API, the parameter should be named access_token, and not token.
parameters = "/?access_token=%s" % token

Create Google Shortened URLs, Update My CSV File

I've got a list of ~3,000 URLs I'm trying to create Google shortened links of, the idea is this CSV has a list of links and I want my code to output the shortened links in the column next to the original URLs.
I've been trying to modify the code found on this site here but I'm not skilled enough to get it to work.
Here's my code (I would not normally post an API key but the original person who asked this already posted it publicly on this site) :
import json
import pandas as pd
df = pd.read_csv('Links_Test.csv')
def shorternUrl(my_URL):
API_KEY = "AIzaSyCvhcU63u5OTnUsdYaCFtDkcutNm6lIEpw"
apiUrl = 'https://www.googleapis.com/urlshortener/v1/url'
longUrl = my_URL
headers = {"Content-type": "application/json"}
data = {"longUrl": longUrl}
h = httplib2.Http('.cache')
headers, response = h.request(apiUrl, "POST", json.dumps(data), headers)
return response
for url in df['URL']:
x = shorternUrl(url)
# Then I want it to write x into the column next to the original URL
But I only get errors at this point, before I even started figuring out how to write the new URLs to the CSV file.
Here's some sample data:
URL
www.apple.com
www.google.com
www.microsoft.com
www.linux.org
Thank you for your help,
Me
I think the issue is that you didnot include the API key in the request. By the way, the certifi package allows you to secure a connection to a link. You can get it using pip install certifi or pip urllib3[secure].
Here I create my own API key, so you might want to replace it with yours.
from urllib3 import PoolManager
import json
import certifi
sampleURL = 'http://www.apple.com'
APIkey = 'AIzaSyD8F41CL3nJBpEf0avqdQELKO2n962VXpA'
APIurl = 'https://www.googleapis.com/urlshortener/v1/url?key=' + APIkey
http = PoolManager(cert_reqs = 'CERT_REQUIRED', ca_certs=certifi.where())
def shortenURL(url):
data = {'key': APIkey, 'longUrl' : url}
response = http.request("POST", APIurl, body=json.dumps(data), headers= {'Content-Type' : 'application/json'}).data.decode('utf-8')
r = json.loads(response)
return (r['id'])
The decoding part converts the response object into a string so that we can convert it to a JSON and retrieve data.
From there on, you can store the data into another column and so on.
For the sampleUrl, I got back https(goo.gl/nujb) from the function.
I found a solution here:
https://pypi.python.org/pypi/pyshorteners
Example copied from linked page:
from pyshorteners import Shortener
url = 'http://www.google.com'
api_key = 'YOUR_API_KEY'
shortener = Shortener('Google', api_key=api_key)
print "My short url is {}".format(shortener.short(url))

Resources