I am logged in in a page and I use a bookmark to download a CSV file. I just click on the link and after some seconds, the file gets downloaded in my PC. I am now trying to automate the "downloading file" process from the URL in Python.
The URL that triggers a file download is the following:
app.example.com/export/org.jsp?media=yes&csv=yes
What I tried in Python3 is shown below:
##First way
import requests
payload = {'inUserName': 'test.test#test.com','inUserPass': 'test'}
with requests.Session() as s:
p = s.post('https://app.example.com/', data=payload)
#print(p.text)
r = s.get('https://app.example.com/export/org.jsp?media=yes&csv=yes')
###Second way
import urllib
import requests
payload = {'inUserName': 'test.test#test.com', 'inUserPass': 'test'}
url = 'https://app.example.com/'
requests.post(url, data=payload)
###Third way
import urllib.request
with urllib.request.urlopen("http://app.example.com/export/org.jsp?media=yes&csv=yes") as url:
s = url.read()
#print(s)
I want to avoid the page scraping technique where I will login in the page and then visit the url. The platform used does not have an API where I can request the file in a different way.
Related
I'm trying to login to a website via python to print the info. So I don't have to keep logging into multiple accounts.
In the tutorial I followed, he just had a login and password, but this one has
Website Form Data
Does the _wp attributes change each login?
The code I use:
mffloginurl = ('https://myforexfunds.com/login-signup/')
mffsecureurl = ('https://myforexfunds.com/account-2')
payload = {
'log': '*****#gmail.com',
'pdw': '*****'
'''brandnestor_action':'login',
'_wpnonce': '9d1753c0b6',
'_wp_http_referer': '/login-signup/',
'_wpnonce': '9d1753c0b6',
'_wp_http_referer': '/login-signup/'''
}
r = requests.post(mffloginurl, data=payload)
print(r.text)
using the correct details of course, but it doesn't login.
I tried without the extra wordpress elements and also with them but it still just goes to the signin page.
python output
different site addresses, different login details
Yeah the nonce will change with every new visit to the page.
I would use request.session() so that it automatically stores session cookies and all that good stuff.
Do a session.GET('some_login_page.com')
Parse with the response content with BeautifulSoup to retrieve the nonce.
Then add that into the payload of your POST request when you login.
A very quick and dirty example:
import requests
from bs4 import BeautifulSoup as bs
email = 'test#email.com'
password = 'password1234'
url = 'https://myforexfunds.com/account-2/'
# Start a session
with requests.session() as session:
# Send a GET request to the login page
r = session.get(url)
# Check if the request was successful
if r.status_code != 200:
print("Get Request Failed")
# Parse the HTML content of the page
soup = bs(r.content, 'lxml')
# Extract the value of the nonce from the HTML
nonce = soup.find(id='woocommerce-login-nonce')['value']
# Set up the login form data
params ={
"username": email,
"password": password,
"woocommerce-login-nonce": nonce,
"_wp_http_referer": "/account-2/",
"login": "Log+in"
}
# Send a POST request with the login form data
r = session.post(url, params=params)
# Check if the request was successful
if r.status_code != 200:
print("Login Failed")
Following the documentation of BeautifulSoup, I am trying to download a specific file from a webpage. First trying to find the link that contains the file name:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find("a", text=re.compile("TASA_DOLAR_REFERENCIA_MC.xls"))
path = link.get('href')
print(f"{path}")
But with no success. Then trying to print every link on that page, I get no links:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find_all('a')
for links in parsed.find_all("a href"):
print(links.get('a href'))
It looks like the url of the file is dynamic, it adds a ?v=123456789 parameter to the end of the url, like the file version, that's why I need to download the file using the file name.
(Eg https://cdn.bancentral.gov.do/documents/estadisticas/mercado-cambiario/documents/TASA_DOLAR_REFERENCIA_MC.xls?v=1612902983415)
Thanks.
Actually you are dealing with a dynamic JavaScript page which is fully loaded via an XHR request to the following url once the page loads.
Below is a direct call to the back-end API which identify the request using page id which is 2538 and then we can load your desired url.
import requests
from bs4 import BeautifulSoup
def main(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0'
}
with requests.Session() as req:
req.headers.update(headers)
data = {
"id": "2538",
"languageName": "es"
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.json()['result']['article']['content'], 'lxml')
target = soup.select_one('a[href*=TASA_DOLAR_REFERENCIA_MC]')['href']
r = req.get(target)
with open('data.xls', 'wb') as f:
f.write(r.content)
if __name__ == "__main__":
main('https://www.bancentral.gov.do/Home/GetContentForRender')
I am getting the following error when I try to upload an image to imgur api.
b'{"data":{"error":"Invalid URL (<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1600x6495 at 0x10E726050>)","request":"\\/3\\/upload","method":"POST"},"success":false,"status":400}'
My code is given below. Client ID is redacted.
#!/usr/bin/env python3
import io
from PIL import Image
import requests
import json
import base64
url = "http://www.tallheights.com/wp-content/uploads/2016/06/background_purple.jpg"
r = requests.get(url)
image = Image.open(io.BytesIO(r.content))
imagestring = str(image)
url = 'https://api.imgur.com/3/upload'
body = {'type':'file','image': imagestring , 'name' : 'abc.jpeg'}
headers = {'Authorization': 'Client-ID <redacted>'}
req = requests.post(url, data=body, headers=headers)
print (req.content)
My code is in Python3 and I am not using the official client library provided by Imgur for two reasons.
The library provides me with only two options - a) Upload by specifying a URL and b) Upload from a local file. In my case, the image I want to upload is neither. It is an image processed by PIL, existing as a PIL object in memory. I do not want to use file system for this particular implementation.
A simple POST request to the API would do the job for me and I want to avoid the dependency to the library and keep the package as light as possible.
This Works:
# imports
import requests
import io
from PIL import Image
# Get Image from request
img_response = requests.get(
"https://encrypted-tbn0.gstatic.com/images? q=tbn%3AANd9GcTVYP3ZsF72FSKPzxJghYkjz_-a1op7YxBK45O0Y4nTjAQ7PZKD"
)
# Some Pillow Resizing
img = Image.open(io.BytesIO(img_response.content))
img_width, img_height = img.size
crop = min(img.size)
square_img = img.crop(
(
(img_width - crop) // 2,
(img_height - crop) // 2,
(img_width + crop) // 2,
(img_height + crop) // 2,
)
) # Square Image is of type Pil Image
imgByteArr = io.BytesIO()
square_img.save(imgByteArr, format="PNG")
imgByteArr = imgByteArr.getvalue()
url = "https://api.imgur.com/3/image"
payload = {"image": imgByteArr}
headers = {"Authorization": "Client-ID xxxxxxxxx"}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text.encode("utf8"))
Checking the API documentation, it seems like the /upload part of your url was changed to /image.
https://apidocs.imgur.com/?version=latest#c85c9dfc-7487-4de2-9ecd-66f727cf3139
(see the "sample request" on the right side)
But it seems like this is all deprecated in general and the information on the very same page contradicts itself.
I'm using python3. For debugging purposes, I'd like to get the exact text that is sent to the remote server via a requests.post command. The requests.post object seems to contain the response text, but I'm looking for the request text.
For example ...
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0'}
payload = json.dumps({'abc':'def','ghi':'jkl'})
url = 'http://example.com/site.php'
r = requests.post(url, headers=headers, data=payload)
How do I get the text of the exact request that is sent to url via this POST command?
I've got a list of ~3,000 URLs I'm trying to create Google shortened links of, the idea is this CSV has a list of links and I want my code to output the shortened links in the column next to the original URLs.
I've been trying to modify the code found on this site here but I'm not skilled enough to get it to work.
Here's my code (I would not normally post an API key but the original person who asked this already posted it publicly on this site) :
import json
import pandas as pd
df = pd.read_csv('Links_Test.csv')
def shorternUrl(my_URL):
API_KEY = "AIzaSyCvhcU63u5OTnUsdYaCFtDkcutNm6lIEpw"
apiUrl = 'https://www.googleapis.com/urlshortener/v1/url'
longUrl = my_URL
headers = {"Content-type": "application/json"}
data = {"longUrl": longUrl}
h = httplib2.Http('.cache')
headers, response = h.request(apiUrl, "POST", json.dumps(data), headers)
return response
for url in df['URL']:
x = shorternUrl(url)
# Then I want it to write x into the column next to the original URL
But I only get errors at this point, before I even started figuring out how to write the new URLs to the CSV file.
Here's some sample data:
URL
www.apple.com
www.google.com
www.microsoft.com
www.linux.org
Thank you for your help,
Me
I think the issue is that you didnot include the API key in the request. By the way, the certifi package allows you to secure a connection to a link. You can get it using pip install certifi or pip urllib3[secure].
Here I create my own API key, so you might want to replace it with yours.
from urllib3 import PoolManager
import json
import certifi
sampleURL = 'http://www.apple.com'
APIkey = 'AIzaSyD8F41CL3nJBpEf0avqdQELKO2n962VXpA'
APIurl = 'https://www.googleapis.com/urlshortener/v1/url?key=' + APIkey
http = PoolManager(cert_reqs = 'CERT_REQUIRED', ca_certs=certifi.where())
def shortenURL(url):
data = {'key': APIkey, 'longUrl' : url}
response = http.request("POST", APIurl, body=json.dumps(data), headers= {'Content-Type' : 'application/json'}).data.decode('utf-8')
r = json.loads(response)
return (r['id'])
The decoding part converts the response object into a string so that we can convert it to a JSON and retrieve data.
From there on, you can store the data into another column and so on.
For the sampleUrl, I got back https(goo.gl/nujb) from the function.
I found a solution here:
https://pypi.python.org/pypi/pyshorteners
Example copied from linked page:
from pyshorteners import Shortener
url = 'http://www.google.com'
api_key = 'YOUR_API_KEY'
shortener = Shortener('Google', api_key=api_key)
print "My short url is {}".format(shortener.short(url))