When I send the GET request via postman there is an option to send and download and the download file(an html file) is executable which opens In a browser.
I wish to do the same via python via following code.
from requests import get as GET
import webbrowser
tokenUrl = '...'
tokenParams = {}
r = GET(tokenUrl,params = tokenParams)
print(r.text)
webbrowser.open(r.text)
However this displays an error in browser saying filenotfound error.
Complete error message:
Firefox can’t find the file at currentpath/<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head> <meta HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"><script>function redirectOnLoad() {if (this.SfdcApp && this.SfdcApp.projectOneNavigator) { SfdcApp.projectOneNavigator.handleRedirect('https://login.salesforce.com/?ec=302&startURL=/setup/secur/RemoteAccessAuthorizationPage.apexp?source=CAAAAYYrFX9vMDAwMDAwMDAwMDAwMDAwAAAA8EXaTiy4GdL6nQvL92NfYNg7dzLv4kyxjuZcHKHgNWQQqNF3iKqic0U94dDZSTuDkdwQlxNr5Iqs91cDizSR_QSWdQfg7Gv8wEkB-WyB_2D-57w3Glc28imEFrJSi3M3Xzf6JxZHMkfwRE4T2w5lYTAie-_LbnGu_BDzvlBF-VsyQeufgzH0x36e1zmxy4ef0pp3jjYOtWY9TGpHmURyoroEYE7cBMRXbkXdmGt_umLePR4lBzpAlKGI1q06FoQePUNAn_VAxZfUu8bw3KtdIrsLMxbNQH6aEDv29tCZJiZR274d5BE_GgRX-4r54Sw-6QhLs0XkcSOKO1ASX2Rm8tL5o-9SGYtlpVY8o9be3HoKAz4_HVqqMj45uUCvt-cUXp3wELtsZJmPC-pzTV0pdYymHpm4QiWn7g14CNz5OVi0anBwBeeytE1z-zEsKWdrnIrJO1Gkzbst_SIewi61lrwa0ZSg5TJmSsq3JRTI_AavON_rFG98ZHgie6Ow6keDVlgBgb154xDI-I3mw8NzVCWbXDKXbisOvAMqbaURUf67_XMHjhU-4h-PXtLM-k3jJrccvsf7E-clBSBJzBsfvUA%3D'); } else if (window.location.replace){ window.location.replace('https://login.salesforce.com/?ec=302&startURL=/setup/secur/RemoteAccessAuthorizationPage.apexp?source=CAAAAYYrFX9vMDAwMDAwMDAwMDAwMDAwAAAA8EXaTiy4GdL6nQvL92NfYNg7dzLv4kyxjuZcHKHgNWQQqNF3iKqic0U94dDZSTuDkdwQlxNr5Iqs91cDizSR_QSWdQfg7Gv8wEkB-WyB_2D-57w3Glc28imEFrJSi3M3Xzf6JxZHMkfwRE4T2w5lYTAie-_LbnGu_BDzvlBF-VsyQeufgzH0x36e1zmxy4ef0pp3jjYOtWY9TGpHmURyoroEYE7cBMRXbkXdmGt_umLePR4lBzpAlKGI1q06FoQePUNAn_VAxZfUu8bw3KtdIrsLMxbNQH6aEDv29tCZJiZR274d5BE_GgRX-4r54Sw-6QhLs0XkcSOKO1ASX2Rm8tL5o-9SGYtlpVY8o9be3HoKAz4_HVqqMj45uUCvt-cUXp3wELtsZJmPC-pzTV0pdYymHpm4QiWn7g14CNz5OVi0anBwBeeytE1z-zEsKWdrnIrJO1Gkzbst_SIewi61lrwa0ZSg5TJmSsq3JRTI_AavON_rFG98ZHgie6Ow6keDVlgBgb154xDI-I3mw8NzVCWbXDKXbisOvAMqbaURUf67_XMHjhU-4h-PXtLM-k3jJrccvsf7E-clBSBJzBsfvUA%3D');} else {window.location.href ='https://login.salesforce.com/?ec=302&startURL=/setup/secur/RemoteAccessAuthorizationPage.apexp?source=CAAAAYYrFX9vMDAwMDAwMDAwMDAwMDAwAAAA8EXaTiy4GdL6nQvL92NfYNg7dzLv4kyxjuZcHKHgNWQQqNF3iKqic0U94dDZSTuDkdwQlxNr5Iqs91cDizSR_QSWdQfg7Gv8wEkB-WyB_2D-57w3Glc28imEFrJSi3M3Xzf6JxZHMkfwRE4T2w5lYTAie-_LbnGu_BDzvlBF-VsyQeufgzH0x36e1zmxy4ef0pp3jjYOtWY9TGpHmURyoroEYE7cBMRXbkXdmGt_umLePR4lBzpAlKGI1q06FoQePUNAn_VAxZfUu8bw3KtdIrsLMxbNQH6aEDv29tCZJiZR274d5BE_GgRX-4r54Sw-6QhLs0XkcSOKO1ASX2Rm8tL5o-9SGYtlpVY8o9be3HoKAz4_HVqqMj45uUCvt-cUXp3wELtsZJmPC-pzTV0pdYymHpm4QiWn7g14CNz5OVi0anBwBeeytE1z-zEsKWdrnIrJO1Gkzbst_SIewi61lrwa0ZSg5TJmSsq3JRTI_AavON_rFG98ZHgie6Ow6keDVlgBgb154xDI-I3mw8NzVCWbXDKXbisOvAMqbaURUf67_XMHjhU-4h-PXtLM-k3jJrccvsf7E-clBSBJzBsfvUA%3D';} } redirectOnLoad();</script></head></html><!-- Body events --><script type="text/javascript">function bodyOnLoad(){if(window.PreferenceBits){window.PreferenceBits.prototype.csrfToken="null";};}function bodyOnBeforeUnload(){}function bodyOnFocus(){}function bodyOnUnload(){}</script></body></html>
I found the requests' documentation on how to handle response helpful in resolving the error.
Code now looks as
r = GET(tokenUrl,params = tokenParams)
with open('response.html', 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
webbrowser.open('response.html')
Related
I am trying to learn web-scraping on asynchronous javascript-heavy sites. I chose a real estate website to do that. So, I have done the search by hand and came up with the URL as the first step. Here is the url:
CW_url = https://www.cushmanwakefield.com/en/united-states/properties/invest/invest-property-search#q=Los%20angeles&sort=%40propertylastupdateddate%20descending&f:PropertyType=[Office,Warehouse%2FDistribution]&f:Country=[United%20States]&f:StateProvince=[CA]
I then tried to write code to read the page using beautiful soup:
while iterations < 10:
time.sleep(5)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = bs(html, 'html.parser')
forsales = sel_soup.findAll("for sale")
iterations += 1
print (f'iteration {iterations} - forsales: {forsales}')
I also tried using requests-html:
from requests_html import HTMLSession, HTML
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
r = await asession.get(CW_url)
r.html.arender(wait = 5, sleep = 5)
r.text.find('for sale')
But, this gives me -1, which means the text could not be found! The r.text does give me a wall of HTML text, and inside that there seems to be some javascript not run yet!
<script type="text/javascript">
var endpointConfiguration = {
itemUri: "sitecore://web/{34F7EE0A-4405-44D6-BF43-13BC99AE8AEE}?lang=en&ver=4",
siteName: "CushmanWakefield",
restEndpointUri: "/coveo/rest"
};
if (typeof (CoveoForSitecore) !== "undefined") {
CoveoForSitecore.SearchEndpoint.configureSitecoreEndpoint(endpointConfiguration);
CoveoForSitecore.version = "5.0.788.5";
var context = document.getElementById("coveo3a949f41");
if (!!context) {
CoveoForSitecore.Context.configureContext(context);
}
}
</script>
I thought the fact that the url contains all the search criteria means that the site makes the fetch request, returns the data, and generate the HTML. Apparently not! So, what am I doing wrong and how to deal with this or similar sites? Ideally, one would replace the search criteria in the CW_url and let the code retrieve and store the data
I am using the Python code below to download a csv file from Databricks Filestore. Usually, files can be downloaded via the browser when kept in Filestore.
When I directly enter the url to the file in my browser, the file downloads ok. BUT when I try to do the same via the code below, the content of the downloaded file is not the csv but some html code - see far below.
Here is my Python code:
def download_from_dbfs_filestore(file):
url ="https://databricks-hot-url/files/{0}".format(file)
req = requests.get(url)
req_content = req.content
my_file = open(file,'wb')
my_file.write(req_content)
my_file.close()
Here is the html. It appears to be referencing a login page but am not sure what to do from here:
<!doctype html><html><head><meta charset="utf-8"/>
<meta http-equiv="Content-Language" content="en"/>
<title>Databricks - Sign In</title><meta name="viewport" content="width=960"/>
<link rel="icon" type="image/png" href="/favicon.ico"/>
<meta http-equiv="content-type" content="text/html; charset=UTF8"/><link rel="icon" href="favicon.ico">
</head><body class="light-mode"><uses-legacy-bootstrap><div id="login-page">
</div></uses-legacy-bootstrap><script src="login/login.xxxxx.js"></script>
</body>
</html>
Solved the problem by using base64 module b64decode:
import base64
DOMAIN = <your databricks 'host' url>
TOKEN = <your databricks 'token'>
jsonbody = {"path": <your dbfs Filestore path>}
response = requests.get('https://%s/api/2.0/dbfs/read/' % (DOMAIN), headers={'Authorization': 'Bearer %s' % TOKEN},json=jsonbody )
if response.status_code == 200:
csv=base64.b64decode(response.json()["data"]).decode('utf-8')
print(csv)
This question already has answers here:
Scraper in Python gives "Access Denied"
(3 answers)
Closed 2 years ago.
This is the first time I've encountered a site where it wouldn't 'allow me access' to the webpage. I'm not sure why and I can't figure out how to scrape from this website.
My attempt:
import requests
from bs4 import BeautifulSoup
def html(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Output:
<html>
<head>
<title>
Access Denied
</title>
</head>
<body>
<h1>
Access Denied
</h1>
You don't have permission to access "http://www.g2a.com/" on this server.
<p>
Reference #18.4d24db17.1592006766.55d2bc1
</p>
</body>
</html>
I've looked into it for awhile now and I found that there is supposed to be some type of token [access, refresh, etc...].
Also, action="/search" but I wasn't sure what to do with just that.
This page needs to specify some HTTP headers to obtain the information (Accept-Language):
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.5'}
def html(url):
return BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html lang="en-us">
<head>
<link href="polyfill.g2a.com" rel="dns-prefetch"/>
<link href="images.g2a.com" rel="dns-prefetch"/>
<link href="id.g2a.com" rel="dns-prefetch"/>
<link href="plus.g2a.com" rel="dns-prefetch"/>
... and so on.
I created a simple Pyramid app from the quick tutorial page here that has the following files relevant to the question:
tutorial/__init__.py:
from pyramid.config import Configurator
def main(global_config, **settings):
config = Configurator(settings=settings)
config.include('pyramid_chameleon')
config.add_route('home', '/')
config.add_route('hello', '/howdy')
config.add_static_view(name='static', path='tutorial:static')
config.scan('.views')
return config.make_wsgi_app()
tutorial/views/views.py:
from pyramid.view import (
view_config,
view_defaults
)
#view_defaults(renderer='../templates/home.pt')
class TutorialViews:
def __init__(self, request):
self.request = request
#view_config(route_name='home')
def home(self):
return {'name': 'Home View'}
#view_config(route_name='hello')
def hello(self):
return {'name': 'Hello View'}
tutorial/templates/image.pt:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Quick Tutorial: ${name}</title>
<link rel="stylesheet"
href="${request.static_url('tutorial:static/app.css') }"/>
</head>
<body>
<h1>Hi ${name}</h1>
<img src="${filepath}">
</body>
</html>
I want to add a URL route to my app such that when a user goes to www.foobar.com/{filename} (for example: www.foobar.com/test.jpeg), the app would show the image from the absolute path /Users/s/Downloads/${filename}on the server, as shown in the file tutorial/templates/image.pt above.
This is what I tried:
In the file tutorial/__init__.py, added config.add_route('image', '/foo/{filename}').
In the file development.ini, added the setting image_dir = /Users/s/Downloads.
In the file tutorial/views/views.py, ddded the following view image:
#view_config(route_name='image', renderer='../templates/image.pt')
def image(request):
import os
filename = request.matchdict.get('filename')
filepath = os.path.join(request.registry.settings['image_dir'], filename)
return {'name': 'Hello View', 'filepath': filepath}
However this does not work. How can I get absolute paths for assets working with Pyramid?
filepath needs to be a url if you're expecting to put it as the src attribute in an img tag in your HTML. The browser needs to request the asset. You should be doing something like request.static_url(...) or request.route_url(...) in that image tag. This is the URL generation part of the problem.
The second part is actually serving file contents when the client/browser requests that URL. You can use a static view to support mapping the folder on disk to a URL structure in your application in the same way as add_static_view is doing. There is a chapter on this in the Pyramid docs at https://docs.pylonsproject.org/projects/pyramid/en/latest/narr/assets.html. There is also a section specifically on using absolute paths on disk. You can register multiple static views in your app if you feel it's necessary.
Setup: I'm trying to create a simple D3 js demo with Python 3 handling requests. Following the steps below I expected to see an error or some kind of HTML generated when I click on my html page with my d3 method. Instead, I receive a blank page.
1) Create custom HTTPRequestHandler class to handle GET requests
from http.server import BaseHTTPRequestHandler, HTTPServer
import os
class HTTPRequestHandler(BaseHTTPRequestHandler):
#handle GET command
def do_GET(self):
rootdir = r'C:\Current_Working_Directory' #file location
try:
if self.path.endswith('.json'):
f = open(rootdir + self.path, 'rb') #open requested file
#send code 200 response
self.send_response(200)
#send header first
self.send_header('Content-type','applicaon/json')
self.end_headers()
#send file content to client
self.wfile.write(f.read())
f.close()
return
except IOError:
self.send_error(404, 'file not found')
def run():
print('http server is starting...')
#ip and port of server
server_address = ('127.0.0.1', 80)
httpd = HTTPServer(server_address, HTTPRequestHandler)
print('http server is running...')
httpd.serve_forever()
if __name__ == '__main__':
run()
2) Save above file as server.py and run python server.py and create some dummy json data stored in data.json in $rootdir and create d3.html:
<!DOCTYPE html>
<html>
<head>
<title>D3 tutorial 10</title>
<script src=”http://d3js.org/d3.v3.min.js”></script>
</head>
<body>
<script>
d3.json(“dummy.json”, function (error,data) {
if (error) return console.warn(error);
console.log(data)
});
</script>
</body>
</html>
4) Open up web page expecting to see an error or some indication that d3 has sent the GET request to the server, but get a blank page.