CSV File download from Databricks Filestore in Python not working - databricks

I am using the Python code below to download a csv file from Databricks Filestore. Usually, files can be downloaded via the browser when kept in Filestore.
When I directly enter the url to the file in my browser, the file downloads ok. BUT when I try to do the same via the code below, the content of the downloaded file is not the csv but some html code - see far below.
Here is my Python code:
def download_from_dbfs_filestore(file):
url ="https://databricks-hot-url/files/{0}".format(file)
req = requests.get(url)
req_content = req.content
my_file = open(file,'wb')
my_file.write(req_content)
my_file.close()
Here is the html. It appears to be referencing a login page but am not sure what to do from here:
<!doctype html><html><head><meta charset="utf-8"/>
<meta http-equiv="Content-Language" content="en"/>
<title>Databricks - Sign In</title><meta name="viewport" content="width=960"/>
<link rel="icon" type="image/png" href="/favicon.ico"/>
<meta http-equiv="content-type" content="text/html; charset=UTF8"/><link rel="icon" href="favicon.ico">
</head><body class="light-mode"><uses-legacy-bootstrap><div id="login-page">
</div></uses-legacy-bootstrap><script src="login/login.xxxxx.js"></script>
</body>
</html>

Solved the problem by using base64 module b64decode:
import base64
DOMAIN = <your databricks 'host' url>
TOKEN = <your databricks 'token'>
jsonbody = {"path": <your dbfs Filestore path>}
response = requests.get('https://%s/api/2.0/dbfs/read/' % (DOMAIN), headers={'Authorization': 'Bearer %s' % TOKEN},json=jsonbody )
if response.status_code == 200:
csv=base64.b64decode(response.json()["data"]).decode('utf-8')
print(csv)

Related

How to upload a file from an html page in S3 bucket using boto3 and lambda?

I want to upload a file that is bigger then 10 MB from an html page to the S3.
I have the html:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Upload File to S3 Bucket</title>
</head>
<body>
<h3>Upload File to S3 Bucket</h3>
<form
action=" https://4oss3kkck9.execute-api.us-east-1.amazonaws.com/api/uploadhtml"
method="post"
enctype="multipart/form-data"
>
<h1>Select File:</h1>
<input type="file" name="file" />
<input type="submit" name="submit" value="Upload" />
</form>
</body>
</html>
And I have the lambda function which is triggered by an API endpoint:
Here is the Lambda:
import json
import base64
import boto3
import email
import logging
import botocore
from create_bucket import create_bucket
def lambda_handler(event, context):
s3 = boto3.client('s3')
buckets = s3.list_buckets()
bucket_name = ""
bucket_without_extension = "filetypewithoutextension"
print("Received event: " + json.dumps(event))
# decoding form-data into bytes
post_data = base64.b64decode(event["body"])
# fetching content-type
try:
content_type = event["headers"]["content-Type"]
except:
content_type = event["headers"]["content-type"]
# concate Content-Type: with content_type from event
ct = "Content-Type: "+content_type+"\n"
# parsing message from bytes
msg = email.message_from_bytes(ct.encode()+post_data)
print(msg)
# checking if the message is multipart
print("Multipart check : ", msg.is_multipart())
# if message is multipart
if msg.is_multipart():
multipart_content = {}
# retrieving form-data
for part in msg.get_payload():
# checking if filename exist as a part of content-disposition header
if part.get_filename():
# fetching the filename
file_name = part.get_filename()
bucket_name = f"thisismycomplicatedbucketfiletype-{file_name.split('.')[-1]}"
multipart_content[part.get_param("name", header="content-disposition")] = part.get_payload(decode=True)
in_bucket = False
if len(file_name.split(".")) == 1 or file_name.split(".")[1] == "":
s3.put_object(Bucket=bucket_without_extension, Key=file_name, Body=multipart_content["file"],
ServerSideEncryption="aws:kms")
return {
"statusCode": 200,
"body": json.dumps("File uploaded successfully!")
}
for bucket in buckets["Buckets"]:
if bucket_name in bucket["Name"]:
s3.put_object(Bucket=bucket_name, Key=file_name, Body=multipart_content["file"],
ServerSideEncryption="aws:kms")
in_bucket = True
break
if not in_bucket:
create_bucket(bucket_name)
s3.put_object(Bucket=bucket_name, Key=file_name, Body=multipart_content["file"],
ServerSideEncryption="aws:kms")
# on upload success
return {
"statusCode": 200,
"body": json.dumps("File uploaded successfully!")
}
else:
# on upload failure
return {
"statusCode": 500,
"body": json.dumps("Upload failed!")
}
But this is working only for files up to 10 MB. I want to upload all file extensions, also small files and files bigger then 10 MB?
How can modify this code? Presigned urls? Multipart Upload?
I'm not sure how can I implement this?
Thank you!
files bigger then 10 MB?
Sadly, you can't do this through API gateway. Its hard limit is 10 MB. The common workaround is to use S3 pre-signed urls for uploading files to S3 instead. This requires changes to your architecture, because for uploading to S3 you can't use API gateway.

Getting meta property with beautifulsoup

I am trying to extract the property "og" from opengraph from a website. What I want is to have all the tags that start with "og" of the document in a list.
What I've tried is:
soup.find_all("meta", property="og:")
and
soup.find_all("meta", property="og")
But it does not find anything unless I specify the complete tag.
A few examples are:
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:url"/>,
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:secure_url"/>,
<meta content="text/html" property="og:video:type"/>,
<meta content="1280" property="og:video:width"/>,
<meta content="720" property="og:video:height"/>
Expected output would be:
l = ["og:video:url", "og:video:secure_url", "og:video:type", "og:video:width", "og:video:height"]
How can I do this?
Thank you
use CSS selector meta[property]
metas = soup.select('meta[property]')
propValue = [v['property'] for v in metas]
print(propValue)
Is this what you want?
from bs4 import BeautifulSoup
sample = """
<html>
<body>
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:url"/>,
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:secure_url"/>,
<meta content="text/html" property="og:video:type"/>,
<meta content="1280" property="og:video:width"/>,
<meta content="720" property="og:video:height"/>
</body>
</html>
"""
print([m["property"] for m in BeautifulSoup(sample, "html.parser").find_all("meta")])
Output:
['og:video:url', 'og:video:secure_url', 'og:video:type', 'og:video:width', 'og:video:height']
You can check if og exist in property as follows:
...
soup = BeautifulSoup(html, "html.parser")
og_elements = [
tag["property"] for tag in soup.find_all("meta", property=lambda t: "og" in t)
]
print(og_elements)

BeautifulSoup Access Denied parsing error [duplicate]

This question already has answers here:
Scraper in Python gives "Access Denied"
(3 answers)
Closed 2 years ago.
This is the first time I've encountered a site where it wouldn't 'allow me access' to the webpage. I'm not sure why and I can't figure out how to scrape from this website.
My attempt:
import requests
from bs4 import BeautifulSoup
def html(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Output:
<html>
<head>
<title>
Access Denied
</title>
</head>
<body>
<h1>
Access Denied
</h1>
You don't have permission to access "http://www.g2a.com/" on this server.
<p>
Reference #18.4d24db17.1592006766.55d2bc1
</p>
</body>
</html>
I've looked into it for awhile now and I found that there is supposed to be some type of token [access, refresh, etc...].
Also, action="/search" but I wasn't sure what to do with just that.
This page needs to specify some HTTP headers to obtain the information (Accept-Language):
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.5'}
def html(url):
return BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html lang="en-us">
<head>
<link href="polyfill.g2a.com" rel="dns-prefetch"/>
<link href="images.g2a.com" rel="dns-prefetch"/>
<link href="id.g2a.com" rel="dns-prefetch"/>
<link href="plus.g2a.com" rel="dns-prefetch"/>
... and so on.

requests how to send a payload to login and confirm the post was successful

Hello so I am trying to login with requests library to snapchat just to practice and I don't know if the payload or data is correct.
heres my code:
try:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36'}
url = 'https://accounts.snapchat.com/accounts/login?continue=https%3A%2F%2Faccounts.snapchat.com%2Faccounts%2Fwelcome'
payload = {'username':'myusername', 'password':'mypassword'}
r = requests.post(url, headers=headers, data=payload)
print(r.text)
note: I have except and everything runs smooth i just don't know if my payload worked or how could i return if the login was successful. This is what happens when I run the script:
<!DOCTYPE html><html lang="en"><head><title>Log In • Snapchat</title><!-- Meta --><meta charset="utf-8" /><meta name="referrer" content="origin" /><meta name="apple-mobile-web-app-capable" content="no" /><meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0" /><script>const PAGE_LOAD_START_TIME_MS = Date.now();</script><!-- Styles --><link rel="stylesheet" href="/accounts/static/styles/semantic.min.css" /><link rel="stylesheet" href="/accounts/static/styles/dropdown.min.css" /><!-- Force reload of css file --><link rel="stylesheet" href="/accounts/static/styles/snapchat.css?t=0" /><link rel="stylesheet" href="/accounts/static/styles/accounts.css" /><link rel="stylesheet" href="/accounts/static/styles/auth.css" /><link rel="stylesheet" href="/accounts/static/styles/revoke.css" /><!-- Scripts --><script src="/accounts/static/scripts/jquery.min.js"></script><script src="/accounts/static/scripts/semantic.min.js"></script><script src="/accounts/static/scripts/dropdown.min.js"></script><script src="/accounts/static/scripts/gtm.js"></script><script src="/accounts/static/scripts/accounts.js"></script><script src="/accounts/static/scripts/pixel.js"></script><!-- Favicon --><link rel="shortcut icon" href="/accounts/static/images/favicon/favicon.png" type="image/png" /><link rel="stylesheet" type="text/css" href="https://snapnet-cdn.storage.googleapis.com/fonts/avenir-next/avenirnext.font.css" /><script src="https://www.google.com/recaptcha/enterprise.js?hl=en-us&render=explicit" async defer></script></head><body><!-- Pusher is Needed for Top Navigation Menu --><div class="pusher"><div id="login-root" data-xsrf="Jvw-azBPshXgY3Q_KRiPtQ"data-continue="https://accounts.snapchat.com/accounts/welcome" data-is-dev="false"data-web-client-id="dea1ed5e-209a-42e0-9c55-e5b465acd3c2"data-business-accounts-enabled="true"></div><script src="/accounts/static/scripts/main.en-us.js?v=41f9d4f7dd0ac6bd3489ff005a59a9f3f02032607240c0204f40d1d829eaf7ac"></script></div><!-- End Pusher --></body></html>
Process finished with exit code 0```
Theres no api so I cant use json to see if it works thanks in advance
When you try to login to Snapchat via Firefox or Chrome you can look at the requests that are sent in the dev-tools.
There is indeed a POST request to https://accounts.snapchat.com/accounts/login with username and passwort like in your payload. But that's not all. There is also a xsrf_token and a g-recatcha-response.
Sadly you won't be able to spoof the g-recaptcha-response with the tools that are available to us mere mortals. So Snapchat wont accept you login even if everything else is correct.

Can you insert a string literal that contains unicode (like a Chinese character) into a Jinja2 template?

In my Python code (.py file), I set a session variable to Chinese characters:
session.mystring = '你好'
At first, Flask choked. So, at the very top of the file, I put:
# -*- coding: utf-8 -*-
That solved the problem. I could then display the string in my Flask app by just doing:
{{ session.mystring }}
Here's the weird problem. If I write the Chinese characters directly into the template (.html file), like so:
<!-- this works -->
<p>{{ session.mystring }}</p>
<!-- this doesn't -->
<p>你好</p>
The browser (Edge) displays the following error:
'utf-8' codec can't decode byte 0xb5 in position 516: invalid start byte
I tried putting the following at the top of the template:
<head>
<meta charset="utf-8">
</head>
But that doesn't help. Is there any way I can insert the Chinese characters directly as a literal in the Jinja2 template?
EDIT (June 7, 2020):
Made my question clearer by actually putting the Chinese characters into the question (thought I couldn't previously).
This should work , by simply adding below in HTML file:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
working eg here :
with minimal flask app:
app.py:
from flask import Flask, render_template
app = Flask(name)
#app.route('/')
def test():
data = '中华人民共和国'
return render_template('index.html', data=data)
if(__name__) == '__main__':
app.run()
In templates/index.html :
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Document</title>
</head>
<body>
{{ data }}
</body>
</html>
o/p:

Resources