Parse input file to thread in flask - multithreading

I have a web application where I want to upload files from the user to an S3 bucket. To achieve this I want the upload process to take place in a separate thread so I can properly start a loading animation. But whenever I parse the file to the thread I get the error:
I/O operation on closed file
This is the code where I get the file, parse it to the thread and start it.
#Dashboard from which I can upload the file
#application.route('/dashboard', methods=['POST'])
#login_required
def dashboard_post():
file = request.form.get('input_file')
worker_thread = Thread(target=upload_file, args=[file])
worker_thread.start()
return redirect(url_for('dashboard_upload'))
#Thread function
def upload_file(file):
filename = secure_filename(file.filename)
try:
S3_CLIENT.upload_fileobj(
file,
<bucketname>,
filename,
)
except Exception as e:
print("Something Happened: ", e)
What do I have to change to be able to parse the file? I don't want to save the file on the host machine to upload it from there because the files can get pretty big.

Related

How to write non blocking code using PyQT5 for uploading to google drive with PyDrive?

I am trying to upload data to Google Drive using pydrive on click of a button of PyQT5. I want to display a message showing like "Data back up in progress...." in status bar (Or onto a label).
However I am getting the message only when uploald is completed. It seems pydrive upload process blocks PyQT process until upload it completed.
How how can I achieved to display the message during the upload process. Below is my code:
def __init__(self, *args, **kwargs):
super(HomeScreen,self).__init__()
loadUi('uiScreens/HomeScreen.ui',self)
self.pushButton.clicked.connect(self.doDataBackup)
def doDataBackup(self):
dbfile = "mumop.db" #File to upload
self.statusBar().showMessage("Data back up in progress....") # This is blocked by pydrive
print("Data backup is in progress.....") # This test line is not blocked by pdrive
upload.uploadToGoogleDrive(dbfile))
# This method is in another file
def uploadToGoogleDrive(file):
gauth = GoogleAuth()
gauth.LoadCredentialsFile("upload/mumopcreds.txt")
if gauth.credentials is None:
gauth.LocalWebserverAuth()
elif gauth.access_token_expired:
gauth.Refresh()
else:
gauth.Authorize()
gauth.SaveCredentialsFile("upload/mumopcreds.txt")
drive = GoogleDrive(gauth)
file1 = drive.CreateFile()
file1.SetContentFile(file)
file1.Upload()
And easy way would be to add QApplication.processEvents() after self.statusBar().showMessage(...).
This should take care of all UI updates before blocking the event queue with your google access.
A more sophisticated way would be to outsource the google access into another thread (see e.g. this tutorial), but maybe this is a bit of an overkill for your usecase.

Can't upload large files to Python + Flask in GCP App Engine

UPDATE: (5/18/2020) Solution at the end of this post!
I'm attempting to upload big CSV files (30MB - 2GB) from a browser to GCP App Engine running Python 3.7 + Flask, and then push those files to GCP Storage. This works fine on local testing with large files, but errors out immediately on GCP with a "413 - Your client issued a request that was too large" if the file is larger than roughly 20MB. This error happens instantly on upload before it even reaches my custom Python logic (I suspect App Engine is checking the Content-Length header). I tried many solutions after lots of SO/blog research to no avail. Note that I am using the basic/free App Engine setup with the F1 instance running the Gunicorn server.
First, I tried setting app.config['MAX_CONTENT_LENGTH'] = 2147483648 but that didn't change anything (SO post). My app still threw an error before it even reached my Python code:
# main.py
app.config['MAX_CONTENT_LENGTH'] = 2147483648 # 2GB limit
#app.route('/', methods=['POST', 'GET'])
def upload():
# COULDN'T GET THIS FAR WITH A LARGE UPLOAD!!!
if flask.request.method == 'POST':
uploaded_file = flask.request.files.get('file')
storage_client = storage.Client()
storage_bucket = storage_client.get_bucket('my_uploads')
blob = storage_bucket.blob(uploaded_file.filename)
blob.upload_from_string(uploaded_file.read())
<!-- index.html -->
<form method="POST" action='/upload' enctype="multipart/form-data">
<input type="file" name="file">
</form>
After further research, I switched to chunked uploads with Flask-Dropzone, hoping I could upload the data in batches then append/build-up the CSV files as a Storage Blob:
# main.py
app = flask.Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 2147483648 # 2GB limit
dropzone = Dropzone(app)
#app.route('/', methods=['POST', 'GET'])
def upload():
if flask.request.method == 'POST':
uploaded_file = flask.request.files.get('file')
storage_client = storage.Client()
storage_bucket = storage_client.get_bucket('my_uploads')
CHUNK_SIZE = 10485760 # 10MB
blob = storage_bucket.blob(uploaded_file.filename, chunk_size=self.CHUNK_SIZE)
# hoping for a create-if-not-exists then append thereafter
blob.upload_from_string(uploaded_file.read())
And the JS/HTML is straight from a few samples I found online:
<script>
Dropzone.options.myDropzone = {
timeout: 300000,
chunking: true,
chunkSize: 10485760 };
</script>
....
<form method="POST" action='/upload' class="dropzone dz-clickable"
id="dropper" enctype="multipart/form-data">
</form>
The above does upload in chunks (I can see repeated calls to POST /upload), but, the call to blob.upload_from_string(uploaded_file.read()) just keeps replacing the blob contents with the last chunk uploaded instead of appending. This also doesn't work even if I strip out the chunk_size=self.CHUNK_SIZE parameter.
Next I looked at writing to /tmp then to Storage but the docs say writing to /tmp takes up the little memory I have, and the filesystem elsewhere is read-only, so neither of those will work.
Is there an append API or approved methodology to upload big files to GCP App Engine and push/stream to Storage? Given the code works on my local server (and happily uploads to GCP Storage), I'm assuming this is a built-in limitation in App Engine that needs to be worked around.
SOLUTION (5/18/2020)
I was able to use Flask-Dropzone to have JavaScript split the upload into many 10MB chunks and send those chunks one at a time to the Python server. On the Python side of things we'd keep appending to a file in /tmp to "build up" the contents until all chunks came in. Finally, on the last chunk we'd upload to GCP Storage then delete the /tmp file.
#app.route('/upload', methods=['POST'])
def upload():
uploaded_file = flask.request.files.get('file')
tmp_file_path = '/tmp/' + uploaded_file.filename
with open(tmp_file_path, 'a') as f:
f.write(uploaded_file.read().decode("UTF8"))
chunk_index = int(flask.request.form.get('dzchunkindex')) if (flask.request.form.get('dzchunkindex') is not None) else 0
chunk_count = int(flask.request.form.get('dztotalchunkcount')) if (flask.request.form.get('dztotalchunkcount') is not None) else 1
if (chunk_index == (chunk_count - 1)):
print('Saving file to storage')
storage_bucket = storage_client.get_bucket('prairi_uploads')
blob = storage_bucket.blob(uploaded_file.filename) #CHUNK??
blob.upload_from_filename(tmp_file_path, client=storage_client)
print('Saved to Storage')
print('Deleting temp file')
os.remove(tmp_file_path)
<!-- index.html -->
<script>
Dropzone.options.myDropzone = {
... // configs
timeout: 300000,
chunking: true,
chunkSize: 1000000
};
</script>
Note that /tmp shares resources with RAM, so you need at least as much RAM as the as the uploaded file size, plus more for Python itself (I had to use an F4 instance). I would imagine there's a better solution to write to block storage instead of /tmp, but I haven't gotten that far yet.
The answer is that you cannot upload or download files larger than 32 MB in a single HTTP request. Source
You either need to redesign your service to transfer data in multiple HTTP requests, transfer data directly to Cloud Storage using Presigned URLs, or select a different service that does NOT use the Global Front End (GFE) such as Compute Engine. This excludes services such as Cloud Functions, Cloud Run, App Engine Flexible.
If you use multiple HTTP requests, you will need to manage memory as all temporary files are stored in memory. This means you will have issues as you approach the maximum instance size of 2 GB.

Uploading a JSON file with Pyrebase to Firebase Storage from Google App Engine

I have a pretty simple Flask web app running in GAE that downloads a JSON file from Firebase Storage and replaces it with the updated one if necessary. Everything works ok but GAE throws an IOError exception whenever I try to create the a new file. I'm using Firebase Storage because I know it isn't possible to read/write files in a GAE environment but how I'm suppose to use Pyrebase storage.child('foo.json').put('foo.json') function then? What I'm doing wrong? Please, check my code below.
firebase_config = {my_firebase_config_dict}
pyrebase_app = pyrebase.initialize_app(firebase_config)
storage = pyrebase_app.storage()
#app.route('/')
def check_for_updates() :
try :
json_feeds = json.loads(requests.get('http://my-firebase-storage-url/example.json').text()
# Here I check if I need to update example.json
# ...
with open("example.json", "w") as file:
json.dump(info, file)
file.close()
storage.child('example.json').put('example.json')
return 'finished successfully!'
except IOError :
return 'example.json doesn't exists'
If I understand correctly you just need this file temporary in GAE and put it in cloud storage afterwards. According to this doc you can do it as in normal OS, but in /tmp folder:
if your app only needs to write temporary files, you can use standard
Python 3.7 methods to write files to a directory named /tmp
I hope it will help!
I finally did it like this but I don't know if this is better, worst or simply equivalent to #vitooh solution. Please, let me know:
firebase_config = {my_firebase_config_dict}
pyrebase_app = pyrebase.initialize_app(firebase_config)
storage = pyrebase_app.storage()
#app.route('/')
def check_for_updates() :
try :
blob = bucket.blob('example.json')
example = json.loads(blob.download_as_string()
# Here I check if I need to update example.json
# ...
if something_changed :
blob.upload_from_string(example, content_type = 'application/json')
return 'finished successfully!'
except IOError :
return 'example.json doesn't exists'

Truncating logging of Post Request in RobotFramework

I am using the Requests library of robot framework to upload files to a server. The file RequestsKeywords.py has a line
logger.info('Post Request using : alias=%s, uri=%s, data=%s, headers=%s, files=%s, allow_redirects=%s '
% (alias, uri, dataStr, headers, files, redir))
This prints out the whole contents of my upload file inside the request in my log file. Now i could get rid of this log by changing the log level however, my goal is to be able to see the log but just truncate it to 80 characters, so I am not browsing through lines of hex values. Any idea how this could be done?
A solution would be to create a wrapper method, that'll temporary disable the logging, and enable it back once completed.
The flow is - get an instance of the RequestsLibrary, call RF's Set Log Level with argument "ERROR" (so at least an error gets through, if needed), call the original keyword, set the log level back to what it was, and return the result.
Here's how it looks like in python:
from robot.libraries.BuiltIn import BuiltIn
def post_request_no_log(*args, **kwargs):
req_lib = BuiltIn().get_library_instance('RequestsLibrary')
current_level = BuiltIn().set_log_level('ERROR')
try:
result = req_lib.post_request(*args, **kwargs)
except Exception as ex:
raise ex
finally:
BuiltIn().set_log_level(current_level)
return result
And the same, in robotframework syntax:
Post Request With No Logging
[Documentation] Runs RequestsLibrary's Post Request, with its logging surpressed
[Arguments] #{args} &{kwargs}
${current level}= Set Log Level ERROR
${result}= Post Request #{args} &{kwargs}
[Return] ${result}
[Teardown] Set Log Level ${current level}
The python's version is bound to be milliseconds faster - no need to parse & match the text in the RF syntax, which on large usage may add up.
Perhaps not the answer you're looking for, but after having looked at the source of the RequestsLibrary I think this is indeed undesirable and should be corrected. It makes sense to have the file contents when running in a debug or trace setting, but not during regular operation.
As I consider this a bug, I'd recommend registering an issue with the GitHub project page or correcting it yourself and providing a pull request. In my opinion the code should be refactored to send the file name under the info setting and the file contents under the trace/debug setting:
logger.info('Post Request using : alias=%s, uri=%s, data=%s, headers=%s, allow_redirects=%s' % ...
logger.trace('Post Request files : files=%s' % ...
In the mean time you have two options. As you correctly said, temporarily reduce the log level settings in Robot Code. If you can't change the script, then using a Robot Framework Listener can help with that. Granted, it would be more work then making the change in the ReqestsLibrary yourself.
An temporary alternative could be to use the RequestLibrary Post, which is deprecated but still present.
If you look at the method in RequestKeywords library, its only calling self. _body_request() at the end. What we ended up doing is writing another keyword that was identical to the original except the part where it called logger.info(). We modified it to log files=%.80s which truncated the file to 80 chars.
def post_request_truncated_logs(
self,
alias,
uri,
data=None,
params=None,
headers=None,
files=None,
allow_redirects=None,
timeout=None):
session = self._cache.switch(alias)
if not files:
data = self._format_data_according_to_header(session, data, headers)
redir = True if allow_redirects is None else allow_redirects
response = self._body_request(
"post",
session,
uri,
data,
params,
files,
headers,
redir,
timeout)
dataStr = self._format_data_to_log_string_according_to_header(data, headers)
logger.info('Post Request using : alias=%s, uri=%s, data=%s, headers=%s, files=%.80s, allow_redirects=%s '
% (alias, uri, dataStr, headers, files, redir))

FileSavePicker Contract Implmentation

I have implemented the FileSavePicker Contract in my app,so when user selects an attachment from mail app and want to save to my app ,then OnTargetFileRequested(FileSavePickerUI^ sender, TargetFileRequestedEventArgs^ e) method gets triggered....
OnTargetFileRequested(FileSavePickerUI^ sender, TargetFileRequestedEventArgs^ e)
{
auto request = e->Request;
auto deferral = request->GetDeferral();
create_task(ApplicationData::Current->LocalFolder->CreateFileAsync(sender->FileName, CreationCollisionOption::GenerateUniqueName)).then([request, deferral](StorageFile^ file)
{
// Assign the resulting file to the targetFile property indicates success
request->TargetFile = file;
// Complete the deferral to let the Picker know the request is finished.
deferral->Complete();
return file;
}.then([=](StorageFile^ file)
{
//here i will upload file to my metro app
}
now whatever file i was created that i need to upload to my metro app....but i am facing an issue with deferral->complete...whether deferral->complete() complete need to written after uploading the file to my app or above the deferral->complete statement is correct.??...
but when i use deferral->complete after uploading the file always 0 bytes of file is getting uploaded...
if i use deferral->complete in createFileAsync() as shown in above code then the file is not getting uploaded........please help me...
can you tell me is this the correct approach?..
thanks in advance...
You should call the deferral->Complete() after the last await call in your method - the purpose of defferal is to inform the caller, that even the called method returned, there is still async action in progress. Once deferral is called complete, then the caller knows everything was done.
So you should probably call the deferral->Complete() after uploading the file or after copying the file to your cache. If no bytes are transfered, make sure you transfer the file correctly - you have to open the original file using OpenReadAsync and copy the stream to either your memory stream (not recommended for large files), or to cache file or somewhere and then send it.

Resources