UPDATE: (5/18/2020) Solution at the end of this post!
I'm attempting to upload big CSV files (30MB - 2GB) from a browser to GCP App Engine running Python 3.7 + Flask, and then push those files to GCP Storage. This works fine on local testing with large files, but errors out immediately on GCP with a "413 - Your client issued a request that was too large" if the file is larger than roughly 20MB. This error happens instantly on upload before it even reaches my custom Python logic (I suspect App Engine is checking the Content-Length header). I tried many solutions after lots of SO/blog research to no avail. Note that I am using the basic/free App Engine setup with the F1 instance running the Gunicorn server.
First, I tried setting app.config['MAX_CONTENT_LENGTH'] = 2147483648 but that didn't change anything (SO post). My app still threw an error before it even reached my Python code:
# main.py
app.config['MAX_CONTENT_LENGTH'] = 2147483648 # 2GB limit
#app.route('/', methods=['POST', 'GET'])
def upload():
# COULDN'T GET THIS FAR WITH A LARGE UPLOAD!!!
if flask.request.method == 'POST':
uploaded_file = flask.request.files.get('file')
storage_client = storage.Client()
storage_bucket = storage_client.get_bucket('my_uploads')
blob = storage_bucket.blob(uploaded_file.filename)
blob.upload_from_string(uploaded_file.read())
<!-- index.html -->
<form method="POST" action='/upload' enctype="multipart/form-data">
<input type="file" name="file">
</form>
After further research, I switched to chunked uploads with Flask-Dropzone, hoping I could upload the data in batches then append/build-up the CSV files as a Storage Blob:
# main.py
app = flask.Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 2147483648 # 2GB limit
dropzone = Dropzone(app)
#app.route('/', methods=['POST', 'GET'])
def upload():
if flask.request.method == 'POST':
uploaded_file = flask.request.files.get('file')
storage_client = storage.Client()
storage_bucket = storage_client.get_bucket('my_uploads')
CHUNK_SIZE = 10485760 # 10MB
blob = storage_bucket.blob(uploaded_file.filename, chunk_size=self.CHUNK_SIZE)
# hoping for a create-if-not-exists then append thereafter
blob.upload_from_string(uploaded_file.read())
And the JS/HTML is straight from a few samples I found online:
<script>
Dropzone.options.myDropzone = {
timeout: 300000,
chunking: true,
chunkSize: 10485760 };
</script>
....
<form method="POST" action='/upload' class="dropzone dz-clickable"
id="dropper" enctype="multipart/form-data">
</form>
The above does upload in chunks (I can see repeated calls to POST /upload), but, the call to blob.upload_from_string(uploaded_file.read()) just keeps replacing the blob contents with the last chunk uploaded instead of appending. This also doesn't work even if I strip out the chunk_size=self.CHUNK_SIZE parameter.
Next I looked at writing to /tmp then to Storage but the docs say writing to /tmp takes up the little memory I have, and the filesystem elsewhere is read-only, so neither of those will work.
Is there an append API or approved methodology to upload big files to GCP App Engine and push/stream to Storage? Given the code works on my local server (and happily uploads to GCP Storage), I'm assuming this is a built-in limitation in App Engine that needs to be worked around.
SOLUTION (5/18/2020)
I was able to use Flask-Dropzone to have JavaScript split the upload into many 10MB chunks and send those chunks one at a time to the Python server. On the Python side of things we'd keep appending to a file in /tmp to "build up" the contents until all chunks came in. Finally, on the last chunk we'd upload to GCP Storage then delete the /tmp file.
#app.route('/upload', methods=['POST'])
def upload():
uploaded_file = flask.request.files.get('file')
tmp_file_path = '/tmp/' + uploaded_file.filename
with open(tmp_file_path, 'a') as f:
f.write(uploaded_file.read().decode("UTF8"))
chunk_index = int(flask.request.form.get('dzchunkindex')) if (flask.request.form.get('dzchunkindex') is not None) else 0
chunk_count = int(flask.request.form.get('dztotalchunkcount')) if (flask.request.form.get('dztotalchunkcount') is not None) else 1
if (chunk_index == (chunk_count - 1)):
print('Saving file to storage')
storage_bucket = storage_client.get_bucket('prairi_uploads')
blob = storage_bucket.blob(uploaded_file.filename) #CHUNK??
blob.upload_from_filename(tmp_file_path, client=storage_client)
print('Saved to Storage')
print('Deleting temp file')
os.remove(tmp_file_path)
<!-- index.html -->
<script>
Dropzone.options.myDropzone = {
... // configs
timeout: 300000,
chunking: true,
chunkSize: 1000000
};
</script>
Note that /tmp shares resources with RAM, so you need at least as much RAM as the as the uploaded file size, plus more for Python itself (I had to use an F4 instance). I would imagine there's a better solution to write to block storage instead of /tmp, but I haven't gotten that far yet.
The answer is that you cannot upload or download files larger than 32 MB in a single HTTP request. Source
You either need to redesign your service to transfer data in multiple HTTP requests, transfer data directly to Cloud Storage using Presigned URLs, or select a different service that does NOT use the Global Front End (GFE) such as Compute Engine. This excludes services such as Cloud Functions, Cloud Run, App Engine Flexible.
If you use multiple HTTP requests, you will need to manage memory as all temporary files are stored in memory. This means you will have issues as you approach the maximum instance size of 2 GB.
Related
I have a storage bucket with a lot of large files (500mb each). At times I need to load multiple files, referenced by name. I have been using the blob.download_as_string() function to download the files one-by-one, but it's extremely slow so I would like to try and download them in parallel instead.
I found the gcloud-aio-storage package, however the documentation is a bit sparse, especially for the download function.
I would prefer to download / store the files in memory instead of downloading to local machine then upload to script.
This is what I've pieced together, though I can't seem to get this to work. I keep getting a timeout error. What am I doing wrong?
Note: Using python 3.7, and latest of all other packages.
test_download.py
from gcloud.aio.storage import Storage
import aiohttp
import asyncio
async def gcs_download(session, bucket_name, file, storage):
async with session:
bucket = storage.get_bucket(bucket_name)
blob = await bucket.get_blob(file)
return await blob.download()
async def get_gcsfiles_async(bucket_name, gcs_files):
async with aiohttp.ClientSession() as session:
storage = Storage(session=session)
coros = (gcs_download(session, bucket_name, file, storage) for file in gcs_files)
return await asyncio.gather(*coros)
Then the way I'm calling / passing in values are as follows:
import test_download as test
import asyncio
bucket_name = 'my_bucket_name'
project_name = 'my_project_name' ### Where do I reference this???
gcs_files = ['bucket_folder/some-file-2020-10-06.txt',
'bucket_folder/some-file-2020-10-07.txt',
'bucket_folder/some-file-2020-10-08.txt']
result = asyncio.run(test.get_gcsfiles_async(bucket_name, gcs_files))
Any help would be appreciated!
Here is related question, although there are two things to note: Google Storage python api download in parallel
When I run the code from the approved answer it ends up getting stuck and never downloads
It's from before the gcloud-aio-storage package was released and might not be leveraging the "best" current methods.
It looks like the documentation for that library is lacking, but I could get something running, and it is working on my tests. Something I found out by looking at the code is that you don’t need to use blob.download(), since it calls storage.download() anyways. I based the script below on the usage section, which deals with uploads, but can be rewritten for downloading. storage.download() does not write to a file, since that is done by storage.download_to_filename(). You can check the available download methods here.
async_download.py
import asyncio
from gcloud.aio.auth import Token
from gcloud.aio.storage import Storage
# Used a token from a service account for authentication
sa_token = Token(service_file="../resources/gcs-test-service-account.json", scopes=["https://www.googleapis.com/auth/devstorage.full_control"])
async def async_download(bucket, obj_names):
async with Storage(token=sa_token) as client:
tasks = (client.download(bucket, file) for file in obj_names) # Used the built in download method, with required args
res = await asyncio.gather(*tasks)
await sa_token.close()
return res
main.py
import async_download as dl_test
import asyncio
bucket_name = "my-bucket-name"
obj_names = [
"text1.txt",
"text2.txt",
"text3.txt"
]
res = asyncio.run(dl_test.async_download(bucket_name, obj_names))
print(res)
If you want to use a Service Account Token instead, you can follow this guide and use the relevant auth scopes. Since Service Accounts are project-wise, specifying a project is not needed, but I did not see any project name references for a Session either. While the GCP Python library for GCS does not yet support parallel downloads, there is a feature request open for this. There is no ETA for a release of this yet.
I have a pretty simple Flask web app running in GAE that downloads a JSON file from Firebase Storage and replaces it with the updated one if necessary. Everything works ok but GAE throws an IOError exception whenever I try to create the a new file. I'm using Firebase Storage because I know it isn't possible to read/write files in a GAE environment but how I'm suppose to use Pyrebase storage.child('foo.json').put('foo.json') function then? What I'm doing wrong? Please, check my code below.
firebase_config = {my_firebase_config_dict}
pyrebase_app = pyrebase.initialize_app(firebase_config)
storage = pyrebase_app.storage()
#app.route('/')
def check_for_updates() :
try :
json_feeds = json.loads(requests.get('http://my-firebase-storage-url/example.json').text()
# Here I check if I need to update example.json
# ...
with open("example.json", "w") as file:
json.dump(info, file)
file.close()
storage.child('example.json').put('example.json')
return 'finished successfully!'
except IOError :
return 'example.json doesn't exists'
If I understand correctly you just need this file temporary in GAE and put it in cloud storage afterwards. According to this doc you can do it as in normal OS, but in /tmp folder:
if your app only needs to write temporary files, you can use standard
Python 3.7 methods to write files to a directory named /tmp
I hope it will help!
I finally did it like this but I don't know if this is better, worst or simply equivalent to #vitooh solution. Please, let me know:
firebase_config = {my_firebase_config_dict}
pyrebase_app = pyrebase.initialize_app(firebase_config)
storage = pyrebase_app.storage()
#app.route('/')
def check_for_updates() :
try :
blob = bucket.blob('example.json')
example = json.loads(blob.download_as_string()
# Here I check if I need to update example.json
# ...
if something_changed :
blob.upload_from_string(example, content_type = 'application/json')
return 'finished successfully!'
except IOError :
return 'example.json doesn't exists'
I've followed this tutoria to create a custom object YoloV3 Keras model:
https://momoky.space/pythonlessons/YOLOv3-object-detection-tutorial/tree/master/YOLOv3-custom-training
Model works perfectly fine, my next goal is to create a Python Flask API witch is capable to process Image after upload it.
I've started modify the Code here for image detection
That's my added code:
#app.route('/api/test', methods=['POST'])
def main():
img = request.files["image"].read()
img = Image.open(io.BytesIO(img))
npimg=np.array(img)
image=npimg.copy()
image=cv2.cvtColor(image,cv2.COLOR_BGR2RGB)
#cv2.imshow("Image", image)
#cv2.waitKey()
cv2.imwrite('c:\\yolo\\temp.jpg', image)
image = 'c:\\yolo\\temp.jpg'
yolo = YOLO()
r_image, ObjectsList = yolo.detect_img(image)
#response = {ObjectsList}
response_pikled = jsonpickle.encode(ObjectsList)
#yolo.close_session()
return Response(response=response_pikled, status=200, mimetype="application/json")
app.run(host="localhost", port=5000)
So my problem is that it works only on first iteration, when I upload a new image I receive following error:
File "C:\Users\xxx\Anaconda3\envs\yolo\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\xxx\Anaconda3\envs\yolo\lib\site-packages\tensorflow\python\client\session.py", line 1095, in _run
'Cannot interpret feed_dict key as Tensor: ' + e.args[0])
TypeError: Cannot interpret feed_dict key as Tensor: Tensor Tensor("Placeholder:0", shape=(3, 3, 3, 32), dtype=float32) is not an element of this graph.
This is the original static part of the code:
if __name__=="__main__":
yolo = YOLO()
image = 'test.png'
r_image, ObjectsList = yolo.detect_img(image)
print(ObjectsList)
#cv2.imshow(image, r_image)
cv2.imwrite('detect.png', r_image)
yolo.close_session()
Things that really confuse me is how to load the model when the application start, and execute detection every time a new image is posted.
Thank you
UPDATE
in the construtor part there's a referenced Keras backend session:
**def __init__(self, **kwargs):
self.__dict__.update(self._defaults) # set up default values
self.__dict__.update(kwargs) # and update with user overrides
self.class_names = self._get_class()
self.anchors = self._get_anchors()
self.sess = K.get_session()
self.boxes, self.scores, self.classes = self.generate()**
After addinga K.clear_session it works for multiple series request:
#app.route('/api/test', methods=['POST'])
def main():
img = request.files["image"].read()
img = Image.open(io.BytesIO(img))
npimg=np.array(img)
image=npimg.copy()
image=cv2.cvtColor(image,cv2.COLOR_BGR2RGB)
#cv2.imshow("Image", image)
#cv2.waitKey()
cv2.imwrite('c:\\yolo\\temp.jpg', image)
image = 'c:\\yolo\\temp.jpg'
yolo = YOLO()
r_image, ObjectsList = yolo.detect_img(image)
#response = {ObjectsList}
response_pikled = jsonpickle.encode(ObjectsList)
#yolo.close_session()
K.clear_session()
return Response(response=response_pikled, status=200, mimetype="application/json")
Will be possible to avoid model, anchors and classes needs to be loaded at every computation avoiding this:
ogs/000/trained_weights_final.h5 model, anchors, and classes loaded.
127.0.0.1 - - [27/Dec/2019 22:58:49] "?[37mPOST /api/test HTTP/1.1?[0m" 200 -
logs/000/trained_weights_final.h5 model, anchors, and classes loaded.
127.0.0.1 - - [27/Dec/2019 22:59:08] "?[37mPOST /api/test HTTP/1.1?[0m" 200 -
logs/000/trained_weights_final.h5 model, anchors, and classes loaded.
127.0.0.1 - - [27/Dec/2019 22:59:33] "?[37mPOST /api/test HTTP/1.1?[0m" 200
-
I've managed to get this up and running as a prototype. I've uploaded a repo: vulcan25/image_processor which implements this all.
The first thing I investigated was the functionality of the method YOLO.detect_img in that code from the tutorial. This method takes a filename, which is immediately handled by cv2.imread in the original code: #L152-L153. The returned data from this is then processed internally by self.detect_image (note the difference in spelling) and the result displayed with cv2.show.
This behaviour isn't good for a webapp and I wanted to keep everything in memory, so figured the best way to change that functionality was to subclass YOLO and overide the detect_img method, making it behave differently. So in processor/my_yolo.py I do something like:
from image_detect import YOLO as stock_yolo
class custom_yolo(stock_yolo):
def detect_img(self, input_stream):
image = cv2.imdecode(numpy.fromstring(input_stream, numpy.uint8), 1)
original_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
original_image_color = cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB)
r_image, ObjectsList = self.detect_image(original_image_color)
is_success, output_stream = cv2.imencode(".jpg", r_image)
return is_success, output_stream
Note: In a later decision, I pulled image_detect.py into my repo to append K.clear_session(). It would have been possible to put the above mod in that file also, but I've stuck with subclassing for that part.
This accepts a stream, then uses cv2.imencode (source) and cv2.imdecode (source) in place of imshow and imread respectively.
We can now define a single a single function which will in turn run all the image processing stuff. This separates that part of the code (and dependencies) from your flask app which is good:
yolo = custom_yolo() # create an object from the custom class defined above.
def process(intput_stream):
start_time = time.time()
is_success, output_stream = yolo.detect_img(input_stream)
io_buf = io.BytesIO(output_stream)
print("--- %s seconds ---" % (time.time() - start_time))
return is_success, io_buf.read()
From Flask we can call this in the same way, where we already have the stream of the uploaded file available to us as: request.files['file'].read() (which is actually a method of the werkzeug.datastructures.FileStorage object, as I've documented elsewhere).
As a side-note, this function could be run from the terminal. If you're launching my repo you'd need to do this in the processor container (see the docker exec syntax at the end of my README)...
from my_yolo import process
with f.open('image.jpg', 'rb') as f:
is_sucess, processed_data = process(f.read())
Then the result written to a file:
with f.open('processed_image.jpg', 'wb' as f):
f.write(processed_data)
Note that my repo actually has two separate flask apps (based on another upload script I put together which implements dropzone.js on the frontend).
I can run in two modes:
processor/src/app.py: Accessible on port 5001, this runs process directly (incoming requests block until the processed data is returned).
flask/src/app.py: Accessible on port 5000, this creates an rq job for each incoming request, the processor container then runs as a worker to process these requests from the queue.
Each app has its own index.html which does its own unique implementation on the frontend. Mode (1) writes images straight to the page, mode (2) adds a link to the page, which when clicked leads to a separate endpoint that serves the image (when processed).
The major difference is how process is invoked. With mode (1) processor/src/app.py:
from my_yolo import process
if file and allowed_file(file.filename):
# process the upload immediately
input_data = file.read()
complete, data = process(input_data)
As mentioned in a comment, I was seeing pretty fast conversions with this mode: ~1.6s per image on CPU. This script also uses a redis set to maintain a list of uploaded files, which can be used for validation on the view endpoint further down.
In mode (2) flask/src/app.py:
from qu import image_enqueue
if file and allowed_file(file.filename):
input_data = file.read()
job = img_enqueue(input_data)
return jsonify({'url': url_for('view', job_id=job.id)})
I've implemented a separate file flask/src/qu.py which implements this img_enqueue function, which ultimately loads the process function from flask/src/my_yolo.py where it is defined as:
def process(data): pass
This is an important destinction. Normally with rq the contents of this function would be defined in the same codebase as the flask service. In fact, I've actually put the business logic in processor/src/my_yolo.py which allows us to detach the container with the image processing dependencies, and ultiately host this somewhere else, as long is it shares a redis connection with the flask service.
Please have a look at the code in the repo for further info, and feel free to log an issue against that repo with any further queries (or if you get stuck). Be aware I may introduce breaking changes, so you may wish to fork.
I've tried to keep this pretty simple. In theory this could be edited slightly to support a different processor/Dockerfile which handles any processing workload, but the same frontend allowing you to submit any type of data from a stream: images, CSV, other text, etc.
Things that really confuse me is how to load the model when the application start, and execute detection every time a new image is posted. Thank you
You'll notice when you run this mode (1) it is perfect, in that the dependencies load when the flask server boots (~17.s) and individual image processing takes ~1s. This is ideal, although probably leads to higher overall memory usage on the server, as each WSGI worker requires all the dependencies loaded.
When run in mode (2) - where processing is passed to rq workers, the libraries are loaded each time an image is processed, so it's much slower. I will try to fix that, I just need to investigate how to pre-load the libraries in the rq deployment; I was close to this before but that was around the time I stumbled with the K.clear_session() problem, so haven't had time to retest a fix for this (yet).
Inside YOLO constructor try adding this:
from keras import backend as K
K.clear_session()
I'm using multer v1.3.0 with express v4.15.4.
I have used fileSize limits like below
multer({
storage: storage,
limits: {fileSize: 1*1000*50, files:1},
fileFilter: function (req, file, cb) {
if (_.indexOf(['image/jpeg', 'image/png'], file.mimetype) == -1)
cb(null, false);
else
cb(null, true)
}
});
In this case I dont think limits are working. Cause not getting any
LIMIT_FILE_SIZE error.
If I remove fileFilter then I'm getting LIMIT_FILE_SIZE error.
But in both cases first the whole file getting uploaded & then fileSize is checked.
Its not good at all to upload a 1GB of file and then check its of 1MB or not.
So I want to know if there has any way to stop the uploading in the middle when file size limit exceeds. I don't want to rely on Content-Length.
From looking through the multer and busboy source it looks like it calculates the data size as a running total and stops reading the stream as soon as the fileSize is reached. See the code here:
https://github.com/mscdex/busboy/blob/8f6752507491c0c9b01198fca626a9fe6f578092/lib/types/multipart.js#L216
So while 1GB might be uploaded it shouldn't be saved anywhere.
If you want to stop the upload from continuing I think you'd have to close the socket though I haven't experimented with this myself. I'm not aware of any way to respond cleanly while the request is still trying to upload data.
Related:
YES or NO: Can a server send an HTTP response, while still uploading the file from the correlative HTTP request?
You can use a javascript script to prevent users uploading all of the 1GB and getting a file size exceeded error. However, all client-side checks can be bypassed, so you should still enforce the file limit on the backend.
Your code is correct, it should work as intended. I am guessing you are worried about the file getting uploaded. There are no workarounds for that since Multer checks the size after upload.
Here is the javascript you can put in to prevent someone from uploading for your client-side code.
function ValidateSize(file) {
const FileSize = file.files[0].size / 1024 / 1024;
if (FileSize > 20) {
alert('File size exceeds 20 MB');
document.getElementById('formFile').value = null;
}
}
<label for="formFile" class="form-label">Select file to upload (Max File Size: 20MB):</label>
<input class="form-control" type="file" onchange="ValidateSize(this)" id="formFile" name="file">
As you can see here:
https://github.com/visionmedia/express/blob/master/examples/multipart/index.js
Express support file uploads by default and store each uploaded file on the temp folder for later use.
My question is: Is it safe?
As I see it, an attacker can fill up all the temp folder with garbage files without any control on it.
Should i check each POST request and delete any unused file?
Let me suggest two solutions to your problem.
Use a virtual drive for your upload location. If your server is running on linux, it is very easy to mount a virtual file system which is in memory only. The files will be placed here faster than if it was on a real harddrive, and if you have problems like the one you describe, it is only a matter of cleaning out the virtual drive or restarting the server. Look at this article for an explaination of ram disks.
Make sure that you only accept a maximum number of x uploads from the same ip address during during a 24 hour period. Combine this solution with solution 1 for maximum effect. One way of implementing this, is to have a global object with upload counts for each ip address, and then clear it out every 24 hours.
var uploads = {}
setInterval(function(){
uploads = {}
}, 24*60*60*1000); //Run every 24 hours
var onUpload = function(request, file){
if(uploads[req.ip] > maxUploadsAllowedPrUser)
fs.unlink(file) //Delete the file
else
uploads[req.ip]++ //Keep the file, and increase count
}