Process large files from S3

Process large files from S3 - node.js

I am trying to get a large file (>10gb) on s3 (stored as csv on s3) and send it as a csv in the response header. I am doing it by using the following procedure:
async getS3Object(params:any) {
s3.getObject(params, function (err, data) {
if (err) {
console.log('Error Fetching File');
}
else {
const csv = data.Body.toString('utf-8');
res.setHeader('Content-disposition', `attachment; filename=${fileId}.csv`);
res.set('Content-Type', 'text/csv');
res.status(200).send(csv);
}
});
This is taking painfully long to process the file and send it as a csv attachments. How can I make this faster?

You're dealing with a huge file; you could break that into chunks using range (see also the docs, search for "calling the getobject property"). If you need the whole file, you could split the work off into workers, though at some point the limit will probably be your connection, and if you need to send the whole file as an attachment that won't help much.
A better solution would be to never download the file in the first place. You can do this by streaming from S3 (see also this, and this), or setting up a proxy in your server so the bucket/subdir seems to the client to be in your app.

If you run this on EC2, the network performance of the EC2 instances varies based on the EC2 type and size. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html
A bottleneck can happen at multiple places:
Network (bandwidth and latency)
CPU
Memory
Local Storage
One can check each of these. CloudWatch Metrics is our friend here.
CPU is the easiest to see and to scale with a bigger instance size.
Memory is a bit harder to observe, but one should have enough memory to keep the document in memory, so the OS does not use the swap.
Local Storage - IO can be observed; If the business logic is just to parse a csv file and output the result in, let's say, another S3 bucket, and there is no need to save the file locally - EC2 instances with local storage can be used - https://aws.amazon.com/ec2/instance-types/ - Storage Optimized.
Network - EC2 instance size can be modified, or Network optimized instances can be used.
Network - the way that one connects to S3 matters. Usually, the best approach is the use an S3 VPC endpoint https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html. The gateway option is free to use. By adopting it, one eliminates the VPC NAT gateway/NAT instance limitations, and it's even more secure.
Network - Sometimes, the S3 is in one region, and the compute is in another. S3 support replication https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html
Maybe some type of APM monitoring and code instrumentation can show is the code can also be optimized.
Thank you.

Related

ASP.NET Core Higher memory use uploading files to Azure Blob Storage SDK using v12 compared to v11

I am building a service with an endpoint that images and other files will be uploaded to, and I need to stream the file directly to Blob Storage. This service will handle hundreds of images per second, so I cannot buffer the images into memory before sending it to Blob Storage.
I was following the article here and ran into this comment
Next, using the latest version (v12) of the Azure Blob Storage libraries and a Stream upload method. Notice that it’s not much better than IFormFile! Although BlobStorageClient is the latest way to interact with blob storage, when I look at the memory snapshots of this operation it has internal buffers (at least, at the time of this writing) that cause it to not perform too well when used in this way.
But, using almost identical code and the previous library version that uses CloudBlockBlob instead of BlobClient, we can see a much better memory performance. The same file uploads result in a small increase (due to resource consumption that eventually goes back down with garbage collection), but nothing near the ~600MB consumption like above
I tried this and found that yes, v11 has considerably less memory usage compared to v12! When I ran my tests with about a ~10MB file the memory, each new upload (after initial POST) jumped the memory usage 40MB, while v11 jumped only 20MB
I then tried a 100MB file. On v12 the memory seemed to use 100MB nearly instantly each request and slowly climbed after that, and was over 700MB after my second upload. Meanwhile v11 didn't really jump in memory, though it would still slowly climb in memory, and ended with around 430MB after the 2nd upload.
I tried experimenting with creating BlobUploadOptions properties InitialTransferSize, MaximumConcurrency, etc. but it only seemed to make it worse.
It seems unlikely that v12 would be straight up worse in performance than v11, so I am wondering what I could be missing or misunderstanding.
Thanks!

Sometimes this issue may occur due to Azure blob storage (v12) libraries.
Try to upload the large files in chunks [a technique called file chunking which breaks the large file into smaller chunks for each upload] instead of uploading whole file. Please refer this link
I tried  producing the scenario in my lab
public void uploadfile()
{
string connectionString = "connection string";
string containerName = "fileuploaded";
string blobName = "test";
string filePath = "filepath";
BlobContainerClient container = new BlobContainerClient(connectionString, containerName);
container.CreateIfNotExists();
// Get a reference to a blob named "sample-file" in a container named "sample-container"
BlobClient blob = container.GetBlobClient(blobName);
// Upload local file
blob.Upload(filePath);
}
The output after uploading file.

Is there a memory limit for User Code Deployment on Hazelcast Cloud? (free version)

I'm currently playing with Hazelcast Cloud. My use case requires me to upload 50mb of jar file dependencies to Hazelcast Cloud servers. I found out that the upload seems to give up after about a minute or so. I get an upload rate of about 1mb a second, it drops after a while and then stops. I have repeated it a few times and the same thing happens.
Here is the config code I'm using:
Clientconfig config = new ClientConfig();
ClientUserCodeDeploymentConfig clientUserCodeDeploymentConfig =
new ClientUserCodeDeploymentConfig();
// added many jars here...
clientUserCodeDeploymentConfig.addJar("jar dependancy path..");
clientUserCodeDeploymentConfig.addJar("jar dependancy path..");
clientUserCodeDeploymentConfig.addJar("jar dependancy path..");
clientUserCodeDeploymentConfig.setEnabled(true);
config.setUserCodeDeploymentConfig(clientUserCodeDeploymentConfig);
ClientNetworkConfig networkConfig = new ClientNetworkConfig();
networkConfig.setConnectionTimeout(9999999); // i.e. don't timeout
networkConfig.setConnectionAttemptPeriod(9999999); // i.e. don't timeout
config.setNetworkConfig(networkConfig);
Any idea what's the cause, maybe there's a limit on the free cloud cluster?

I'd suggest using the smaller jar because this feature, the client user code upload, was designed for a bit different use cases:
You have objects that run on the cluster via the clients such as Runnable, Callable and Entry Processors.
You have new or amended user domain objects (in-memory format of the IMap set to Object) which need to be deployed into the cluster.
Please see more info here.

Firebase Dynamic Pages using Cloud Functions with Firestore are slow

We have dynamic pages being served by Firebase Cloud Functions, but the TTFB is very slow on these pages with TTFB of 900ms - 2s, at first we just assumed it to be a cold start issue, but even with consistent traffic it is very slow at TTFB of 700ms - 1.2s.
This is a bit problematic for our project since it is organic traffic dependent and Google Pagespeed would need a server response of less than 200ms.
Anyway, we tried to check what might be causing the issue and we pinpointed it with Firestore, when a Cloud Function accesses Firestore, we noticed there are some delays. This is a basic sample code of how we implement Cloud Function and Firestore:
dynamicPages.get('/ph/test/:id', (req, res) => {
var globalStartTime = Date.now();
var period = [];
db.collection("CollectionTest")
.get()
.then((querySnapshot) => {
period.push(Date.now() - globalStartTime);
console.log('1', period);
return db.collection("CollectionTest")
.get();
})
.then((querySnapshot) => {
period.push(Date.now() - globalStartTime);
console.log('2', period);
res.status(200)
.send('Period: ' + JSON.stringify(period));
return true;
})
.catch((error) => {
console.log(error);
res.end();
return false;
});
});
This is running on Firebase + Cloud Functions + NodeJS
CollectionTest is very small with only 100 documents inside, with each document having the following fields:
directorName: (string)
directorProfileUrl: (string)
duration: (string)
genre: (array)
posterUrl: (string)
rating: (string)
releaseDate: (string)
status: (int)
synopsis: (string)
title: (string)
trailerId: (string)
urlId: (string)
With this test, we would get the following results:
[467,762] 1.52s
[203,315] 1.09s
[203,502] 1.15s
[191,297] 1.00s
[206,319] 1.03s
[161,267] 1.03s
[115,222] 843ms
[192,301] 940ms
[201,308] 945ms
[208,312] 950ms
This data is [Firestore Call 1 Exectution Time, Firestore Call 2 Exectution Time] TTFB
If we check the results of the test, there are signs that the TTFB is getting lower, maybe that is when the Cloud Function has already warmed up? But even so, Firestore is eating up 200-300ms in the Cloud Function based on the results of our second Firestore Call and even if Firestore took lesser time to execute, TTFB would still take up 600-800ms, but that is a different story.
Anyway, can anyone help how we can improve Firestore performance in our Cloud Functions (or if possible, the TTFB performance)? Maybe we are doing something obviously wrong that we don't know about?

I will try to help, but maybe lacks a bit of context about what you load before returning dynamicPages but here some clues:
First of all, the obvious part (I have to point it anyway):
1 - Take care how you measure your TTFB:
Measuring TTFB remotely means you're also measuring the network
latency at the same time which obscures the thing TTFB is actually
measuring: how fast the web server is able to respond to a request.
2 - And from Google Developers documentation about Understanding Resource Timing (here):
[...]. Either:
Bad network conditions between client and server, or
A slowly responding server application
To address a high TTFB, first cut out as much network as possible.
Ideally, host the application locally and see if there is still a big
TTFB. If there is, then the application needs to be optimized for
response speed. This could mean optimizing database queries,
implementing a cache for certain portions of content, or modifying
your web server configuration. There are many reasons a backend can be
slow. You will need to do research into your software and figure out
what is not meeting your performance budget.
If the TTFB is low locally then the networks between your client and
the server are the problem. The network traversal could be hindered by
any number of things. There are a lot of points between clients and
servers and each one has its own connection limitations and could
cause a problem. The simplest method to test reducing this is to put
your application on another host and see if the TTFB improves.
Not so obvious ones:
You can take a look at the official Google documentation regarding Cloud Functions Performance here: https://cloud.google.com/functions/docs/bestpractices/tips
Did you require some files before?
According to this answer from Firebase cloud functions is very slow: Firebase cloud functions is very slow:
Looks like a lot of these problems can be solved using the hidden
variable process.env.FUNCTION_NAME as seen here:
https://github.com/firebase/functions-samples/issues/170#issuecomment-323375462
Are these dynamic pages loaded being accessed by a guest user or a logged user? Because maybe the first request has to sort out the authentication details, so it's known to be slower...
If nothing of this works, I will take a look at the common performance issues like DB connection (here: Optimize Database Performance), improving server configuration, cache all you can and take care of possible redirections in your app...
To end, reading through Internet, there are a lot of threads with your problem (low performance on simple Cloud Functions). Like this one: https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2374 && in S.O: https://stackoverflow.com/search?q=%5Bgoogle-cloud-functions%5D+slow
With comments like:
since when using cloud functions, the penalty is incurred on each http
invocation the overhead is still very high (i.e. 0.8s per HTTP call).
or:
Bear in mind that both Cloud Functions and Cloud Firestore are both in
beta and provide no guarantees for performance. I'm sure if you
compare performance with Realtime Database, you will see better
numbers.
Maybe it is still an issue.
Hope it helps!

Firebase Functions - ERROR, but no Event Message in Console

I have written a function on firebase that downloads an image (base64) from firebase storage and sends that as response to the user:
const functions = require('firebase-functions');
import os from 'os';
import path from 'path';
const storage = require('firebase-admin').storage().bucket();
export default functions.https.onRequest((req, res) => {
const name = req.query.name;
let destination = path.join(os.tmpdir(), 'image-randomNumber');
return storage.file('postPictures/' + name).download({
destination
}).then(() => {
res.set({
'Content-Type': 'image/jpeg'
});
return res.status(200).sendFile(destination);
});
});
My client calls that function multiple times after one another (in series) to load a range of images for display, ca. 20, of an average size of 4KB.
After 10 or so pictures have been loaded (amount varies), all other pictures fail. The reason is that my function does not respond correctly, and the firebase console shows me that my function threw an error:
The above image shows that
A request to the function (called "PostPictureView") suceeds
Afterwards, three requests to the controller fail
In the end, after executing a new request to the "UserLogin"-function, also that fails.
The response given to the client is the default "Error: Could not handle request". After waiting a few seconds, all requests get handled again as they are supposed to be.
My best guesses:
The project is on free tier, maybe google is throttling something? (I did not hit any limits afaik)
Is there a limit of messages the google firebase console can handle?
Could the tmpdir from the functions-app run low? I never delete the temporary files so far, but would expect that either google deletes them automatically, or warns me in a different way that the space is running low.
Does someone know an alternative way to receive the error messages, or has experienced similar issues? (As Firebase Functions is still in Beta, it could also be an error from google)
Btw: Downloading the image from the client (android app, react-native) directly is not possible, because I will use the function to check for access permissions later. The problem is reproducable for me.

In Cloud Functions, the /tmp directory is backed by memory. So, every file you download there is effectively taking up memory on the server instance that ran the function.
Cloud Functions may reuses server instances for repeated calls to the same function. This means your function is downloading another file (to that same instance) with each invocation. Since the names of the files are different each time, you are accumulating files in /tmp that each occupy memory.
At some point, this server instance is going to run out of memory with all these files in /tmp. This is bad.
It's a best practice to always clean up files after you're done with them. Better yet, if you can stream the file content from Cloud Storage to the client, you'll use even less memory (and be billed even less for the memory-hours you use).

After some more research, I've found the solution: The Firebase Console seems to not show all error information.
For detailed information to your functions, and errors that might be omitted in the Firebase Console, check out the website from google cloud functions.
There I saw: The memory (as suggested by #Doug Stevensson) usage never ran over 80MB (limit of 256MB) and never shut the server down. Moreover, there is a DNS resolution limit for the free tier, that my application hit.
The documentation points to a limit of DNS resolutions: 40,000 per 100 seconds. In my case, this limit was never hit - firebase counts a total executions of roundabout 8000 - but it seems there is a lower, undocumented limit for the free tier. After upgrading my account (I started the trial that GCP offers, so actually not paying anything) and linking the project to the billing account, everything works perfectly.

Azure Table Storage Performance

How fast should I be expecting the performance of Azure Storage to be? I'm seeing ~100ms on basic operations like getEntity, updateEntity, etc.
This guy seems to be getting 4ms which makes it look like something is really wrong here!
http://www.troyhunt.com/2013/12/working-with-154-million-records-on.html
I'm using the azure-table-node npm plugin.
https://www.npmjs.org/package/azure-table-node
A simple getEntity call is taking ~90ms:
exports.get = function(table, pk, rk, callback) {
var start = process.hrtime();
client().getEntity(table, pk, rk, function(err, entity) {
console.log(prettyhr(process.hrtime(start)));
...
The azure-storage module appears to be even slower:
https://www.npmjs.org/package/azure-storage
var start = process.hrtime();
azureClient.retrieveEntity(table, pk, rk, function(err, entity) {
console.log('retrieveEntity',prettyhr(process.hrtime(start)));
...
retrieveEntity 174 ms

Well, it really depends from where you are accessing the Azure Storage.
Are you trying to access the storage from the same DataCenter or just from somewhere on the Internet?
If your code is not running in the same DataCenter then it's just a matter of network latency to perform an HttpRequest to DataCenter where you have your storage running. So this can vary a lot, depending from where you're trying to access the DC and in which region your DC is located. (to make an idea you can check the latency from your pc for example to all Azure DCs Storage here: http://azurespeedtest.azurewebsites.net/
If you're code is running in the same DC, everything should be pretty fast for simple operations such as the ones you are trying out, probably just a few miliseconds.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string