is it worthwhile to zip a big array before sending it to a client - web

Web site loads a big array (70,000 elements, each a line of text) at present from a file as a script.
is it worth zipping it (reduces size from 2Mb by a factor of 6) and unzipping in the client?
If so what is the simplest way to do it?
I don't know whether sending long data takes more time than unzipping in typical cases.

Related

Decompress lots of small files and compressing them again for efficiency and avoid S3 API costs

I've 1B+ gzip files (avg. 50 kb per each) and I want to upload them into S3 server. As I need to pay for each write operation, it becomes a huge cost problem to transfer them into S3. Also, those files are very similar and I want to compress them within a large file, so that compression efficiency will increase too.
I'm a newbie when it comes to write shell scripts but looking for a way, where I can:
Find all .gz files,
Decompress first 1K,
Compress in a single folder,
Delete this 1K batch,
Iterate to next 1K file,
I appreciate if you able to help me to think more creatively to do this. The only way in my mind is decompressing all of them and compress them by each 1K chunks, but it is not possible as I don't have disk space to compress them.
Test with a few files how much additional space is used when decompressing files. Try to make more free space (move 90% of the files to another host).
When files are similar the compression rate of 10% of the files will be high.
I guess that 10 chunks will fit, but it will be tight everytime you want to decompress one. So I would go for 100 chunks.
But first think what you want to do with the data in the future.
Never use it? Delete it.
Perhaps 1 time in the far future? Glacier.
Often? Use smaller chunks so you can find the right file easier.

How to overcome the sizelimits of mongodb and in general storing, sending and retrieving large documents?

Currently i work on an application that can send and retrieve arbitary large files. In the beginning we decided to use json for this because it is quite easy to handle and store. This works until images, videos or larger stuff in general comes in.
The current way we do this.
So we got a few problems at least with the current approach:
1 MB File size limit of express. Solution
10 MB File size limit of axios. Solution
16 MB File size limit of MongoDB. No solution currently
So currently we are trying to overcome the limits of MongoDB, but in general this seems like we are on the wrong path. As we go higher there will be more and more limits that are harder to overcome and maybe MongoDB's limit is not solveable. So would there be a way to do this in a more efficent way then what we currently do?
There is one thing left to say. In general we need to load the whole object on serverside back together to verify that the structure is the one we would expect and to hash the whole object. So we did not think of splitting it right now, but maybe that is the only option left. But even then how would you send videos or similar big chunks ?
If you need to store files bigger than 16 MB in MongoDb, you can use GridFS.
GridFS works by splitting your file into smaller chunks of data and store them separately. When that file is needed it gets reassembled and becomes available.

Zip File - is it possible to paginate through the file data?

Say I have a really large zip file (80GB) containing one massive CSV file (> 200GB).
Is it possible to fetch a subsection of the 80GB file data, modify the central directory, and extract just that bit of data?
Pictorial representation:
Background on my problem:
I have a cyclic process that does a summing on a certain column of a large zipped CSV file stashed in the cloud.
What I do today is I stream the file to my disk, extract it and then stream the file line by line. This makes is a very disk bound operation. Disk IS the bottle neck for sure.
Sure, I can leverage other cloud services to get what I need faster but that is not free.
I'm curious if I can see speed gains by just taking 1GB sub sections of zip until there's nothing left to read.
What I know:
The Zip file is stored using the deflate compression algorithm (always)
In the API I use to get the file from the cloud, I can specify a byte range to filter to. This means I can seek through the bytes of a file without hitting disk!
According the zip file specs there are three major parts to a zip file in order:
1: A header describing the file and it's attributes
2: The raw file data in deflated format
3: The central directory listing out what files start and stop and what bytes
What I don't know:
How the deflate algorithm works exactly. Does it jumble the file up or does it just compress things in order of the original file? If it does jumble, this approach may not be possible.
Had anyone built a tool like this already?
You can always decompress starting from the beginning, going as far as you like, keeping only the last, say, 1 GB, once you get to where you want. You cannot just start decompressing somewhere in the middle. At least not with a normal .zip file that has not been very specially prepared somehow for random access.
The central directory has nothing to do with random access of a single entry. All it can do is tell you where an entry starts and how long it is (both compressed and uncompressed).
I would recommend that you reprocess the .zip file into a .zip file with many (~200) entries, each on the order of 1 GB uncompressed. The resulting .zip file will be very close to the same size, but you can then use the central directory to pick one of the 200 entries, randomly access it, and decompress just that one.

Faster way to split way big file in to smaller files?

I have a small file which is about 6.5 GB and I tried to split it into files of size 5MB each using split -d -line--bytes=5MB. It took me over 6 minutes to split this file.
I have files over 1TB.
Is there a faster way to do this?
Faster than a tool specifically designed to do this kind of job? Doesn't sound likely in the general case. However, there are a few things you may be able to do:
Save the output files to a different physical storage unit. This avoids reading and writing data to the same disk at the same time, allowing more uninterrupted processing.
If the record size is static you can use --bytes to avoid the processing overhead of dealing with full lines.

What's the best method for fetch the huge files from the webserver using c#

Hi i have a spec for fetch the files from server and predict the un-used files from the directory in this situation i am going to fetch the files from server it will return huge files, the problem is the cpu usage will increase while i am fetching large files, so i like to eliminate this scenario. can any one knows how to avoid this situation please share with me though it might help full for me.
Thanks
You can split your large file on server into several smaller pieces and fetch some metadata about amount of pieces, size etc. and than fetch them one by one from your client c# code and join pieces in binary mode to your larger file.

Resources