How to split a big file faster? - linux

I have a file whose size is 6TB in AWS EC2, I want to split it to multiple files which size is 1Tb, so that it can be uploaded to AWS s3 bucket
I use this command
split -b1T -d myfile myfil.
but it runs so slow that after 1 hour, only 60G was split out.
How can i make it faster? or is there any way to split binary files more quickly?

Related

Dealing with large number of small json files using pyspark

I have around 376K of JSON files under a directory in S3. These files are 2.5 KB each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL with 20 workers:
spark.read.json("path")
It just didn't run. There was a Timeout after 5 hrs. So, I developed and ran a shell script to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.
I used the below command to append the JSON records from different files under a single file:
for f in Agent/*.txt; do cat ${f} >> merged.json;done;
It doesn't have any nested JSON. I even tried the multiline option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10 to display the top 10 lines but it goes to an infinite loop.
The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.
Since I was dealing with a JSON dataset, I used the jq utility to process it. Below is the shell script that would merge a large number of records in a faster way into one file:
find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
Later on, I was able to load the JSON records as expected with the below code:
spark.read.option("multiline","true").json("path")
I have run into trouble in the past working with thousands of small files. In my case they where csv files not json. One of hte thing I did to try and debug was to create a for loop and load smaller batches then combine all the data frame together. During each iteration I would call an action to force the execution. I would log the progress to get an idea of it was making progress . And monitor how it was slowing down as the job progressed

AWS Lambda: how to give ffmpeg large files?

Scenario:
Using AWS Lambda (Node.js), I want to process large files from S3 ( > 1GB).
The /tmp fs limit of 512MB means that I can't copy the S3 input there.
I can certainly increase the Lambda memory space, in order to read in the files.
Do I pass the memory buffer to ffmpeg? (node.js, how?)
Or....should I just make an EFS mount point and use that as the transcoding scratchpad?
You can just use the HTTP(s) protocol as input for ffmpeg.
Lambda has max 10GB memory limit, and data transfer speed from S3 is around 300MB per second the last time I test. So if you have only 1GB max video and are not doing memory intensive transformation, this approach should work fine
ffmpeg -i "https://public-qk.s3.ap-southeast-1.amazonaws.com/sample.mp4" -ss 00:00:10 -vframes 1 -f image2 "image%03d.jpg"
ffmpeg works on files, so maybe an alternative would be to setup a unix pipe and then read that pipe with ffmpeg, constantly feeding it with the s3 stream.
But maybe you'd wanna consider running this as an ECS task instead, you wouldn't have a time constraint, and not the same storage constraint either. Cold start of it using Fargate would be 1-2 minutes though, which maybe isn't acceptable?
Lambda now supports up to 10Gb storage:
https://aws.amazon.com/blogs/aws/aws-lambda-now-supports-up-to-10-gb-ephemeral-storage/
Update with cli:
$ aws lambda update-function-configuration --function-name PDFGenerator --ephemeral-storage '{"Size": 10240}'

Fastest way to get the files count and total size of a folder in GCS?

Assume there is bucket with a folder root, it has subfolders and files. Is there any way to get the total files count and total size of the root folder?
What I tried:
With gsutil du I'm getting the size quickly but won't the get count. With gsutil ls ___ I'm getting list and size, if I pipe it with awk and sum them. I might get the expected result but ls itself is taking lot of time.
So is there a better/faster way to handle this?
Doing an object listing of some sort is the way to go - both the ls and du commands in gsutil perform object listing API calls under the hood.
If you want to get a summary of all objects in a bucket, check Cloud Monitoring (as mentioned in the docs). But, this isn't applicable if you want statistics for a subset of objects - GCS doesn't support actual "folders", so all your objects under the "folder" foo are actually just objects named with a common prefix, foo/.
If you want to analyze the number of objects under a given prefix, you'll need to perform object listing API calls (either using a client library or using gsutil). The listing operations can only return so many objects per response and thus are paginated, meaning you'll have to make several calls if you have lots of objects under the desired prefix. The max number of results per listing call is currently 1,000. So as an example, if you had 200,000 objects to list, you'd have to make 200 sequential API calls.
A note on gsutil's ls:
There are several scenarios in which gsutil can do "extra" work when completing an ls command, like when doing a "long" listing using the -L flag or performing recursive listings using the -r flag. To save time and perform the fewest number of listings possible in order to obtain a total count of bytes under some prefix, you'll want to do a "flat" listing using gsutil's wildcard support, e.g.:
gsutil ls -l gs://my-bucket/some-prefix/**
Alternatively, you could try writing a script using one of the GCS client libraries, like the Python library and its list_blobs functionality.
If you want to track the count of objects in a bucket over a long time, Cloud Monitoring offers the metric "storage/object_count". The metric updates about once per day, which makes it more useful for long-term trends.
As for counting instantaneously, unfortunately gsutil ls is probably your best bet.
Using gsutil du -sh, which could be a good idea for small directories.
For big directories, I am not able to get a result, even after a few hours, but only the following retrying message:
Using gsutil ls which is more efficient.
For big directories, it could take tens of minutes, but at least it complete.
To retrieve the number of files and the total size of a directory with gsutil ls, you can use the following command:
gsutil ls -l gs://bucket/dir/** | awk '{size+=$1} END {print "nb_files:", NR, "\ntotal_size:",size,"B"}'
Then divide the value by:
1024 for KB
1024 * 1024 for MB
1024 * 1024 * 1024 for GB
...
Example:

What is the best way to transfer large files using aws s3 cp command of awscli

I am transferring around 150 files each 1 gb to s3 using aws s3 cp command in a loop which takes around 20 sec/file, so would be 50 mins. If i put all the files in directory it takes upto 40 mins if i use folder copy with --recursive which is Multithread. I tried to change the s3 config by specifying the concurrent req to 20 , increased bandwidth, but its almost same time. What is the best way to reduce the time.

ext performance handling millions of files

I have a filesystem with 40 million files in a 10 level tree structure (around 500 GB in total). The problem I have is the backup. An Incr backup (bacula) takes 9 hours (around 10 GB) with a very low performance. Some directories have 50k files, other 10k files. The HDs are HW RAID, and I have the default Ubuntu LV on top. I think the bottleneck here is the # of files (the huge # of inodes.) I'm trying to improve the performance (a full backup on the same FS takes 4+ days, at 200k/s read speed).
- Do you think that partitioning the FS into several smaller FS would help? I can have 1000 smaller FS...
- Do you think that moving from HD to SSD would help?
- Any advice?
Thanks!
Moving to SSD will improve the speed of the backup. The SSD will get tired very soon and you will need the backup...
Can't you organise things that you know where to look for changed/new files?
In that way you pnlu need to increment-backup those folders.
Is it necessary your files are online? Can you have tar files of old trees 3 levels deep?
I guess a find -mtime -1 will take hours as well.
I hope that the backup is not using the same partition as de tree structure
(everything under /tmp is a very bad plan), the temporary files the bavkup might make should be on a different partition.
Where are the new files coming from? When all files are changed by a process you control, your process can make a logfile with a list of files changed.

Resources