Dealing with large number of small json files using pyspark - linux

I have around 376K of JSON files under a directory in S3. These files are 2.5 KB each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL with 20 workers:"path")
It just didn't run. There was a Timeout after 5 hrs. So, I developed and ran a shell script to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.
I used the below command to append the JSON records from different files under a single file:
for f in Agent/*.txt; do cat ${f} >> merged.json;done;
It doesn't have any nested JSON. I even tried the multiline option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10 to display the top 10 lines but it goes to an infinite loop.

The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.
Since I was dealing with a JSON dataset, I used the jq utility to process it. Below is the shell script that would merge a large number of records in a faster way into one file:
find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
Later on, I was able to load the JSON records as expected with the below code:"multiline","true").json("path")

I have run into trouble in the past working with thousands of small files. In my case they where csv files not json. One of hte thing I did to try and debug was to create a for loop and load smaller batches then combine all the data frame together. During each iteration I would call an action to force the execution. I would log the progress to get an idea of it was making progress . And monitor how it was slowing down as the job progressed


Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:"s3://input_bucket_name/data/")
as well as"file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

How to make subsection of sbatch jobs run at a time

I have a bash script (essentially what it does is align all files in a directory to all the files of another specified directory). The number of jobs gets quite large if there are 100 files being aligned individually to another 100 files (10,000 jobs), which all get submitted to slurm individually.
I have been doing them in batches manually but I think there must be a way to include it in a script so that, for example, only 50 jobs are running at a time.
I tried
$ sbatch --array [1-1000]%50
but it didn't work

Fastest way to get the files count and total size of a folder in GCS?

Assume there is bucket with a folder root, it has subfolders and files. Is there any way to get the total files count and total size of the root folder?
What I tried:
With gsutil du I'm getting the size quickly but won't the get count. With gsutil ls ___ I'm getting list and size, if I pipe it with awk and sum them. I might get the expected result but ls itself is taking lot of time.
So is there a better/faster way to handle this?
Doing an object listing of some sort is the way to go - both the ls and du commands in gsutil perform object listing API calls under the hood.
If you want to get a summary of all objects in a bucket, check Cloud Monitoring (as mentioned in the docs). But, this isn't applicable if you want statistics for a subset of objects - GCS doesn't support actual "folders", so all your objects under the "folder" foo are actually just objects named with a common prefix, foo/.
If you want to analyze the number of objects under a given prefix, you'll need to perform object listing API calls (either using a client library or using gsutil). The listing operations can only return so many objects per response and thus are paginated, meaning you'll have to make several calls if you have lots of objects under the desired prefix. The max number of results per listing call is currently 1,000. So as an example, if you had 200,000 objects to list, you'd have to make 200 sequential API calls.
A note on gsutil's ls:
There are several scenarios in which gsutil can do "extra" work when completing an ls command, like when doing a "long" listing using the -L flag or performing recursive listings using the -r flag. To save time and perform the fewest number of listings possible in order to obtain a total count of bytes under some prefix, you'll want to do a "flat" listing using gsutil's wildcard support, e.g.:
gsutil ls -l gs://my-bucket/some-prefix/**
Alternatively, you could try writing a script using one of the GCS client libraries, like the Python library and its list_blobs functionality.
If you want to track the count of objects in a bucket over a long time, Cloud Monitoring offers the metric "storage/object_count". The metric updates about once per day, which makes it more useful for long-term trends.
As for counting instantaneously, unfortunately gsutil ls is probably your best bet.
Using gsutil du -sh, which could be a good idea for small directories.
For big directories, I am not able to get a result, even after a few hours, but only the following retrying message:
Using gsutil ls which is more efficient.
For big directories, it could take tens of minutes, but at least it complete.
To retrieve the number of files and the total size of a directory with gsutil ls, you can use the following command:
gsutil ls -l gs://bucket/dir/** | awk '{size+=$1} END {print "nb_files:", NR, "\ntotal_size:",size,"B"}'
Then divide the value by:
1024 for KB
1024 * 1024 for MB
1024 * 1024 * 1024 for GB

Spark: Cut down no. of output files

I wrote a Spark program that mimics functionality of an existing Map Reduce job. The MR job takes about 50 minutes every day, but the Spark job took only 9 minutes! That’s great!
When I looked at the output directory, I noticed that it created 1,020 part files. The MR job uses only 20 reducers so it creates only 20 files. We need to cut down on # of output files; otherwise our Namespace would be full in no time.
I am trying to figure out how I can reduce the number of output files under Spark. Seems like 1,020 tasks are getting triggered and each one creates a part file. Is this correct? Do I have to change the level of parallelism to cut down no. of tasks thereby reducing no. of output files? If so how do I set it? I am afraid cutting down no. of tasks will slow down this process – but I can test that!
Cutting down the number of reduce tasks will slow down the process for sure. However, it still should be considerably faster than Hadoop MapReduce for your use case.
In my opinion, the best method to limit the number of output files is using the coalesce(numPartitions) transformation. Below is an example:
JavaSparkContext ctx = new JavaSparkContext(/*your configuration*/);
JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt");
//Consider we have 1020 partitions and thus 1020 map tasks
JavaRDD<String> mappedData = your map function );
//Consider we need 20 output files
JavaRDD<String> newData = mappedData.coalesce(20)
newData.saveAsTextFile("output path");
In this example, the map function would be executed by 1020 tasks, which would not be altered in any way. However, after having coalesced the partitions, there should only be 20 partitions to work with. In that case, 20 output files would be saved at the end of the program.
As mentioned earlier, take into account that this method will be slower than having 1020 output files. The data needs to be stored into few partitions (from 1020 to 20).
Note: please take a look to the repartition command on the following link too.

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.
