I get OutOfMemoryError when processing tar.gz files greater than 1gb in spark.
To get past this error I have tried splitting the tar.gz into multiple parts using the 'split' command only to find out that each split is not a tar.gz on its own and so cannot be processed as such.
dir=/dbfs/mnt/data/temp
b=524288000
for file in /dbfs/mnt/data/*.tar.gz;
do
a=$(stat -c%s "$file");
if [[ "$a" -gt "$b" ]] ; then
split -b 500M -d --additional-suffix=.tar.gz $file "${file%%.*}_part"
mv $file $dir
fi
done
Error when trying to process the split files
Caused by: java.io.EOFException
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:281)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.read(TarArchiveInputStream.java:590)
at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:98)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.Reader.read(Reader.java:140)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2001)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1980)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1957)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1907)
at org.apache.commons.io.IOUtils.toString(IOUtils.java:778)
at org.apache.commons.io.IOUtils.toString(IOUtils.java:803)
at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$$anonfun$apply$1.apply(command-2152765781429277:33)
at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$$anonfun$apply$1.apply(command-2152765781429277:31)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
at scala.collection.immutable.Stream.foreach(Stream.scala:595)
at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:316)
at scala.collection.AbstractTraversable.toMap(Traversable.scala:104)
at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$.apply(command-2152765781429277:34)
at linea3796c25fa964697ba042965141ff28827.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-2152765781429278:3)
at linea3796c25fa964697ba042965141ff28827.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-2152765781429278:3)
I have tar.gz files that go upto 4gb in size and each of these can contain upto 7000 json documents whose sizes vary from 1mb to 50mb.
If I want to divide the large tar.gz files into smaller tar.gz files is my only option to decompress and somehow then recompress based on file size or file count? - "is this the way?"
Normal gzip files are not splittable.
GZip Tar archives are harder to deal with.
Spark can handle gzipped json files, but not gzipped tar files and not tar files.
Spark can handle binary files up to about 2GB each.
Spark can handle JSON that's been concatenated together
I would recommend using a Pandas UDF or a .pipe() operator to process each tar gzipped file (one per worker). Each worker would unzip, untar and process each JSON document in a streaming fashion, never filling memory. Hopefully you have enough source files to run this in parallel and see a speed up.
You might want to explore streaming approaches for delivering your compressed JSON files incrementally to ADLS Gen 2 / S3 and using Databricks Auto Loader features to load and process the files as soon as they arrive.
Also answers to this question How to load tar.gz files in streaming datasets? appear promising.
Related
I have large gzip files stored HDFS location
- /dataset1/sample.tar.gz --> contains 1.csv & 2.csv .... and so on
I would like extract
/dataset1/extracted/1.csv
/dataset1/extracted/2.csv
/dataset1/extracted/3.csv
.........................
.........................
/dataset1/extracted/1000.csv
Is there any hdfs commands that can be used to extract tar gz file (without copying to local machine) or use python/scala spark?
I tried using spark but since spark can not parallelize reading a gzipfile and the gzip file is very huge like 50GB.
I want to split the gzip and use those for spark aggregations.
I have a tar file which is 3.1 TB(TeraByte)
File name - Testfile.tar
I would like to split this tar file into 2 parts - Testfil1.tar and Testfile2.tar
I tried the following so far
split -b 1T Testfile.tar "Testfile.tar"
What i get is Testfile.taraa(what is "aa")
And i just stopped my command. I also noticed that the output Testfile.taraa doesn't seem to be a tar file when I do ls in the directory. It seems like it is a text file. May be once the full split is completed it will look like a tar file?
The behavior from split is correct, from man page online: http://man7.org/linux/man-pages/man1/split.1.html
Output pieces of FILE to PREFIXaa, PREFIXab, ...
Don't stop the command let it run and then you can use cat to concatenate (join) them all back again.
Examples can be seen here: https://unix.stackexchange.com/questions/24630/whats-the-best-way-to-join-files-again-after-splitting-them
split -b 100m myImage.iso
# later
cat x* > myImage.iso
UPDATE
Just as clarification since I believe you have not understood the approach. You split a big file like this to transport it for example, files are not usable this way. To use it again you need to concatenate (join) pieces back. If you want usable parts, then you need to decompress the file, split it in parts and compress them. With split you basically split the binary file. I don't think you can use those parts.
You are doing the compression first and the partition later.
If you want each part to be a tar file, you should use 'split' first with de original file, and then 'tar' with each part.
I am trying to simply compress a directory using
nohup tar cvzf /home/ali/42.tar.gz /home/hadoop/ &
It creates the archive but it is the same exact size as the original directory.
I check the file size of the original directory (Includes millions of text files) and it's exactly the same as the one that is supposed to be compressed, it's like the switch to compress is not working.
also, I using gunzip command to tell me more details about the compression achieved.
[hadoop#node3 ali]$ gunzip -lv 42.tar.gz
method crc date time compressed uncompressed ratio uncompressed_name
defla 7716afb7 Feb 28 10:25 7437323730944 1927010989 -385851.3% 42.tar
update
total capacity of server = 14T
size of /home/hadoop directory = 7.2T
size of /home/ali/42.tar.gz = 6.8T
Note: In the middle of the compression process, the capacity of the server is filled, therefore the file size is 6.8T.
What am I doing wrong?
I am using lz4mt multi-threaded version of lz4 and in my workflow I am sending thousands of large size files (620 MB) from client to server and when file reaches on server my rule will trigger and compress file using lz4mt and then remove uncompressed file. The problem is sometimes when I remove uncompressed file, I am not able to get compressed file of right size its because lz4mt returns immediately before sending output to disk.
So is there any way lz4mt will remove uncompressed file itself after compressing as done by bzip2.
Input: bzip2 uncompress_file
Output: Compressed file only
whereas
Input: lz4mt uncompress_file
Output: (Uncompressed + Compressed) file
Below script sync command also not working properly I think.
The script which execute as my rule triggers is:
script.sh
/bin/lz4mt uncompressed_file output_file
/bin/sync
/bin/rm uncompressed_file
Please tell me how to solve above issue.
Thanks a lot
Author here. You could try the following methods
Concatenate commands with && or ;.
Add lz4mt command line option -q (suppress prompt), and -f (force overwrite).
Try it with original lz4.
I made two compressed copy of my folder, first by using the command tar czf dir.tar.gz dir
This gives me an archive of size ~16kb. Then I tried another method, first i gunzipped all files inside the dir and then used
gzip ./dir/*
tar cf dir.tar dir/*.gz
but the second method gave me dir.tar of size ~30kb (almost double). Why there is so much difference in size?
Because zip process in general is more efficient on big sample than on small files. You have zipped 100 files of 1ko for example. Each file will have a certain compression, plus the overhead of the gzip format.
file1.tar -> files1.tar.gz (admit 30 bytes of headers/footers)
file2.tar -> files2.tar.gz (admit 30 bytes of headers/footers)
...
file100.tar -> files100.tar.gz (admit 30 bytes of headers/footers)
------------------------------
30*100 = 3ko of overhead.
But if you try to compress a tar file of 100ko (which contains your 100 files), the overhead of the gzip format will be added only one time (instead of 100 times) and the compression can be better)
Overhead from the per-file metadata and suboptimal conpression by gzip when processing files individually resulting from gzip not observing data in full and thus compressing with suboptimal dictionary (which is reset after each file).
tar cf should create an uncompressed archive, it means the size of your directory should almost be the same as your archive, maybe even more.
tar czf will run gunzip compression through it.
This can be further checked by doing a man tar in shell prompt in Linux,
-z, --gzip, --gunzip, --ungzip
filter the archive through gzip