extract gz files using hdfs/spark - apache-spark

I have large gzip files stored HDFS location
- /dataset1/sample.tar.gz --> contains 1.csv & 2.csv .... and so on
I would like extract
/dataset1/extracted/1.csv
/dataset1/extracted/2.csv
/dataset1/extracted/3.csv
.........................
.........................
/dataset1/extracted/1000.csv
Is there any hdfs commands that can be used to extract tar gz file (without copying to local machine) or use python/scala spark?
I tried using spark but since spark can not parallelize reading a gzipfile and the gzip file is very huge like 50GB.
I want to split the gzip and use those for spark aggregations.

Related

How to move files written with Pandas on Spark cluster to HDFS?

I'm running a Spark Job using Cluster Mode and writing few files using Pandas and I think it's writing in temp directory, now I want to move these files or write these files in HDFS.
You have multiple options:
convert Pandas Dataframe into PySpark DataFrame and simply save it into HDFS
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.parquet("hdfs:///path/on/hdfs/file.parquet")
save file locally using Pandas and use subprocess to copy file to HDFS
import subprocess
command = "hdfs dfs -copyFromLocal -f local/file.parquet /path/on/hdfs".split()
result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout)
print(result.stderr)
save file locally and use 3rd party library - hdfs3 - to copy file to HDFS
from hdfs3 import HDFileSystem
hdfs = HDFileSystem()
hdfs.cp("local/file.parquet", "/path/on/hdfs")

Break a large tar.gz file into multiple smaller tar.gz files

I get OutOfMemoryError when processing tar.gz files greater than 1gb in spark.
To get past this error I have tried splitting the tar.gz into multiple parts using the 'split' command only to find out that each split is not a tar.gz on its own and so cannot be processed as such.
dir=/dbfs/mnt/data/temp
b=524288000
for file in /dbfs/mnt/data/*.tar.gz;
do
a=$(stat -c%s "$file");
if [[ "$a" -gt "$b" ]] ; then
split -b 500M -d --additional-suffix=.tar.gz $file "${file%%.*}_part"
mv $file $dir
fi
done
Error when trying to process the split files
Caused by: java.io.EOFException
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:281)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.read(TarArchiveInputStream.java:590)
at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:98)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.Reader.read(Reader.java:140)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2001)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1980)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1957)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1907)
at org.apache.commons.io.IOUtils.toString(IOUtils.java:778)
at org.apache.commons.io.IOUtils.toString(IOUtils.java:803)
at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$$anonfun$apply$1.apply(command-2152765781429277:33)
at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$$anonfun$apply$1.apply(command-2152765781429277:31)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
at scala.collection.immutable.Stream.foreach(Stream.scala:595)
at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:316)
at scala.collection.AbstractTraversable.toMap(Traversable.scala:104)
at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$.apply(command-2152765781429277:34)
at linea3796c25fa964697ba042965141ff28827.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-2152765781429278:3)
at linea3796c25fa964697ba042965141ff28827.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-2152765781429278:3)
I have tar.gz files that go upto 4gb in size and each of these can contain upto 7000 json documents whose sizes vary from 1mb to 50mb.
If I want to divide the large tar.gz files into smaller tar.gz files is my only option to decompress and somehow then recompress based on file size or file count? - "is this the way?"
Normal gzip files are not splittable.
GZip Tar archives are harder to deal with.
Spark can handle gzipped json files, but not gzipped tar files and not tar files.
Spark can handle binary files up to about 2GB each.
Spark can handle JSON that's been concatenated together
I would recommend using a Pandas UDF or a .pipe() operator to process each tar gzipped file (one per worker). Each worker would unzip, untar and process each JSON document in a streaming fashion, never filling memory. Hopefully you have enough source files to run this in parallel and see a speed up.
You might want to explore streaming approaches for delivering your compressed JSON files incrementally to ADLS Gen 2 / S3 and using Databricks Auto Loader features to load and process the files as soon as they arrive.
Also answers to this question How to load tar.gz files in streaming datasets? appear promising.

Pyspark: Load a tar.gz file into a dataframe and filter by filename

I have a tar.gz file that has multiple files. The hierarchy looks as below. My intention is to read the tar.gz file, filter out the contents of b.tsv as it is static metadata where all the other files are actual records.
gzfile.tar.gz
|- a.tsv
|- b.tsv
|- thousand more files.
By pyspark load, I'm able to load the file into a dataframe. I used the command:
spark = SparkSession.\
builder.\
appName("Loading Gzip Files").\
getOrCreate()
input = spark.read.load('/Users/jeevs/git/data/gzfile.tar.gz',\
format='com.databricks.spark.csv',\
sep = '\t'
With the intention to filter, I added the filename
from pyspark.sql.functions import input_file_name
input.withColumn("filename", input_file_name())
Which now generates the data like so:
|_c0 |_c1 |filename |
|b.tsv0000666000076500001440035235677713575350214013124 0ustar netsaintusers1|Lynx 2.7.1|file:///Users/jeevs/git/data/gzfile.tar.gz|
|2|Lynx 2.7|file:///Users/jeevs/git/data/gzfile.tar.gz|
Of course, the file field is populating with the tar.gz file, making that approach useless.
A more irritating problem is, the _c0 is getting populated with filename+garbage+first row values
At this point, I'm wondering if the file read itself is getting weird as it is a tar.gz file. When we did the v1 of this processing, (spark 0.9), we had another step that loaded the data from s3 into an ec2 box, extract and write back into s3. I'm trying to get rid of those steps.
Thanks in advance!
Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
Above code will unzip all files with extension *.tar.gz in source to destination location.
If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable.
This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
Use following methods to load file, in assumption the content in *.csv file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')

Read csv file from Hadoop using Spark

I'm using spark-shell to read csv files from hdfs.
I can read those csv file using the following code in bash:
bin/hadoop fs -cat /input/housing.csv |tail -5
so this suggest the housing.csv is indeed in hdfs right now.
How can I read it using spark-shell?
Thanks in advance.
sc.textFile("hdfs://input/housing.csv").first()
I tried this way, but failed.
Include the csv package in the shell and
var df = spark.read.format("csv").option("header", "true").load("hdfs://x.x.x.x:8020/folder/file.csv")
8020 is the default port.
Thanks,
Ash
You can read this easily with spark using csv method or by specifying format("csv"). In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv.
Here is a snippet of code that can read csv.
val df = spark.
read.
schema(dataSchema).
csv(s"/input/housing.csv")

apache pig, store result in a txt file

Hello i'm a new PIG user,
I'm trying to stock some data in a txt file but when i'm using the STORE command, it creates a folder that contains the following files : _SUCCESS and part-r-00000.
How to get this result in a txt file ?
Thanks.
This is how STORE output usually look like.
You can run Hadoop fs command from inside pig so you can write something like the below inside your pig (see the documentation here)
fs -getmerge /my/hdfs/output/dir/* /my/local/dir/result.txt
fs -copyFromLocal /my/local/dir/result.txt /my/hdfs/other/output/dir/
Read the files using cat command and pipe the output to .txt file using put command
hadoop fs -cat /in_dir/part-* | hadoop fs -put - /out_dir/output.txt
or
Merge the files in the folder using the getmerge command to the output .txt file
hadoop fs -getmerge /in_dir/ /out_dir/output.txt
That's the way, map reduce job writes the output.
As Pig runs map-reduce job internally so job writes output in the form of part files :
part-m-00000(map output) or part-r-00000(reduce output).
Let's say you are giving following output dir("/user/output1.txt") in your script, so it will have:
/user/output1.txt/part-r-00000
/user/output1.txt/_SUCCESS
There may be multiple part files created inside output1.txt, so in that case you can merge those into one.
hadoop fs -getmerge /user/output1.txt/* /localdir/output/result.txt
hadoop fs -copyFromLocal /localdir/output/result.txt /user/output/result.txt

Resources