PySpark read gzip of multiple json file Failed - apache-spark

I have no issue in reading a standalone JSON file which consists of a single line in the format of {xxx...}. However, when I compress it using tar -zcvf into one-file.json.gz and attempt to read it, I receive a single column named
root
|-- _corrupt_record: string (nullable = true)
This code to read that gzip file.
df = (
spark.read.option("recursiveFileLookup", "true")
.json("../../one-file.json.gz")
)
When I try to use Aws Glue, it have some exceptions like:
"Failure Reason": "Unable to parse file: one-file.json.gz\n"
I want to know what's wrong here.

The reason is simple. The spark can only read json format data and .json.gz is not json. You have to untar the file before it is read by spark. You can use the tarfile module to do it like:
import tarfile
tar_file = tarfile.open("*.tar.gz")
tar_file.extractall()
spark.read.option("recursiveFileLookup", "true").json(tar_file)
tar_file .close()

Related

Extract tar.gz{some integer} in python

I am trying to extract a file name with this format--> filename.tar.gz10
I have tried mutpile wayd but for all of them, I get the error that is unknow format. it works fine for files ends with tar.gz00. I tried to change the name but still does not work.
Here are what I have tried,
import tarfile
file = tarfile.open('filename.tar.gz10')
file.extractall('./extracted_path')
file.close()
Another way is,
shutil.unpack_archive('./filename.tar.gz10', './extracted_path', 'tar.gz17')
Thanks for your help in advance.
This coule be because the archive was split into smaller chunks, on linux you could do so using the split -b command so one big file is actually multiple smaller ones now, and they are named like
file.tar.gz01
file.tar.gz02
file.tar.gz03
file.tar.gz04
etc...
you wont be able to decompress these file individually, so you have to concatenate them first into one file then decompress.
To verify whther it was split or not, run file {filename} and if does not recognize it as a gzip compressed archive then it is propably split (this is why you get unknown format error)
You can try to do the following:
from glob import glob
import os
path = '/path/to/' # location of your files
list_of_files = glob(path + '*.tar.gz*') # list all gzip files
bash_command = 'gzip -dk filename.tar.gz' + ' '.join(list_of_files) # create bash command to concatenate the files
os.system(bash_command)

extract gz files using hdfs/spark

I have large gzip files stored HDFS location
- /dataset1/sample.tar.gz --> contains 1.csv & 2.csv .... and so on
I would like extract
/dataset1/extracted/1.csv
/dataset1/extracted/2.csv
/dataset1/extracted/3.csv
.........................
.........................
/dataset1/extracted/1000.csv
Is there any hdfs commands that can be used to extract tar gz file (without copying to local machine) or use python/scala spark?
I tried using spark but since spark can not parallelize reading a gzipfile and the gzip file is very huge like 50GB.
I want to split the gzip and use those for spark aggregations.

How to move files written with Pandas on Spark cluster to HDFS?

I'm running a Spark Job using Cluster Mode and writing few files using Pandas and I think it's writing in temp directory, now I want to move these files or write these files in HDFS.
You have multiple options:
convert Pandas Dataframe into PySpark DataFrame and simply save it into HDFS
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.parquet("hdfs:///path/on/hdfs/file.parquet")
save file locally using Pandas and use subprocess to copy file to HDFS
import subprocess
command = "hdfs dfs -copyFromLocal -f local/file.parquet /path/on/hdfs".split()
result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout)
print(result.stderr)
save file locally and use 3rd party library - hdfs3 - to copy file to HDFS
from hdfs3 import HDFileSystem
hdfs = HDFileSystem()
hdfs.cp("local/file.parquet", "/path/on/hdfs")

Pyspark: Load a tar.gz file into a dataframe and filter by filename

I have a tar.gz file that has multiple files. The hierarchy looks as below. My intention is to read the tar.gz file, filter out the contents of b.tsv as it is static metadata where all the other files are actual records.
gzfile.tar.gz
|- a.tsv
|- b.tsv
|- thousand more files.
By pyspark load, I'm able to load the file into a dataframe. I used the command:
spark = SparkSession.\
builder.\
appName("Loading Gzip Files").\
getOrCreate()
input = spark.read.load('/Users/jeevs/git/data/gzfile.tar.gz',\
format='com.databricks.spark.csv',\
sep = '\t'
With the intention to filter, I added the filename
from pyspark.sql.functions import input_file_name
input.withColumn("filename", input_file_name())
Which now generates the data like so:
|_c0 |_c1 |filename |
|b.tsv0000666000076500001440035235677713575350214013124 0ustar netsaintusers1|Lynx 2.7.1|file:///Users/jeevs/git/data/gzfile.tar.gz|
|2|Lynx 2.7|file:///Users/jeevs/git/data/gzfile.tar.gz|
Of course, the file field is populating with the tar.gz file, making that approach useless.
A more irritating problem is, the _c0 is getting populated with filename+garbage+first row values
At this point, I'm wondering if the file read itself is getting weird as it is a tar.gz file. When we did the v1 of this processing, (spark 0.9), we had another step that loaded the data from s3 into an ec2 box, extract and write back into s3. I'm trying to get rid of those steps.
Thanks in advance!
Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
Above code will unzip all files with extension *.tar.gz in source to destination location.
If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable.
This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
Use following methods to load file, in assumption the content in *.csv file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')

Read csv file from Hadoop using Spark

I'm using spark-shell to read csv files from hdfs.
I can read those csv file using the following code in bash:
bin/hadoop fs -cat /input/housing.csv |tail -5
so this suggest the housing.csv is indeed in hdfs right now.
How can I read it using spark-shell?
Thanks in advance.
sc.textFile("hdfs://input/housing.csv").first()
I tried this way, but failed.
Include the csv package in the shell and
var df = spark.read.format("csv").option("header", "true").load("hdfs://x.x.x.x:8020/folder/file.csv")
8020 is the default port.
Thanks,
Ash
You can read this easily with spark using csv method or by specifying format("csv"). In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv.
Here is a snippet of code that can read csv.
val df = spark.
read.
schema(dataSchema).
csv(s"/input/housing.csv")

Resources