skip multiple line of headers while reading multiple csv files in spark - apache-spark

I am trying to read multiple csv files using spark. I need to skip more than one line of header from each csv file.
I am able to achieve this by below code.
rdd = df.rdd
schema = df.schema
rdd_without_header = rdd.zipWithIndex().filter(lambda (row, index): index > skip_header).keys()
df = spark_session.createDataFrame(rdd_without_header, schema=schema)
This code is working fine, but if I am having multiple compressed files of format gz this operation is taking very very long time to complete.
Difference is of magnitude 10x when using compressed files as against non compressed files.
Since I want to skip multiple lines of header from all the files, I am not able to leverage the skip header option of spark
option("header", "true")
What should be the best and optimized way to handle this use case.

Related

pyspark apply function in parallel to data in many csv files

Can pyspark be used to efficiently read and process many .csv files? As a minimal example, data are many .csv files each with 5 rows and 2 columns. My real use case is many thousands of files each with few millions of rows and hundreds of columns (appx 10GB per file) on a filesystem or a cluster.
A quick and dirty pandas implementation is as follows (assuming fns is a list of .csv filenames, and processing is implemented as the max of column-means), but will be slow because files are read serially and processing uses a single core.
result = []
for fn in fns:
df = pd.read_csv(fn, header=None)
result.append(df.agg(func).max())
My expectation is that pyspark can both read and process files in parallel.
If all your files have the same schema then you can directly read all the files using
spark.read.csv
And it seems your files don't have schema then you can provide your custom schema also
import pyspark.sql.types as t
schema = t.StructType([t.StructField('id',t.IntegerType(),True),
t.StructField('name',t.StringType(),True)])
df = spark.read.csv('path/to/folder',schema=schema)
#perform you aggregations on df now

Custom File Format to partition data while writing

Hi I want to save my spark dataframe to a file with custom file format,
such that it partitions data to different files while writing to the file.
Also I need single part file for each partition key.
I have tried extending TextBasedFileFormat and change writer to suit my needs.
The data is getting partitioned while writing to file without shuffle.
But I feel each rdd partition will write data to different part file
When you write the dataframe, each partition of underlying RDD will be written by separate tasks. Now each of these RDD partitions might correspond to data which belongs to different partition key. So each task will end up creating multiple part files.
To solve this, you have to repartition your dataframe by the partitionKey. This will involve a shuffle and all the data corresponding to same partitionKey will come into same RDD partition. This can be done by -
val newDf = df.repartition("partitionKey")
Now this RDD can be written to any file format (say parquet, csv etc) and their should be 1 file per partition. If the file size is going big, it might create multiple files. This can be controlled by config "spark.sql.files.maxRecordsPerFile".
val newDf = df.repartition("partitionKey")
newDf.write.partitionBy("partitionKey").parquet("<directory_path>")

How to convert multiple parquet files into TFrecord files using SPARK?

I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy(). I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps:
Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files.
Read those parquet files to convert them into TFrecord files with the tensorflow-connector plugin.
It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:
# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))
sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))
Could I construct the function convert_parquet_to_tfrecord that would be able to do this on the executors?
I've also tried just using the wildcard when reading all the parquet files:
SQLContext(sc).read.parquet('/path/*.parquet')
This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.
Any other suggestions?
Try spark-tfrecord.
Spark-TFRecord is a tool similar to spark-tensorflow-connector but it does partitionBy. The following example shows how to partition a dataset.
import org.apache.spark.sql.SaveMode
// create a dataframe
val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
val tf_output_dir = "/tmp/tfrecord-test"
// dump the tfrecords to files.
df.repartition(3, col("number")).write.mode(SaveMode.Overwrite).partitionBy("number").format("tfrecord").option("recordType", "Example").save(tf_output_dir)
More information can be found at
Github repo:
https://github.com/linkedin/spark-tfrecord
If I understood your question correctly, you want to write the partitions locally on the workers' disk.
If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.
This is the code that you are looking for (as stated in the documentation linked above):
myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")
On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.

How to read ".gz" compressed file using spark DF or DS?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?
Details : File is csv with tab delimited.
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
PySpark:
df = spark.read.csv("file.csv.gz", sep='\t')
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

PySpark: Writing input files to separate output files without repartitioning

I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.

Resources