Create a single CSV per partition with Spark - apache-spark

I have a ~10GB dataframe that should be written as a bunch of CSV files, one per partition.
The CSVs should be partitioned by 3 fields: "system", "date_month" and "customer".
Inside each folder exactly one CSV file should be written, and the data inside the CSV file should be ordered by two other fields: "date_day" and "date_hour".
The filesystem (an S3 bucket) should look like this:
/system=foo/date_month=2022-04/customer=CU000001/part-00000-x.c000.csv
/system=foo/date_month=2022-04/customer=CU000002/part-00000-x.c000.csv
/system=foo/date_month=2022-04/customer=CU000003/part-00000-x.c000.csv
/system=foo/date_month=2022-04/customer=CU000004/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000002/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000003/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000004/part-00000-x.c000.csv
I know I can easily achieve that using coalesce(1) but that will only use one worker and I'd like to avoid that.
I've tried this strategy
mydataframe.
repartition($"system", $"date_month", $"customer").
sort("date_day", "date_hour").
write.
partitionBy("system", "date_month", "customer").
option("header", "false").
option("sep", "\t").
format("csv").
save(s"s3://bucket/spool/")
my idea was that each worker would have gotten a different partition so it would have easily sorted the data and written a single file in the partition path. After running the code I've noticed I have many CSV for each partition, something like this:
/system=foo/date_month=2022-05/customer=CU000001/part-00000-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00001-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00002-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00003-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00004-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00005-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00006-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00007-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
[...]
the data in each file is ordered as expected and the concatenation of all the files would create the correct file, but that takes too much time and I'd prefer to rely on Spark.
Is there a way to create a single ordered CSV file per partition, without moving all the data to a single worker with coalesce(1)?
I'm using scala, if that matters.

sort() (and also orderBy()) triggers a shuffle because it sorts the whole dataframe, to sort within the partition you should use the aptly named sortWithinPartitions.
mydataframe.
repartition($"system", $"date_month", $"customer").
sortWithinPartitions("date_day", "date_hour").
write.
partitionBy("system", "date_month", "customer").
option("header", "false").
option("sep", "\t").
format("csv").
save(s"s3://bucket/spool/")

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

How can I append to same file in HDFS(spark 2.11)

I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files
If it keep creating n numbers of files,i feel it won't be much efficient
HDFS FILE SYSYTEM
Code
lines.foreachRDD(f => {
if (!f.isEmpty()) {
val df = f.toDF().coalesce(1)
df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9")
}
})
In my pom I am using respective dependencies:
spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11
As you already realized Append in Spark means write-to-existing-directory not append-to-file.
This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).
Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.
It’s creating file for each rdd as every time you are reinitialising the DataFrame variable. I would suggest have a DataFrame variable and assign as null outside of loop and inside each rdd union with the local DataFrame. After the loop write using the outer DataFrame.

Adding Spark Partition as a column without reading all files

Using spark 2.1.
What is the best way to ensure that the partition column used when writing a dataframe out to parquet gets added back into a dataframe after reading it back in without using /* across all of the files? Just want to s3a://my/path/part={2018-*} and make sure that the part column I originally used becomes available when I read it.
I thought that basePath option takes care of this where it just adds any of the partitions following the path as a column but can't seem to get it to work.
Tried this out:
my files are pretty standard and look like this and I want part added back in as a column:
s3a://my/path/part=20170101
s3a://my/path/part=20170102
This is not working:
spark.read
.option("BasePath", "s3a://my/path/")
.parquet(filePath)
Am I just thinking about this incorrectly and I should be reading all of the files and then filtering after? I thought a main benefit of partitioning by a column is that you can then just be able to read a subset of the files by using the partition.

Spark process file in chunks

I would like to process chunks of data (from a csv file) and then do some analysis within each partition/chunk.
How do I do this and then process these multiple chunks in parallel fashion? I'd like to run map and reduce on each chunk
I don't think you can read only part of a file. Also I'm not quite sure if I understand your intent correctly or if you understood the concept of Spark correctly.
If you read a file and apply map function on the Dataset/RDD, Spark will automatically process the function in parallel on your data.
That is, each worker in your cluster will be assigned to a partition of your data, i.e. will process "n%" of the data. Which data items will be in the same partition is decided by the partitioner. By default, Spark uses a Hash Partitioner.
(Alternatively to map, you can apply mapParititions)
Here are some thoughts that came to my mind:
partition your data using the partitionBy method and create your own partitioner. This partitioner can for example put the first n rows into partition 1, the next n rows into partition 2, etc.
If your data is small enough to fit on the driver, you can read the whole file, collect it into an array, and skip the desired number of rows (in the first run, no row is skipped), take the next n rows, and then create an RDD again of these rows.
You can preprocess the data, create the partitons somehow, i.e. containing the n% and then store it again. This will create different files on your disk/HDFS: part-00000, part-00001, etc. Then in your actual program you can read just the desired part file, one after the other...

spark: dataframe.count yields way more rows than printing line by line or show()

New to Spark; using Databricks. Really puzzled.
I have this dataFrame: df.
df.count() yields Long = 5460
But if I print line by line:
df.collect.foreach(println) I get only 541 rows printed out. Similarly, df.show(5460) only shows 1017 rows. What could be the reason?
A related question: how can I save "df" with Databricks? And where does it save to? -- I tried to save before but couldn't find the file afterwards. I load the data by mounting an S3 bucket, if that's relevant.
Regarding your first question, Databricks output truncates by default. This applies both to text output in cells and to the output of display(). I would trust .count().
Regarding your second question, there are four types of places you can save on Databricks:
To Hive-managed tables using df.write.saveAsTable(). These will end up in an S3 bucket managed by Databricks, which is mounted to /user/hive/warehouse. Note that you will not have access to the AWS credentials to work with that bucket. However, you can use the Databricks file utilities (dbutils.fs.*) or the Hadoop filesystem APIs to work with the files, should you need to.
Local SSD storage. This is best done with persist() or cache() but, if you really need to, you can write to, for example, /tmp using df.write.save("/dbfs/tmp/...").
Your own S3 buckets, which you need to mount.
To /FileStore/, which is the only "directory" you can download from directly from your cluster. This is useful, for example, for writing CSV files you want to bring into Excel immediately. You write the file and output a "Download File" HTML link into your notebook.
For more details see the Databricks FileSystem Guide.
The difference could be bad source data. Spark is lazy by nature so it's not going to build a bunch columns and fill them in just to count rows. So the data may not parse when you actually execute against it or the rows or null. Or your schema doesn't allow nulls for certain columns and they are null when the data is fully parsed. Or you are modifying the data between your count, collect and show. There is just not enough detail to tell for sure. you can open up a spark shell and create a small piece of data and test those conditions by turning that data into a dataframe. Change the schema to allow and not allow nulls or add nulls in source data and not nulls. make the source data string but make the schema require integers.
As far as saving your data frame. you create a dataframe writer with write and then define the file type you want to save it as and then the file name. This example saves a parquet file. There are many other options for filetype and write options that are permitted here.
df.write.parquet("s3://myfile")

Resources