Custom File Format to partition data while writing - apache-spark

Hi I want to save my spark dataframe to a file with custom file format,
such that it partitions data to different files while writing to the file.
Also I need single part file for each partition key.
I have tried extending TextBasedFileFormat and change writer to suit my needs.
The data is getting partitioned while writing to file without shuffle.
But I feel each rdd partition will write data to different part file

When you write the dataframe, each partition of underlying RDD will be written by separate tasks. Now each of these RDD partitions might correspond to data which belongs to different partition key. So each task will end up creating multiple part files.
To solve this, you have to repartition your dataframe by the partitionKey. This will involve a shuffle and all the data corresponding to same partitionKey will come into same RDD partition. This can be done by -
val newDf = df.repartition("partitionKey")
Now this RDD can be written to any file format (say parquet, csv etc) and their should be 1 file per partition. If the file size is going big, it might create multiple files. This can be controlled by config "spark.sql.files.maxRecordsPerFile".
val newDf = df.repartition("partitionKey")
newDf.write.partitionBy("partitionKey").parquet("<directory_path>")

Related

Is one parquet files under the parquet folder a partition?

I saved my dataframe as parquet format
df.write.parquet('/my/path')
When checking on HDFS, I can see that there is 10 part-xxx.snappy.parquet files under the parquet directory /my/path
My question is: is one part-xxx.snappy.parquet file correspond to a partition of my dataframe ?
Yes, part-** files are created based on number of partitions in the dataframe while writing to HDFS.
To check number of partitions in the dataframe:
df.rdd.getNumPartitions()
To control number of files writing to filesystem we can use .repartition (or) .coalesce() (or) dynamically based on our requirement.
Yes, this creates one file per Spark-partition.
Note, that you can also partition files by some attribute:
df.write.partitionBy("key").parquet("/my/path")
in such case Spark is going to create up to Spark-partition number of files for each parquet-partition. Common way to reduce number of files in such case is to repartition data by key before writing (this effectively creates one file per partition).

How to write outputs of spark streaming application to a single file

I'm reading data from Kafka using spark streaming and passing to py file for prediction. It returns predictions as well as the original data. It's saving the original data with its predictions to file however it is creating a single file for each RDD.
I need a single file consisting of all the data collected till the I stop the program to be saved to a single file.
I have tried writeStream it does not create even a single file.
I have tried to save it to parquet using append but it creates multiple files that is 1 for each RDD.
I tried to write with append mode still multiple files as output.
The below code creates a folder output.csv and enters all the files into it.
def main(args: Array[String]): Unit = {
val ss = SparkSession.builder()
.appName("consumer")
.master("local[*]")
.getOrCreate()
val scc = new StreamingContext(ss.sparkContext, Seconds(2))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer"->
"org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer">
"org.apache.kafka.common.serialization.StringDeserializer",
"group.id"-> "group5" // clients can take
)
mappedData.foreachRDD(
x =>
x.map(y =>
ss.sparkContext.makeRDD(List(y)).pipe(pyPath).toDF().repartition(1)
.write.format("csv").mode("append").option("truncate","false")
.save("output.csv")
)
)
scc.start()
scc.awaitTermination()
I need to get just 1 file with all the statements one by one collected while streaming.
Any help will be appreciated, thank you in anticipation.
You cannot modify any file in hdfs once it has been written. If you wish to write the file in realtime(append the blocks of data from streaming job in the same file every 2 seconds), its simply isn't allowed as hdfs files are immutable. I suggest you try to write a read logic that reads from multiple files, if possible.
However, if you must read from a single file, I suggest either one of the two approaches, after you have written output to a single csv/parquet folder, with "Append" SaveMode(which will create part files for each block you write every 2 seconds).
You can create a hive table on top of this folder read data from that table.
You can write a simple logic in spark to read this folder with multiple files and write it to another hdfs location as a single file using reparation(1) or coalesce(1), and read the data from that location. See below:
spark.read.csv("oldLocation").coalesce(1).write.csv("newLocation")
repartition - its recommended to use repartition while increasing no of partitions, because it involve shuffling of all the data.
coalesce- it’s is recommended to use coalesce while reducing no of partitions. For example if you have 3 partitions and you want to reduce it to 2 partitions, Coalesce will move 3rd partition Data to partition 1 and 2. Partition 1 and 2 will remains in same Container.but repartition will shuffle data in all partitions so network usage between executor will be high and it impacts the performance.
Performance wise coalesce performance better than repartition while reducing no of partitions.
So while writing use option as coalesce.
For Ex: df.write.coalesce

Spark - Stream kafka to file that changes every day?

I have a kafka stream I will be processing in spark. I want to write the output of this stream to a file. However, I want to partition these files by day, so everyday it will start writing to a new file. Can something like this be done? I want this to be left running and when a new day occurs, it will switch to write to a new file.
val streamInputDf = spark.readStream.format("kafka")
.option("kafka.bootstrapservers", "XXXX")
.option("subscribe", "XXXX")
.load()
val streamSelectDf = streamInputDf.select(...)
streamSelectDf.writeStream.format("parquet)
.option("path", "xxx")
???
Adding partition from spark can be done with partitionBy provided in
DataFrameWriter for non-streamed or with DataStreamWriter for
streamed data.
Below are the signatures :
public DataFrameWriter partitionBy(scala.collection.Seq
colNames)
DataStreamWriter partitionBy(scala.collection.Seq colNames)
Partitions the output by the given columns on the file system.
DataStreamWriter partitionBy(String... colNames) Partitions the
output by the given columns on the file system.
Description :
partitionBy public DataStreamWriter partitionBy(String... colNames)
Partitions the output by the given columns on the file system. If
specified, the output is laid out on the file system similar to Hive's
partitioning scheme. As an example, when we partition a dataset by
year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize
physical data layout. It provides a coarse-grained index for skipping
unnecessary data reads when queries have predicates on the partitioned
columns. In order for partitioning to work well, the number of
distinct values in each column should typically be less than tens of
thousands.
Parameters: colNames - (undocumented) Returns: (undocumented) Since:
2.0.0
so if you want to partition data by year and month spark will save the data to folder like:
year=2019/month=01/05
year=2019/month=02/05
Option 1 (Direct write):
You have mentioned parquet - you can use saving as a parquet format with:
df.write.partitionBy('year', 'month','day').format("parquet").save(path)
Option 2 (insert in to hive using same partitionBy ):
You can also insert into hive table like:
df.write.partitionBy('year', 'month', 'day').insertInto(String tableName)
Getting all hive partitions:
Spark sql is based on hive query language so you can use SHOW PARTITIONS
To get list of partitions in the specific table.
sparkSession.sql("SHOW PARTITIONS partitionedHiveParquetTable")
Conclusion :
I would suggest option 2 ... since Advantage is later you can query data based on partition (aka query on raw data to know what you have received) and underlying file can be parquet or orc.
Note :
Just make sure you have .enableHiveSupport() when you are creating session with SparkSessionBuilder and also make sure whether you have hive-conf.xml etc. configured properly.
Based on this answer spark should be able to write to a folder based on the year, month and day, which seems to be exactly what you are looking for. Have not tried it in spark streaming, but hopefully this example gets you on the right track:
df.write.partitionBy("year", "month", "day").format("parquet").save(outPath)
If not, you might be able to put in a variable filepath based on current_date()

Need less parquet files

I am doing the following process
rdd.toDF.write.mode(SaveMode.Append).partitionBy("Some Column").parquet(output_path)
However, under each partition, there are too many parquet files and each of them, the size is very small, that will makes my following steps become very slow to load all the parquet files. Is there a better way that under each partition, make less parquet files and increase the single parquet file size?
You can repartition before save:
rdd.toDF.repartition("Some Column").write.mode(SaveMode.Append).partitionBy("Some Column")
I used to have this problem.
Actually you can't control the partition of files because it depends on the executor doing.
The way to work around it is using method coalesce to make a shuffle and you can make how many partition you want but it's not efficient way you also need to set driver memory enough to handle this operation.
df = df.coalesce(numPartitions).write.partitionBy(""yyyyy").parquet("xxxx")
I also faced this issue. The problem is if you use coalesce each partition gets same number of parquet files. Now different partitions have different size so ideally I need different coalesce for each partition.
It's going to be really quite expensive if you open a lot of small files. Let's say you open 1k files and each filesize are far from the value of your parquet.block.size.
Here are my suggestions:
Create a job that will first merge your input parquet files to have smaller number of files where their sizes are near or equal to parquet.block.size. The default block size for 128Mb, though it's configurable by updating parquet.block.size. Spark would love if your parquet file is near before or equal the value of your parquet.block.size. The block size is the size of a row group being buffered in memory.
Or update your spark job to just read limited number of files
Or if you have a big machine and/or resources, just do the right tuning.
Hive query has a way to merge small files into larger one. This is not available in spark sql. Also, reducing spark.sql.shuffle.partitions wont help with Dataframe API.
I tried below solution and it generated lesser number of parquet files(from 800 parquet files to 29).
Suppose the data is loaded to a dataframe df
Create a temporary table in hive.
df.createOrReplaceTempView("tempTable")
spark.sql("CREATE TABLE test_temp LIKE test")
spark.sql("INSERT INTO TABLE test_temp SELECT * FROM tempTable")
The test_temp will contain small parquet files.
Populate final hive table from temporary table
spark.sql("INSERT INTO test SELECT * FROM test_temp")
The final table will contain lesser files. Drop temporary table after populating final table.

Spark - Saving data to Parquet file in case of dynamic schema

I have a JavaPairRDD of the following typing:
Tuple2<String, Iterable<Tuple2<String, Iterable<Tuple2<String, String>>>>>
that denotes the following object:
(Table_name, Iterable(Tuple_ID, Iterable(Column_name, Column_value)))
This means each record in the RDD will create one Parquet file.
The idea is, as you may have guessed, to save each object as a new Parquet table called Table_name. In this table, there is one column called ID that stores the value Tuple_ID, and each column Column_name stores the value Column_value.
The challenge I'm facing is that the table's columns (the schema) are collected on the fly on runtime, AND, as it is not possible to create nested RDDs in Spark, I can't create an RDD within the previous RDD (for each record) and save it finally to a Parquet file --after converting it to a DataFrame of course.
And I can't just convert the previous RDD to a DataFrame, for the obvious reason (need to iterate to get column/value).
As a temporarily workaround, I flattened the RDD into a list of the same typing as the RDD using collect(), but this is not the proper way as the data could be larger than the available disk space on the driver machine, causing an out of memory.
Any advice on how to achieve this? please let me know if the question is not clear enough.
Take a look at answer for this [question][1]
[1]: Writing RDD partitions to individual parquet files in its own directory. I used this answer to create separate (one or more) parquet file for each partition. This technique I believe you can use the same to create separate file each with different schema if you like.

Resources