spark save and read parquet on HDFS - apache-spark

I am writing this code
val inputData = spark.read.parquet(inputFile)
spark.conf.set("spark.sql.shuffle.partitions",6)
val outputData = inputData.sort($"colname")
outputData.write.parquet(outputFile) //write on HDFS
If I want to read the content of the file "outputFile" from HDFS, I don't find the same number of partitions and the data is not sorted. Is this normal?
I am using Spark 2.0

This is an unfortunate deficiency of Spark. While write.parquet saves files as part-00000.parquet, part-00001.parquet, ... , it saves no partition information, and does not guarantee that part-00000 on disk is read back as the first partition.
We have added functionality for our project to a) read back partitions in the same order (this involves doing some somewhat-unsafe partition casting and sorting based on the contained filename), and b) serialize partitioners to disk and read them back.
As far as I know, there is nothing you can do in stock Spark at the moment to solve this problem. I look forward to seeing a resolution in future versions of Spark!
Edit: My experience is in Spark 1.5.x and 1.6.x. If there is a way to do this in native Spark with 2.0, please let me know!

You should make use of the repartition() instead. This would write the parquet file the way you want it:
outputData.repartition(6).write.parquet("outputFile")
Then, it would be the same if you try to read it back .
Parquet preserves the order of rows. You should use take() instead of show() to check the contents. take(n) returns the first n rows and the way it works is by first reading the first partition to get an idea of the partition size and then getting the rest of the data in batches..

Related

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?
Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain() on your query to see if the feature is being used. (Not all versions of hdfs support this.)

How to write outputs of spark streaming application to a single file

I'm reading data from Kafka using spark streaming and passing to py file for prediction. It returns predictions as well as the original data. It's saving the original data with its predictions to file however it is creating a single file for each RDD.
I need a single file consisting of all the data collected till the I stop the program to be saved to a single file.
I have tried writeStream it does not create even a single file.
I have tried to save it to parquet using append but it creates multiple files that is 1 for each RDD.
I tried to write with append mode still multiple files as output.
The below code creates a folder output.csv and enters all the files into it.
def main(args: Array[String]): Unit = {
val ss = SparkSession.builder()
.appName("consumer")
.master("local[*]")
.getOrCreate()
val scc = new StreamingContext(ss.sparkContext, Seconds(2))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer"->
"org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer">
"org.apache.kafka.common.serialization.StringDeserializer",
"group.id"-> "group5" // clients can take
)
mappedData.foreachRDD(
x =>
x.map(y =>
ss.sparkContext.makeRDD(List(y)).pipe(pyPath).toDF().repartition(1)
.write.format("csv").mode("append").option("truncate","false")
.save("output.csv")
)
)
scc.start()
scc.awaitTermination()
I need to get just 1 file with all the statements one by one collected while streaming.
Any help will be appreciated, thank you in anticipation.
You cannot modify any file in hdfs once it has been written. If you wish to write the file in realtime(append the blocks of data from streaming job in the same file every 2 seconds), its simply isn't allowed as hdfs files are immutable. I suggest you try to write a read logic that reads from multiple files, if possible.
However, if you must read from a single file, I suggest either one of the two approaches, after you have written output to a single csv/parquet folder, with "Append" SaveMode(which will create part files for each block you write every 2 seconds).
You can create a hive table on top of this folder read data from that table.
You can write a simple logic in spark to read this folder with multiple files and write it to another hdfs location as a single file using reparation(1) or coalesce(1), and read the data from that location. See below:
spark.read.csv("oldLocation").coalesce(1).write.csv("newLocation")
repartition - its recommended to use repartition while increasing no of partitions, because it involve shuffling of all the data.
coalesce- it’s is recommended to use coalesce while reducing no of partitions. For example if you have 3 partitions and you want to reduce it to 2 partitions, Coalesce will move 3rd partition Data to partition 1 and 2. Partition 1 and 2 will remains in same Container.but repartition will shuffle data in all partitions so network usage between executor will be high and it impacts the performance.
Performance wise coalesce performance better than repartition while reducing no of partitions.
So while writing use option as coalesce.
For Ex: df.write.coalesce

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

Spark dataframe saveAsTable vs save

I am using spark 1.6.1 and I am trying to save a dataframe to an orc format.
The problem I am facing is that the save method is very slow, and it takes about 6 minutes for 50M orc file on each executor.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").save(path)
I tried using saveAsTable to an hive table which is also using orc formats, and that seems to be faster about 20% to 50% faster, but this method has its own problems - it seems that when a task fails, retries will always fail due to file already exist.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").saveAsTable(tableName)
Is there a reason save method is so slow?
Am I doing something wrong?
The problem is due to partitionBy method. PartitionBy reads the values of column specified and then segregates the data for every value of the partition column.
Try to save it without partition by, there would be significant performance difference.
See my previous comments above regarding cardinality and partitionBy.
If you really want to partition it, and it's just one 50MB file, then use something like
dt.write.format("orc").mode("append").repartition(4).saveAsTable(tableName)
repartition will create 4 roughly even partitions, rather than what you are doing to partition on a dt column which could end up writing a lot of orc files.
The choice of 4 partitions is a bit arbitrary. You're not going to get much performance/parallelizing benefit from partitioning tiny files like that. The overhead of reading more files is not worth it.
Use save() to save at particular location may be at some blob location.
Use saveAsTable() to save dataframe as spark SQL tables

How to combine small parquet files with Spark?

I have a Hive table that has a lot of small parquet files and I am creating a Spark data frame out of it to do some processing using SparkSQL. Since I have a large number of splits/files my Spark job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provides, that is, to combine these small input splits into larger ones by specifying a max split size setting. How can I achieve this with Spark? I tried using the coalesce function, but I can only specify the number of partitions with it (I can only control the number of output files with it). Instead I really want some control over the (combined) input split size that a task processes.
Edit: I am using Spark itself, not Hive on Spark.
Edit 2: Here is the current code I have:
//create a data frame from a test table
val df = sqlContext.table("schema.test_table").filter($"my_partition_column" === "12345")
//coalesce it to a fixed number of partitions. But as I said in my question
//with coalesce I cannot control the file sizes, I can only specify
//the number of partitions
df.coalesce(8).write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.insertInto("schema.test_table")
I have not tried but read it in getting started guide that setting this property should work "hive.merge.sparkfiles=true"
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
In case using Spark on Hive, than Spark's abstraction doesn't provide explicit split of data. However we can control the parallelism in several ways.
You can leverage DataFrame.repartition(numPartitions: Int) to explicitly control the number of partitions.
In case you are using Hive Context than ensure hive-site.xml contains the CombinedInputFormat. That may help.
For more info, take a look at following documentation about Spark data parallelism - http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism.

Resources