Spark - Saving data to Parquet file in case of dynamic schema - apache-spark

I have a JavaPairRDD of the following typing:
Tuple2<String, Iterable<Tuple2<String, Iterable<Tuple2<String, String>>>>>
that denotes the following object:
(Table_name, Iterable(Tuple_ID, Iterable(Column_name, Column_value)))
This means each record in the RDD will create one Parquet file.
The idea is, as you may have guessed, to save each object as a new Parquet table called Table_name. In this table, there is one column called ID that stores the value Tuple_ID, and each column Column_name stores the value Column_value.
The challenge I'm facing is that the table's columns (the schema) are collected on the fly on runtime, AND, as it is not possible to create nested RDDs in Spark, I can't create an RDD within the previous RDD (for each record) and save it finally to a Parquet file --after converting it to a DataFrame of course.
And I can't just convert the previous RDD to a DataFrame, for the obvious reason (need to iterate to get column/value).
As a temporarily workaround, I flattened the RDD into a list of the same typing as the RDD using collect(), but this is not the proper way as the data could be larger than the available disk space on the driver machine, causing an out of memory.
Any advice on how to achieve this? please let me know if the question is not clear enough.

Take a look at answer for this [question][1]
[1]: Writing RDD partitions to individual parquet files in its own directory. I used this answer to create separate (one or more) parquet file for each partition. This technique I believe you can use the same to create separate file each with different schema if you like.

Related

Spark SQL Update/Delete

Currently, I am working on a project using pySpark that reads in a few Hive tables, stores them as dataframes, and I have to perform a few updates/filters on them. I am avoiding using Spark syntax at all costs to create a framework that will only take SQL in a parameter file that will be run using my pySpark framework.
Now the problem is that I have to perform UPDATE/DELETE queries on my final dataframe, are there any possible work arounds to performing these operations on my dataframe?
Thank you so much!
A DataFrame is immutable , you can not change it, so you are not able to update/delete.
If you want to "delete" there is a .filter option (it will create a new DF excluding records based on the validation that you applied on filter).
If you want to "update", the closer equivalent is .map, where you can "modify" your record and that value will be on a new DF, the thing is that function will iterate all the records on the .df.
Another thing that you need to keep in mind is: if you load data into a df from some source (ie. Hive table) and perform some operations. That updated data wont be reflected on your source data. DF's live on memory, until you persist that data.
So, you can not work with DF like a sql-table for those operations. Depending on your requirements you need to analyze if Spark is a solution for your specific problem.

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

Custom File Format to partition data while writing

Hi I want to save my spark dataframe to a file with custom file format,
such that it partitions data to different files while writing to the file.
Also I need single part file for each partition key.
I have tried extending TextBasedFileFormat and change writer to suit my needs.
The data is getting partitioned while writing to file without shuffle.
But I feel each rdd partition will write data to different part file
When you write the dataframe, each partition of underlying RDD will be written by separate tasks. Now each of these RDD partitions might correspond to data which belongs to different partition key. So each task will end up creating multiple part files.
To solve this, you have to repartition your dataframe by the partitionKey. This will involve a shuffle and all the data corresponding to same partitionKey will come into same RDD partition. This can be done by -
val newDf = df.repartition("partitionKey")
Now this RDD can be written to any file format (say parquet, csv etc) and their should be 1 file per partition. If the file size is going big, it might create multiple files. This can be controlled by config "spark.sql.files.maxRecordsPerFile".
val newDf = df.repartition("partitionKey")
newDf.write.partitionBy("partitionKey").parquet("<directory_path>")

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this:
dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path)
As mentioned in this question, partitionBy will delete the full existing hierarchy of partitions at path and replaced them with the partitions in dataFrame. Since new incremental data for a particular day will come in periodically, what I want is to replace only those partitions in the hierarchy that dataFrame has data for, leaving the others untouched.
To do this it appears I need to save each partition individually using its full path, something like this:
singlePartition.write.mode(SaveMode.Overwrite).parquet(path + "/eventdate=2017-01-01/hour=0/processtime=1234567890")
However I'm having trouble understanding the best way to organize the data into single-partition DataFrames so that I can write them out using their full path. One idea was something like:
dataFrame.repartition("eventdate", "hour", "processtime").foreachPartition ...
But foreachPartition operates on an Iterator[Row] which is not ideal for writing out to Parquet format.
I also considered using a select...distinct eventdate, hour, processtime to obtain the list of partitions, and then filtering the original data frame by each of those partitions and saving the results to their full partitioned path. But the distinct query plus a filter for each partition doesn't seem very efficient since it would be a lot of filter/write operations.
I'm hoping there's a cleaner way to preserve existing partitions for which dataFrame has no data?
Thanks for reading.
Spark version: 2.1
This is an old topic, but I was having the same problem and found another solution, just set your partition overwrite mode to dynamic by using:
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
So, my spark session is configured like this:
spark = SparkSession.builder.appName('AppName').getOrCreate()
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
The mode option Append has a catch!
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName)
I've tested and saw that this will keep the existing partition files. However, the problem this time is the following: If you run the same code twice (with the same data), then it will create new parquet files instead of replacing the existing ones for the same data (Spark 1.6). So, instead of using Append, we can still solve this problem with Overwrite. Instead of overwriting at the table level, we should overwrite at the partition level.
df.write.mode(SaveMode.Overwrite)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName + "/y=" + year + "/m=" + month + "/d=" + day)
See the following link for more information:
Overwrite specific partitions in spark dataframe write method
(I've updated my reply after suriyanto's comment. Thnx.)
I know this is very old. As I can not see any solution posted, I will go ahead and post one. This approach assumes you have a hive table over the directory you want to write to.
One way to deal with this problem is to create a temp view from dataFrame which should be added to the table and then use normal hive-like insert overwrite table ... command:
dataFrame.createOrReplaceTempView("temp_view")
spark.sql("insert overwrite table table_name partition ('eventdate', 'hour', 'processtime')select * from temp_view")
It preserves old partitions while (over)writing to only new partitions.

Spark dataframe saveAsTable vs save

I am using spark 1.6.1 and I am trying to save a dataframe to an orc format.
The problem I am facing is that the save method is very slow, and it takes about 6 minutes for 50M orc file on each executor.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").save(path)
I tried using saveAsTable to an hive table which is also using orc formats, and that seems to be faster about 20% to 50% faster, but this method has its own problems - it seems that when a task fails, retries will always fail due to file already exist.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").saveAsTable(tableName)
Is there a reason save method is so slow?
Am I doing something wrong?
The problem is due to partitionBy method. PartitionBy reads the values of column specified and then segregates the data for every value of the partition column.
Try to save it without partition by, there would be significant performance difference.
See my previous comments above regarding cardinality and partitionBy.
If you really want to partition it, and it's just one 50MB file, then use something like
dt.write.format("orc").mode("append").repartition(4).saveAsTable(tableName)
repartition will create 4 roughly even partitions, rather than what you are doing to partition on a dt column which could end up writing a lot of orc files.
The choice of 4 partitions is a bit arbitrary. You're not going to get much performance/parallelizing benefit from partitioning tiny files like that. The overhead of reading more files is not worth it.
Use save() to save at particular location may be at some blob location.
Use saveAsTable() to save dataframe as spark SQL tables

Resources