Spark enforce partitioning on read - apache-spark

I have a dataset that is partitioned like:
raw_data/year=2020/month=05/day=01/hour=00/minute=00/xxx.parquet
raw_data/year=2020/month=05/day=01/hour=00/minute=01/xxx.parquet
...
...
raw_data/year=2020/month=05/day=01/hour=01/minute=00/xxx.parquet
...
I want to load a large number of partitions (say 1 month period), aggregate them per hour, then save it with the following partitions:
processed_data/year=2020/month=05/day=01/hour=00/yyy.parquet
processed_data/year=2020/month=05/day=01/hour=01/yyy.parquet
...
I feel like, if Spark can read the dataset such that, each executor reads al of the files under hour partition, it would minimize the reshuffling. Is there any way to specify Spark's partition reading pattern?

Best approach is as per this document: http://tantusdata.com/spark-shuffle-case-1-partition-by-and-repartition/
df.repartition...write.partitionBy... to avoid shuffling and better subsequent read performance.
Spark partition discovery on read with base path could help as well.

I think it is best to save the data in the way you want to read it instead of trying to customize how Spark loads data.
You could read all the data and partition it by hours as you like. Probably you need to first create a column like "year-month-day-hour", but then you can repartition your data based on this column.
df.repartition(col("year-month-day-hour")).write.format("parquet").save(path-to-file)

Related

Glue Spark write data one partition at time

Need help to understand how it works: I have 2 TB of data which I am writing using glue spark partition on a certain date column. I am using g2x with 40 workers nodes.
These are a few observations:
Job is writing one partition at one time i.e data for one day is loaded only. (Shouldn't it write data-parallel in multiple partitions)
It creates very small files within partitions.
For the above reason, writing data is very slow. Are there any settings that can be changed to improve this?
To avoid creating very small files, you can use coalesce(k) where k is the number of partitions that you want to have, probably 40.
More about coalesce

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

In Apache Spark's `bucketBy`, how do you generate 1 file per bucket instead of 1 file per bucket per partition?

I am trying to use Spark's bucketBy feature on a pretty large dataset.
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
The problem is that my Spark cluster has about 500 partitions/tasks/executors (not sure the terminology), so I end up with files that look like:
part-00001-{UUID}_00001.c000.snappy.parquet
part-00001-{UUID}_00002.c000.snappy.parquet
...
part-00001-{UUID}_00500.c000.snappy.parquet
part-00002-{UUID}_00001.c000.snappy.parquet
part-00002-{UUID}_00002.c000.snappy.parquet
...
part-00002-{UUID}_00500.c000.snappy.parquet
part-00500-{UUID}_00001.c000.snappy.parquet
part-00500-{UUID}_00002.c000.snappy.parquet
...
part-00500-{UUID}_00500.c000.snappy.parquet
That's 500x500=250000 bucketed parquet files! It takes forever for the FileOutputCommitter to commit that to S3.
Is there a way to generate one file per bucket, like in Hive? Or is there a better way to deal with this problem? As of now it seems like I have to choose between lowering the parallelism of my cluster (reduce number of writers) or reducing the parallelism of my parquet files (reduce number of buckets).
Thanks
In order to get 1 file per final bucket do the following. Right before writing the dataframe as table repartition it using exactly same columns as ones you are using for bucketing and set the number of new partitions to be equal to number of buckets you will use in bucketBy (or a smaller number which is a divisor of number of buckets, though I don't see a reason to use a smaller number here).
In your case that would probably look like this:
dataframe.repartition(500, bucketColumn1, bucketColumn2)
.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
In the cases when you're saving to an existing table you need to make sure the types of columns are matching exactly (e.g. if your column X is INT in dataframe, but BIGINT in the table you're inserting into your repartitioning by X into 500 buckets won't match repartitioning by X treated as BIGINT and you'll end up with each of 500 executors writing 500 files again).
Just to be 100% clear - this repartitioning will add another step into your execution which is to gather the data for each bucket on 1 executor (so one full data reshuffle if the data was not partitioned same way before). I'm assuming that is exactly what you want.
It was also mentioned in comments to another answer that you'll need to be prepared for possible issues if your bucketing keys are skewed. It is true, but default Spark behavior doesn't exactly help you much if the first thing you do after loading the table is to aggregate/join on the same columns you bucketed by (which seems like a very possible scenario for someone who chose to bucket by these columns). Instead you will get a delayed issue and only see the skewness when try to load the data after the writing.
In my opinion it would be really nice if Spark offered a setting to always repartition your data before writing a bucketed table (especially when inserting into existing tables).
This should solve it.
dataframe.write()
.format("parquet")
.bucketBy(1, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
Modify the Input Parameter for the BucketBy Function to 1.
You can look at the code of bucketBy from spark's git repository - https://github.com/apache/spark/blob/f8d59572b014e5254b0c574b26e101c2e4157bdd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
The first split part-00001, part-00002 is based on the number of parallel tasks running when you save the bucketed table. In your case you had 500 parallel tasks running. The number of files inside each part file is decided based on the input you provide for the bucketBy function.
To learn more about Spark tasks, partitions, executors, view my Medium articles - https://medium.com/#tharun026

Spark: Most efficient way to sort and partition data to be written as parquet

My data is in principle a table, which contains a column ID and a column GROUP_ID, besides other 'data'.
In the first step I am reading CSV's into Spark, do some processing to prepare the data for the second step, and write the data as parquet.
The second step does a lot of groupBy('GROUP_ID') and Window.partitionBy('GROUP_ID').orderBy('ID').
The goal now is -- in order to avoid shuffling in the second step -- to efficiently load the data in the first step, as this is a one-timer.
Question Part 1: AFAIK, Spark preserves the partitioning when loading from parquet (which is actually the basis of any "optimized write consideration" to be made) - correct?
I came up with three possibilities:
df.orderBy('ID').write.partitionBy('TRIP_ID').parquet('/path/to/parquet')
df.orderBy('ID').repartition(n, 'TRIP_ID').write.parquet('/path/to/parquet')
df.repartition(n, 'TRIP_ID').sortWithinPartitions('ID').write.parquet('/path/to/parquet')
I would set n such that the individual parquet files would be ~100MB.
Question Part 2: Is it correct that the three options produce "the same"/similar results in regard of the goal (avoid shuffling in the 2nd step)? If not, what is the difference? And which one is 'better'?
Question Part 3: Which of the three options performs better regarding step 1?
Thanks for sharing your knowledge!
EDIT 2017-07-24
After doing some tests (writing to and reading from parquet) it seems that Spark is not able to recover partitionBy and orderBy information by default in the second step. The number of partitions (as obtained from df.rdd.getNumPartitions() seems to be determined by the number of cores and/or by spark.default.parallelism (if set), but not by the number of parquet partitions. So answer for question 1 would be WRONG, and questions 2 and 3 would be irrelevant.
So it turns out the REAL QUESTION is: is there a way to tell Spark, that the data is already partitioned by column X and sorted by column Y?
You probably will be interested in bucketing support in Spark.
See details here
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html
large.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable(bucketedTableName)
Notice Spark 2.4 added support for bucket pruning (like partition pruning)
More direct functionality you're looking at is Hive' bucketed-sorted tables
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
This is not yet available in Spark (see PS section below)
Also notice that the sorting information will not be loaded by Spark automatically, but since the data is already sorted.. the sorting operation on it will actually be much faster as not much work to do - e.g. one pass on data just to confirm that it is already sorted.
PS.
Spark and Hive bucketing are slightly different.
This is umbrella ticket to provide a compatibility in Spark for bucketed tables created in Hive -
https://issues.apache.org/jira/browse/SPARK-19256
As far as I know, NO there is no way to read data from parquet and tell Spark that it is already partitioned by some expression and ordered.
In short, one file on HDFS etc. is too big for one Spark partition. And even if you read whole file to one partition playing with Parquet properties such as parquet.split.files=false, parquet.task.side.metadata=true etc. there are would be most costs compare to just one shuffle.
Try bucketBy. Also, partition discovery can help.

Spark dataframe saveAsTable vs save

I am using spark 1.6.1 and I am trying to save a dataframe to an orc format.
The problem I am facing is that the save method is very slow, and it takes about 6 minutes for 50M orc file on each executor.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").save(path)
I tried using saveAsTable to an hive table which is also using orc formats, and that seems to be faster about 20% to 50% faster, but this method has its own problems - it seems that when a task fails, retries will always fail due to file already exist.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").saveAsTable(tableName)
Is there a reason save method is so slow?
Am I doing something wrong?
The problem is due to partitionBy method. PartitionBy reads the values of column specified and then segregates the data for every value of the partition column.
Try to save it without partition by, there would be significant performance difference.
See my previous comments above regarding cardinality and partitionBy.
If you really want to partition it, and it's just one 50MB file, then use something like
dt.write.format("orc").mode("append").repartition(4).saveAsTable(tableName)
repartition will create 4 roughly even partitions, rather than what you are doing to partition on a dt column which could end up writing a lot of orc files.
The choice of 4 partitions is a bit arbitrary. You're not going to get much performance/parallelizing benefit from partitioning tiny files like that. The overhead of reading more files is not worth it.
Use save() to save at particular location may be at some blob location.
Use saveAsTable() to save dataframe as spark SQL tables

Resources