read/write bucketed tables in Spark - apache-spark

I have a number of tables (with 100 million-ish rows) that are stored as external Hive tables using Parquet format. The Spark job needs to join several of them together, using a single column, with almost no filtering. The join column has unique values about 2/3X fewer than the number of rows.
I can see that there are shuffles happening by the join key; and I have been trying to utilize bucketing/partitioning to improve join performance. My thought is that if Spark can be made aware that each of these tables has been bucketed using the same column, it can load the dataframes and join them without shuffling. I have tried using Hive bucketing, but the shuffles don't go away. (From Spark's documentation it looks like Hive bucketing is not supported as of Spark 2.3.0 at least, which I found out later.) Can I use Spark's bucketing feature to do this? If yes, would I have to disable Hive support and just read the files directly? Or could I rewrite the tables once using Spark's bucketing scheme and still be able to read them as Hive tables?
EDIT: For writing out the Hive bucketed tables I was using something like:
customerDF
.write
.option("path", "/some/path")
.mode("overwrite")
.format("parquet")
.bucketBy(200, "customer_key")
.sortBy("customer_key")
.saveAsTable("table_name")
The writing part seems to work. However, reading from two tables written that way and joining them didn't work as I expected. That is, Spark was repartitioning both tables again into 200 partitions.
I don't have code for doing Spark bucketing right now but will update if I figure it out.

Related

How to paginate hive table in spark?

I wanted to do a pagination on a hive table having ~1.5 billion rows using pyspark. I came across one solution using ROW_NUMBER(). When I tried it, I am running out memory. Not sure whether spark is trying to bring in the complete table to it's memory and then doing a pagination.
After that, I came across this LIMIT clause in Hive SQL (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause) and tried it. But it failed in spark, the reason which I figured out was that hiveQL is not completely supported in spark.sql(). Spark SQL limit does not support multiple arguments for offset -> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-limit.html
Is there a good approach where in I can do pagination using spark?
PS: The hive table does not have an ID column, with which I can sort and do a pagination. :)
basic use of spark :
# Extract the data
df = spark.read.table("my_table")
# Transform the data
df = df.withColumn("new_col", some_transformation())
# Load the data
df.write ... # write wherever you want

Performance of spark while reading from hive vs parquet

Assuming I have an external hive table on top parquet/orc files partitioned on date, what would be the performance impact of using
spark.read.parquet("s3a://....").filter("date_col='2021-06-20'")
v/s
spark.sql("select * from table").filter("date_col='2021-06-20'")
After reading into a dataframe, It will be followed by a series of transformations and aggregations.
spark version : 2.3.0 or 3.0.2
hive version : 1.2.1000
number of records per day : 300-700 Mn
My hunch is that there won't be any performance difference while using either of the above queries since parquet natively has most of the optimizations that a hive metastore can provide and spark is capable of using it. Like, predicate push-down, advantages of columnar storage etc.
As a follow-up question, what happens if
The underlying data was csv instead of parquet. Does having a hive table on top improves performance ?
Hive table was bucketed. Does it make sense to read the underlying file system in this case instead of reading from table ?
Also, are there any situations where reading directly from parquet is a better option compared to hive ?
Hive should actually be faster here because they both have pushdowns, Hive already has the schema stored. The parquet read as you have it here will need to infer the merged schema. You can make them about the same by providing the schema.
You can make the Parquet version even faster by navigating directly to the partition. This avoids having to do the initial filter on the available partitions.
So something like this would do it:
spark.read.option("basePath", "s3a://....").parquet("s3a://..../date_col=2021-06-20")
Note this works best if you already have a schema, because this also skips schema merging.
As to your follow-ups:
It would make a huge difference if it's CSV, as it would then have to parse all of the data and then filter out those columns. CSV is really bad for large datasets.
Shouldn't really gain you all that much and may get you into trouble. The metadata that Hive stores can allow Spark to navigate your data more efficiently here than you trying to do it yourself.

Hive partitions to Spark partitions

We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.
But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.
As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?
Partition Discovery --> might be what you are looking for:
" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

What is an optimized way of joining large tables in Spark SQL

I have a need of joining tables using Spark SQL or Dataframe API. Need to know what would be optimized way of achieving it.
Scenario is:
All data is present in Hive in ORC format (Base Dataframe and Reference files).
I need to join one Base file (Dataframe) read from Hive with 11-13 other reference file to create a big in-memory structure (400 columns) (around 1 TB in size)
What can be best approach to achieve this? Please share your experience if some one has encounter similar problem.
My default advice on how to optimize joins is:
Use a broadcast join if you can (see this notebook). From your question it seems your tables are large and a broadcast join is not an option.
Consider using a very large cluster (it's cheaper that you may think). $250 right now (6/2016) buys about 24 hours of 800 cores with 6Tb RAM and many SSDs on the EC2 spot instance market. When thinking about total cost of a big data solution, I find that humans tend to substantially undervalue their time.
Use the same partitioner. See this question for information on co-grouped joins.
If the data is huge and/or your clusters cannot grow such that even (3) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
Side note: I say "appending" above because in production I never use SaveMode.Append. It is not idempotent and that's a dangerous thing. I use SaveMode.Overwrite deep into the subtree of a partitioned table tree structure. Prior to 2.0.0 and 1.6.2 you'll have to delete _SUCCESS or metadata files or dynamic partition discovery will choke.
Hope this helps.
Spark uses SortMerge joins to join large table. It consists of hashing each row on both table and shuffle the rows with the same hash into the same partition. There the keys are sorted on both side and the sortMerge algorithm is applied. That's the best approach as far as I know.
To drastically speed up your sortMerges, write your large datasets as a Hive table with pre-bucketing and pre-sorting option (same number of partitions) instead of flat parquet dataset.
tableA
.repartition(2200, $"A", $"B")
.write
.bucketBy(2200, "A", "B")
.sortBy("A", "B")
.mode("overwrite")
.format("parquet")
.saveAsTable("my_db.table_a")
tableb
.repartition(2200, $"A", $"B")
.write
.bucketBy(2200, "A", "B")
.sortBy("A", "B")
.mode("overwrite")
.format("parquet")
.saveAsTable("my_db.table_b")
The overhead cost of writing pre-bucketed/pre-sorted table is modest compared to the benefits.
The underlying dataset will still be parquet by default, but the Hive metastore (can be Glue metastore on AWS) will contain precious information about how the table is structured. Because all possible "joinable" rows are colocated, Spark won't shuffle the tables that are pre-bucketd (big savings!) and won't sort the rows within the partition of table that are pre-sorted.
val joined = tableA.join(tableB, Seq("A", "B"))
Look at the execution plan with and without pre-bucketing.
This will not only save you a lot of time during your joins, it will make it possible to run very large joins on relatively small cluster without OOM. At Amazon, we use that in prod most of the time (there are still a few cases where it is not required).
To know more about pre-bucketing/pre-sorting:
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
https://data-flair.training/blogs/bucketing-in-hive/
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
https://databricks.com/session/hive-bucketing-in-apache-spark
Partition the source use hash partitions or range partitions or you can write custom partitions if you know better about the joining fields. Partition will help to avoid repartition during joins as spark data from same partition across tables will exist in same location.
ORC will definitely help the cause.
IF this is still causing spill, try using tachyon which will be faster than disk

Resources