Apache Spark Partitioning Data Using a SQL Function nTile - apache-spark

I am trying multiple ways to optimize executions of large datasets using partitioning. In particular I'm using a function commonly used with traditional SQL databases called nTile.
The objective is to place a certain number of rows into a bucket using a combination of buckettind and repartitioning. This allows Apache Spark to process data more efficient when processing partitioned datasets or should I say bucketted datasets.
Below is two examples. The first example shows how I've used ntile to split a dataset into two buckets followed by repartitioning the data into 2 partitions on the bucketted nTile called skew_data.
I then follow with the same query but without any bucketing or repartitioning.
The problem is query without the bucketting is faster then the query with bucketting, even the query without bucketting places all the data into one partition whereas the query with bucketting splits the query into 2 partitions.
Can someone let me know why that is.
FYI
I'm running the query on a Apache Spark cluster from Databricks.
The cluster just has one single node with 2 cores and 15Gb memory.
First example with nTile/Bucketting and repartitioning
allin = spark.sql("""
SELECT
t1.make
, t2.model
, NTILE(2) OVER (ORDER BY t2.sale_price) AS skew_data
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
.repartition(2, col("skew_data"), rand())
.drop('skew_data')
The above code splits the data into partitions as follows, with the corresponding partition distribution
Number of partitions: 2
Partitioning distribution: [5556767, 5556797]
The second example: with no nTile/Bucketting or repartitioning
allin_NO_nTile = spark.sql("""
SELECT
t1.make
,t2.model
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
The above code puts all the data into a single partition as shown below:
Number of partitions: 1
Partitioning distribution: [11113564]
My question is, why is it that the second query(without nTile or repartitioning) is faster than query with nTile and repartitioning?
I have gone to great lengths to write this question out as fully as possible, but if you need further explanation please don't hesitate to ask. I really want to get to the bottom of this.

I abandoned my original approached and used the new PySpark function called bucketBy(). If you want to know how to apply bucketBy() to bucket data go to
https://www.youtube.com/watch?v=dv7IIYuQOXI&list=PLOmMQN2IKdjvowfXo_7hnFJHjcE3JOKwu&index=39

Related

Avoid data shuffle and coalesce-numPartitions is not applied to individual partition while doing left anti-join in spark dataframe

I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
df_purge.coalesce(numPartitions).write.partitionBy("date").mode("overwrite").parquet("hdfs_path")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!
I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source: https://sparkbyexamples.com:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
dfRepart.repartition(2)
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("c:/tmp/zipcodes-state-more")

Splitting spark data into partitions and writing those partitions to disk in parallel

Problem outline: Say I have 300+ GB of data being processed with spark on an EMR cluster in AWS. This data has three attributes used to partition on the filesystem for use in Hive: date, hour, and (let's say) anotherAttr. I want to write this data to a fs in such a way that minimizes the number of files written.
What I'm doing right now is getting the distinct combinations of date, hour, anotherAttr, and a count of how many rows make up combination. I collect them into a List on the driver, and iterate over the list, building a new DataFrame for each combination, repartitioning that DataFrame using the number of rows to guestimate file size, and writing the files to disk with DataFrameWriter, .orc finishing it off.
We aren't using Parquet for organizational reasons.
This method works reasonably well, and solves the problem that downstream teams using Hive instead of Spark don't see performance issues resulting from a high number of files. For example, if I take the whole 300 GB DataFrame, do a repartition with 1000 partitions (in spark) and the relevant columns, and dumped it to disk, it all dumps in parallel, and finishes in ~9 min with the whole thing. But that gets up to 1000 files for the larger partitions, and that destroys Hive performance. Or it destroys some kind of performance, honestly not 100% sure what. I've just been asked to keep the file count as low as possible. With the method I'm using, I can keep the files to whatever size I want (relatively close anyway), but there is no parallelism and it takes ~45 min to run, mostly waiting on file writes.
It seems to me that since there's a 1-to-1 relationship between some source row and some destination row, and that since I can organize the data into non-overlapping "folders" (partitions for Hive), I should be able to organize my code/DataFrames in such a way that I can ask spark to write all the destination files in parallel. Does anyone have suggestions for how to attack this?
Things I've tested that did not work:
Using a scala parallel collection to kick off the writes. Whatever spark was doing with the DataFrames, it didn't separate out the tasks very well and some machines were getting massive garbage collection problems.
DataFrame.map - I tried to map across a DataFrame of the unique combinations, and kickoff writes from inside there, but there's no access to the DataFrame of the data that I actually need from within that map - the DataFrame reference is null on the executor.
DataFrame.mapPartitions - a non-starter, couldn't come up with any ideas for doing what I want from inside mapPartitions
The word 'partition' is also not especially helpful here because it refers both to the concept of spark splitting up the data by some criteria, and to the way that the data will be organized on disk for Hive. I think I was pretty clear in the usages above. So if I'm imagining a perfect solution to this problem, it's that I can create one DataFrame that has 1000 partitions based on the three attributes for fast querying, then from that create another collection of DataFrames, each one having exactly one unique combination of those attributes, repartitioned (in spark, but for Hive) with the number of partitions appropriate to the size of the data it contains. Most of the DataFrames will have 1 partition, a few will have up to 10. The files should be ~3 GB, and our EMR cluster has more RAM than that for each executor, so we shouldn't see a performance hit from these "large" partitions.
Once that list of DataFrames is created and each one is repartitioned, I could ask spark to write them all to disk in parallel.
Is something like this possible in spark?
One thing I'm conceptually unclear on: say I have
val x = spark.sql("select * from source")
and
val y = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr")
and
val z = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr2")
To what extent is y is a different DataFrame than z? If I repartition y, what effect does the shuffle have on z, and on x for that matter?
We had the same problem (almost) and we ended up by working directly with RDD (instead of DataFrames) and implementing our own partitioning mechanism (by extending org.apache.spark.Partitioner)
Details: we are reading JSON messages from Kafka. The JSON should be grouped by customerid/date/more fields and written in Hadoop using Parquet format, without creating too many small files.
The steps are (simplified version):
a)Read the messages from Kafka and transform them to a structure of RDD[(GroupBy, Message)]. GroupBy is a case class containing all the fields that are used for grouping.
b)Use a reduceByKeyLocally transformation and obtain a map of metrics (no of messages/messages size/etc) for each group - eg Map[GroupBy, GroupByMetrics]
c)Create a GroupPartitioner that's using the previously collected metrics (and some input parameters like the desired Parquet size etc) to compute how many partitions should be created for each GroupBy object. Basically we are extending org.apache.spark.Partitioner and overriding numPartitions and getPartition(key: Any)
d)we partition the RDD from a) using the previously defined partitioner: newPartitionedRdd = rdd.partitionBy(ourCustomGroupByPartitioner)
e)Invoke spark.sparkContext.runJob with two parameters: the first one is the RDD partitioned at d), the second one is a custom function (func: (TaskContext, Iterator[T]) that will write the messages taken from Iterator[T] into Hadoop/Parquet
Let's say that we have 100 mil messages, grouped like that
Group1 - 2 mil
Group2 - 80 mil
Group3 - 18 mil
and we decided that we have to use 1.5 mil messages per partition to obtain Parquet files greater than 500MB. We'll end up with 2 partitions for Group1, 54 for Group2, 12 for Group3.
This statement:
I collect them into a List on the driver, and iterate over the list,
building a new DataFrame for each combination, repartitioning that
DataFrame using the number of rows to guestimate file size, and
writing the files to disk with DataFrameWriter, .orc finishing it off.
is completely off-beam where Spark is concerned. Collecting to driver is never a good approach, volumes and OOM issues and latency in your approach is high.
Use so the below so as to simplify and get parallelism of Spark benefits saving time and money for your boss:
df.repartition(cols...)...write.partitionBy(cols...)...
shuffle occurs via repartition, no shuffling ever with partitionBy.
That simple, with Spark's default parallelism utilized.

Efficient pyspark join

I've read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I've found are basically:
Use a broadcast join if you can. (I usually can't because the dataframes are too large)
Consider using a very large cluster. (I'd rather not because of $$$).
Use the same partitioner.
The last one is the one i'd rather try, but I can't find a way to do it in pyspark. I've tried:
df.repartition(numberOfPartitions,['parition_col1','partition_col2'])
but it doesn't help, it still takes way too long until I stop it, because spark get's stucked in the last few jobs.
So, how can I use the same partitioner in pyspark and speed up my joins, or even get rid of the shuffles that takes forever ? Which code do I need to use ?
PD: I've checked other articles, even on stackoverflow, but I still can't see code.
you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
It was nicely explained by Sim. see link below
two pass approach to join big dataframes in pyspark
based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.
Here is the code.
from pyspark.sql.functions import *
emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition.
you can then read and loop through each sub partition data and join both the dataframes and persist them together.
counter =0;
paritioncount = 4;
while counter<=paritioncount:
query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
EMP_DF1 =spark.sql(query1)
EMP_DF2 =spark.sql(query2)
df1 = EMP_DF1.alias('df1')
df2 = EMP_DF2.alias('df2')
innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
innerjoin_EMP.show()
innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
counter = counter +1
I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.
Thank you #vikrantrana for your answer, I will try it if I ever need it. I say these because I found out the problem wasn't with the 'big' joins, the problem was the amount of calculations prior to the join. Imagine this scenario:
I read a table and I store in a dataframe, called df1. I read another table, and I store it in df2. Then, I perfome a huge amount of calculations and joins to both, and I end up with a join between df1 and df2. The problem here wasn't the size, the problem was spark's execution plan was huge and it couldn't maintain all the intermediate tables in memory, so it started to write to disk and it took so much time.
The solution that worked to me was to persist df1 and df2 in disk before the join (I also persisted other intermediate dataframes that were the result of big and complex calculations).

What is the fastest way to get a large number of time ranges using Apache Spark?

I have about 100 GB of time series data in Hadoop. I'd like to use Spark to grab all data from 1000 different time ranges.
I have tried this using Apache Hive by creating an extremely long SQL statement that has about 1000 'OR BETWEEN X AND Y OR BETWEEN Q AND R' statements.
I have also tried using Spark. In this technique I've created a dataframe that has the time ranges in question and loaded that into spark with:
spark_session.CreateDataFrame()
and
df.registerTempTable()
With this, I'm doing a join with the newly created timestamp dataframe and the larger set of timestamped data.
This query is taking an extremely long time and I'm wondering if there's a more efficient way to do this.
Especially if the data is not partitioned or ordered in any special way, you or Spark need to scan it all no matter what.
I would define a predicate given the set of time ranges:
import scala.collection.immutable.Range
val ranges: List[Range] = ??? // load your ranges here
def matches(timestamp: Int): Boolean = {
// This is not efficient, a better data structure than a List
// should be used, but this is just an example
ranges.contains(_.contains(timestamp))
}
val data: RDD[(Int, T)] = ??? // load the data in an RDD
val filtered = data.filter(x => matches(x.first))
You can do the same with DataFrame/DataSet and UDFs.
This works well if the set of ranges is provided in the driver. If instead it comes from a table, like the 100G data, first collect it back in the driver, if not too big.
Your Spark job goes through 100GB dataset to select relevant data.
I don’t think there is big difference between using SQL or data frame api, as under the hood the full scan happening anyway.
I would consider re-structuring your data, so it is optimised for specific queries.
In your cases partitioning by time can give quite significant improvement (for ex. HIVE table with partitioning).
If you perform search using the same field, that has been used for partitioning - Spark job will only look into relevant partitions.

How to optimize spark sql operations on large data frame?

I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to complete. And min, max and avg on any column alone takes more than one and half hours to complete.
I am working on a limited resource cluster (as it is the only available one), a total of 9 executors each with 2 core and 5GB memory per executor spread over 3 physical nodes.
Is there any way to optimise this, say bring down the time to do all the aggregate functions on each column to less than 30 minutes atleast with the same cluster, or bumping up my resources is the only way?? which I am personally not very keen to do.
One solution I came across to speed up data frame operations is to cache them. But I don't think its a feasible option in my case.
All the real world scenarios I came across use huge clusters for this kind of load.
Any help is appreciated.
I use spark 1.6.0 in standalone mode with kryo serializer.
There are some cool features in sparkSQL like:
Cluster by/ Distribute by/ Sort by
Spark allows you to write queries in SQL-like language - HiveQL. HiveQL let you control the partitioning of data, in the same way we can use this in SparkSQL queries also.
Distribute By
In spark, Dataframe is partitioned by some expression, all the rows for which this expression is equal are on the same partition.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY KEY
So, look how it works:
par1: [(1,c), (3,b)]
par2: [(3,c), (1,b), (3,d)]
par3: [(3,a),(2,a)]
This will transform into:
par1: [(1,c), (3,b), (3,c), (1,b), (3,d), (3,a)]
par2: [(2,a)]
Sort By
SELECT * FROM df SORT BY key
for this case it will look like:
par1: [(1,c), (1,b), (3,b), (3,c), (3,d), (3,a)]
par2: [(2,a)]
Cluster By
This is shortcut for using distribute by and sort by together on the same set of expressions.
SET spark.sql.shuffle.partitions =2
SELECT * FROM df CLUSTER BY key
Note: This is basic information, Let me know if this helps otherwise we can use various different methods to optimize your spark Jobs and queries, according to the situation and settings.

Resources