Partitioning of Data Frame in Pyspark using Custom Partitioner - apache-spark

Looking for some info on using custom partitioner in Pyspark. I have a dataframe holding country data for various countries. So if I do repartition on country column, it will distribute my data into n partitions and keeping similar country data to specific partitions. This is creating a skew partition data when I see using glom() method.
Some countries like USA and CHN has huge amount of data in particular dataframe. I want to repartition my dataframe such that if the countries are USA and CHN then it will further split into some 10 partitions else keep the partitions same for other countries like IND, THA, AUS etc. Can we extend partitioner class in Pyspark code.
I have read this in below link that we can extend scala partitioner class in scala Spark application and can modify the partitioner class to use custom logic to repartition our data on base of requirements. Like the one I have.. please help to achieve this solution in Pyspark.. See the link below What is an efficient way to partition by column but maintain a fixed partition count?
I am using Spark version 2.3.0.2 and below is my Dataframe structure:
datadf= spark.sql("""
SELECT
ID_NUMBER ,SENDER_NAME ,SENDER_ADDRESS ,REGION_CODE ,COUNTRY_CODE
from udb.sometable
""");
The incoming data has data for six countries, like AUS, IND, THA, RUS, CHN and USA.
CHN and USA has skew data.
so if I do repartition on COUNTRY_CODE, two partitions contains a lot data whereas others are fine. I checked this using glom() method.
newdf = datadf.repartition("COUNTRY_CODE")
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext, DataFrameWriter, DataFrame
newDF = datadf.repartitionByRange(3,"COUNTRY_CODE","USA")
I was trying repartition my data into 3 more partitions for country USA and CHN only and would like to keep the other countries data into single partition.
This is what I am expecting
AUS- one partition
IND- one partition
THA- one partition
RUS- one partition
CHN- three partition
USA- three partition
Traceback (most recent call last): File "", line 1, in
File
"/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line
1182, in getattr
"'%s' object has no attribute '%s'" % (self.class.name, name)) AttributeError: 'DataFrame' object has no attribute
'repartitionByRange'

Try something like this with hashing:
newDf = oldDf.repartition(N, $"col1", $"coln")
or for ranging approach:
newDF = oldDF.repartitionByRange(N, $"col1", $"coln")
There is no custom partitioning for DF's just yet.
In your case I would go for hashing, but there are no guarantees.
But if your data is skew you may need some extra work, like 2 columns for partitioning being the simplest approach.
E.g. an existing or new column - in this case a column that applies a grouping against a given country, e.g. 1 .. N, and the partition on two cols.
For countries with many grouping you get N synthetic sub divisions; for others with low cardinality, only with 1 such group number. Not too hard. Both partitioning can take more than 1 col.
In my view uniform number filling of partitions takes a lot of effort and not really attainable, but a next best approach as in this here can suffice well enough. Amounts to custom partitioning to an extent.
Otherwise, using .withColumn on a DF you can simulate custom partitioning with those rules and filling of a new DF column and then apply the repartitionByRange. Also not so hard.

There is no custom partitioner in Structured API, so in order to use custom partitioner, you'll need to drop down to RDD API. Simple 3 steps as follows:
Convert Structured API to RDD API
dataRDD = dataDF.rdd
Apply custom partitioner in RDD API
import random
# Extract key from Row object
dataRDD = dataRDD.map(lambda r: (r[0], r))
def partitioner(key):
if key == "CHN":
return random.randint(1, 10)
elif key == "USA":
return random.randint(11, 20)
else:
# distinctCountryDict is a dict mapping distinct countries to distinct integers
# these distinct integers should not overlap with range(1, 20)
return distinctCountryDict[key]
numPartitions = 100
dataRDD = dataRDD.partitionBy(numPartitions, partitioner)
# Remove key extracted previously
dataRDD = dataRDD.map(lambda r: r[1])
Convert RDD API back to Structured API
dataDF = dataRDD.toDF()
This way, you get the best of both worlds, Spark types and optimized physical plan in Structured API, as well as custom partitioner in low-level RDD API. And we only drop down to low-level API only when it's absolutely necessary.

There is no direct way to apply user defined partitioner on PySpark, the short cut is to create a new column with a UDF, assigning each record with a partition ID based on the business logic. And use the new column for partitioning, that way the data gets spread evenly.
numPartitions= 3
df = df.withColumn("Hash#", udf_country_hash(df['Country']))
df = df.withColumn("Partition#", df["Hash#"] % numPartitions)
df.repartition(numPartitions, "Partition#")
Please check the online version of code #
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8963851468310921/2231943684776180/5846184720595634/latest.html
In my experience converting DataFrame to RDD and back to DataFrame is a costly operation, better to avoid it.

Related

Why is UDF not running in parallel on available executors?

I have a tiny spark Dataframe that essentially pushes a string into a UDF. I'm expecting, because of .repartition(3), which is the same length as targets, for the processing inside run_sequential to be applied on available executors - i.e. applied to 3 different executors.
The issue is that only 1 executor is used. How can I parallelise this processing to force my pyspark script to assign each element of target to a different executor?
import pandas as pd
import pyspark.sql.functions as F
def run_parallel(config):
def run_sequential(target):
#process with target variable
pass
return F.udf(run_sequential)
targets = ["target_1", "target_2", "target_3"]
config = {}
pdf = spark.createDataFrame(pd.DataFrame({"targets": targets})).repartition(3)
pdf.withColumn(
"apply_udf", run_training_parallel(config)("targets")
).collect()
The issue here is that repartitioning a DataFrame does not guarantee that all the created partitions will be of the same size. With such a small number of records there is a pretty high chance that some of them will map into the same partition. Spark is not meant to process such small datasets and its algorithms are tailored to work efficiently with large amounts of data - if your dataset has 3 million records and you split it in 3 partitions of approximately 1 million records each, a difference of several records per partition will be insignificant in most cases. This is obviously not the case when repartitioning 3 records.
You can use df.rdd.glom().map(len).collect() to examine the size of the partitions before and after repartitioning to see how the distribution changes.
$ pyspark --master "local[3]"
...
>>> pdf = spark.createDataFrame([("target_1",), ("target_2",), ("target_3",)]).toDF("targets")
>>> pdf.rdd.glom().map(len).collect()
[1, 1, 1]
>>> pdf.repartition(3).rdd.glom().map(len).collect()
[0, 2, 1]
As you can see, the resulting partitioning is uneven and the first partition in my case is actually empty. The irony here is that the original dataframe has the desired property and that one is getting destroyed by repartition().
While your particular case is not what Spark typically targets, it is still possible to forcefully distribute three records in three partitions. All you need to do is to provide an explicit partition key. RDDs have the zipWithIndex() method that extends each record with its ID. The ID is the perfect partition key since its value starts with 0 and increases by 1.
>>> new_df = (pdf
.coalesce(1) # not part of the solution - see below
.rdd # Convert to RDD
.zipWithIndex() # Append ID to each record
.map(lambda x: (x[1], x[0])) # Make record ID come first
.partitionBy(3) # Repartition
.map(lambda x: x[1]) # Remove record ID
.toDF()) # Turn back into a dataframe
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
In the above code, coalesce(1) is added only to demonstrate that the final partitioning is not influenced by the fact that pdf initially has one record in each partition.
A DataFrame-only solution is to first coalesce pdf to a single partition and then use repartition(3). With no partitioning column(s) provided, DataFrame.repartition() uses the round-robin partitioner and hence the desired partitioning will be achieved. You cannot simply do pdf.coalesce(1).repartition(3) since Catalyst (the Spark query optimisation engine) optimises out the coalesce operation, so a partitioning-dependent operation must be inserted in between. Adding a column containing F.monotonically_increasing_id() is a good candidate for such an operation.
>>> new_df = (pdf
.coalesce(1)
.withColumn("id", F.monotonically_increasing_id())
.repartition(3))
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
Note that, unlike in the RDD-based solution, coalesce(1) is required as part of the solution.

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Spark scala partition dataframe for large cross joins

I have two dataframes that need to be cross joined on a 20-node cluster. However because of their size, a simple crossjoin is failing. I am looking to partition the data and perform the crossjoin and am looking for an efficient way to do it.
Simple Algorithm
Manually split file f1 into three and read into dataframes: df1A, df1B, df1C. Manually split file f2 into four and ready into dataframes: df2A, df2B, df2C, df2D. Cross join df1A X df2A, df1A X df2B,..,df1A X df2D,...,df1C X df2D. Save each cross join in a file and manually put together all files. This way Spark can perform each cross join parallely and things should complete fairly quickly.
Question
Is there is more efficient way of accomplishing this by reading both files into two dataframes, then partitioning each dataframe into 3 and 4 "pieces" and for each partition of one dataframe cross join with every partition of the other dataframe?
Data frame can be partitioned ether range or hash .
val df1 = spark.read.csv("file1.txt")
val df2 = spark.read.csv("file2.txt")
val partitionedByRange1 = df1.repartitionByRange(3, $"k")
val partitionedByRange2 = df2.repartitionByRange(4, $"k")
val result =partitionedByRange1.crossJoin(partitionedByRange2);
NOTE : set property spark.sql.crossJoin.enabled=true
You can convert this in to a rdd and then use cartesian operation on that RDD. You should then be able to save that RDD to a file. Hope that helps

First element of each dataframe partition Spark 2.0

I need to retrieve the first element of each dataframe partition.
I know that I need to use mapPartitions but it is not clear for me how to use it.
Note: I am using Spark2.0, the dataframe is sorted.
I believe it should look something like following:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
implicit val encoder = RowEncoder(df.schema)
val newDf = df.mapPartitions(iterator => iterator.take(1))
This will take 1 element from each partition in DataFrame. Then you can collect all the data to your driver i.e.:
nedDf.collect()
This will return you an array with a number of elements equal to number of your partitions.
UPD updated in order to support Spark 2.0

Spark DataTables: where is partitionBy?

A common Spark processing flow we have is something like this:
Loading:
rdd = sqlContext.parquetFile("mydata/")
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
Processing by id (this is why we do the partitionBy above!)
rdd.reduceByKey(....)
rdd.join(...)
However, Spark 1.3 changed sqlContext.parquetFile to return DataFrame instead of RDD, and it no longer has the partitionBy, getNumPartitions, and reduceByKey methods.
What do we do now with partitionBy?
We can replace the loading code with something like
rdd = sqlContext.parquetFile("mydata/").rdd
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
df = rdd.map(lambda ...: Row(...)).toDF(???)
and use groupBy instead of reduceByKey.
Is this the right way?
PS. Yes, I understand that partitionBy is not necessary for groupBy et al. However, without a prior partitionBy, each join, groupBy &c may have to do cross-node operations. I am looking for a way to guarantee that all operations requiring grouping by my key will run local.
It appears that, since version 1.6, repartition(self, numPartitions, *cols) does what I need:
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns.
Also made numPartitions optional if partitioning columns are specified.
Since DataFrame provide us an abstraction of Table and Column over RDD, the most convenient way to manipulate DataFrame is to use these abstraction along with the specific table manipulations methods that DataFrame enables us.
On a DataFrame, we could:
transform the table schema with select() \ udf() \ as()
filter rows out by filter() or where()
fire an aggregation through groupBy() and agg()
or other analytic job using sample() \ join() \ union()
persist your result using saveAsTable() \ saveAsParquet() \ insertIntoJDBC()
Please refer to Spark SQL and DataFrame Guide for more details.
Therefore, a common job looks like:
val people = sqlContext.parquetFile("...")
val department = sqlContext.parquetFile("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
And for your specific requirements, this could look like:
val t = sqlContext.parquetFile()
t.filter().select().groupBy().agg()

Resources