How to divide a column by its sum in a Spark DataFrame - apache-spark

How can I divide a column by its own sum in a Spark DataFrame, efficiently and without immediately triggering a computation?
Suppose we have some data:
import pyspark
from pyspark.sql import SparkSession, Window
import pyspark.sql.functions as spf
spark = SparkSession.builder.master('local').getOrCreate()
data = spark.range(0, 100)
data # --> DataFrame[id: bigint]
I’d like to create a new column on this data frame called “normalized” that contains id / sum(id). One way to do it is to pre-compute the sum, like this:
s = data.select(spf.sum('id')).collect()[0][0]
data2 = data.withColumn('normalized', spf.col('id') / s)
data2 # --> DataFrame[id: bigint, normalized: double]
That works fine, but it immediately triggers a computation; if you're defining something similar for many columns it will cause multiple redundant passes over the data.
Another way to do it is with a windowing specification that includes the whole table:
w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
data3 = data.withColumn('normalized', spf.col('id') / spf.sum('id').over(w))
data3 # --> DataFrame[id: bigint, normalized: double]
In this case, it's fine to define data3, but once you try to actually compute it, Spark 2.2.0 will move all the data into a single partition, which typically causes the job to fail for large data sets.
What other approaches are there to solving this problem, that don't trigger an immediate computation and that will work with large data sets? I'm interested in any solutions, not necessarily solutions based on pyspark.

crossJoin with aggregate is one approach:
data.crossJoin(
data.select(spf.sum('id').alias("sum_id"))
).withColumn("normalized", spf.col("id") / spf.col("sum_id"))
but I wouldn't worry to much:
That works fine, but it immediately triggers a computation; if you're defining something similar for many columns it will cause multiple redundant passes over the data.
Just compute multiple statistics at once:
data2 = data.select(spf.rand(42).alias("x"), spf.randn(42).alias("y"))
mean_x, mean_y = data2.groupBy().mean().first()
and the rest is just an operation on local expressions:
data2.select(spf.col("x") - mean_x, spf.col("y") - mean_y)

Related

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Pyspark - Index from monotonically_increasing_id changes after list aggregation

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.
dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

Partitioning of Data Frame in Pyspark using Custom Partitioner

Looking for some info on using custom partitioner in Pyspark. I have a dataframe holding country data for various countries. So if I do repartition on country column, it will distribute my data into n partitions and keeping similar country data to specific partitions. This is creating a skew partition data when I see using glom() method.
Some countries like USA and CHN has huge amount of data in particular dataframe. I want to repartition my dataframe such that if the countries are USA and CHN then it will further split into some 10 partitions else keep the partitions same for other countries like IND, THA, AUS etc. Can we extend partitioner class in Pyspark code.
I have read this in below link that we can extend scala partitioner class in scala Spark application and can modify the partitioner class to use custom logic to repartition our data on base of requirements. Like the one I have.. please help to achieve this solution in Pyspark.. See the link below What is an efficient way to partition by column but maintain a fixed partition count?
I am using Spark version 2.3.0.2 and below is my Dataframe structure:
datadf= spark.sql("""
SELECT
ID_NUMBER ,SENDER_NAME ,SENDER_ADDRESS ,REGION_CODE ,COUNTRY_CODE
from udb.sometable
""");
The incoming data has data for six countries, like AUS, IND, THA, RUS, CHN and USA.
CHN and USA has skew data.
so if I do repartition on COUNTRY_CODE, two partitions contains a lot data whereas others are fine. I checked this using glom() method.
newdf = datadf.repartition("COUNTRY_CODE")
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext, DataFrameWriter, DataFrame
newDF = datadf.repartitionByRange(3,"COUNTRY_CODE","USA")
I was trying repartition my data into 3 more partitions for country USA and CHN only and would like to keep the other countries data into single partition.
This is what I am expecting
AUS- one partition
IND- one partition
THA- one partition
RUS- one partition
CHN- three partition
USA- three partition
Traceback (most recent call last): File "", line 1, in
File
"/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line
1182, in getattr
"'%s' object has no attribute '%s'" % (self.class.name, name)) AttributeError: 'DataFrame' object has no attribute
'repartitionByRange'
Try something like this with hashing:
newDf = oldDf.repartition(N, $"col1", $"coln")
or for ranging approach:
newDF = oldDF.repartitionByRange(N, $"col1", $"coln")
There is no custom partitioning for DF's just yet.
In your case I would go for hashing, but there are no guarantees.
But if your data is skew you may need some extra work, like 2 columns for partitioning being the simplest approach.
E.g. an existing or new column - in this case a column that applies a grouping against a given country, e.g. 1 .. N, and the partition on two cols.
For countries with many grouping you get N synthetic sub divisions; for others with low cardinality, only with 1 such group number. Not too hard. Both partitioning can take more than 1 col.
In my view uniform number filling of partitions takes a lot of effort and not really attainable, but a next best approach as in this here can suffice well enough. Amounts to custom partitioning to an extent.
Otherwise, using .withColumn on a DF you can simulate custom partitioning with those rules and filling of a new DF column and then apply the repartitionByRange. Also not so hard.
There is no custom partitioner in Structured API, so in order to use custom partitioner, you'll need to drop down to RDD API. Simple 3 steps as follows:
Convert Structured API to RDD API
dataRDD = dataDF.rdd
Apply custom partitioner in RDD API
import random
# Extract key from Row object
dataRDD = dataRDD.map(lambda r: (r[0], r))
def partitioner(key):
if key == "CHN":
return random.randint(1, 10)
elif key == "USA":
return random.randint(11, 20)
else:
# distinctCountryDict is a dict mapping distinct countries to distinct integers
# these distinct integers should not overlap with range(1, 20)
return distinctCountryDict[key]
numPartitions = 100
dataRDD = dataRDD.partitionBy(numPartitions, partitioner)
# Remove key extracted previously
dataRDD = dataRDD.map(lambda r: r[1])
Convert RDD API back to Structured API
dataDF = dataRDD.toDF()
This way, you get the best of both worlds, Spark types and optimized physical plan in Structured API, as well as custom partitioner in low-level RDD API. And we only drop down to low-level API only when it's absolutely necessary.
There is no direct way to apply user defined partitioner on PySpark, the short cut is to create a new column with a UDF, assigning each record with a partition ID based on the business logic. And use the new column for partitioning, that way the data gets spread evenly.
numPartitions= 3
df = df.withColumn("Hash#", udf_country_hash(df['Country']))
df = df.withColumn("Partition#", df["Hash#"] % numPartitions)
df.repartition(numPartitions, "Partition#")
Please check the online version of code #
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8963851468310921/2231943684776180/5846184720595634/latest.html
In my experience converting DataFrame to RDD and back to DataFrame is a costly operation, better to avoid it.

spark: No of records in DataFrame is different in different runs

I am running a spark job that reads data from teradata. The query looks like
select * from db_name.table_name sample 5000000;
I'm trying to pull sample of 5 million rows of data. When I tried to print the number of rows in the result DataFrame, it is giving different results each time I run. Sometimes it is 4999937 and sometimes it is 5000124. Is there any particular reason for this kind of behaviour?
EDIT #1:
The code I'm using:
val query = "(select * from db_name.table_name sample 5000000) as data"
var teradataConfig = Map("url"->"jdbc:teradata://HOSTNAME/DATABASE=db_name,DBS_PORT=1025,MAYBENULL=ON",
"TMODE"->"TERA",
"user"->"username",
"password"->"password",
"driver"->"com.teradata.jdbc.TeraDriver",
"dbtable" -> query)
var df = spark.read.format("jdbc").options(teradataConfig).load()
df.count
Try caching the resultant dataframe and perform count action on the dataframe
df.cache()
println(s"Record count: ${df.count()}
From here on when you reuse the df to create new dataframe or any other transformation you don't get mismatched counts since it is already in cache.
Make sure you have given enough memory to hold the cached dataframe in memory.

Appending data to an empty dataframe

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.union(df2)
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
df1=d1.union(df2)
Any idea how to add all the dynamically created dataframes to one initially created data frame?
Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
df.show()
DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

Resources