I have a PySpark dataframe with columns ID and BALANCE.
I am trying to bucket the column balance into 100 percentile (1-100%) buckets and calculate how many IDs fall in each bucket.
I cannot use anything related to RDD, I can only use PySpark syntax. I've tried the code below.
w = Window.orderBy(df.BALANCE)
test = df.withColumn('percentile_col', F.percent_rank().over(w))
I am hoping to get a new column that automatically calculates the percentile of each data point in BALANCE column and ignoring the missing value.
Spark 3.1+ has unionByName which has an optional argument allowMissingColumns. This makes it easier.
Test data:
from pyspark.sql import functions as F, Window as W
df = spark.range(12).withColumn(
'balance',
F.when(~F.col('id').isin([0, 1, 2, 3, 4]), F.col('id') + 500))
df.show()
#+---+-------+
#| id|balance|
#+---+-------+
#| 0| null|
#| 1| null|
#| 2| null|
#| 3| null|
#| 4| null|
#| 5| 505|
#| 6| 506|
#| 7| 507|
#| 8| 508|
#| 9| 509|
#| 10| 510|
#| 11| 511|
#+---+-------+
percent_rank will give you exact percentiles - results may have many numbers after the decimal point. That's why percent_rank alone may not provide what you want.
df = (
df.filter(~F.isnull('balance'))
.withColumn('percentile', F.percent_rank().over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+-------------------+
#| id|balance| percentile|
#+---+-------+-------------------+
#| 5| 505| 0.0|
#| 6| 506|0.16666666666666666|
#| 7| 507| 0.3333333333333333|
#| 8| 508| 0.5|
#| 9| 509| 0.6666666666666666|
#| 10| 510| 0.8333333333333334|
#| 11| 511| 1.0|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+-------------------+
The following should work. Rounding step is added.
pr = F.percent_rank().over(W.orderBy('balance'))
df = (
df.filter(~F.isnull('balance'))
.withColumn('bucket', F.when(pr == 0, 1).otherwise(F.ceil(pr * 100)))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+------+
#| id|balance|bucket|
#+---+-------+------+
#| 5| 505| 1|
#| 6| 506| 17|
#| 7| 507| 34|
#| 8| 508| 50|
#| 9| 509| 67|
#| 10| 510| 84|
#| 11| 511| 100|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+------+
You may also consider ntile. Every value is added to one of n "buckets".
When n=100 (test table has less than 100 items, so only the first "buckets" get values):
df = (
df.filter(~F.isnull('balance'))
.withColumn('ntile_100', F.ntile(100).over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+---------+
#| id|balance|ntile_100|
#+---+-------+---------+
#| 5| 505| 1|
#| 6| 506| 2|
#| 7| 507| 3|
#| 8| 508| 4|
#| 9| 509| 5|
#| 10| 510| 6|
#| 11| 511| 7|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+---------+
When n=4:
df = (
df.filter(~F.isnull('balance'))
.withColumn('ntile_100', F.ntile(4).over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+---------+
#| id|balance|ntile_100|
#+---+-------+---------+
#| 5| 505| 1|
#| 6| 506| 1|
#| 7| 507| 2|
#| 8| 508| 2|
#| 9| 509| 3|
#| 10| 510| 3|
#| 11| 511| 4|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+---------+
Try this.
We are first checking if the df.Balance column has Null values. If it has Null values we are displaying None. Else the percent_rank() function gets applied.
from pyspark.sql import functions as F
w = Window.orderBy(df.BALANCE)
test = df.withColumn('percentile_col',when(df.BALANCE.isNull(), lit(None)).otherwise(F.percent_rank().over(w)))
Related
Learning Apache Spark through PySpark and having issues.
I have the following DF:
+----------+------------+-----------+----------------+
| game_id|posteam_type|total_plays|total_touchdowns|
+----------+------------+-----------+----------------+
|2009092003| home| 90| 3|
|2010091912| home| 95| 0|
|2010112106| home| 75| 0|
|2010121213| home| 85| 3|
|2009092011| null| 9| null|
|2010110703| null| 2| null|
|2010112111| null| 6| null|
|2011100909| home| 102| 3|
|2011120800| home| 72| 2|
|2012010110| home| 74| 6|
|2012110410| home| 68| 1|
|2012120911| away| 91| 2|
|2011103008| null| 6| null|
|2012111100| null| 3| null|
|2013092212| home| 86| 6|
|2013112407| home| 73| 4|
|2013120106| home| 99| 3|
|2014090705| home| 94| 3|
|2014101203| home| 77| 4|
|2014102611| home| 107| 6|
+----------+------------+-----------+----------------+
I'm attempting to find the average number of plays it takes to score a TD or sum(total_plays)/sum(total_touchdowns).
I figured out the code to get the sums but can't figure out how to get the total average:
plays = nfl_game_play.groupBy().agg({'total_plays': 'sum'}).collect()
touchdowns = nfl_game_play.groupBy().agg({'total_touchdowns',: 'sum'}).collect()
As you can see I tried storing each as a variable but beyond just remembering what each value is and manually doing it.
Try with below code:
Example:
df.show()
#+-----------+----------------+
#|total_plays|total_touchdowns|
#+-----------+----------------+
#| 90| 3|
#| 95| 0|
#| 9| null|
#+-----------+----------------+
from pyspark.sql.functions import *
total_avg=df.groupBy().agg(sum("total_plays")/sum("total_touchdowns")).collect()[0][0]
#64.66666666666667
I have a dataframe similar to below. I originally filled all null values with -1 to do my joins in Pyspark.
df = pd.DataFrame({'Number': ['1', '2', '-1', '-1'],
'Letter': ['A', '-1', 'B', 'A'],
'Value': [30, 30, 30, -1]})
pyspark_df = spark.createDataFrame(df)
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| -1| 30|
| -1| B| 30|
| -1| A| -1|
+------+------+-----+
After processing the dataset, I need to replace all -1 back to null values.
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+
What's the easiest way to do this?
Another way to do this in a less verbose manner is to use replace.
pyspark_df.replace(-1,None).replace('-1',None).show()
when+otherwise shall do the trick:
import pyspark.sql.functions as F
pyspark_df.select([F.when(F.col(i).cast("Integer") <0 , None).otherwise(F.col(i)).alias(i)
for i in df.columns]).show()
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+
You can scan all columns and replace -1's with None:
import pyspark.sql.functions as F
for x in pyspark_df.columns:
pyspark_df = pyspark_df.withColumn(x, F.when(F.col(x)==-1, F.lit(None)).otherwise(F.col(x)))
pyspark_df.show()
Output:
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+
Use reduce to apply when+otherwise on all columns on dataframe.
df.show()
#+------+------+-----+
#|Number|Letter|Value|
#+------+------+-----+
#| 1| A| 30|
#| 2| -1| 30|
#| -1| B| 30|
#+------+------+-----+
from functools import reduce
(reduce(lambda new_df, col_name: new_df.withColumn(col_name, when(col(col_name)== '-1',lit(None)).otherwise(col(col_name))),df.columns,df)).show()
#+------+------+-----+
#|Number|Letter|Value|
#+------+------+-----+
#| 1| A| 30|
#| 2| null| 30|
#| null| B| 30|
#+------+------+-----+
I have a data frame read with sqlContext.sql function in pyspark.
This contains 4 numerics columns with information per client (this is the key id).
I need to calculate the max value per client and join this value to the data frame:
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| null| null| null| null|
| 1| null| null| null| null|
| 2| null| null| null| null|
| 3| null| null| null| null|
| 4| null| null| null| null|
| 5| null| null| null| null|
| 6| 23| 13| 17| 8|
| 7| null| null| null| null|
| 8| null| null| null| null|
| 9| null| null| null| null|
| 10| 34| 2| 4| 0|
| 11| 0| 0| 0| 0|
| 12| 0| 0| 0| 0|
| 13| 0| 0| 30| 0|
| 14| null| null| null| null|
| 15| null| null| null| null|
| 16| 37| 29| 29| 29|
| 17| 0| 0| 16| 0|
| 18| 0| 0| 0| 0|
| 19| null| null| null| null|
+--------+-------+-------+-------+-------+
In this case, the max value to the client six is 23 and the client ten is 30. the null is naturally null in the new column.
Please help me showing how can i do this operation.
There is a function for that: pyspark.sql.functions.greatest.
>>> df = spark.createDataFrame([(1, 4, 3)], ['a', 'b', 'c'])
>>> df.select(greatest(df.a, df.b, df.c).alias("greatest")).collect()
[Row(greatest=4)]
The example was taken directly from the docs.
(Least does the opposite.)
I think combing values to a list and than finding max on it would be the simplest approach.
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)
])
df = spark\
.createDataFrame(
data=[(0, None, None, None, None),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, None, 1, 0)],
schema=schema)
import pyspark.sql.functions as F
def agg_to_list(m21,m22,m23,m24):
return [m21,m22,m23,m24]
u_agg_to_list = F.udf(agg_to_list, ArrayType(IntegerType()))
df2 = df.withColumn('all_values', u_agg_to_list('m_ant21', 'm_ant22', 'm_ant23', 'm_ant24'))\
.withColumn('max', F.sort_array("all_values", False)[0])\
.select('ClientId', 'max')
df2.show()
Outputs :
+--------+----+
|ClientId|max |
+--------+----+
|0 |null|
|1 |99 |
|2 |1 |
|3 |1 |
+--------+----+
Hi Data frame created like below.
df = sc.parallelize([
(1, 3),
(2, 3),
(3, 2),
(4,2),
(1, 3)
]).toDF(["id",'t'])
it shows like below.
+---+---+
| id| t|
+---+---+
| 1| 3|
| 2| 3|
| 3| 2|
| 4| 2|
| 1| 3|
+---+---+
my main aim is ,I want to replace repeated value in every column with how many times repeated.
so i have tried flowing code it is not working as expected.
from pyspark.sql.functions import col
column_list = ["id",'t']
w = Window.partitionBy(column_list)
dfmax=df.select(*((count(col(c)).over(w)).alias(c) for c in df.columns))
dfmax.show()
+---+---+
| id| t|
+---+---+
| 2| 2|
| 2| 2|
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
my expected output will be
+---+---+
| id| t|
+---+---+
| 2| 3|
| 1| 3|
| 1| 1|
| 1| 1|
| 2| 3|
+---+---+
If I understand you correctly, what you're looking for is simply:
df.select(*[count(c).over(Window.partitionBy(c)).alias(c) for c in df.columns]).show()
#+---+---+
#| id| t|
#+---+---+
#| 2| 3|
#| 2| 3|
#| 1| 2|
#| 1| 3|
#| 1| 2|
#+---+---+
The difference between this and what you posted is that we only partition by one column at a time.
Remember that DataFrames are unordered. If you wanted to maintain your row order, you could add an ordering column using pyspark.sql.functions.monotonically_increasing_id():
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("order", monotonically_increasing_id())\
.select(*[count(c).over(Window.partitionBy(c)).alias(c) for c in df.columns])\
.sort("order")\
.drop("order")\
.show()
#+---+---+
#| id| t|
#+---+---+
#| 2| 3|
#| 1| 3|
#| 1| 2|
#| 1| 2|
#| 2| 3|
#+---+---+
I'm trying to force spark to only apply a window function on a specified subset of a dataframe, while the actual window has access to rows outside of this subset. Let me go through an example:
I have a spark dataframe that has been saved to hdfs. The dataframe contains events, so each row has a timestamp an id and an integer feature. Also there is a column that I calculate, which is a sum window function made like this:
df = spark.table("some_table_in_hdfs")
w = Window.partitionBy("id").orderBy("date")
df = df.withColumn("feat_int_sum", F.sum("feat_int").over(w))
df.show()
+----------+--------+---+------------+
| date|feat_int| id|feat_int_sum|
+----------+--------+---+------------+
|2018-08-10| 5| 0| 5|
|2018-08-12| 27| 0| 32|
|2018-08-14| 3| 0| 35|
|2018-08-11| 32| 1| 32|
|2018-08-12| 552| 1| 584|
|2018-08-16| 2| 1| 586|
+----------+--------+---+------------+
When I do a load of new data from a different source, I would like to append to the above dataframe in hdfs. In order to do this, I have to apply the window function to the new data as well. I would like to union the two dataframes so that the window function has access to the "old" feat_int values to do the sum over.
df_new = spark.table("some_new_table")
df_new.show()
+----------+--------+---+
| date|feat_int| id|
+----------+--------+---+
|2018-08-20| 65| 0|
|2018-08-23| 3| 0|
|2018-08-24| 4| 0|
|2018-08-21| 69| 1|
|2018-08-25| 37| 1|
|2018-08-26| 3| 1|
+----------+--------+---+
df_union = df.union(df_new.withColumn("feat_int_sum", F.lit(None)))
df_union.show()
+----------+--------+---+------------+
| date|feat_int| id|feat_int_sum|
+----------+--------+---+------------+
|2018-08-10| 5| 0| 5|
|2018-08-12| 27| 0| 32|
|2018-08-14| 3| 0| 35|
|2018-08-20| 65| 0| null|
|2018-08-23| 3| 0| null|
|2018-08-24| 4| 0| null|
|2018-08-11| 32| 1| 32|
|2018-08-12| 552| 1| 584|
|2018-08-16| 2| 1| 586|
|2018-08-21| 69| 1| null|
|2018-08-25| 37| 1| null|
|2018-08-26| 3| 1| null|
+----------+--------+---+------------+
The problem is, that I would like to apply the sum window function on df_union but only on the rows with null in feat_int_sum. The reason is that I don't want to recalculate the window function for all the values that are already calculated in df. So the desired result would be something like this:
+----------+--------+---+------------+-----------------+
| date|feat_int| id|feat_int_sum|feat_int_sum_temp|
+----------+--------+---+------------+-----------------+
|2018-08-10| 5| 0| 5| null|
|2018-08-12| 27| 0| 32| null|
|2018-08-14| 3| 0| 35| null|
|2018-08-20| 65| 0| null| 100|
|2018-08-23| 3| 0| null| 103|
|2018-08-24| 4| 0| null| 107|
|2018-08-11| 32| 1| 32| null|
|2018-08-12| 552| 1| 584| null|
|2018-08-16| 2| 1| 586| null|
|2018-08-21| 69| 1| null| 655|
|2018-08-25| 37| 1| null| 692|
|2018-08-26| 3| 1| null| 695|
+----------+--------+---+------------+-----------------+
I tried wrapping the window function in a when statement like this:
df_union.withColumn("feat_int_sum_temp", F.when(F.col('date') > '2018-08-16', F.sum('feat_int').over(w))
But looking at the spark explain plan, it seems that it will run the window function for all rows, and afterwards apply the when conditon.
The whole reason I don't want to run the window function on the old rows, is that I'm dealing with some really big tables, and I don't want to waste computational ressources recalculating values that will not be used.
After this step I would coalesce the feat_int_sum and the feat_int_sum_temp columns, and append only the new part of the data to hdfs.
I would appreciate any hints on how to force spark to only apply the window function on the specified subset, while the actual window has access to rows outside of this subset.