Calculate quantile on grouped data in spark Dataframe

Calculate quantile on grouped data in spark Dataframe - apache-spark

I have the following Spark dataframe :
agent_id|payment_amount|
+--------+--------------+
| a| 1000|
| b| 1100|
| a| 1100|
| a| 1200|
| b| 1200|
| b| 1250|
| a| 10000|
| b| 9000|
+--------+--------------+
my desire output would be something like
agen_id 95_quantile
a whatever is 95 quantile for agent a payments
b whatever is 95 quantile for agent b payments
for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:
test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
but I take the following error:
'GroupedData' object has no attribute 'approxQuantile'
I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes
I am using Spark 2.0.0

One solution would be to use percentile_approx :
>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")
>>> df2.show()
# +--------+-----------------+
# |agent_id| approxQuantile|
# +--------+-----------------+
# | a|8239.999999999998|
# | b|7449.999999999998|
# +--------+-----------------+
Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.
Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.
Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.
EDIT : From Spark 2+, HiveContext is not required.

Related

How to use the windows over function PySpark with time period constraint and other conditions

Can I get some help on how to write this logic up in pyspark?
Suppose I have the the table as attached image shown.
So given: date, userid, visit, grouping (old) as inputs, I want to create a new column called grouping (new) such that it does the following:
for any given user id:
first check and see what's the grouping (old). If it is != Bad, then grouping (New) = grouping (old)
If the grouping (old) = Bad, then apply the most recent date's most recent visit's grouping (old) such that it is != Bad
However, if the most recent grouping (old) from a prior date that is != Bad is more than 30 days away, then make grouping (new) = Bad (as the data is out of date)
what I've attempted which didn't work as expected:
days = lambda i: i * 86400
user_30d_tracker =
Window.partitionBy("userid")
.orderBy(f.col("date").cast("timestamp").cast("long"))
.rangeBetween(-days(30), 0)
.rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
df = (df.withColumn("Grouping(old)_YN",
f.when(f.col("Grouping(old)")==f.lit("Bad"), "No")
.otherwise(f.lit("Yes"))))
df = df.withColumn("Grouping_new",
f.max(f.when(f.col("Grouping(old)_YN") == f.lit("Yes"),
f.col("Grouping(old)"))).over(user_30d_tracker))

Suppose this is what your dataframe looks like
import pyspark.sql.functions as f
from pyspark.sql.window import Window
data = [[123, "20200101", 1, "Good_sub1"],
[123, "20200101", 2, "Bad"],
[123, "20200115", 1, "Bad"],
[123, "20200115", 2, "Bad"],
[123, "20200116", 1, "Good_sub2"],
[123, "20200116", 2, "Bad"],
[123, "20200116", 3, "Good_sub3"],
[123, "20220901", 1, "Bad"]]
df = spark.createDataFrame(data,
"userid:int, date:string, visit:int, `grouping(old)`:string")
df.show()
# +------+--------+-----+-------------+
# |userid| date|visit|grouping(old)|
# +------+--------+-----+-------------+
# | 123|20200101| 1| Good_sub1|
# | 123|20200101| 2| Bad|
# | 123|20200115| 1| Bad|
# | 123|20200115| 2| Bad|
# | 123|20200116| 1| Good_sub2|
# | 123|20200116| 2| Bad|
# | 123|20200116| 3| Good_sub3|
# | 123|20220901| 1| Bad|
# +------+--------+-----+-------------+
days = lambda i: i * 86400
user_30d_tracker = Window.partitionBy("userid")\
.orderBy(f.col("date").cast("timestamp").cast("long"))\
.rangeBetween(-days(30), 0)\
.rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
Let's take a look at the user_30d_tracker window. This window needs some changes considering the points below
The order of the window seems to need the values of visit column beside the unix timestamp of date. With only timestamp of date in the orderBy clause, spark will not guarantee the rows with visit = 1 will come before visit = 2 for example. So, somehow we need to include visit column in the orderBy clause.
Additional frame rowsBetween will overwrite the rangeBetween frame, hence it will be different from what were expected.
An option would be only using rangeBetween frame for the window. But, since rangeBetween frame only allows 1 column to be used in orderBy expression, we can use a workaround by adding visit value to the unix timestamp of date (this is like treating visit as how many seconds had passed since date started).
user_30d_tracker = Window\
.partitionBy("userid")\
.orderBy(f.unix_timestamp("date", "yyyyMMdd") + f.col("visit"))\
.rangeBetween(-days(30), 0)
Then, to get the most recent non-'Bad' grouping(old) value, it would be better using last function (with ignorenulls=True) instead of max since it takes the recent value in the window, not the maximum of sorted strings. After that, use coalesce to fill the null values in the new column.
df = (df
.withColumn("Grouping(old)_YN",
f.when(f.col("Grouping(old)") == f.lit("Bad"), "No")
.otherwise(f.lit("Yes")))
.withColumn("Grouping_new",
f.last(f.when(f.col("Grouping(old)_YN") == f.lit("Yes"),
f.col("Grouping(old)")), ignorenulls=True).over(user_30d_tracker))
.withColumn("Grouping_new", f.coalesce(f.col("Grouping_new"), f.col("Grouping(old)")))
)
df.show()
# +------+--------+-----+-------------+----------------+------------+
# |userid| date|visit|grouping(old)|Grouping(old)_YN|Grouping_new|
# +------+--------+-----+-------------+----------------+------------+
# | 123|20200101| 1| Good_sub1| Yes| Good_sub1|
# | 123|20200101| 2| Bad| No| Good_sub1|
# | 123|20200115| 1| Bad| No| Good_sub1|
# | 123|20200115| 2| Bad| No| Good_sub1|
# | 123|20200116| 1| Good_sub2| Yes| Good_sub2|
# | 123|20200116| 2| Bad| No| Good_sub2|
# | 123|20200116| 3| Good_sub3| Yes| Good_sub3|
# | 123|20220901| 1| Bad| No| Bad|
# +------+--------+-----+-------------+----------------+------------+

Get the distinct elements of a column grouped by another column on a PySpark Dataframe

I have a pyspark DF of ids and purchases which I'm trying to transform for use with FP growth.
Currently i have multiple rows for a given id with each row only relating to a single purchase.
I'd like to transform this dataframe to a form where there are two columns, one for id (with a single row per id ) and the second column containing a list of distinct purchases for that id.
I've tried to use a User Defined Function (UDF) to map the distinct purchases onto the distinct ids but I get a "py4j.Py4JException: Method getstate([]) does not exist". Thanks to #Mithril
I see that "You can't use sparkSession object , spark.DataFrame object or other Spark distributed objects in udf and pandas_udf, because they are unpickled."
So I've implemented the TERRIBLE approach below (which will work but is not scalable):
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
# Lets have a look at the resulting spark dataframe
spk_df_1.show()
# Lets capture the ids and list of their distinct pruschases in a
# list of tuples
purschases_lst = []
nums1 = []
import pyspark.sql.functions as f
# for each distinct id lets get the list of their distinct pruschases
for id in spark.sql("SELECT distinct(id) FROM TBLdf ").rdd.map(lambda row : row[0]).collect():
purschase = df.filter(f.col("id") == id).select("item").distinct().rdd.map(lambda row : row[0]).collect()
nums1.append((id,purschase))
# Lets see what our list of transaction tuples looks like
print(nums1)
print("\n")
# lets turn the list of transaction tuples into a pandas dataframe
df_pd = pd.DataFrame(nums1)
# Finally lets turn our pandas dataframe into a pyspark Dataframe
df2 = spark.createDataFrame(df_pd)
df2.show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
[(1, ['fruit', 'cake']), (3, ['beer']), (2, ['tea'])]
+---+-------------+
| 0| 1|
+---+-------------+
| 1|[fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-------------+
If anybody has any suggestions I'd greatly appreciate it.

That is a task for collect_set, which creates a set of items without duplicates:
import pyspark.sql.functions as F
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
spk_df_1.show()
spk_df_1.groupby('id').agg(F.collect_set('item')).show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
+---+-----------------+
| id|collect_set(item)|
+---+-----------------+
| 1| [fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-----------------+

Spark (or pyspark) columns content shuffle with GroupBy

I'm working with Spark 2.2.0.
I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)
table.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE| etc......
+--------------------+-------------------+-----------------+
| W1| HM|
| W2| SM|
| W3| HM|
etc...
I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...| BORGO| 1|
| C ATHIS MONS| ATHIS MONS CEDEX| 1|
| CMA BOSC LE HARD| BOSC LE HARD| 1|
The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....
I have no clue why. Everything else works fine.

How to find max value Alphabet from DataFrame apache spark?

i am trying to get the max value Alphabet from a panda dataframe as whole. I am not interested in what row or column it came from. I am just interested in a single max value within the dataframe.
This is what it looks like:
id conditionName
1 C
2 b
3 A
4 A
5 A
expected result is:
|id|conditionName|
+--+-------------+
| 3| A |
| 4| A |
| 5| A |
+----------------+
because 'A' is the first letter of the alphabet
df= df.withColumn("conditionName", col("conditionName").cast("String"))
.groupBy("id,conditionName").max("conditionName");
df.show(false);
Exception: "conditionName" is not a numeric column. Aggregation function can only be applied on a numeric column.;
I need the max from an entire dataframe Alphabet character.
What should I use, so that the desired results?
Thank advance !

You can sort your DataFrame by your string column, grab the first value and use it to filter your original data:
from pyspark.sql.functions import lower, desc, first
# we need lower() because ordering strings is case sensitive
first_letter = df.orderBy((lower(df["condition"]))) \
.groupBy() \
.agg(first("condition").alias("condition")) \
.collect()[0][0]
df.filter(df["condition"] == first_letter).show()
#+---+---------+
#| id|condition|
#+---+---------+
#| 3| A|
#| 4| A|
#| 5| A|
#+---+---------+
Or more elegantly using Spark SQL:
df.registerTempTable("table")
sqlContext.sql("SELECT *
FROM table
WHERE lower(condition) = (SELECT min(lower(condition))
FROM table)
")

how to index categorical features in another way when using spark ml

The VectorIndexer in spark indexes categorical features according to the frequency of variables. But I want to index the categorical features in a different way.
For example, with a dataset as below, "a","b","c" will be indexed as 0,1,2 if I use the VectorIndexer in spark. But I want to index them according to the label.
There are 4 rows data which are indexed as 1, and among them 3 rows have feature 'a',1 row feautre 'c'. So here I will index 'a' as 0,'c' as 1 and 'b' as 2.
Is there any convienient way to implement this?
label|feature
-----------------
1 | a
1 | c
0 | a
0 | b
1 | a
0 | b
0 | b
0 | c
1 | a

If I understand your question correctly, you are looking to replicate the behaviour of StringIndexer() on grouped data. To do so (in pySpark), we first define an udf that will operate on a List column containing all the values per group. Note that elements with equal counts will be ordered arbitrarily.
from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType
def encoder(col):
# Generate count per letter
x = Counter(col)
# Create a dictionary, mapping each letter to its rank
ranking = {pair[0]: rank
for rank, pair in enumerate(x.most_common())}
# Use dictionary to replace letters by rank
new_list = [ranking[i] for i in col]
return(new_list)
encoder_udf = udf(encoder, ArrayType(IntegerType()))
Now we can aggregate the feature column into a list grouped by the column label using collect_list() , and apply our udf rowwise:
from pyspark.sql.functions import collect_list, explode
df1 = (df.groupBy("label")
.agg(collect_list("feature")
.alias("features"))
.withColumn("index",
encoder_udf("features")))
Consequently, you can explode the index column to get the encoded values instead of the letters:
df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 0|
| 0| 2|
| 1| 0|
| 1| 1|
| 1| 0|
| 1| 0|
+-----+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Calculate quantile on grouped data in spark Dataframe - apache-spark

Related

How to use the windows over function PySpark with time period constraint and other conditions

Get the distinct elements of a column grouped by another column on a PySpark Dataframe

Spark (or pyspark) columns content shuffle with GroupBy

How to find max value Alphabet from DataFrame apache spark?

how to index categorical features in another way when using spark ml

Categories

Resources