Calculate time difference between consecutive rows in pairs per group in pyspark - apache-spark

I want to calculate the time spent per SeqID for each user. I have a dataframe like this.
However, the time is split between two actions for every user, Action_A and Action_B.
The total time per user, per seqID would be sum across all such pairs
For first user, it is 5 + 3 [(2019-12-10 10:00:00 - 2019-12-10 10:05:00) + (2019-12-10 10:20:00 - 2019-12-10 10:23:00)]
So first user has ideally spent 8 mins for SeqID 1 (and not 23 mins).
Similarly user 2 has spent 1 + 5 = 6 mins
How can I calculate this using pyspark?
data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
(("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.show()
+---+-----+-------------------+--------+
| ID|SeqID| Timestamp| Action|
+---+-----+-------------------+--------+
|ID1| 15|2019-12-10 10:00:00|Action_A|
|ID1| 15|2019-12-10 10:05:00|Action_B|
|ID1| 15|2019-12-10 10:20:00|Action_A|
|ID1| 15|2019-12-10 10:23:00|Action_B|
|ID2| 23|2019-12-10 11:10:00|Action_A|
|ID2| 23|2019-12-10 11:11:00|Action_B|
|ID2| 23|2019-12-10 11:30:00|Action_A|
|ID2| 23|2019-12-10 11:35:00|Action_B|
+---+-----+-------------------+--------+
Once I have the data for each pair, I can sum across the group (ID, SeqID)
Expected output (could be seconds also)
+---+-----+--------+
| ID|SeqID|Dur_Mins|
+---+-----+--------+
|ID1| 15| 8|
|ID2| 23| 6|
+---+-----+--------+

Here is a possible solution using Higher-Order Functions (Spark >=2.4):
transform_expr = "transform(ts_array, (x,i) -> (unix_timestamp(ts_array[i+1]) - unix_timestamp(x))/60 * ((i+1)%2))"
df.groupBy("ID", "SeqID").agg(array_sort(collect_list(col("Timestamp"))).alias("ts_array")) \
.withColumn("transformed_ts_array", expr(transform_expr)) \
.withColumn("Dur_Mins", expr("aggregate(transformed_ts_array, 0D, (acc, x) -> acc + coalesce(x, 0D))")) \
.drop("transformed_ts_array", "ts_array") \
.show(truncate=False)
Steps:
Collect all timestamps to array for each group ID, SeqID and sort them in ascending order
Apply a transform to the array with lambda function (x, i) => Double. Where x is the actual element and i its index. For each timestamp in the array, we calculate the diff with the next timestamp. And we multiply by (i+1)%2 in order to have only the diff as pairs 2 per 2 (first with the second, third with the fourth, ...) as there are always 2 actions.
Finally, we aggregate the result array of transformation to sum all the elements.
Output:
+---+-----+--------+
|ID |SeqID|Dur_Mins|
+---+-----+--------+
|ID1|15 |8.0 |
|ID2|23 |6.0 |
+---+-----+--------+

A possible (might be complicated as well) way to do it with flatMapValues and rdd
Using your data variable
df = spark.createDataFrame(data, ["id", "seq_id", "ts", "action"]). \
withColumn('ts', func.col('ts').cast('timestamp'))
# func to calculate the duration | applied on each row
def getDur(groupedrows):
"""
"""
res = []
for row in groupedrows:
if row.action == 'Action_A':
frst_ts = row.ts
dur = 0
elif row.action == 'Action_B':
dur = (row.ts - frst_ts).total_seconds()
res.append([val for val in row] + [float(dur)])
return res
# run the rules on the base df | row by row
# grouped on ID, SeqID - sorted on timestamp
dur_rdd = df.rdd. \
groupBy(lambda k: (k.id, k.seq_id)). \
flatMapValues(lambda r: getDur(sorted(r, key=lambda ok: ok.ts))). \
values()
# specify final schema
dur_schema = df.schema. \
add('dur', 'float')
# convert to DataFrame
dur_sdf = spark.createDataFrame(dur_rdd, dur_schema)
dur_sdf.orderBy('id', 'seq_id', 'ts').show()
+---+------+-------------------+--------+-----+
| id|seq_id| ts| action| dur|
+---+------+-------------------+--------+-----+
|ID1| 15|2019-12-10 10:00:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:05:00|Action_B|300.0|
|ID1| 15|2019-12-10 10:20:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:23:00|Action_B|180.0|
|ID2| 23|2019-12-10 11:10:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:11:00|Action_B| 60.0|
|ID2| 23|2019-12-10 11:30:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:35:00|Action_B|300.0|
+---+------+-------------------+--------+-----+
# Your required data
dur_sdf.groupBy('id', 'seq_id'). \
agg((func.sum('dur') / func.lit(60)).alias('dur_mins')). \
show()
+---+------+--------+
| id|seq_id|dur_mins|
+---+------+--------+
|ID1| 15| 8.0|
|ID2| 23| 6.0|
+---+------+--------+
This fits the data you've described, but check if it fits your all your cases.

Another possible solution using Window Functions
spark = SparkSession.Builder().master("local[3]").appName("TestApp").enableHiveSupport().getOrCreate()
data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
(("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.registerTempTable("tmpTbl")
df = spark.sql("select *, lead(Timestamp,1) over (partition by ID,SeqID order by Timestamp) Next_Timestamp from tmpTbl")
updated_df = df.filter("Action != 'Action_B'")
final_df = updated_df.withColumn("diff", (F.unix_timestamp('Next_Timestamp') - F.unix_timestamp('Timestamp'))/F.lit(60))
final_df.groupBy("ID","SeqID").agg(F.sum(F.col("diff")).alias("Duration")).show()
Output

Related

Choose from multinomial distribution

I have a series of values and a probability I want each those values sampled. Is there a PySpark method to sample from that distribution for each row? I know how to hard-code with a random number generator, but I want this method to be flexible for any number of assignment values and probabilities:
assignment_values = ["foo", "buzz", "boo"]
value_probabilities = [0.3, 0.3, 0.4]
Hard-coded method with random number generator:
from pyspark.sql import Row
data = [
{"person": 1, "company": "5g"},
{"person": 2, "company": "9s"},
{"person": 3, "company": "1m"},
{"person": 4, "company": "3l"},
{"person": 5, "company": "2k"},
{"person": 6, "company": "7c"},
{"person": 7, "company": "3m"},
{"person": 8, "company": "2p"},
{"person": 9, "company": "4s"},
{"person": 10, "company": "8y"},
]
df = spark.createDataFrame(Row(**x) for x in data)
(
df
.withColumn("rand", F.rand())
.withColumn(
"assignment",
F.when(F.col("rand") < F.lit(0.3), "foo")
.when(F.col("rand") < F.lit(0.6), "buzz")
.otherwise("boo")
)
.show()
)
+-------+------+-------------------+----------+
|company|person| rand|assignment|
+-------+------+-------------------+----------+
| 5g| 1| 0.8020603266148111| boo|
| 9s| 2| 0.1297179045352752| foo|
| 1m| 3|0.05170251723736685| foo|
| 3l| 4|0.07978240998283603| foo|
| 2k| 5| 0.5931269297050258| buzz|
| 7c| 6|0.44673560271164037| buzz|
| 3m| 7| 0.1398027427612647| foo|
| 2p| 8| 0.8281404801171598| boo|
| 4s| 9|0.15568513681001817| foo|
| 8y| 10| 0.6173220502731542| boo|
+-------+------+-------------------+----------+
I think randomSplit may serve you. It randomly makes several dataframes out of your one nd puts them all into a list.
df.randomSplit([0.3, 0.3, 0.4])
You can also provide seed to it.
You can join the dfs back together using reduce
from pyspark.sql import functions as F
from functools import reduce
df = spark.createDataFrame(
[(1, "5g"),
(2, "9s"),
(3, "1m"),
(4, "3l"),
(5, "2k"),
(6, "7c"),
(7, "3m"),
(8, "2p"),
(9, "4s"),
(10, "8y")],
['person', 'company'])
assignment_values = ["foo", "buzz", "boo"]
value_probabilities = [0.3, 0.3, 0.4]
dfs = df.randomSplit(value_probabilities, 5)
dfs = [df.withColumn('assignment', F.lit(assignment_values[i])) for i, df in enumerate(dfs)]
df = reduce(lambda a, b: a.union(b), dfs)
df.show()
# +------+-------+----------+
# |person|company|assignment|
# +------+-------+----------+
# | 1| 5g| foo|
# | 2| 9s| foo|
# | 6| 7c| foo|
# | 4| 3l| buzz|
# | 5| 2k| buzz|
# | 8| 2p| buzz|
# | 3| 1m| boo|
# | 7| 3m| boo|
# | 9| 4s| boo|
# | 10| 8y| boo|
# +------+-------+----------+

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?
Here is the code to create my dataframe:
from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data1 = [
('George', datetime(2010, 3, 24, 3, 19, 58), 13),
('George', datetime(2020, 9, 24, 3, 19, 6), 8),
('George', datetime(2009, 12, 12, 17, 21, 30), 5),
('Micheal', datetime(2010, 11, 22, 13, 29, 40), 12),
('Maggie', datetime(2010, 2, 8, 3, 31, 23), 8),
('Ravi', datetime(2009, 1, 1, 4, 19, 47), 2),
('Xien', datetime(2010, 3, 2, 4, 33, 51), 3),
]
df1 = sqlContext.createDataFrame(data1, ['name', 'trial_start_time', 'purchase_time'])
df1.show(truncate=False)
Here is the dataframe:
+-------+-------------------+-------------+
|name |trial_start_time |purchase_time|
+-------+-------------------+-------------+
|George |2010-03-24 07:19:58|13 |
|George |2020-09-24 07:19:06|8 |
|George |2009-12-12 22:21:30|5 |
|Micheal|2010-11-22 18:29:40|12 |
|Maggie |2010-02-08 08:31:23|8 |
|Ravi |2009-01-01 09:19:47|2 |
|Xien |2010-03-02 09:33:51|3 |
+-------+-------------------+-------------+
Here is a working example to replace one string:
from pyspark.sql.functions import regexp_replace, regexp_extract, col
df1.withColumn("name", regexp_replace('name', "Ravi", "Ravi_renamed")).show()
Here is the output:
+------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+------------+-------------------+-------------+
| George|2010-03-24 07:19:58| 13|
| George|2020-09-24 07:19:06| 8|
| George|2009-12-12 22:21:30| 5|
| Micheal|2010-11-22 18:29:40| 12|
| Maggie|2010-02-08 08:31:23| 8|
|Ravi_renamed|2009-01-01 09:19:47| 2|
| Xien|2010-03-02 09:33:51| 3|
+------------+-------------------+-------------+
In pandas I could replace multiple strings in one line of code with a lambda expression:
df1[name].apply(lambda x: x.replace('George','George_renamed1').replace('Ravi', 'Ravi_renamed2')
I am not sure if this can be done in pyspark with regexp_replace. Perhaps another alternative? When I read about using lambda expressions in pyspark it seems I have to create udf functions (which seem to get a little long). But I am curious if I can simply run some type of regex expression on multiple strings like above in one line of code.
This is what you're looking for:
Using when() (most readable)
df1.withColumn('name',
when(col('name') == 'George', 'George_renamed1')
.when(col('name') == 'Ravi', 'Ravi_renamed2')
.otherwise(col('name'))
)
With mapping expr (less explicit but handy if there's many values to replace)
df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))
or if you already have a list to use i.e.
name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']
# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))
the above but only using pyspark imported functions
mapping_expr = create_map([lit(x) for x in name_changes])
df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))
Result
df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58| 13|
|George_renamed1|2020-09-24 03:19:06| 8|
|George_renamed1|2009-12-12 17:21:30| 5|
| Micheal|2010-11-22 13:29:40| 12|
| Maggie|2010-02-08 03:31:23| 8|
| Ravi_renamed2|2009-01-01 04:19:47| 2|
| Xien|2010-03-02 04:33:51| 3|
+---------------+-------------------+-------------+

Sum of array elements depending on value condition pyspark

I have a pyspark dataframe:
id | column
------------------------------
1 | [0.2, 2, 3, 4, 3, 0.5]
------------------------------
2 | [7, 0.3, 0.3, 8, 2,]
------------------------------
I would like to create a 3 columns:
Column 1: contain the sum of the elements < 2
Column 2: contain the sum of the elements > 2
Column 3: contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a
values I put null.
Expect result:
id | column | column<2 | column>2 | column=2
------------------------------|--------------------------------------------
1 | [0.2, 2, 3, 4, 3, 0.5]| [0.7] | [12] | null
---------------------------------------------------------------------------
2 | [7, 0.3, 0.3, 8, 2,] | [0.6] | [15] | [2]
---------------------------------------------------------------------------
Can you help me please ?
Thank you
For Spark 2.4+, you can use aggregate and filter higher-order functions like this:
df.withColumn("column<2", expr("aggregate(filter(column, x -> x < 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column>2", expr("aggregate(filter(column, x -> x > 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column=2", expr("aggregate(filter(column, x -> x == 2), 0D, (x, acc) -> acc + x)")) \
.show(truncate=False)
Gives:
+---+------------------------------+--------+--------+--------+
|id |column |column<2|column>2|column=2|
+---+------------------------------+--------+--------+--------+
|1 |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7 |10.0 |2.0 |
|2 |[7.0, 0.3, 0.3, 8.0, 2.0] |0.6 |15.0 |2.0 |
+---+------------------------------+--------+--------+--------+
Here's a way you can try:
import pyspark.sql.functions as F
# using map filter the list and count based on condition
s = (df
.select('column')
.rdd
.map(lambda x: [[i for i in x.column if i < 2],
[i for i in x.column if i > 2],
[i for i in x.column if i == 2]])
.map(lambda x: [Row(round(sum(i), 2)) for i in x]))
.toDF(['col<2','col>2','col=2'])
# create a dummy id so we can join both data frames
df = df.withColumn('mid', F.monotonically_increasing_id())
s = s.withColumn('mid', F.monotonically_increasing_id())
#simple left join
df = df.join(s, on='mid').drop('mid').show()
+---+--------------------+-----+------+-----+
| id| column|col<2| col>2|col=2|
+---+--------------------+-----+------+-----+
| 0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
| 1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+
For Spark 2.4+, you can use aggregate function and do the calculation in one step:
from pyspark.sql.functions import expr
# I adjusted the 2nd array-item in id=1 from 2.0 to 2.1 so there is no `2.0` when id=1
df = spark.createDataFrame([(1,[0.2, 2.1, 3., 4., 3., 0.5]),(2,[7., 0.3, 0.3, 8., 2.,])],['id','column'])
df.withColumn('data', expr("""
aggregate(
/* ArrayType argument */
column,
/* zero: set empty array to initialize acc */
array(),
/* merge: iterate through `column` and reduce based on the values of y and the array indices of acc */
(acc, y) ->
CASE
WHEN y < 2.0 THEN array(IFNULL(acc[0],0) + y, acc[1], acc[2])
WHEN y > 2.0 THEN array(acc[0], IFNULL(acc[1],0) + y, acc[2])
ELSE array(acc[0], acc[1], IFNULL(acc[2],0) + y)
END,
/* finish: to convert the array into a named_struct */
acc -> (acc[0] as `column<2`, acc[1] as `column>2`, acc[2] as `column=2`)
)
""")).selectExpr('id', 'data.*').show()
#+---+--------+--------+--------+
#| id|column<2|column>2|column=2|
#+---+--------+--------+--------+
#| 1| 0.7| 12.1| null|
#| 2| 0.6| 15.0| 2.0|
#+---+--------+--------+--------+
Before Spark 2.4, the functional-support for ArrayType is limited, you might do it with explode and then groupby+pivot:
from pyspark.sql.functions import sum as fsum, expr
df.selectExpr('id', 'explode_outer(column) as item') \
.withColumn('g', expr('if(item < 2, "column<2", if(item > 2, "column>2", "column=2"))')) \
.groupby('id') \
.pivot('g', ["column<2", "column>2", "column=2"]) \
.agg(fsum('item')) \
.show()
#+---+--------+--------+--------+
#| id|column<2|column>2|column=2|
#+---+--------+--------+--------+
#| 1| 0.7| 12.1| null|
#| 2| 0.6| 15.0| 2.0|
#+---+--------+--------+--------+
In case explode is slow (i.e. SPARK-21657 shown before Spark 2.3), use an UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, DoubleType
schema = StructType([
StructField("column>2", DoubleType()),
StructField("column<2", DoubleType()),
StructField("column=2", DoubleType())
])
def split_data(arr):
d = {}
if arr is None: arr = []
for y in arr:
if y > 2:
d['column>2'] = d.get('column>2',0) + y
elif y < 2:
d['column<2'] = d.get('column<2',0) + y
else:
d['column=2'] = d.get('column=2',0) + y
return d
udf_split_data = udf(split_data, schema)
df.withColumn('data', udf_split_data('column')).selectExpr('id', 'data.*').show()

Convert string list to binary list in pyspark

I have a dataframe like this
data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])),
(("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
+---+----------------------------+
|ID |MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May] |
|ID3|[October, June] |
+---+----------------------------+
I want to compare every row with a default list, such that if the value is present assign 1 else 0
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
Hence my expected output is this
+---+----------------------------+------------------+
|ID |MonthList |Binary_MonthList |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May] |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June] |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+
I am able to do this in python, but don't know how to do this in pyspark
You can try to use such a udf.
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
def_month_list_func = udf(lambda x: [1 if i in x else 0 for i in default_month_list], ArrayType(IntegerType()))
df = df.withColumn("Binary_MonthList", def_month_list_func(col("MonthList")))
df.show()
# output
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+
How about using array_contains():
from pyspark.sql.functions import array, array_contains
df.withColumn('Binary_MonthList', array([array_contains('MonthList', c).astype('int') for c in default_month_list])).show()
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+
pissall answer is completely fine. I'm just posting a more general solution that works without an udf and doesn't require you to be aware of possible values.
A CountVectorizer does exactly that what you want. This algorithm adds all distinct values to his dictionary as long as they fullfil certain criteria (e.g. minimum or maximum occurence). You can apply this model on a dataframe and it will return one-hot encoded a sparse vector column (which can be converted to a dense vector column) which represents the items of the given input column.
from pyspark.ml.feature import CountVectorizer
data = [(("ID1", ['October', 'September', 'August']))
, (("ID2", ['August', 'June', 'May', 'August']))
, (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
#binary=True checks only if a item of the dictionary is present and not how often
#vocabSize defines the maximum size of the dictionary
#minDF=1.0 defines in how much rows (1.0 means one row is enough) a values has to be present to be added to the vocabulary
cv = CountVectorizer(inputCol="MonthList", outputCol="Binary_MonthList", vocabSize=12, minDF=1.0, binary=True)
cvModel = cv.fit(df)
df = cvModel.transform(df)
df.show(truncate=False)
cvModel.vocabulary
Output:
+---+----------------------------+
|ID | MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2| [August, June, May, August]|
|ID3| [October, June] |
+---+----------------------------+
+---+----------------------------+-------------------------+
|ID | MonthList | Binary_MonthList |
+---+----------------------------+-------------------------+
|ID1|[October, September, August]|(5,[1,2,3],[1.0,1.0,1.0])|
|ID2|[August, June, May, August] |(5,[0,1,4],[1.0,1.0,1.0])|
|ID3|[October, June] | (5,[0,2],[1.0,1.0]) |
+---+----------------------------+-------------------------+
['June', 'August', 'October', 'September', 'May']

PySpark :: FP-growth algorithm ( raise ValueError("Params must be either a param map or a list/tuple of param maps, ")

I am the beginner with PySpark. I am using FPgrowth computing association in PySpark. I followed the steps below.
Data Example
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['customer_id', 'product_id']
vals = [
(370, 154),
(41, 40),
(109, 173),
(18, 55),
(105, 126),
(370, 121),
(41, 32323),
(109, 22),
(18, 55),
(105, 133),
(109, 22),
(18, 55),
(105, 133)
]
df = spark.createDataFrame(vals, columns)
df.show()
+-----------+----------+
|customer_id|product_id|
+-----------+----------+
| 370| 154|
| 41| 40|
| 109| 173|
| 18| 55|
| 105| 126|
| 370| 121|
| 41| 32323|
| 109| 22|
| 18| 55|
| 105| 133|
| 109| 22|
| 18| 55|
| 105| 133|
+-----------+----------+
### Prepare input data
from pyspark.sql.functions import collect_list, col
transactions = df.groupBy("customer_id")\
.agg(collect_list("product_id").alias("product_ids"))\
.rdd\
.map(lambda x: (x.customer_id, x.product_ids))
transactions.collect()
[(370, [121, 154]),
(41, [32323, 40]),
(105, [133, 133, 126]),
(18, [55, 55, 55]),
(109, [22, 173, 22])]
## Convert .rdd to spark dataframe
df2 = spark.createDataFrame(transactions)
df2.show()
+---+---------------+
| _1| _2|
+---+---------------+
|370| [121, 154]|
| 41| [32323, 40]|
|105|[126, 133, 133]|
| 18| [55, 55, 55]|
|109| [22, 173, 22]|
+---+---------------+
df3 = df2.selectExpr("_1 as customer_id", "_2 as product_id")
df3.show()
df3.printSchema()
+-----------+---------------+
|customer_id| product_id|
+-----------+---------------+
| 370| [154, 121]|
| 41| [32323, 40]|
| 105|[126, 133, 133]|
| 18| [55, 55, 55]|
| 109| [173, 22, 22]|
+-----------+---------------+
root
|-- customer_id: long (nullable = true)
|-- product_id: array (nullable = true)
| |-- element: long (containsNull = true)
## FPGrowth Model Building
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="product_id", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df3)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-12-aa1f71745240> in <module>()
----> 1 model = fpGrowth.fit(df3)
/usr/lib/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/usr/lib/spark/python/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
/usr/lib/spark/python/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
I looked up but I did not figure out what went wrong. The only thing maybe I could point to I converted the RDD to a dataframe.
Can anybody point me to what I am doing wrong?
If you carefully check the traceback you'll see the source of the problem:
Caused by: org.apache.spark.SparkException: Items in a transaction must be unique but got ....
Replace collect_list with collect_set and problem will be fixed.
Well, I just realised FPGrowth from pyspark.ml.fpm takes a PySpark dataframe, not an rdd. So above mentioned method converted my dataset to an rdd.
I was able to avoid the situation by using PySpark collect_set list with groupby to get a data frame and pass on.
from pyspark.sql.session import SparkSession
# instantiate Spark
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['customer_id', 'product_id']
vals = [
(370, 154),
(370, 40),
(370, 173),
(41, 55),
(41, 126),
(41, 121),
(41, 321),
(105, 22),
(105, 55),
(105, 133),
(109, 22),
(109, 55),
(109, 133)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
df.show()
+-----------+----------+
|customer_id|product_id|
+-----------+----------+
| 370| 154|
| 370| 40|
| 370| 173|
| 41| 55|
| 41| 126|
| 41| 121|
| 41| 32323|
| 105| 22|
| 105| 55|
| 105| 133|
| 109| 22|
| 109| 55|
| 109| 133|
+-----------+----------+
# Create dataframe for FPGrowth model input
from pyspark.sql.functions import collect_list, col
from pyspark.sql import functions as F
from pyspark.sql.functions import *
transactions = df.groupBy("customer_id")\
.agg(F.collect_set("product_id"))
transactions.show()
+-----------+-----------------------+
|customer_id|collect_set(product_id)|
+-----------+-----------------------+
| 370| [154, 173, 40]|
| 41| [321, 121, 126, 55]|
| 105| [133, 22, 55]|
| 109| [133, 22, 55]|
+-----------+-----------------------+
# FPGrowth model
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(product_id)", minSupport=0.5, minConfidence=0.6
model_working = fpGrowth.fit(transactions)
# Display frequent itemsets
model_working.freqItemsets.show()
+-------------+----+
| items|freq|
+-------------+----+
| [55]| 3|
| [22]| 2|
| [22, 55]| 2|
| [133]| 2|
| [133, 22]| 2|
|[133, 22, 55]| 2|
| [133, 55]| 2|
+-------------+----+
# Display generated association rules.
model_working.associationRules.show()
# transform examines the input items against all the association rules and summarise the
# consequents as prediction
model_working.transform(transactions).show()
+----------+----------+------------------+
|antecedent|consequent| confidence|
+----------+----------+------------------+
| [133]| [22]| 1.0|
| [133]| [55]| 1.0|
| [133, 55]| [22]| 1.0|
| [133, 22]| [55]| 1.0|
| [22]| [55]| 1.0|
| [22]| [133]| 1.0|
| [55]| [22]|0.6666666666666666|
| [55]| [133]|0.6666666666666666|
| [22, 55]| [133]| 1.0|
+----------+----------+------------------+
+-----------+-----------------------+----------+
|customer_id|collect_set(product_id)|prediction|
+-----------+-----------------------+----------+
| 370| [154, 173, 40]| []|
| 41| [321, 121, 126, 55]| [22, 133]|
| 105| [133, 22, 55]| []|
| 109| [133, 22, 55]| []|
+-----------+-----------------------+----------+

Resources