I am trying to create a JSON from the below structure.
Sample data:
Country|SegmentID|total_cnt|max_value|
+---------+---------+---------+---------+
| Pune| 1| 10.0| 15|
| Delhi| 1| 10.0| 15|
|Bangalore| 1| 10.0| 15|
| Pune| 2| 10.0| 16|
| Delhi| 2| 10.0| 16|
|Bangalore| 2| 10.0| 16|
| Pune| 3| 15.0| 16|
| Delhi| 3| 10.0| 16|
|Bangalore| 3| 15.0| 16|
+---------+---------+---------+---------+
Here is my code:
Expected JSON Structure:
[{
"NAME": "SEG1",
"VAL": 15,
"CITIES": {
"Bangalore": 10,
"Delhi": 10,
"Pune": 10
}
},
{
"NAME": "SEG2",
"VAL": 16,
"CITIES": {
"Bangalore": 10,
"Delhi": 10,
"Pune": 10
}
},
{
"NAME": "SEG3",
"VAL": 16,
"CITIES": {
"Bangalore": 15,
"Delhi": 10,
"Pune": 15
}
}
]
I can create a one level hierarchy but that is also not satisfying my requirement.
join_df=join_df.toPandas()
j = (join_df.groupby(['SegmentID','max_value'], as_index=False)
.apply(lambda x: x[['Country','total_cnt']].to_dict('r'))
.reset_index().rename(columns={0:'CITIES'})
.to_json(orient='records'))
This Gives result as such:
[{"SegmentID":1,"max_value":15,"Cities":[{"Country":"Pune","total_cnt":10.0},{"Country":"Delhi","total_cnt":10.0},{"Country":"Bangalore","total_cnt":10.0}]},{"SegmentID":2,"max_value":16,"Cities":[{"Country":"Pune","total_cnt":10.0},{"Country":"Delhi","total_cnt":10.0},{"Country":"Bangalore","total_cnt":10.0}]},{"SegmentID":3,"max_value":16,"Cities":[{"Country":"Pune","total_cnt":15.0},{"Country":"Delhi","total_cnt":10.0},{"Country":"Bangalore","total_cnt":15.0}]}]
You can convert Dataframe to RDD and apply your transformations:
from pyspark.sql.types import *
import json
NewSchema = StructType([StructField("Name", StringType())
, StructField("VAL", IntegerType())
, StructField("CITIES", StringType())
])
def reduceKeys(row1, row2):
row1[0].update(row2[0])
return row1
res_df = join_df.rdd.map(lambda row: ("SEG" + str(row[1]), ({row[0]: row[2]}, row[3])))\
.reduceByKey(lambda x, y: reduceKeys(x, y))\
.map(lambda row: (row[0], row[1][1], json.dumps(row[1][0])))\
.toDF(NewSchema)
Here is the result:
res_df.show(20, False)
+----+---+------------------------------------------------+
|Name|VAL|CITIES |
+----+---+------------------------------------------------+
|SEG1|15 |{"Pune": 10.0, "Delhi": 10.0, "Bangalore": 10.0}|
|SEG3|16 |{"Pune": 15.0, "Delhi": 10.0, "Bangalore": 15.0}|
|SEG2|16 |{"Pune": 10.0, "Delhi": 10.0, "Bangalore": 10.0}|
+----+---+------------------------------------------------+
Now you can save it in a JSON file:
res_df.coalesce(1).write.format('json').save('output.json')
Related
I have a series of values and a probability I want each those values sampled. Is there a PySpark method to sample from that distribution for each row? I know how to hard-code with a random number generator, but I want this method to be flexible for any number of assignment values and probabilities:
assignment_values = ["foo", "buzz", "boo"]
value_probabilities = [0.3, 0.3, 0.4]
Hard-coded method with random number generator:
from pyspark.sql import Row
data = [
{"person": 1, "company": "5g"},
{"person": 2, "company": "9s"},
{"person": 3, "company": "1m"},
{"person": 4, "company": "3l"},
{"person": 5, "company": "2k"},
{"person": 6, "company": "7c"},
{"person": 7, "company": "3m"},
{"person": 8, "company": "2p"},
{"person": 9, "company": "4s"},
{"person": 10, "company": "8y"},
]
df = spark.createDataFrame(Row(**x) for x in data)
(
df
.withColumn("rand", F.rand())
.withColumn(
"assignment",
F.when(F.col("rand") < F.lit(0.3), "foo")
.when(F.col("rand") < F.lit(0.6), "buzz")
.otherwise("boo")
)
.show()
)
+-------+------+-------------------+----------+
|company|person| rand|assignment|
+-------+------+-------------------+----------+
| 5g| 1| 0.8020603266148111| boo|
| 9s| 2| 0.1297179045352752| foo|
| 1m| 3|0.05170251723736685| foo|
| 3l| 4|0.07978240998283603| foo|
| 2k| 5| 0.5931269297050258| buzz|
| 7c| 6|0.44673560271164037| buzz|
| 3m| 7| 0.1398027427612647| foo|
| 2p| 8| 0.8281404801171598| boo|
| 4s| 9|0.15568513681001817| foo|
| 8y| 10| 0.6173220502731542| boo|
+-------+------+-------------------+----------+
I think randomSplit may serve you. It randomly makes several dataframes out of your one nd puts them all into a list.
df.randomSplit([0.3, 0.3, 0.4])
You can also provide seed to it.
You can join the dfs back together using reduce
from pyspark.sql import functions as F
from functools import reduce
df = spark.createDataFrame(
[(1, "5g"),
(2, "9s"),
(3, "1m"),
(4, "3l"),
(5, "2k"),
(6, "7c"),
(7, "3m"),
(8, "2p"),
(9, "4s"),
(10, "8y")],
['person', 'company'])
assignment_values = ["foo", "buzz", "boo"]
value_probabilities = [0.3, 0.3, 0.4]
dfs = df.randomSplit(value_probabilities, 5)
dfs = [df.withColumn('assignment', F.lit(assignment_values[i])) for i, df in enumerate(dfs)]
df = reduce(lambda a, b: a.union(b), dfs)
df.show()
# +------+-------+----------+
# |person|company|assignment|
# +------+-------+----------+
# | 1| 5g| foo|
# | 2| 9s| foo|
# | 6| 7c| foo|
# | 4| 3l| buzz|
# | 5| 2k| buzz|
# | 8| 2p| buzz|
# | 3| 1m| boo|
# | 7| 3m| boo|
# | 9| 4s| boo|
# | 10| 8y| boo|
# +------+-------+----------+
Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?
Here is the code to create my dataframe:
from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data1 = [
('George', datetime(2010, 3, 24, 3, 19, 58), 13),
('George', datetime(2020, 9, 24, 3, 19, 6), 8),
('George', datetime(2009, 12, 12, 17, 21, 30), 5),
('Micheal', datetime(2010, 11, 22, 13, 29, 40), 12),
('Maggie', datetime(2010, 2, 8, 3, 31, 23), 8),
('Ravi', datetime(2009, 1, 1, 4, 19, 47), 2),
('Xien', datetime(2010, 3, 2, 4, 33, 51), 3),
]
df1 = sqlContext.createDataFrame(data1, ['name', 'trial_start_time', 'purchase_time'])
df1.show(truncate=False)
Here is the dataframe:
+-------+-------------------+-------------+
|name |trial_start_time |purchase_time|
+-------+-------------------+-------------+
|George |2010-03-24 07:19:58|13 |
|George |2020-09-24 07:19:06|8 |
|George |2009-12-12 22:21:30|5 |
|Micheal|2010-11-22 18:29:40|12 |
|Maggie |2010-02-08 08:31:23|8 |
|Ravi |2009-01-01 09:19:47|2 |
|Xien |2010-03-02 09:33:51|3 |
+-------+-------------------+-------------+
Here is a working example to replace one string:
from pyspark.sql.functions import regexp_replace, regexp_extract, col
df1.withColumn("name", regexp_replace('name', "Ravi", "Ravi_renamed")).show()
Here is the output:
+------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+------------+-------------------+-------------+
| George|2010-03-24 07:19:58| 13|
| George|2020-09-24 07:19:06| 8|
| George|2009-12-12 22:21:30| 5|
| Micheal|2010-11-22 18:29:40| 12|
| Maggie|2010-02-08 08:31:23| 8|
|Ravi_renamed|2009-01-01 09:19:47| 2|
| Xien|2010-03-02 09:33:51| 3|
+------------+-------------------+-------------+
In pandas I could replace multiple strings in one line of code with a lambda expression:
df1[name].apply(lambda x: x.replace('George','George_renamed1').replace('Ravi', 'Ravi_renamed2')
I am not sure if this can be done in pyspark with regexp_replace. Perhaps another alternative? When I read about using lambda expressions in pyspark it seems I have to create udf functions (which seem to get a little long). But I am curious if I can simply run some type of regex expression on multiple strings like above in one line of code.
This is what you're looking for:
Using when() (most readable)
df1.withColumn('name',
when(col('name') == 'George', 'George_renamed1')
.when(col('name') == 'Ravi', 'Ravi_renamed2')
.otherwise(col('name'))
)
With mapping expr (less explicit but handy if there's many values to replace)
df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))
or if you already have a list to use i.e.
name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']
# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))
the above but only using pyspark imported functions
mapping_expr = create_map([lit(x) for x in name_changes])
df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))
Result
df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58| 13|
|George_renamed1|2020-09-24 03:19:06| 8|
|George_renamed1|2009-12-12 17:21:30| 5|
| Micheal|2010-11-22 13:29:40| 12|
| Maggie|2010-02-08 03:31:23| 8|
| Ravi_renamed2|2009-01-01 04:19:47| 2|
| Xien|2010-03-02 04:33:51| 3|
+---------------+-------------------+-------------+
Using spark dataframe.
scala> val df_input = Seq( ("p1", """{"a": 1, "b": 2}"""), ("p2", """{"c": 3}""") ).toDF("p_id", "p_meta")
df_input: org.apache.spark.sql.DataFrame = [p_id: string, p_meta: string]
scala> df_input.show()
+----+----------------+
|p_id| p_meta|
+----+----------------+
| p1|{"a": 1, "b": 2}|
| p2| {"c": 3}|
+----+----------------+
Given this input df, is it possible to split it by json key to create a new df_output like the output below?
df_output =
p_id p_meta_key p_meta_value
p1 a 1
p1 b 2
p2 c 3
I am using spark version 3.0.0 / scala 2.12.x . and I prefer to using spark.sql.functions._
Another alternative-
from_json + explode
val df_input = Seq( ("p1", """{"a": 1, "b": 2}"""), ("p2", """{"c": 3}""") )
.toDF("p_id", "p_meta")
df_input.show(false)
/**
* +----+----------------+
* |p_id|p_meta |
* +----+----------------+
* |p1 |{"a": 1, "b": 2}|
* |p2 |{"c": 3} |
* +----+----------------+
*/
df_input.withColumn("p_meta", from_json($"p_meta", "map<string, string>", Map.empty[String, String]))
.selectExpr("p_id", "explode(p_meta) as (p_meta_key, p_meta_value)")
.show(false)
/**
* +----+----------+------------+
* |p_id|p_meta_key|p_meta_value|
* +----+----------+------------+
* |p1 |a |1 |
* |p1 |b |2 |
* |p2 |c |3 |
* +----+----------+------------+
*/
the below code will solve your problem, I have tested this in spark 3.0.0/scala 2.12.10.
import org.apache.spark.sql.functions._
val df_input = Seq( ("p1", """{"a": 1, "b": 2}"""), ("p2", """{"c": 3}""") ).toDF("p_id", "p_meta")
df_input.show()
/*
+----+----------------+
|p_id| p_meta|
+----+----------------+
| p1|{"a": 1, "b": 2}|
| p2| {"c": 3}|
+----+----------------+
*/
//UDF to convert JSON to MAP
def convert(str:String):Map[String,String]={
"(\\w+): (\\w+)".r.findAllIn(str).matchData.map(i => {
(i.group(1), i.group(2))
}).toMap
}
val udfConvert=spark.udf.register("udfConvert",convert _)
//Remove double quotes
val df=df_input.withColumn("p_meta", regexp_replace($"p_meta", "\"", ""))
df.show()
/*
+----+------------+
|p_id| p_meta|
+----+------------+
| p1|{a: 1, b: 2}|
| p2| {c: 3}|
+----+------------+
*/
val df1=df.withColumn("new_col",udfConvert($"p_meta"))
/*
+----+------------+----------------+
|p_id| p_meta| new_col|
+----+------------+----------------+
| p1|{a: 1, b: 2}|[a -> 1, b -> 2]|
| p2| {c: 3}| [c -> 3]|
+----+------------+----------------+
*/
df1.select($"p_id",$"p_meta",$"new_col",explode($"new_col")).drop($"p_meta").drop($"new_col").withColumn("p_meta_key",$"key").withColumn("p_mata_value",$"value").drop($"key").drop($"value").show()
/*
+----+----------+------------+
|p_id|p_meta_key|p_mata_value|
+----+----------+------------+
| p1| a| 1|
| p1| b| 2|
| p2| c| 3|
+----+----------+------------+
*/
I want to calculate the time spent per SeqID for each user. I have a dataframe like this.
However, the time is split between two actions for every user, Action_A and Action_B.
The total time per user, per seqID would be sum across all such pairs
For first user, it is 5 + 3 [(2019-12-10 10:00:00 - 2019-12-10 10:05:00) + (2019-12-10 10:20:00 - 2019-12-10 10:23:00)]
So first user has ideally spent 8 mins for SeqID 1 (and not 23 mins).
Similarly user 2 has spent 1 + 5 = 6 mins
How can I calculate this using pyspark?
data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
(("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.show()
+---+-----+-------------------+--------+
| ID|SeqID| Timestamp| Action|
+---+-----+-------------------+--------+
|ID1| 15|2019-12-10 10:00:00|Action_A|
|ID1| 15|2019-12-10 10:05:00|Action_B|
|ID1| 15|2019-12-10 10:20:00|Action_A|
|ID1| 15|2019-12-10 10:23:00|Action_B|
|ID2| 23|2019-12-10 11:10:00|Action_A|
|ID2| 23|2019-12-10 11:11:00|Action_B|
|ID2| 23|2019-12-10 11:30:00|Action_A|
|ID2| 23|2019-12-10 11:35:00|Action_B|
+---+-----+-------------------+--------+
Once I have the data for each pair, I can sum across the group (ID, SeqID)
Expected output (could be seconds also)
+---+-----+--------+
| ID|SeqID|Dur_Mins|
+---+-----+--------+
|ID1| 15| 8|
|ID2| 23| 6|
+---+-----+--------+
Here is a possible solution using Higher-Order Functions (Spark >=2.4):
transform_expr = "transform(ts_array, (x,i) -> (unix_timestamp(ts_array[i+1]) - unix_timestamp(x))/60 * ((i+1)%2))"
df.groupBy("ID", "SeqID").agg(array_sort(collect_list(col("Timestamp"))).alias("ts_array")) \
.withColumn("transformed_ts_array", expr(transform_expr)) \
.withColumn("Dur_Mins", expr("aggregate(transformed_ts_array, 0D, (acc, x) -> acc + coalesce(x, 0D))")) \
.drop("transformed_ts_array", "ts_array") \
.show(truncate=False)
Steps:
Collect all timestamps to array for each group ID, SeqID and sort them in ascending order
Apply a transform to the array with lambda function (x, i) => Double. Where x is the actual element and i its index. For each timestamp in the array, we calculate the diff with the next timestamp. And we multiply by (i+1)%2 in order to have only the diff as pairs 2 per 2 (first with the second, third with the fourth, ...) as there are always 2 actions.
Finally, we aggregate the result array of transformation to sum all the elements.
Output:
+---+-----+--------+
|ID |SeqID|Dur_Mins|
+---+-----+--------+
|ID1|15 |8.0 |
|ID2|23 |6.0 |
+---+-----+--------+
A possible (might be complicated as well) way to do it with flatMapValues and rdd
Using your data variable
df = spark.createDataFrame(data, ["id", "seq_id", "ts", "action"]). \
withColumn('ts', func.col('ts').cast('timestamp'))
# func to calculate the duration | applied on each row
def getDur(groupedrows):
"""
"""
res = []
for row in groupedrows:
if row.action == 'Action_A':
frst_ts = row.ts
dur = 0
elif row.action == 'Action_B':
dur = (row.ts - frst_ts).total_seconds()
res.append([val for val in row] + [float(dur)])
return res
# run the rules on the base df | row by row
# grouped on ID, SeqID - sorted on timestamp
dur_rdd = df.rdd. \
groupBy(lambda k: (k.id, k.seq_id)). \
flatMapValues(lambda r: getDur(sorted(r, key=lambda ok: ok.ts))). \
values()
# specify final schema
dur_schema = df.schema. \
add('dur', 'float')
# convert to DataFrame
dur_sdf = spark.createDataFrame(dur_rdd, dur_schema)
dur_sdf.orderBy('id', 'seq_id', 'ts').show()
+---+------+-------------------+--------+-----+
| id|seq_id| ts| action| dur|
+---+------+-------------------+--------+-----+
|ID1| 15|2019-12-10 10:00:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:05:00|Action_B|300.0|
|ID1| 15|2019-12-10 10:20:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:23:00|Action_B|180.0|
|ID2| 23|2019-12-10 11:10:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:11:00|Action_B| 60.0|
|ID2| 23|2019-12-10 11:30:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:35:00|Action_B|300.0|
+---+------+-------------------+--------+-----+
# Your required data
dur_sdf.groupBy('id', 'seq_id'). \
agg((func.sum('dur') / func.lit(60)).alias('dur_mins')). \
show()
+---+------+--------+
| id|seq_id|dur_mins|
+---+------+--------+
|ID1| 15| 8.0|
|ID2| 23| 6.0|
+---+------+--------+
This fits the data you've described, but check if it fits your all your cases.
Another possible solution using Window Functions
spark = SparkSession.Builder().master("local[3]").appName("TestApp").enableHiveSupport().getOrCreate()
data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
(("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.registerTempTable("tmpTbl")
df = spark.sql("select *, lead(Timestamp,1) over (partition by ID,SeqID order by Timestamp) Next_Timestamp from tmpTbl")
updated_df = df.filter("Action != 'Action_B'")
final_df = updated_df.withColumn("diff", (F.unix_timestamp('Next_Timestamp') - F.unix_timestamp('Timestamp'))/F.lit(60))
final_df.groupBy("ID","SeqID").agg(F.sum(F.col("diff")).alias("Duration")).show()
Output
I am the beginner with PySpark. I am using FPgrowth computing association in PySpark. I followed the steps below.
Data Example
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['customer_id', 'product_id']
vals = [
(370, 154),
(41, 40),
(109, 173),
(18, 55),
(105, 126),
(370, 121),
(41, 32323),
(109, 22),
(18, 55),
(105, 133),
(109, 22),
(18, 55),
(105, 133)
]
df = spark.createDataFrame(vals, columns)
df.show()
+-----------+----------+
|customer_id|product_id|
+-----------+----------+
| 370| 154|
| 41| 40|
| 109| 173|
| 18| 55|
| 105| 126|
| 370| 121|
| 41| 32323|
| 109| 22|
| 18| 55|
| 105| 133|
| 109| 22|
| 18| 55|
| 105| 133|
+-----------+----------+
### Prepare input data
from pyspark.sql.functions import collect_list, col
transactions = df.groupBy("customer_id")\
.agg(collect_list("product_id").alias("product_ids"))\
.rdd\
.map(lambda x: (x.customer_id, x.product_ids))
transactions.collect()
[(370, [121, 154]),
(41, [32323, 40]),
(105, [133, 133, 126]),
(18, [55, 55, 55]),
(109, [22, 173, 22])]
## Convert .rdd to spark dataframe
df2 = spark.createDataFrame(transactions)
df2.show()
+---+---------------+
| _1| _2|
+---+---------------+
|370| [121, 154]|
| 41| [32323, 40]|
|105|[126, 133, 133]|
| 18| [55, 55, 55]|
|109| [22, 173, 22]|
+---+---------------+
df3 = df2.selectExpr("_1 as customer_id", "_2 as product_id")
df3.show()
df3.printSchema()
+-----------+---------------+
|customer_id| product_id|
+-----------+---------------+
| 370| [154, 121]|
| 41| [32323, 40]|
| 105|[126, 133, 133]|
| 18| [55, 55, 55]|
| 109| [173, 22, 22]|
+-----------+---------------+
root
|-- customer_id: long (nullable = true)
|-- product_id: array (nullable = true)
| |-- element: long (containsNull = true)
## FPGrowth Model Building
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="product_id", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df3)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-12-aa1f71745240> in <module>()
----> 1 model = fpGrowth.fit(df3)
/usr/lib/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/usr/lib/spark/python/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
/usr/lib/spark/python/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
I looked up but I did not figure out what went wrong. The only thing maybe I could point to I converted the RDD to a dataframe.
Can anybody point me to what I am doing wrong?
If you carefully check the traceback you'll see the source of the problem:
Caused by: org.apache.spark.SparkException: Items in a transaction must be unique but got ....
Replace collect_list with collect_set and problem will be fixed.
Well, I just realised FPGrowth from pyspark.ml.fpm takes a PySpark dataframe, not an rdd. So above mentioned method converted my dataset to an rdd.
I was able to avoid the situation by using PySpark collect_set list with groupby to get a data frame and pass on.
from pyspark.sql.session import SparkSession
# instantiate Spark
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['customer_id', 'product_id']
vals = [
(370, 154),
(370, 40),
(370, 173),
(41, 55),
(41, 126),
(41, 121),
(41, 321),
(105, 22),
(105, 55),
(105, 133),
(109, 22),
(109, 55),
(109, 133)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
df.show()
+-----------+----------+
|customer_id|product_id|
+-----------+----------+
| 370| 154|
| 370| 40|
| 370| 173|
| 41| 55|
| 41| 126|
| 41| 121|
| 41| 32323|
| 105| 22|
| 105| 55|
| 105| 133|
| 109| 22|
| 109| 55|
| 109| 133|
+-----------+----------+
# Create dataframe for FPGrowth model input
from pyspark.sql.functions import collect_list, col
from pyspark.sql import functions as F
from pyspark.sql.functions import *
transactions = df.groupBy("customer_id")\
.agg(F.collect_set("product_id"))
transactions.show()
+-----------+-----------------------+
|customer_id|collect_set(product_id)|
+-----------+-----------------------+
| 370| [154, 173, 40]|
| 41| [321, 121, 126, 55]|
| 105| [133, 22, 55]|
| 109| [133, 22, 55]|
+-----------+-----------------------+
# FPGrowth model
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(product_id)", minSupport=0.5, minConfidence=0.6
model_working = fpGrowth.fit(transactions)
# Display frequent itemsets
model_working.freqItemsets.show()
+-------------+----+
| items|freq|
+-------------+----+
| [55]| 3|
| [22]| 2|
| [22, 55]| 2|
| [133]| 2|
| [133, 22]| 2|
|[133, 22, 55]| 2|
| [133, 55]| 2|
+-------------+----+
# Display generated association rules.
model_working.associationRules.show()
# transform examines the input items against all the association rules and summarise the
# consequents as prediction
model_working.transform(transactions).show()
+----------+----------+------------------+
|antecedent|consequent| confidence|
+----------+----------+------------------+
| [133]| [22]| 1.0|
| [133]| [55]| 1.0|
| [133, 55]| [22]| 1.0|
| [133, 22]| [55]| 1.0|
| [22]| [55]| 1.0|
| [22]| [133]| 1.0|
| [55]| [22]|0.6666666666666666|
| [55]| [133]|0.6666666666666666|
| [22, 55]| [133]| 1.0|
+----------+----------+------------------+
+-----------+-----------------------+----------+
|customer_id|collect_set(product_id)|prediction|
+-----------+-----------------------+----------+
| 370| [154, 173, 40]| []|
| 41| [321, 121, 126, 55]| [22, 133]|
| 105| [133, 22, 55]| []|
| 109| [133, 22, 55]| []|
+-----------+-----------------------+----------+