I have data like the example data below. I’m trying to create a new column in my data using PySpark that would be the category of the first event for a customer based on the timestamp. Like the example output data below.
I have an example below of what I think would accomplish it using a window function in sql.
I’m pretty new to PySpark. I understand you can run sql inside of PySpark. I’m wondering if I have the code correct below to run the sql window function in PySpark. That is I’m wondering if I can just paste the sql code inside of spark.sql, as I have below.
Input:
eventid customerid category timestamp
1 3 a 1/1/12
2 3 b 2/3/14
4 2 c 4/1/12
Output:
eventid customerid category timestamp first_event
1 3 a 1/1/12 a
2 3 b 2/3/14 a
4 2 c 4/1/12 c
window function example:
select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table
# implementing window function example with pyspark
PySpark:
# Note: assume df is dataframe with structure of table above
# (df is table)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“Operations”).getOrCreate()
# Register the DataFrame as a SQL temporary view
df.createOrReplaceView(“Table”)
sql_results = spark.sql(“select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table”)
# display results
sql_results.show()
You can use window function in pyspark as well
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>>
>>> df.show()
+-------+----------+--------+---------+
|eventid|customerid|category|timestamp|
+-------+----------+--------+---------+
| 1| 3| a| 1/1/12|
| 2| 3| b| 2/3/14|
| 4| 2| c| 4/1/12|
+-------+----------+--------+---------+
>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>>
>>> df.show()
+-------+----------+--------+---------+-----------+
|eventid|customerid|category|timestamp|first_event|
+-------+----------+--------+---------+-----------+
| 1| 3| a| 1/1/12| a|
| 2| 3| b| 2/3/14| a|
| 4| 2| c| 4/1/12| c|
+-------+----------+--------+---------+-----------+
Related
I am using spark dataframes.
The task is this: to calculate and display in descending order the number of cities in the country grouped by country and region.
Initial data:
from pyspark.sql.functions import col
from pyspark.sql.functions import count
df = spark.read.json("/content/world-cities.json")
df.printSchema()
df.show()
enter image description here
Desired result:
enter image description here
I get grouping only by the country column.
How to add grouping by second column subcountry?
df.groupBy(col('country')).agg(count("*").alias("cnt"))\
.orderBy(col('cnt').desc())\
.show()
enter image description here
If i understand you correctly you just need to add second column to your group by
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),("USA","usa-subcountry", "usa-city-2"),("USA","usa-subcountry-2", "usa-city"), ("Argentina","argentina-subcountry", "argentina-city")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("cnt"))\
.orderBy(F.col('cnt').desc())\
.show()
Output is:
+---------+--------------------+---+
| country| subcountry|cnt|
+---------+--------------------+---+
| USA| usa-subcountry| 2|
| USA| usa-subcountry-2| 1|
|Argentina|argentina-subcountry| 1|
+---------+--------------------+---+
Edit: another try based on comment:
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),
("USA","usa-subcountry", "usa-city-2"),
("USA","usa-subcountry", "usa-city-3"),
("USA","usa-subcountry-2", "usa-city"),
("Argentina","argentina-subcountry", "argentina-city"),
("Argentina","argentina-subcountry-2", "argentina-city-2"),
("UK","UK-subcountry", "UK-city-1")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("city_count"))\
.groupBy(F.col('country')).agg(F.count("*").alias("subcountry_count"), F.sum('city_count').alias("city_count"))\
.orderBy(F.col('city_count').desc())\
.show()
output:
+---------+----------------+----------+
| country|subcountry_count|city_count|
+---------+----------------+----------+
| USA| 2| 4|
|Argentina| 2| 2|
| UK| 1| 1|
+---------+----------------+----------+
I am assuming that cities and subcountries are unique, if not you may consider to use countDistinct instead of count
I have a pyspark DF of ids and purchases which I'm trying to transform for use with FP growth.
Currently i have multiple rows for a given id with each row only relating to a single purchase.
I'd like to transform this dataframe to a form where there are two columns, one for id (with a single row per id ) and the second column containing a list of distinct purchases for that id.
I've tried to use a User Defined Function (UDF) to map the distinct purchases onto the distinct ids but I get a "py4j.Py4JException: Method getstate([]) does not exist". Thanks to #Mithril
I see that "You can't use sparkSession object , spark.DataFrame object or other Spark distributed objects in udf and pandas_udf, because they are unpickled."
So I've implemented the TERRIBLE approach below (which will work but is not scalable):
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
# Lets have a look at the resulting spark dataframe
spk_df_1.show()
# Lets capture the ids and list of their distinct pruschases in a
# list of tuples
purschases_lst = []
nums1 = []
import pyspark.sql.functions as f
# for each distinct id lets get the list of their distinct pruschases
for id in spark.sql("SELECT distinct(id) FROM TBLdf ").rdd.map(lambda row : row[0]).collect():
purschase = df.filter(f.col("id") == id).select("item").distinct().rdd.map(lambda row : row[0]).collect()
nums1.append((id,purschase))
# Lets see what our list of transaction tuples looks like
print(nums1)
print("\n")
# lets turn the list of transaction tuples into a pandas dataframe
df_pd = pd.DataFrame(nums1)
# Finally lets turn our pandas dataframe into a pyspark Dataframe
df2 = spark.createDataFrame(df_pd)
df2.show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
[(1, ['fruit', 'cake']), (3, ['beer']), (2, ['tea'])]
+---+-------------+
| 0| 1|
+---+-------------+
| 1|[fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-------------+
If anybody has any suggestions I'd greatly appreciate it.
That is a task for collect_set, which creates a set of items without duplicates:
import pyspark.sql.functions as F
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
spk_df_1.show()
spk_df_1.groupby('id').agg(F.collect_set('item')).show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
+---+-----------------+
| id|collect_set(item)|
+---+-----------------+
| 1| [fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-----------------+
I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.
df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value
I have data that looks like this:
userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06-04 03:04:00,18685891
3136afcb,2017-06-04 03:03:00,18382821
661212dd,2017-06-04 03:06:00,80831484
40e8a7c3,2017-06-04 03:12:00,18825769
I would like to add a new boolean column that marks true if there are 2 or moreuserid within a 5 minutes window in the same location_point. I had an idea of using lag function to lookup over a window partitioned by the userid and with the range between the current timestamp and the next 5 minutes:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
days = lambda i: i * 60*5
windowSpec = W.partitionBy(col("userid")).orderBy(col("eventtime").cast("timestamp").cast("long")).rangeBetween(0, days(5))
lastURN = F.lag(col("location_point"), 1).over(windowSpec)
visitCheck = (last_location_point == output.location_pont)
output.withColumn("visit_check", visitCheck).select("userid","eventtime", "location_pont", "visit_check")
This code is giving me an analysis exception when I use the RangeBetween function:
AnalysisException: u'Window Frame RANGE BETWEEN CURRENT ROW AND 1500
FOLLOWING must match the required frame ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING;
Do you know any way to tackle this problem?
Given your data:
Let's add a column with a timestamp in seconds:
df = df.withColumn('timestamp',df_taf.eventtime.astype('Timestamp').cast("long"))
df.show()
+--------+-------------------+--------------+----------+
| userid| eventtime|location_point| timestamp|
+--------+-------------------+--------------+----------+
|4e191908|2017-06-04 03:00:00| 18685891|1496545200|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560|
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890|
+--------+-------------------+--------------+----------+
Now, let's define a window function, with a partition by location_point, an order by timestamp and a range between -300s and current time. We can count the number of elements in this window and put these data in a column named 'occurences in_5_min':
w = Window.partitionBy('location_point').orderBy('timestamp').rangeBetween(-60*5,0)
df = df.withColumn('occurrences_in_5_min',F.count('timestamp').over(w))
df.show()
+--------+-------------------+--------------+----------+--------------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|
+--------+-------------------+--------------+----------+--------------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1|
+--------+-------------------+--------------+----------+--------------------+
Now you can add the desired column with True if the number of occurences is strictly more than 1 in the last 5 minutes on a particular location:
add_bool = udf(lambda col : True if col>1 else False, BooleanType())
df = df.withColumn('already_occured',add_bool('occurrences_in_5_min'))
df.show()
+--------+-------------------+--------------+----------+--------------------+---------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|already_occured|
+--------+-------------------+--------------+----------+--------------------+---------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1| false|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1| false|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1| false|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1| false|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2| true|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1| false|
+--------+-------------------+--------------+----------+--------------------+---------------+
rangeBetween just doesn't make sense for non-aggregate function like lag. lag takes always a specific row, denoted by offset argument, so specifying frame is pointless.
To get a window over time series you can use window grouping with standard aggregates:
from pyspark.sql.functions import window, countDistinct
(df
.groupBy("location_point", window("eventtime", "5 minutes"))
.agg( countDistinct("userid")))
You can add more arguments to modify slide duration.
You can try something similar with window functions if you partition by location:
windowSpec = (W.partitionBy(col("location"))
.orderBy(col("eventtime").cast("timestamp").cast("long"))
.rangeBetween(0, days(5)))
df.withColumn("id_count", countDistinct("userid").over(windowSpec))
I have 2 dataframes in Spark which are train and test. I have a categorical column in both, say Product_ID, what I want to do is that, I want to put -1 value for those categories, which are in test but not present in train.
So for that I first found distinct categories for that column in p_not_in_test. But I am not able proceed further. how to do that.....
p_not_in_test = test.select('Product_ID').subtract(train.select('Product_ID'))
p_not_in_test = p_not_in_test.distinct()
Regards
Here's a reproducible example, first we create dummy data:
test = sc.parallelize([("ID1", 1,5),("ID2", 2,4),
("ID3", 5,8),("ID4", 9,0),
("ID5", 0,3)]).toDF(["PRODUCT_ID", "val1", "val2"])
train = sc.parallelize([("ID1", 4,7),("ID3", 1,4),
("ID5", 9,2)]).toDF(["PRODUCT_ID", "val1", "val2"])
Now we need to extend your definition of p_not_in_test so we get a list as an output:
p_not_in_test = (test.select('PRODUCT_ID')
.subtract(train.select('PRODUCT_ID'))
.rdd.map(lambda x: x[0]).collect())
Finally, we can create an udf that will add "-1" in front of each ID that's not present in train.
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
addString = udf(lambda x: '-1 ' + x if x in p_not_in_test else x, StringType())
test.withColumn("NEW_ID",addString(test["PRODUCT_ID"])).show()
+----------+----+----+------+
|PRODUCT_ID|val1|val2|NEW_ID|
+----------+----+----+------+
| ID1| 1| 5| ID1|
| ID2| 2| 4|-1 ID2|
| ID3| 5| 8| ID3|
| ID4| 9| 0|-1 ID4|
| ID5| 0| 3| ID5|
+----------+----+----+------+