I am trying to do few transformations on my RDD and for that, I am calling a function using map. However, this function is not getting invoked. Someone please let me know what I am doing wrong here?
I can see test function getting invoked but not store_past_info
def store_past_info(row):
print "------------------- store_past_info ------------------------------"
if row["transactiontype"] == "Return":
global prv_transaction_number
prv_transaction_number = row["transnumber"]
global return_occured
return_occured = True
global group_id
group_id.append(row["transnumber"])
if row["transactiontype"] == "Purchase":
if return_occured:
global group_id
group_id.append(prv_transaction_number)
else:
global group_id
group_id.append(row["transnumber"])
print group_id
def test(rdd):
print "------------------- test ------------------------------"
rdd.map(store_past_info).collect()
print group_id
This is how it works in store:
if some item is purchased an id is generated.
if you want to return few items from your purchase, two entries are been made
Return entry with new id for the return of all the products, with org_id as id of your purchase order you want to return
New Purchase entry with the same id as your last purchase id for things you want to keep
Input
Date Type Id org_id
25-03-2018 Purchase 111
25-03-2018 Purchase 112
26-03-2018 Return 113 111
26-03-2018 Purchase 111
Output
I want to add a new column group_id, which will show the same id for Return and Corresponding Purchase happened after return ( customer don't do this purchase, this is how system keeps entry for every return) step 2.1
Date Type Id org_id group_id
25-03-2018 Purchase 111 111
25-03-2018 Purchase 112 112
26-03-2018 Return 113 111 113
26-03-2018 Purchase 111 113
IIUC, I believe you can get your output using DataFrames, a pyspark.sql.Window function, and crossJoin()
First convert your rdd to a DataFrame using
df = rdd.toDF() # you may have to specify the column names
df.show()
#+----------+--------+---+------+
#| Date| Type| Id|org_id|
#+----------+--------+---+------+
#|25-03-2018|Purchase|111| null|
#|25-03-2018|Purchase|112| null|
#|26-03-2018| Return|113| 111|
#|26-03-2018|Purchase|111| null|
#+----------+--------+---+------+
Then we will need to add an Index column to keep track of the order of the rows. We can use pyspark.sql.functions.monotonically_increasing_id(). This will guarantee that the values will be increasing (so they can be ordered), but does not mean that they will be sequential.
import pyspark.sql.functions as f
df = df.withColumn('Index', f.monotonically_increasing_id())
df.show()
#+----------+--------+---+------+-----------+
#| Date| Type| Id|org_id| Index|
#+----------+--------+---+------+-----------+
#|25-03-2018|Purchase|111| null| 8589934592|
#|25-03-2018|Purchase|112| null|17179869184|
#|26-03-2018| Return|113| 111|34359738368|
#|26-03-2018|Purchase|111| null|42949672960|
#+----------+--------+---+------+-----------+
The ordering is important because you want to look for rows that come after a Return.
Next use crossJoin to join the DataFrame to itself.
Since this returns the Cartesian product, we will filter it to just the rows that meet either of the following conditions:
l.Index = r.Index (essentially join a row to itself)
(l.Id = r.org_id) AND (l.Index > r.Index) (an Id is equal to a org_id from an earlier row- this is where the Index column is helpful)
Then we add a column for group_id and set it equal to r.Id if the second condition is met. Otherwise we set this column to None.
df1 = df.alias('l').crossJoin(df.alias('r'))\
.where('(l.Index = r.Index) OR ((l.Id = r.org_id) AND (l.Index > r.Index))')\
.select(
'l.Index',
'l.Date',
'l.Type',
'l.Id',
'l.org_id',
f.when(
(f.col('l.Id') == f.col('r.org_id')) & (f.col('l.Index') > f.col('r.Index')),
f.col('r.Id')
).otherwise(f.lit(None)).alias('group_id')
)
df1.show()
#+-----------+----------+--------+---+------+--------+
#| Index| Date| Type| Id|org_id|group_id|
#+-----------+----------+--------+---+------+--------+
#| 8589934592|25-03-2018|Purchase|111| null| null|
#|17179869184|25-03-2018|Purchase|112| null| null|
#|34359738368|26-03-2018| Return|113| 111| null|
#|42949672960|26-03-2018|Purchase|111| null| 113|
#|42949672960|26-03-2018|Purchase|111| null| null|
#+-----------+----------+--------+---+------+--------+
We are almost there but as you can see there are two things that still need to be done.
We need to eliminate the duplicate row for Index = 42949672960
We need to fill in the group_id for rows where it is null using the value from Id.
For the first step, we will use a Window function to create a temporary column called rowNum. This will be the pyspark.sql.functions.row_number() for each Index ordered by the boolean condition group_id IS NULL.
For the Index values where there are multiple rows, the one where the group_id has already been set will sort first. Thus we just need to select the rows where the rowNum is equal to 1 (row_number() starts at 1, not 0).
After this is done, the second step is trivial- just replace the remaining null values with the value from Id.
from pyspark.sql import Window
w = Window.partitionBy(f.col('Index')).orderBy(f.isnull('group_id'))
df2 = df1.withColumn('rowNum', f.row_number().over(w))\
.where(f.col('rowNum')==1)\
.sort('Index')\
.select(
'Date',
'Type',
'Id',
'org_id',
f.when(
f.isnull('group_id'),
f.col('Id')
).otherwise(f.col('group_id')).alias('group_id')
)
df2.show()
#+----------+--------+---+------+--------+
#| Date| Type| Id|org_id|group_id|
#+----------+--------+---+------+--------+
#|25-03-2018|Purchase|111| null| 111|
#|25-03-2018|Purchase|112| null| 112|
#|26-03-2018| Return|113| 111| 113|
#|26-03-2018|Purchase|111| null| 113|
#+----------+--------+---+------+--------+
Related
Suppose I have the following pyspark dataframe df:
id date var1 var2
1 1 NULL 2
1 2 b 3
2 1 a NULL
2 2 a 1
I want the first non missing observation for all var* columns and additionally the value of date where this is from, i.e. the final result should look like:
id var1 dt_var1 var2 dt_var2
1 b 2 2 1
2 a 1 1 2
Getting the values is straightforward using
df.orderBy(['id','date']).groupby('id').agg(
*[F.first(x, ignorenulls=True).alias(x) for x in ['var1', 'var2']]
)
But I fail to see how I could get the respective dates. I could loop variable for variable, drop missing, and keep the first row. But this sounds like a poor solution that will not scale well, as it would require a separate dataframe for each variable.
I would prefer a solution that scales to many columns (var3, var4,...)
You should not use groupby if you want to get the first non-null according to date ordering. The order is not guaranteed after a groupby operation even if you called orderby just before.
You need to use window functions instead. To get the date associated with each var value you can use this trick with structs:
from pyspark.sql import Window, functions as F
w = (Window.partitionBy("id").orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
df1 = df.select(
"id",
*[F.first(
F.when(F.col(x).isNotNull(), F.struct(x, F.col("date").alias(f"dt_{x}"))),
ignorenulls=True).over(w).alias(x)
for x in ["var1", "var2"]
]
).distinct().select("id", "var1.*", "var2.*")
df1.show()
#+---+----+-------+----+-------+
#| id|var1|dt_var1|var2|dt_var2|
#+---+----+-------+----+-------+
#| 1| b| 2| 2| 1|
#| 2| a| 1| 1| 2|
#+---+----+-------+----+-------+
I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|
I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+
I wrote the spark logic below.
High level:
The code loops through some data, pulls some records back in batches, applies some logic to those records and appends the output to another table created at run time. The job completes successfully but the table is empty.
Detailed:
The code should create a spark data frame with 3 names.
For each name, the code constructs a query, using the name as a filter condition, applies some logic to the returned data and stores it in a new spark data frame (output_spark_df). This data frame, then, gets converted to a temp table and spark.sql is, then, used to insert the results into my_database.my_results. my_database.my_results should have data loaded into it 3 times. Despite the job completeing successfully, my_database.my_results remains empty.
Any guidance would be greatly appreciated.
if __name__ == "__main__":
spark = SparkSession.builder.appName('batch_job').config("spark.kryoserializer.buffer.max", "2047mb").config("spark.sql.broadcastTimeout", "-1").config("spark.sql.autoBroadcastJoinThreshold","-1").getOrCreate()
# Set up hive table to capture results
#-------------------------------------
spark.sql("DROP TABLE IF EXISTS my_database.my_results")
spark.sql("CREATE TABLE IF NOT EXISTS my_database.my_results (field1 STRING, field2 INT) STORED AS PARQUET")
names = spark.sql("select distinct name from my_database.my_input where name IN ('mike','jane','ryan')")
for n in names:
input_spark_df = spark.sql("select * from my_database.my_input where name = '{}'".format(n))
.
.
.
<APPLY LOGIC>
.
.
.
output_spark_df = <logic applied>
# Capture output and append to pre-created hive table
#----------------------------------------------------
output_spark_df.registerTempTable("results")
spark.sql("INSERT INTO TABLE my_database.my_results SELECT * FROM results")
spark.stop()
names is still an dataframe in your code as you are looping over dataframe which result no matching records inside your for loop.
To make names variable as list we need to do flatMap and collect to create a list then loop over the list.
Fix:
# create names list
names=spark.sql("select distinct id as id from default.i").\
rdd.\
flatMap(lambda z:z).\
collect()
# to print values in the list
for n in names:
print(n)
Example with sample data:
#sample data
spark.sql("select distinct id as id from default.i").show()
#+---+
#| id|
#+---+
#| 1|
#| 2|
#| 3|
#+---+
#creating a list
names=spark.sql("select distinct id as id from default.i").flatMap(lambda z:z).collect()
#looping over the list
for n in names:
spark.sql("select * from default.i where id = '{}'".format(n)).show()
#result
#+---+
#| id|
#+---+
#| 1|
#+---+
#
#+---+
#| id|
#+---+
#| 2|
#+---+
#
#+---+
#| id|
#+---+
#| 3|
#+---+
I want to loop through spark dataframe, check if a condition i.e aggregated value of multiple rows is true/false then create a dataframe. Please see the code outline, can you please help fix the code? i'm pretty new to spark and python- struggling may way through it,any help is greatly appreciated
sort trades by Instrument and date (in asc order)
dfsorted = df.orderBy('Instrument','Date').show()
new temp variable to keep track of the quantity sum
sumofquantity = 0
for each row in the dfsorted
sumofquantity = sumofquantity + dfsorted['Quantity']
keep appending the rows looped thus far into this new dataframe called dftemp
dftemp= dfsorted (how to write this?)
if ( sumofquantity == 0)
once the sumofquantity becomes zero, for all the rows in the tempview-add a new column with unique seqential number
and append rows into the final dataframe
dffinal= dftemp.withColumn('trade#', assign a unique trade number)
reset the sumofquantity back to 0
sumofquantity = 0
clear the dftemp-how to clear the dataframe so i can start wtih zero rows for next iteration?
trade_sample.csv ( raw input file)
Customer ID,Instrument,Action,Date,Price,Quantity
U16,ADM6,BUY,20160516,0.7337,2
U16,ADM6,SELL,20160516,0.7337,-1
U16,ADM6,SELL,20160516,0.9439,-1
U16,CLM6,BUY,20160516,48.09,1
U16,CLM6,SELL,20160517,48.08,-1
U16,ZSM6,BUY,20160517,48.09,1
U16,ZSM6,SELL,20160518,48.08,-1
Expected Result ( notice last new column-that is all that I'm trying to add)
Customer ID,Instrument,Action,Date,Price,Quantity,trade#
U16,ADM6,BUY,20160516,0.7337,2,10001
U16,ADM6,SELL,20160516,0.7337,-1,10001
U16,ADM6,SELL,20160516,0.9439,-1,10001
U16,CLM6,BUY,20160516,48.09,1,10002
U16,CLM6,SELL,20160517,48.08,-1,10002
U16,ZSM6,BUY,20160517,48.09,1,10003
U16,ZSM6,SELL,20160518,48.08,-1,10003
Looping in such way is not good practice. You can not add/sum dataframe cumulatively and clear immutable dataframe. For your problem you can use spark windowing concept.
As much I understand your problem you want to calculate sum of Quantity for each customer ID. Once it complete sum for one Customer ID you reset sumofquantity to zero. If it is so, then you can partition Customer ID with order by Instrument , Date and calculate sum for each Customer ID. Once you get sum then you can check for trade# with your conditions.
just refer below code:
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import row_number,col,sum
>>> w = Window.partitionBy("Customer ID").orderBy("Instrument","Date")
>>> w1 = Window.partitionBy("Customer ID").orderBy("Instrument","Date","rn")
>>> dftemp = Df.withColumn("rn", (row_number().over(w))).withColumn("sumofquantity", sum("Quantity").over(w1)).select("Customer_ID","Instrument","Action","Date","Price","Quantity","sumofquantity")
>>> dftemp.show()
+-----------+----------+------+--------+------+--------+-------------+
|Customer_ID|Instrument|Action| Date| Price|Quantity|sumofquantity|
+-----------+----------+------+--------+------+--------+-------------+
| U16| ADM6| BUY|20160516|0.7337| 2| 2|
| U16| ADM6| SELL|20160516|0.7337| -1| 1|
| U16| ADM6| SELL|20160516|0.9439| -1| 0|
| U16| CLM6| BUY|20160516| 48.09| 1| 1|
| U16| CLM6| SELL|20160517| 48.08| -1| 0|
| U16| ZSM6| BUY|20160517| 48.09| 1| 1|
| U16| ZSM6| SELL|20160518| 48.08| -1| 0|
+-----------+----------+------+--------+------+--------+-------------+
You can refer Window function at below link:
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I have data that looks like this:
userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06-04 03:04:00,18685891
3136afcb,2017-06-04 03:03:00,18382821
661212dd,2017-06-04 03:06:00,80831484
40e8a7c3,2017-06-04 03:12:00,18825769
I would like to add a new boolean column that marks true if there are 2 or moreuserid within a 5 minutes window in the same location_point. I had an idea of using lag function to lookup over a window partitioned by the userid and with the range between the current timestamp and the next 5 minutes:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
days = lambda i: i * 60*5
windowSpec = W.partitionBy(col("userid")).orderBy(col("eventtime").cast("timestamp").cast("long")).rangeBetween(0, days(5))
lastURN = F.lag(col("location_point"), 1).over(windowSpec)
visitCheck = (last_location_point == output.location_pont)
output.withColumn("visit_check", visitCheck).select("userid","eventtime", "location_pont", "visit_check")
This code is giving me an analysis exception when I use the RangeBetween function:
AnalysisException: u'Window Frame RANGE BETWEEN CURRENT ROW AND 1500
FOLLOWING must match the required frame ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING;
Do you know any way to tackle this problem?
Given your data:
Let's add a column with a timestamp in seconds:
df = df.withColumn('timestamp',df_taf.eventtime.astype('Timestamp').cast("long"))
df.show()
+--------+-------------------+--------------+----------+
| userid| eventtime|location_point| timestamp|
+--------+-------------------+--------------+----------+
|4e191908|2017-06-04 03:00:00| 18685891|1496545200|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560|
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890|
+--------+-------------------+--------------+----------+
Now, let's define a window function, with a partition by location_point, an order by timestamp and a range between -300s and current time. We can count the number of elements in this window and put these data in a column named 'occurences in_5_min':
w = Window.partitionBy('location_point').orderBy('timestamp').rangeBetween(-60*5,0)
df = df.withColumn('occurrences_in_5_min',F.count('timestamp').over(w))
df.show()
+--------+-------------------+--------------+----------+--------------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|
+--------+-------------------+--------------+----------+--------------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1|
+--------+-------------------+--------------+----------+--------------------+
Now you can add the desired column with True if the number of occurences is strictly more than 1 in the last 5 minutes on a particular location:
add_bool = udf(lambda col : True if col>1 else False, BooleanType())
df = df.withColumn('already_occured',add_bool('occurrences_in_5_min'))
df.show()
+--------+-------------------+--------------+----------+--------------------+---------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|already_occured|
+--------+-------------------+--------------+----------+--------------------+---------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1| false|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1| false|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1| false|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1| false|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2| true|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1| false|
+--------+-------------------+--------------+----------+--------------------+---------------+
rangeBetween just doesn't make sense for non-aggregate function like lag. lag takes always a specific row, denoted by offset argument, so specifying frame is pointless.
To get a window over time series you can use window grouping with standard aggregates:
from pyspark.sql.functions import window, countDistinct
(df
.groupBy("location_point", window("eventtime", "5 minutes"))
.agg( countDistinct("userid")))
You can add more arguments to modify slide duration.
You can try something similar with window functions if you partition by location:
windowSpec = (W.partitionBy(col("location"))
.orderBy(col("eventtime").cast("timestamp").cast("long"))
.rangeBetween(0, days(5)))
df.withColumn("id_count", countDistinct("userid").over(windowSpec))