Looping in Spark dataframes using python - apache-spark

I want to loop through spark dataframe, check if a condition i.e aggregated value of multiple rows is true/false then create a dataframe. Please see the code outline, can you please help fix the code? i'm pretty new to spark and python- struggling may way through it,any help is greatly appreciated
sort trades by Instrument and date (in asc order)
dfsorted = df.orderBy('Instrument','Date').show()
new temp variable to keep track of the quantity sum
sumofquantity = 0
for each row in the dfsorted
sumofquantity = sumofquantity + dfsorted['Quantity']
keep appending the rows looped thus far into this new dataframe called dftemp
dftemp= dfsorted (how to write this?)
if ( sumofquantity == 0)
once the sumofquantity becomes zero, for all the rows in the tempview-add a new column with unique seqential number
and append rows into the final dataframe
dffinal= dftemp.withColumn('trade#', assign a unique trade number)
reset the sumofquantity back to 0
sumofquantity = 0
clear the dftemp-how to clear the dataframe so i can start wtih zero rows for next iteration?
trade_sample.csv ( raw input file)
Customer ID,Instrument,Action,Date,Price,Quantity
U16,ADM6,BUY,20160516,0.7337,2
U16,ADM6,SELL,20160516,0.7337,-1
U16,ADM6,SELL,20160516,0.9439,-1
U16,CLM6,BUY,20160516,48.09,1
U16,CLM6,SELL,20160517,48.08,-1
U16,ZSM6,BUY,20160517,48.09,1
U16,ZSM6,SELL,20160518,48.08,-1
Expected Result ( notice last new column-that is all that I'm trying to add)
Customer ID,Instrument,Action,Date,Price,Quantity,trade#
U16,ADM6,BUY,20160516,0.7337,2,10001
U16,ADM6,SELL,20160516,0.7337,-1,10001
U16,ADM6,SELL,20160516,0.9439,-1,10001
U16,CLM6,BUY,20160516,48.09,1,10002
U16,CLM6,SELL,20160517,48.08,-1,10002
U16,ZSM6,BUY,20160517,48.09,1,10003
U16,ZSM6,SELL,20160518,48.08,-1,10003

Looping in such way is not good practice. You can not add/sum dataframe cumulatively and clear immutable dataframe. For your problem you can use spark windowing concept.
As much I understand your problem you want to calculate sum of Quantity for each customer ID. Once it complete sum for one Customer ID you reset sumofquantity to zero. If it is so, then you can partition Customer ID with order by Instrument , Date and calculate sum for each Customer ID. Once you get sum then you can check for trade# with your conditions.
just refer below code:
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import row_number,col,sum
>>> w = Window.partitionBy("Customer ID").orderBy("Instrument","Date")
>>> w1 = Window.partitionBy("Customer ID").orderBy("Instrument","Date","rn")
>>> dftemp = Df.withColumn("rn", (row_number().over(w))).withColumn("sumofquantity", sum("Quantity").over(w1)).select("Customer_ID","Instrument","Action","Date","Price","Quantity","sumofquantity")
>>> dftemp.show()
+-----------+----------+------+--------+------+--------+-------------+
|Customer_ID|Instrument|Action| Date| Price|Quantity|sumofquantity|
+-----------+----------+------+--------+------+--------+-------------+
| U16| ADM6| BUY|20160516|0.7337| 2| 2|
| U16| ADM6| SELL|20160516|0.7337| -1| 1|
| U16| ADM6| SELL|20160516|0.9439| -1| 0|
| U16| CLM6| BUY|20160516| 48.09| 1| 1|
| U16| CLM6| SELL|20160517| 48.08| -1| 0|
| U16| ZSM6| BUY|20160517| 48.09| 1| 1|
| U16| ZSM6| SELL|20160518| 48.08| -1| 0|
+-----------+----------+------+--------+------+--------+-------------+
You can refer Window function at below link:
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Related

pyspark value of column when other column has first nonmissing value

Suppose I have the following pyspark dataframe df:
id date var1 var2
1 1 NULL 2
1 2 b 3
2 1 a NULL
2 2 a 1
I want the first non missing observation for all var* columns and additionally the value of date where this is from, i.e. the final result should look like:
id var1 dt_var1 var2 dt_var2
1 b 2 2 1
2 a 1 1 2
Getting the values is straightforward using
df.orderBy(['id','date']).groupby('id').agg(
*[F.first(x, ignorenulls=True).alias(x) for x in ['var1', 'var2']]
)
But I fail to see how I could get the respective dates. I could loop variable for variable, drop missing, and keep the first row. But this sounds like a poor solution that will not scale well, as it would require a separate dataframe for each variable.
I would prefer a solution that scales to many columns (var3, var4,...)
You should not use groupby if you want to get the first non-null according to date ordering. The order is not guaranteed after a groupby operation even if you called orderby just before.
You need to use window functions instead. To get the date associated with each var value you can use this trick with structs:
from pyspark.sql import Window, functions as F
w = (Window.partitionBy("id").orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
df1 = df.select(
"id",
*[F.first(
F.when(F.col(x).isNotNull(), F.struct(x, F.col("date").alias(f"dt_{x}"))),
ignorenulls=True).over(w).alias(x)
for x in ["var1", "var2"]
]
).distinct().select("id", "var1.*", "var2.*")
df1.show()
#+---+----+-------+----+-------+
#| id|var1|dt_var1|var2|dt_var2|
#+---+----+-------+----+-------+
#| 1| b| 2| 2| 1|
#| 2| a| 1| 1| 2|
#+---+----+-------+----+-------+

Generate unique ID in a mutable pyspark data frame

I want to generate sequential unique id to a data frame that is subject to change. When i say change it means that more number of rows will be added tomorrow after i generate the ids today. when more rows are added i want to look up the id column which has the generated ids and increment for the newly added data
+-------+--------------------+-------------+
|deal_id| deal_name|Unique_id |
+-------+--------------------+--------------
| 613760|ABCDEFGHI | 1|
| 613740|TEST123 | 2|
| 598946|OMG | 3|
Say if i get more data tomorrow i want to append the same to this data frame and the unique id should increment to 4 and go on.
+-------+--------------------+-------------+
|deal_id| deal_name|Unique_id |
+-------+--------------------+--------------
| 613760|ABCDEFGHI | 1|
| 613740|TEST123 | 2|
| 598946|OMG | 3|
| 591234|OM21 | 4|
| 988217|Otres | 5|
.
.
.
Code Snippet
deals_df_final = deals_df.withColumn("Unique_id",F.monotonically_increasing_id())
But this didnt give sequential ID.
I can try row_num and RDD zip with index but looks like the dataframe will be immutable.
Any help please? I want to be able to generate and also increment the id as and when data is added.
Very brief note if it helps - I had the same problem, and the 2nd example in this post helped me: https://kb.databricks.com/sql/gen-unique-increasing-values.html
My current in-progress code:
from pyspark.sql import (
SparkSession,
functions as F,
window as W
)
df_with_increasing_id = df.withColumn("monotonically_increasing_id", F.monotonically_increasing_id())
window = W.Window.orderBy(F.col('monotonically_increasing_id'))
df_with_consecutive_increasing_id = df_with_increasing_id.withColumn('increasing_id', F.row_number().over(window))
df = df_with_consecutive_increasing_id.drop('monotonically_increasing_id')
# now find the maximum value in the `increasing_id` column in the current dataframe before appending new
previous_max_id = df.agg({'increasing_id': 'max'}).collect()[0]
previous_max_id = previous_max_id['max(increasing_id)']
# CREATE NEW ROW HERE
# and then create new ids (same way as creating them originally)
# then union or vertically concatenate it with the old dataframe to get the combined one
df.withColumn("cnsecutiv_increase", F.col("increasing_id") + F.lit(previous_max_id)).show()

pyspark rdd map is not calling function

I am trying to do few transformations on my RDD and for that, I am calling a function using map. However, this function is not getting invoked. Someone please let me know what I am doing wrong here?
I can see test function getting invoked but not store_past_info
def store_past_info(row):
print "------------------- store_past_info ------------------------------"
if row["transactiontype"] == "Return":
global prv_transaction_number
prv_transaction_number = row["transnumber"]
global return_occured
return_occured = True
global group_id
group_id.append(row["transnumber"])
if row["transactiontype"] == "Purchase":
if return_occured:
global group_id
group_id.append(prv_transaction_number)
else:
global group_id
group_id.append(row["transnumber"])
print group_id
def test(rdd):
print "------------------- test ------------------------------"
rdd.map(store_past_info).collect()
print group_id
This is how it works in store:
if some item is purchased an id is generated.
if you want to return few items from your purchase, two entries are been made
Return entry with new id for the return of all the products, with org_id as id of your purchase order you want to return
New Purchase entry with the same id as your last purchase id for things you want to keep
Input
Date Type Id org_id
25-03-2018 Purchase 111
25-03-2018 Purchase 112
26-03-2018 Return 113 111
26-03-2018 Purchase 111
Output
I want to add a new column group_id, which will show the same id for Return and Corresponding Purchase happened after return ( customer don't do this purchase, this is how system keeps entry for every return) step 2.1
Date Type Id org_id group_id
25-03-2018 Purchase 111 111
25-03-2018 Purchase 112 112
26-03-2018 Return 113 111 113
26-03-2018 Purchase 111 113
IIUC, I believe you can get your output using DataFrames, a pyspark.sql.Window function, and crossJoin()
First convert your rdd to a DataFrame using
df = rdd.toDF() # you may have to specify the column names
df.show()
#+----------+--------+---+------+
#| Date| Type| Id|org_id|
#+----------+--------+---+------+
#|25-03-2018|Purchase|111| null|
#|25-03-2018|Purchase|112| null|
#|26-03-2018| Return|113| 111|
#|26-03-2018|Purchase|111| null|
#+----------+--------+---+------+
Then we will need to add an Index column to keep track of the order of the rows. We can use pyspark.sql.functions.monotonically_increasing_id(). This will guarantee that the values will be increasing (so they can be ordered), but does not mean that they will be sequential.
import pyspark.sql.functions as f
df = df.withColumn('Index', f.monotonically_increasing_id())
df.show()
#+----------+--------+---+------+-----------+
#| Date| Type| Id|org_id| Index|
#+----------+--------+---+------+-----------+
#|25-03-2018|Purchase|111| null| 8589934592|
#|25-03-2018|Purchase|112| null|17179869184|
#|26-03-2018| Return|113| 111|34359738368|
#|26-03-2018|Purchase|111| null|42949672960|
#+----------+--------+---+------+-----------+
The ordering is important because you want to look for rows that come after a Return.
Next use crossJoin to join the DataFrame to itself.
Since this returns the Cartesian product, we will filter it to just the rows that meet either of the following conditions:
l.Index = r.Index (essentially join a row to itself)
(l.Id = r.org_id) AND (l.Index > r.Index) (an Id is equal to a org_id from an earlier row- this is where the Index column is helpful)
Then we add a column for group_id and set it equal to r.Id if the second condition is met. Otherwise we set this column to None.
df1 = df.alias('l').crossJoin(df.alias('r'))\
.where('(l.Index = r.Index) OR ((l.Id = r.org_id) AND (l.Index > r.Index))')\
.select(
'l.Index',
'l.Date',
'l.Type',
'l.Id',
'l.org_id',
f.when(
(f.col('l.Id') == f.col('r.org_id')) & (f.col('l.Index') > f.col('r.Index')),
f.col('r.Id')
).otherwise(f.lit(None)).alias('group_id')
)
df1.show()
#+-----------+----------+--------+---+------+--------+
#| Index| Date| Type| Id|org_id|group_id|
#+-----------+----------+--------+---+------+--------+
#| 8589934592|25-03-2018|Purchase|111| null| null|
#|17179869184|25-03-2018|Purchase|112| null| null|
#|34359738368|26-03-2018| Return|113| 111| null|
#|42949672960|26-03-2018|Purchase|111| null| 113|
#|42949672960|26-03-2018|Purchase|111| null| null|
#+-----------+----------+--------+---+------+--------+
We are almost there but as you can see there are two things that still need to be done.
We need to eliminate the duplicate row for Index = 42949672960
We need to fill in the group_id for rows where it is null using the value from Id.
For the first step, we will use a Window function to create a temporary column called rowNum. This will be the pyspark.sql.functions.row_number() for each Index ordered by the boolean condition group_id IS NULL.
For the Index values where there are multiple rows, the one where the group_id has already been set will sort first. Thus we just need to select the rows where the rowNum is equal to 1 (row_number() starts at 1, not 0).
After this is done, the second step is trivial- just replace the remaining null values with the value from Id.
from pyspark.sql import Window
w = Window.partitionBy(f.col('Index')).orderBy(f.isnull('group_id'))
df2 = df1.withColumn('rowNum', f.row_number().over(w))\
.where(f.col('rowNum')==1)\
.sort('Index')\
.select(
'Date',
'Type',
'Id',
'org_id',
f.when(
f.isnull('group_id'),
f.col('Id')
).otherwise(f.col('group_id')).alias('group_id')
)
df2.show()
#+----------+--------+---+------+--------+
#| Date| Type| Id|org_id|group_id|
#+----------+--------+---+------+--------+
#|25-03-2018|Purchase|111| null| 111|
#|25-03-2018|Purchase|112| null| 112|
#|26-03-2018| Return|113| 111| 113|
#|26-03-2018|Purchase|111| null| 113|
#+----------+--------+---+------+--------+

How to find max value Alphabet from DataFrame apache spark?

i am trying to get the max value Alphabet from a panda dataframe as whole. I am not interested in what row or column it came from. I am just interested in a single max value within the dataframe.
This is what it looks like:
id conditionName
1 C
2 b
3 A
4 A
5 A
expected result is:
|id|conditionName|
+--+-------------+
| 3| A |
| 4| A |
| 5| A |
+----------------+
because 'A' is the first letter of the alphabet
df= df.withColumn("conditionName", col("conditionName").cast("String"))
.groupBy("id,conditionName").max("conditionName");
df.show(false);
Exception: "conditionName" is not a numeric column. Aggregation function can only be applied on a numeric column.;
I need the max from an entire dataframe Alphabet character.
What should I use, so that the desired results?
Thank advance !
You can sort your DataFrame by your string column, grab the first value and use it to filter your original data:
from pyspark.sql.functions import lower, desc, first
# we need lower() because ordering strings is case sensitive
first_letter = df.orderBy((lower(df["condition"]))) \
.groupBy() \
.agg(first("condition").alias("condition")) \
.collect()[0][0]
df.filter(df["condition"] == first_letter).show()
#+---+---------+
#| id|condition|
#+---+---------+
#| 3| A|
#| 4| A|
#| 5| A|
#+---+---------+
Or more elegantly using Spark SQL:
df.registerTempTable("table")
sqlContext.sql("SELECT *
FROM table
WHERE lower(condition) = (SELECT min(lower(condition))
FROM table)
")

how to index categorical features in another way when using spark ml

The VectorIndexer in spark indexes categorical features according to the frequency of variables. But I want to index the categorical features in a different way.
For example, with a dataset as below, "a","b","c" will be indexed as 0,1,2 if I use the VectorIndexer in spark. But I want to index them according to the label.
There are 4 rows data which are indexed as 1, and among them 3 rows have feature 'a',1 row feautre 'c'. So here I will index 'a' as 0,'c' as 1 and 'b' as 2.
Is there any convienient way to implement this?
label|feature
-----------------
1 | a
1 | c
0 | a
0 | b
1 | a
0 | b
0 | b
0 | c
1 | a
If I understand your question correctly, you are looking to replicate the behaviour of StringIndexer() on grouped data. To do so (in pySpark), we first define an udf that will operate on a List column containing all the values per group. Note that elements with equal counts will be ordered arbitrarily.
from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType
def encoder(col):
# Generate count per letter
x = Counter(col)
# Create a dictionary, mapping each letter to its rank
ranking = {pair[0]: rank
for rank, pair in enumerate(x.most_common())}
# Use dictionary to replace letters by rank
new_list = [ranking[i] for i in col]
return(new_list)
encoder_udf = udf(encoder, ArrayType(IntegerType()))
Now we can aggregate the feature column into a list grouped by the column label using collect_list() , and apply our udf rowwise:
from pyspark.sql.functions import collect_list, explode
df1 = (df.groupBy("label")
.agg(collect_list("feature")
.alias("features"))
.withColumn("index",
encoder_udf("features")))
Consequently, you can explode the index column to get the encoded values instead of the letters:
df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 0|
| 0| 2|
| 1| 0|
| 1| 1|
| 1| 0|
| 1| 0|
+-----+-----+

Resources