Pyspark: Calculate streak of consecutive observations

Pyspark: Calculate streak of consecutive observations - apache-spark

I have a Spark (2.4.0) data frame with a column that has just two values (either 0 or 1). I need to calculate the streak of consecutive 0s and 1s in this data, resetting the streak to zero if the value changes.
An example:
from pyspark.sql import (SparkSession, Window)
from pyspark.sql.functions import (to_date, row_number, lead, col)
spark = SparkSession.builder.appName('test').getOrCreate()
# Create dataframe
df = spark.createDataFrame([
('2018-01-01', 'John', 0, 0),
('2018-01-01', 'Paul', 1, 0),
('2018-01-08', 'Paul', 3, 1),
('2018-01-08', 'Pete', 4, 0),
('2018-01-08', 'John', 3, 0),
('2018-01-15', 'Mary', 6, 0),
('2018-01-15', 'Pete', 6, 0),
('2018-01-15', 'John', 6, 1),
('2018-01-15', 'Paul', 6, 1),
], ['str_date', 'name', 'value', 'flag'])
df.orderBy('name', 'str_date').show()
## +----------+----+-----+----+
## | str_date|name|value|flag|
## +----------+----+-----+----+
## |2018-01-01|John| 0| 0|
## |2018-01-08|John| 3| 0|
## |2018-01-15|John| 6| 1|
## |2018-01-15|Mary| 6| 0|
## |2018-01-01|Paul| 1| 0|
## |2018-01-08|Paul| 3| 1|
## |2018-01-15|Paul| 6| 1|
## |2018-01-08|Pete| 4| 0|
## |2018-01-15|Pete| 6| 0|
## +----------+----+-----+----+
With this data, I'd like to calculate the streak of consecutive zeros and ones, ordered by date and "windowed" by name:
# Expected result:
## +----------+----+-----+----+--------+--------+
## | str_date|name|value|flag|streak_0|streak_1|
## +----------+----+-----+----+--------+--------+
## |2018-01-01|John| 0| 0| 1| 0|
## |2018-01-08|John| 3| 0| 2| 0|
## |2018-01-15|John| 6| 1| 0| 1|
## |2018-01-15|Mary| 6| 0| 1| 0|
## |2018-01-01|Paul| 1| 0| 1| 0|
## |2018-01-08|Paul| 3| 1| 0| 1|
## |2018-01-15|Paul| 6| 1| 0| 2|
## |2018-01-08|Pete| 4| 0| 1| 0|
## |2018-01-15|Pete| 6| 0| 2| 0|
## +----------+----+-----+----+--------+--------+
Of course, I would need the streak to reset itself to zero if the 'flag' changes.
Is there a way of doing this?

This would require a difference in row numbers approach to first group consecutive rows with the same value and then using a ranking approach among the groups.
from pyspark.sql import Window
from pyspark.sql import functions as f
#Windows definition
w1 = Window.partitionBy(df.name).orderBy(df.date)
w2 = Window.partitionBy(df.name,df.flag).orderBy(df.date)
res = df.withColumn('grp',f.row_number().over(w1)-f.row_number().over(w2))
#Window definition for streak
w3 = Window.partitionBy(res.name,res.flag,res.grp).orderBy(res.date)
streak_res = res.withColumn('streak_0',f.when(res.flag == 1,0).otherwise(f.row_number().over(w3))) \
.withColumn('streak_1',f.when(res.flag == 0,0).otherwise(f.row_number().over(w3)))
streak_res.show()

There is a more intuitive solution without the use of row_number() if you already have a natural ordering column (str_date) in this case.
In short, to find streak of 1's, just use the
cumulative sum of the flag,
then, multiplied by the flag.
To find streak of 0's, invert the flag first and then do the same for streak of 1's.
First we define a function to calculate cumulative sum:
from pyspark.sql import Window
from pyspark.sql import functions as f
def cum_sum(df, new_col_name, partition_cols, order_col, value_col):
windowval = (Window.partitionBy(partition_cols).orderBy(order_col)
.rowsBetween(Window.unboundedPreceding, 0))
return df.withColumn(new_col_name, f.sum(value_col).over(windowval))
Note the use of rowsBetween (instead of rangeBetween). This is important to get the correct cumulative sum when there are duplicate values in the order column.
Calculate streak of 1's
df = cum_sum(df,
new_col_name='1_group',
partition_cols='name',
order_col='str_date',
value_col='flag')
df = df.withColumn('streak_1', f.col('flag')*f.col('1_group'))
Calculate streak of 0's
df = df.withColumn('flag_inverted', 1-f.col('flag'))
df = cum_sum(df,
new_col_name='0_group',
partition_cols='name',
order_col='str_date',
value_col='flag_inverted')
df = df.withColumn('streak_0', f.col('flag_inverted')*f.col('0_group'))

Related

How can I count different groups and group them into one column in PySpark?

In this example, I have the following dataframe:
client_id rule_1 rule_2 rule_3 rule_4 rule_5
1 1 0 1 0 0
2 0 1 0 0 0
3 0 1 1 1 0
4 1 0 1 1 1
It shows the client_id and if he's obeying a certain rule or not.
How would I be able to count the number of clients that obey each rule, in a way that I can show all information in one dataframe?
rule obeys count
rule_1 0 23852
rule_1 1 95102
rule_2 0 12942
rule_2 1 45884
rule_3 0 29319
rule_3 1 9238
rule_4 0 55321
rule_4 1 23013
rule_5 0 96842
rule_5 1 86739

The operation of moving column names to rows is called unpivoting. In Spark, it is done using stack function.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, 0, 1, 0, 0),
(2, 0, 1, 0, 0, 0),
(3, 0, 1, 1, 1, 0),
(4, 1, 0, 1, 1, 1)],
["client_id", "rule_1", "rule_2", "rule_3", "rule_4", "rule_5"])
Script:
to_unpivot = [f"\'{c}\', `{c}`" for c in df.columns if c != "client_id"]
stack_str = ",".join(to_unpivot)
df = (df
.select(F.expr(f"stack({len(to_unpivot)}, {stack_str}) as (rule, obeys)"))
.groupBy("rule", "obeys")
.count()
)
df.show()
# +------+-----+-----+
# | rule|obeys|count|
# +------+-----+-----+
# |rule_1| 1| 2|
# |rule_2| 1| 2|
# |rule_1| 0| 2|
# |rule_3| 1| 3|
# |rule_2| 0| 2|
# |rule_4| 0| 2|
# |rule_3| 0| 1|
# |rule_5| 0| 3|
# |rule_5| 1| 1|
# |rule_4| 1| 2|
# +------+-----+-----+

We can transpose down the rule columns and take a count. Here's an example using the sample in your question.
rule_cols = [k for k in data_sdf.columns if 'rule' in k]
# ['rule_1', 'rule_2', 'rule_3', 'rule_4', 'rule_5']
data_sdf. \
withColumn('arr_rule_structs',
func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) for k in rule_cols])
). \
selectExpr('id', 'inline(arr_rule_structs)'). \
groupBy('key', 'val'). \
agg(func.count('id').alias('cnt')). \
orderBy('key', 'val'). \
show()
# +------+---+---+
# | key|val|cnt|
# +------+---+---+
# |rule_1| 0| 2|
# |rule_1| 1| 2|
# |rule_2| 0| 2|
# |rule_2| 1| 2|
# |rule_3| 0| 1|
# |rule_3| 1| 3|
# |rule_4| 0| 2|
# |rule_4| 1| 2|
# |rule_5| 0| 3|
# |rule_5| 1| 1|
# +------+---+---+
Feel free to use your own field names within the struct instead of key and val.

Check multiple columns for any column greater than zero using a regex

I need to apply a when function on multiple columns. I want to check if at least one of the columns has a value greater than 0.
This is my solution:
df.withColumn("any value", F.when(
(col("col1") > 0) |
(col("col2") > 0) |
(col("col3") > 0) |
...
(col("colX") > 0)
, "any greater than 0").otherwise(None))
Is it possible to do the same task with a regex, so I don't have to write all the column names?

So let's create sample data:
df = spark.createDataFrame(
[(0, 0, 0, 0), (0, 0, 2, 0), (0, 0, 0, 0), (1, 0, 0, 0)],
['a', 'b', 'c', 'd']
)
Then, you can build your condition from a list of columns (say all the columns of the dataframe) using map and reduce like this:
cols = df.columns
from pyspark.sql import functions as F
condition = reduce(lambda a, b: a | b, map(lambda c: F.col(c) > 0, cols))
df.withColumn("any value", F.when(condition, "any greater than 0")).show()
which yields:
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+

Another way you could have this done is create an array, use forall to check and conditionally assign values. Code below
df = df.withColumn('any value', array(df.columns)).withColumn('any value',when(forall('any value',lambda x: x==0),None).otherwise("any greater than 0"))
df.show()
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+

Generate n columns in dataframe based on mutiple values

I have created dataframe like this from a table
df = spark.sql("select * from test") # it is having 2 columns id and name
df2 = df.groupby('id').agg(collect_list('name')
df2.show()
|id|name|
|44038:4572|[0032477212299451]|
|44038:5439|[00324772, 0032477, 003247, 00324]|
|44038:4429|[0032477212299308]|
Until here it's correct, for one id I can store multiple names (values).
Now when I try to create dynamic columns into dataframe based on values, it is not working.
df3 = df2.select([df2.id] + [df2.name[i] for i in range (length)])
Output:
|id |name[0]|
|44038:4572|0032477212299451|
|44038:5439|00324772|
|44038:4429|032477212299308|
Expected output in dataframe:
|id|name[0]|name[1]|name[2]|name[3]|
|44038:4572|0032477212299451|null|null|null|
|44038:5439|00324772|0032477|003247|0034|
|44038:4429|032477212299308|null|null|null|
And then have to replace null with 0.

You might be better off doing pivot instead of collect_list:
from pyspark.sql import functions as F, Window
df2 = (df.withColumn('rn', F.row_number().over(Window.partitionBy('id').orderBy(F.desc('name'))))
.groupBy('id')
.pivot('rn')
.agg(F.first('name'))
.fillna("0")
)
df2.show()
+----------+----------------+-------+------+-----+
| id| 1| 2| 3| 4|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+
If you want pretty column names, you can do
df3 = df2.toDF('id', *[f'name{i}' for i in range(len(df2.columns) - 1)])
df3.show()
+----------+----------------+-------+------+-----+
| id| name0| name1| name2|name3|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+

Extracting value using Window and Partition

I have a dataframe in pyspark
id | value
1 0
1 1
1 0
2 1
2 0
3 0
3 0
3 1
I want to extract all the rows after the first occurrence of 1 in value column in the same id group. I have created Window with partition of Id but do not know how to get rows which are present after value 1.
Im expecting result to be
id | value
1 1
1 0
2 1
2 0
3 1

Below solutions may be relevant for this (It is working perfectly for small data but may cause the problem in big data if id are on multiple partitions)
df = sqlContext.createDataFrame([
[1, 0],
[1, 1],
[1, 0],
[2, 1],
[2, 0],
[3, 0],
[3, 0],
[3, 1]
],
['id', 'Value']
)
df.show()
+---+-----+
| id|Value|
+---+-----+
| 1| 0|
| 1| 1|
| 1| 0|
| 2| 1|
| 2| 0|
| 3| 0|
| 3| 0|
| 3| 1|
+---+-----+
#importing Libraries
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys
#This way we can generate a cumulative sum for values
df.withColumn(
"sum",
F.sum(
"value"
).over(W.partitionBy(["id"]).rowsBetween(-sys.maxsize, 0))
).show()
+---+-----+-----+
| id|Value|sum |
+---+-----+-----+
| 1| 0| 0|
| 1| 1| 1|
| 1| 0| 1|
| 3| 0| 0|
| 3| 0| 0|
| 3| 1| 1|
| 2| 1| 1|
| 2| 0| 1|
+---+-----+-----+
#Filter all those which are having sum > 0
df.withColumn(
"sum",
F.sum(
"value"
).over(W.partitionBy(["id"]).rowsBetween(-sys.maxsize, 0))
).where("sum > 0").show()
+---+-----+-----+
| id|Value|sum |
+---+-----+-----+
| 1| 1| 1|
| 1| 0| 1|
| 3| 1| 1|
| 2| 1| 1|
| 2| 0| 1|
+---+-----+-----+
Before running this you must be sure that data related to ID should be partitioned and no id can be on 2 partitions.

Ideally, you would need to:
Create a window partitioned by id and ordered the same way the dataframe already is
Keep only the rows for which there is a "one" before them in the window
AFAIK, there is no look up function within windows in Spark. Yet, you could follow this idea and work something out. Let's first create the data and import functions and windows.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
l = [(1, 0), (1, 1), (1, 0), (2, 1), (2, 0), (3, 0), (3, 0), (3, 1)]
df = spark.createDataFrame(l, ['id', 'value'])
Then, let's add an index on the dataframe (it's free) to be able to order the windows.
indexedDf = df.withColumn("index", F.monotonically_increasing_id())
Then we create a window that only looks at the values before the current row, ordered by that index and partitioned by id.
w = Window.partitionBy("id").orderBy("index").rowsBetween(Window.unboundedPreceding, 0)
Finally, we use that window to collect the set of preceding values of each row, and filter out the ones that do not contain 1. Optionally, we order back by index because the windowing does not preserve the order by id column.
indexedDf\
.withColumn('set', F.collect_set(F.col('value')).over(w))\
.where(F.array_contains(F.col('set'), 1))\
.orderBy("index")\
.select("id", "value").show()
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 0|
| 2| 1|
| 2| 0|
| 3| 1|
+---+-----+

Spark: Match columns from two dataframes

I have a dataframe of format as below
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
another dataframe contains coefficients for each column in first dataframe. for example
+------+------+---------+------+
| CE_sp|CE_sp2|CE_colour|CE_sp3|
+------+------+---------+------+
| 0.94| 0.31| 0.11| 0.72|
+------+------+---------+------+
Now I want to add a column to first dataframe which is calculated by adding scores from second dataframe.
for ex.
+---+---+------+---+-----+
| sp|sp2|colour|sp3|Score|
+---+---+------+---+-----+
| 0| 1| 1| 0| 0.42|
| 1| 0| 0| 1| 1.66|
| 0| 0| 1| 0| 0.11|
+---+---+------+---+-----+
i.e
r -> row of first dataframe
score = r(0)*CE_sp + r(1)*CE_sp2 + r(2)*CE_colour + r(3)*CE_sp3
There can be n number of columns and order of columns can be different.
Thanks in Advance!!!

Quick and simple:
import org.apache.spark.sql.functions.col
val df = Seq(
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
).toDF("sp","sp2", "colour", "sp3")
val coefs = Map("sp" -> 0.94, "sp2" -> 0.32, "colour" -> 0.11, "sp3" -> 0.72)
val score = df.columns.map(
c => col(c) * coefs.getOrElse(c, 0.0)).reduce(_ + _)
df.withColumn("score", score)
And the same thing in PySpark:
from pyspark.sql.functions import col
df = sc.parallelize([
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
]).toDF(["sp","sp2", "colour", "sp3"])
coefs = {"sp": 0.94, "sp2": 0.32, "colour": 0.11, "sp3": 0.72}
df.withColumn("score", sum(col(c) * coefs.get(c, 0) for c in df.columns))

I believe that there many way to accomplish what you are trying to do. In all cases you don't need that second DataFrame, like I said in the comments.
Here is one way :
import org.apache.spark.ml.feature.{ElementwiseProduct, VectorAssembler}
import org.apache.spark.mllib.linalg.{Vectors,Vector => MLVector}
val df = Seq((0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)).toDF("sp", "sp2", "colour", "sp3")
// Your coefficient represents a dense Vector
val coeffSp = 0.94
val coeffSp2 = 0.31
val coeffColour = 0.11
val coeffSp3 = 0.72
val weightVectors = Vectors.dense(Array(coeffSp, coeffSp2, coeffColour, coeffSp3))
// You can assemble the features with VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(df.columns) // since you need to compute on all your columns
.setOutputCol("features")
// Once these features assembled we can perform an element wise product with the weight vector
val output = assembler.transform(df)
val transformer = new ElementwiseProduct()
.setScalingVec(weightVectors)
.setInputCol("features")
.setOutputCol("weightedFeatures")
// Create an UDF to sum the weighted vectors values
import org.apache.spark.sql.functions.udf
def score = udf((score: MLVector) => { score.toDense.toArray.sum })
// Apply the UDF on the weightedFeatures
val scores = transformer.transform(output).withColumn("score",score('weightedFeatures))
scores.show
// +---+---+------+---+-----------------+-------------------+-----+
// | sp|sp2|colour|sp3| features| weightedFeatures|score|
// +---+---+------+---+-----------------+-------------------+-----+
// | 0| 1| 1| 0|[0.0,1.0,1.0,0.0]|[0.0,0.31,0.11,0.0]| 0.42|
// | 1| 0| 0| 1|[1.0,0.0,0.0,1.0]|[0.94,0.0,0.0,0.72]| 1.66|
// | 0| 0| 1| 0| (4,[2],[1.0])| (4,[2],[0.11])| 0.11|
// +---+---+------+---+-----------------+-------------------+-----+
I hope this helps. Don't hesitate if you have more questions.

Here is a simple solution:
scala> df_wght.show
+-----+------+---------+------+
|ce_sp|ce_sp2|ce_colour|ce_sp3|
+-----+------+---------+------+
| 1| 2| 3| 4|
+-----+------+---------+------+
scala> df.show
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
Then we can just do a simple cross join and crossproduct.
val scored = df.join(df_wght).selectExpr("(sp*ce_sp + sp2*ce_sp2 + colour*ce_colour + sp3*ce_sp3) as final_score")
The output:
scala> scored.show
+-----------+
|final_score|
+-----------+
| 5|
| 5|
| 3|
+-----------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark: Calculate streak of consecutive observations - apache-spark

Related

How can I count different groups and group them into one column in PySpark?

Check multiple columns for any column greater than zero using a regex

Generate n columns in dataframe based on mutiple values

Extracting value using Window and Partition

Spark: Match columns from two dataframes

Categories

Resources