I have a spark dataframe like the input column below. It has a date column "dates" and a int column "qty". I would like to create a new column "daysout" that has the difference in days between the current date value and the first consecutive date where qty=0. I've provided example input and output below. Any tips are greatly appreciated.
input df:
dates qty
2020-04-01 1
2020-04-02 0
2020-04-03 0
2020-04-04 3
2020-04-05 0
2020-04-06 7
output:
dates qty daysout
2020-04-01 1 0
2020-04-02 0 0
2020-04-03 0 1
2020-04-04 3 2
2020-04-05 0 0
2020-04-06 7 1
Here is a possible approach which compares if current row is 0 and lagged row is not 0 , then takes a sum of that window , which then acts as a window for a row number to be assigned and subtract 1 to get your desired result:
import pyspark.sql.functions as F
w = Window().partitionBy().orderBy(F.col("dates"))
w1 = F.sum(F.when((F.col("qty")==0)&(F.lag("qty").over(w)!=0),1).otherwise(0)).over(w)
w2 = Window.partitionBy(w1).orderBy('dates')
df.withColumn("daysout",F.row_number().over(w2) - 1).show()
+----------+---+-------+
| dates|qty|daysout|
+----------+---+-------+
|2020-04-01| 1| 0|
|2020-04-02| 0| 0|
|2020-04-03| 0| 1|
|2020-04-04| 3| 2|
|2020-04-05| 0| 0|
|2020-04-06| 7| 1|
+----------+---+-------+
Related
I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. 0.5* 0 0 2/3
3. 0.5 0.5 1 1/3
4. 0 0.5 0 0
where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).
You can create this kind of transition matrix using a window and pivot.
First some dummy data:
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)
df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
| 3| 3|
| 4| 2|
| 1| 2|
...
To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.
Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.
from pyspark.sql import functions as F, Window
w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0.2| 0.2|0.25|0.35|
| 2|0.27|0.31|0.19|0.23|
| 3|0.46|0.17|0.21|0.17|
| 4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+
I want to calculate row wise mean and subract mean from each value of row and get maximum at he end
here is my dataframe
col1 | col2 | col3
0 | 2 | 3
4 | 2 | 3
1 | 0 | 3
0 | 0 | 0
df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))
i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.
Requeire results
col1 | col2 | col3 mean_Value Max_difference_Value
0 | 2 | 3 1.66 1.34
4 | 2 | 3 3.0 1.0
1 | 0 | 3 1.33 1.67
1 | 0 | 1 0.66 0.66
Note this is main formula: abs(mean-columns value).max()
Using greatest and list comprehension.
spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
withColumn('max_diff_val',
func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
). \
show()
# +----+----+----+------------------+------------------+
# |col1|col2|col3| mean_value| max_diff_val|
# +----+----+----+------------------+------------------+
# | 0| 2| 3|1.6666666666666667|1.6666666666666667|
# | 4| 2| 3| 3.0| 1.0|
# | 1| 0| 3|1.3333333333333333|1.6666666666666667|
# | 0| 0| 0| 0.0| 0.0|
# +----+----+----+------------------+------------------+
Have you tried UDFs?
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np
#udf
def udf_mean(col1, col2, col3):
return np.mean([col1, col2, col3])
df = df.withColumn("mean_value", udf_mean(col1, col2, col3))
Similarly you can try for max difference value.
I have a date and want to create a new attributes based on the date. For example.
date quarter1 quarter2 quarter3 quarter4
2/3/2020(mm/dd/yyyy) 1 0. 0. 0
11/11/2020 0 0 0 1
You can try to cast the date column and use quarter function, then apply an when + otherwise condition to create the columns:
from pyspark.sql import functions as F
qtrs = ['quarter1','quarter2','quarter3','quarter4']
df = df.select("*",F.concat(F.lit("quarter"),
F.quarter(F.to_date("date",'M/d/yyyy'))).alias("quarters"))\
.select("*",*[F.when(F.col("quarters")==col,1).otherwise(0).alias(col) for col in qtrs])\
.drop("quarters")
df.show()
+----------+--------+--------+--------+--------+
| date|quarter1|quarter2|quarter3|quarter4|
+----------+--------+--------+--------+--------+
| 2/3/2020| 1| 0| 0| 0|
|11/11/2020| 0| 0| 0| 1|
+----------+--------+--------+--------+--------+
Per OP's request, adding approach with withColumn:
df = (df.withColumn("quarters",F.concat(F.lit("quarter"),
F.quarter(F.to_date("date",'M/d/yyyy'))))
.withColumn("quarter1",F.when(F.col("quarters")=='quarter1',1).otherwise(0))
.withColumn("quarter2",F.when(F.col("quarters")=='quarter2',1).otherwise(0))
.withColumn("quarter3",F.when(F.col("quarters")=='quarter3',1).otherwise(0))
.withColumn("quarter4",F.when(F.col("quarters")=='quarter4',1).otherwise(0))
.drop("quarters")
)
df.show()
I want to compare Input DataFrame with Main DataFrame and return the value of matching row to the input data,
Consider the example below
Input DataFrame
A
B
C
1
0
1
0
0
0
1
1
1
0
1
1
Main DataFrame
A
B
C
Point
1
1
1
P1
1
0
1
P2
After comparing the Input with main DataFrame the result should be like below
Output DataFrame
A
B
C
Point
1
0
1
P2
0
0
0
NA
1
1
1
P1
0
1
1
NA
You can use left join :
from pyspark.sql import functions as F
result_df = input_df.join(main_df, ["A", "B", "C"], "left") \
.withColumn("Point", F.coalesce(F.col("Point"), F.lit("NA")))
result_df.show()
#+---+---+---+-----+
#| A| B| C|Point|
#+---+---+---+-----+
#| 0| 0| 0| NA|
#| 1| 0| 1| P2|
#| 1| 1| 1| P1|
#| 0| 1| 1| NA|
#+---+---+---+-----+
I have a table like this.
Id prod val
1 0 0
2 0 0
3 1 1000
4 0 0
5 1 2000
6 0 0
7 0 0
I want to add a new column new_val and the condition for this column is, if prod = 0, then new_val should be from the preceding row where prod = 1.
If prod = 1 it should have the same value as val column. How do I achieve this using spark sql?
Id prod val new_val
1 0 0 1000
2 0 0 1000
3 1 1000 1000
4 0 0 2000
5 1 2000 2000
6 1 4000 4000
7 1 3000 3000
Any help is greatly appreciated
You can use something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().orderBy("id")
df = df.withColumn("new_val", F.when(F.col("prod") == 0, F.lag("val").over(w)).otherwise(F.col("val")))
What we are basically doing is using an if-else condition:
When prod == 0, take lag of val which is value of previous row (over a window that is ordered by id column), and if prod == 1, then we use the present value of the column.
You can acheive that with
val w = Window.orderBy("id").rowsBetween(0, Window.unboundedFollowing)
df
.withColumn("new_val", when($"prod" === 0, null).otherwise($"val"))
.withColumn("new_val", first("new_val", ignoreNulls = true).over(w))
It first, creates new column with null values whenever value doesn't change:
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| null|
| 2| 0| 0| null|
| 3| 1|1000| 1000|
| 4| 0| 0| null|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+
And it replaces values with first non-null value in the following records
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| 1000|
| 2| 0| 0| 1000|
| 3| 1|1000| 1000|
| 4| 0| 0| 2000|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+