Populate a column based on previous value and row Pyspark - apache-spark

I have a spark dataframe with 5 columns group, date, a, b, and c and I want to do the following:
given df
group date a b c
a 2018-01 2 3 10
a 2018-02 4 5 null
a 2018-03 2 1 null
expected output
group date a b c
a 2018-01 2 3 10
a 2018-02 4 5 10*3+2=32
a 2018-03 2 1 32*5+4=164
for each group, calculate c by b * c + a and use the output as the c of the next row.
I tried using Lag and window function but couldn't find the right way for this.

Within a window you cannot access results of a column that you are currently about to calculate. This would force Spark to do the calculations sequentially and should be avoided. Another approach is to transform the recursive calculation c_n = func(c_(n-1)) into a formula that only uses the (constant) values of a, b and the first value of c:
All input values for this formula can be collected with a window and the formula itself is implemented as udf:
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window
df = ...
w=Window.partitionBy('group').orderBy('date')
df1 = df.withColumn("la", F.collect_list("a").over(w)) \
.withColumn("lb", F.collect_list("b").over(w)) \
.withColumn("c0", F.first("c").over(w))
import numpy as np
def calc_c(c0, a, b):
if c0 is None:
return 0.0
if len(a) == 1:
return float(c0)
e1 = c0 * np.prod(b[:-1])
e2 = 0.0
for i,an in enumerate(a[:-1]):
e2 = e2 + an * np.prod(b[i+1:-1])
return float(e1 + e2)
calc_c_udf= F.udf(calc_c, T.DoubleType())
df1.withColumn("result", calc_c_udf("c0", "la", "lb")) \
.show()
Output:
+-----+-------+---+---+----+---------+---------+---+------+
|group| date| a| b| c| la| lb| c0|result|
+-----+-------+---+---+----+---------+---------+---+------+
| a|2018-01| 2| 3| 10| [2]| [3]| 10| 10.0|
| a|2018-02| 4| 5|null| [2, 4]| [3, 5]| 10| 32.0|
| a|2018-03| 2| 1|null|[2, 4, 2]|[3, 5, 1]| 10| 164.0|
+-----+-------+---+---+----+---------+---------+---+------+

Related

transition matrix from pyspark dataframe

I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. 0.5* 0 0 2/3
3. 0.5 0.5 1 1/3
4. 0 0.5 0 0
where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).
You can create this kind of transition matrix using a window and pivot.
First some dummy data:
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)
df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
| 3| 3|
| 4| 2|
| 1| 2|
...
To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.
Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.
from pyspark.sql import functions as F, Window
w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0.2| 0.2|0.25|0.35|
| 2|0.27|0.31|0.19|0.23|
| 3|0.46|0.17|0.21|0.17|
| 4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+

Find top n results for multiple fields in Spark dataframe

I have a dataframe like this one:
name field1 field2 field3
a 4 10 8
b 5 0 11
c 10 7 4
d 0 1 5
I need to find top 3 names for each field.
Expected output:
top3-field1 top3-field2 top3-field3
c a b
b c a
a d d
So, I tried to sort field(n) column values, limit top 3 results and generate new columns using withColumn method, like this:
df1 = df.orderBy(f.col("field1").desc(), "name") \
.limit(3) \
.withColumn("top3-field1", df["name"]) \
.select("top3-field1", "field1")
With this approach I have to create different dataframes for each field(n), and then join them to get the result as described above. I feel that there must be better solution for this problem. Hope someone can give me suggestions
You can first stack the df, then get the rank descending, then filter out rank less than or equal to 3, finally pivot the names:
Note that I am using this function in my code to make stacking a little easier in typing per se:
from pyspark.sql import functions as F, Window as W #imports
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",stack_multiple_col(df,df.columns[1:]))
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
out.show()
+---+------+------+------+
|Rnk|field1|field2|field3|
+---+------+------+------+
| 1| c| a| b|
| 2| b| c| a|
| 3| a| d| d|
+---+------+------+------+
If you are not willing to use the function, you can write the same as :
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",
'stack(3,"field1",field1,"field2",field2,"field3",field3) as (col,values)')
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
Full code:
def stack_multiple_col(df,cols=df.columns,output_columns=["col","values"]):
"""stacks multiple columns in a dataframe,
takes all columns by default unless passed a list of values"""
return (f"""stack({len(cols)},{','.join(map(','.join,
(zip([f'"{i}"' for i in cols],cols))))}) as ({','.join(output_columns)})""")
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",stack_multiple_col(df,df.columns[1:]))
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
out.show()

groupby category and sum the count

Let's say I have a table (df) like so:
type count
A 5000
B 5000
C 200
D 123
... ...
... ...
Z 453
How can I sum the column count by type A, B and all other types fall into Others category?
I currently have this:
df = df.withColumn('type', when(col("type").isnot("A", "B"))
My expected output would be like so:
type count
A 5000
B 5000
Other 3043
You want to group by when expression and sum the count :
from pyspark.sql import functions as F
df1 = df.groupBy(
when(
F.col("type").isin("A", "B"), F.col("type")
).otherwise("Others").alias("type")
).agg(
F.sum("count").alias("count")
)
df1.show()
#+------+-----+
#| type|count|
#+------+-----+
#| B| 5000|
#| A| 5000|
#|Others| 776|
#+------+-----+
You can divide the dataframe into two parts based on the type, aggregate a sum for the second part, and do a unionAll to combine them.
import pyspark.sql.functions as F
result = df.filter("type in ('A', 'B')").unionAll(
df.filter("type not in ('A', 'B')")
.select(F.lit('Other'), F.sum('count'))
)
result.show()
+-----+-----+
| type|count|
+-----+-----+
| A| 5000|
| B| 5000|
|Other| 776|
+-----+-----+

Cell wise operation on data frame, determine precision

I have data frame with different data types in it.
I would like to determine precision of float types.
I can select only float64 with this code:
df_float64 = df.loc[:, df.dtypes == np.float64]
(not sure why columns with only 'Nan' values are also selected but this is just side note)
Now to determine precision I thing abut such approach:
precision = len(cell.split(".")[1]
If cell would be a string.
And have output in form of csv with maximum precision for each column.
So having data frame like this:
| A| B| C| D|
| 0.01|0.0923| 1.0| 1.2|
| 100.1| 203.3| 1.093| 1.9|
| 0.0| 0.23| 1.03| 1.0|
I would like to to have this:
| A| B| C| D|
| 2| 4| 3| 1|
Is this possible using Pandas?
Thanks
You can use:
fillna first for remove NaNs
cast to str by astype
loop by columns by apply or list comprehension with lambda function
for each column split, get second values of list by str[1] and get len
get max values - output is Series
convert Series to one row DataFrame if necessery
a = df.fillna(0).astype(str).apply(lambda x: x.str.split('.').str[1].str.len()).max()
print (a)
A 2
B 4
C 3
D 1
dtype: int64
df = a.to_frame().T
print (df)
A B C D
0 2 4 3 1
Another solution:
df = df.fillna(0).astype(str)
a = [df[x].str.split('.').str[1].str.len().max() for x in df]
df = pd.DataFrame([a], columns=df.columns)
print (df)
A B C D
0 2 4 3 1
I think you are looking for applymap i.e
If you have a dataframe df
A B C D
0 0.01 0.0923 1.000 1.2
1 100.10 203.3000 1.093 1.9
2 0.00 0.2300 1.030 1.0
ndf = pd.DataFrame(df.astype(str).applymap(lambda x: len(x.split(".")[-1])).max()).T
If you have nan you can use if else i.e
ndf = pd.DataFrame(df.astype(str).applymap(lambda x: len(x.split(".")[-1]) if x != 'nan' else 0 ).max()).T
Output:
A B C D
0 2 4 3 1

Calculate per row and add new column in DataFrame PySpark - better solution?

I work with Data Frame in PySpark
I have the following task: check how many "times" values from each column was > 2 for all columns. For u1 it is 0, for u2 => 2 and etc
user a b c d times
u1 1 0 1 0 0
u2 0 1 4 3 2
u3 2 1 7 0 1
My solution below. It works, I'm not sure that it is the best way and didn't try on real big data yet. I don't like transform to rdd and back to data frame. Is there anything better? I thouth in the beginning to claculate by UDF per columns, but didn't find a way to accamulte and sum all results per row:
def calculate_times(row):
times = 0
for index, item in enumerate(row):
if not isinstance(item, basestring):
if item > 2:
times = times+1
return times
def add_column(pair):
return dict(pair[0].asDict().items() + [("is_outlier", pair[1])])
def calculate_times_for_all(df):
rdd_with_times = df.map(lambda row: (calculate_times(row))
rdd_final = df.rdd.zip(rdd_with_times).map(add_column)
df_final = sqlContext.createDataFrame(rdd_final)
return df_final
for this solution i used this topic
How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?
Thanks!
It is just a simple one-liner. Example data:
df = sc.parallelize([
("u1", 1, 0, 1, 0), ("u2", 0, 1, 4, 3), ("u3", 2, 1, 7, 0)
]).toDF(["user", "a", "b", "c", "d"])
withColumn:
df.withColumn("times", sum((df[c] > 2).cast("int") for c in df.columns[1:]))
and the result:
+----+---+---+---+---+-----+
|user| a| b| c| d|times|
+----+---+---+---+---+-----+
| u1| 1| 0| 1| 0| 0|
| u2| 0| 1| 4| 3| 2|
| u3| 2| 1| 7| 0| 1|
+----+---+---+---+---+-----+
Note:
It columns are nullable you should correct for that, for example using coalesce:
from pyspark.sql.functions import coalesce
sum(coalesce((df[c] > 2).cast("int"), 0) for c in df.columns[1:])

Resources