Cell wise operation on data frame, determine precision - python-3.x

I have data frame with different data types in it.
I would like to determine precision of float types.
I can select only float64 with this code:
df_float64 = df.loc[:, df.dtypes == np.float64]
(not sure why columns with only 'Nan' values are also selected but this is just side note)
Now to determine precision I thing abut such approach:
precision = len(cell.split(".")[1]
If cell would be a string.
And have output in form of csv with maximum precision for each column.
So having data frame like this:
| A| B| C| D|
| 0.01|0.0923| 1.0| 1.2|
| 100.1| 203.3| 1.093| 1.9|
| 0.0| 0.23| 1.03| 1.0|
I would like to to have this:
| A| B| C| D|
| 2| 4| 3| 1|
Is this possible using Pandas?
Thanks

You can use:
fillna first for remove NaNs
cast to str by astype
loop by columns by apply or list comprehension with lambda function
for each column split, get second values of list by str[1] and get len
get max values - output is Series
convert Series to one row DataFrame if necessery
a = df.fillna(0).astype(str).apply(lambda x: x.str.split('.').str[1].str.len()).max()
print (a)
A 2
B 4
C 3
D 1
dtype: int64
df = a.to_frame().T
print (df)
A B C D
0 2 4 3 1
Another solution:
df = df.fillna(0).astype(str)
a = [df[x].str.split('.').str[1].str.len().max() for x in df]
df = pd.DataFrame([a], columns=df.columns)
print (df)
A B C D
0 2 4 3 1

I think you are looking for applymap i.e
If you have a dataframe df
A B C D
0 0.01 0.0923 1.000 1.2
1 100.10 203.3000 1.093 1.9
2 0.00 0.2300 1.030 1.0
ndf = pd.DataFrame(df.astype(str).applymap(lambda x: len(x.split(".")[-1])).max()).T
If you have nan you can use if else i.e
ndf = pd.DataFrame(df.astype(str).applymap(lambda x: len(x.split(".")[-1]) if x != 'nan' else 0 ).max()).T
Output:
A B C D
0 2 4 3 1

Related

transition matrix from pyspark dataframe

I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. 0.5* 0 0 2/3
3. 0.5 0.5 1 1/3
4. 0 0.5 0 0
where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).
You can create this kind of transition matrix using a window and pivot.
First some dummy data:
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)
df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
| 3| 3|
| 4| 2|
| 1| 2|
...
To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.
Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.
from pyspark.sql import functions as F, Window
w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0.2| 0.2|0.25|0.35|
| 2|0.27|0.31|0.19|0.23|
| 3|0.46|0.17|0.21|0.17|
| 4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+

How to subract row wise mean from each value of column and get one row wise max after subracting mean value pyspark

I want to calculate row wise mean and subract mean from each value of row and get maximum at he end
here is my dataframe
col1 | col2 | col3
0 | 2 | 3
4 | 2 | 3
1 | 0 | 3
0 | 0 | 0
df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))
i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.
Requeire results
col1 | col2 | col3 mean_Value Max_difference_Value
0 | 2 | 3 1.66 1.34
4 | 2 | 3 3.0 1.0
1 | 0 | 3 1.33 1.67
1 | 0 | 1 0.66 0.66
Note this is main formula: abs(mean-columns value).max()
Using greatest and list comprehension.
spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
withColumn('max_diff_val',
func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
). \
show()
# +----+----+----+------------------+------------------+
# |col1|col2|col3| mean_value| max_diff_val|
# +----+----+----+------------------+------------------+
# | 0| 2| 3|1.6666666666666667|1.6666666666666667|
# | 4| 2| 3| 3.0| 1.0|
# | 1| 0| 3|1.3333333333333333|1.6666666666666667|
# | 0| 0| 0| 0.0| 0.0|
# +----+----+----+------------------+------------------+
Have you tried UDFs?
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np
#udf
def udf_mean(col1, col2, col3):
return np.mean([col1, col2, col3])
df = df.withColumn("mean_value", udf_mean(col1, col2, col3))
Similarly you can try for max difference value.

Find top n results for multiple fields in Spark dataframe

I have a dataframe like this one:
name field1 field2 field3
a 4 10 8
b 5 0 11
c 10 7 4
d 0 1 5
I need to find top 3 names for each field.
Expected output:
top3-field1 top3-field2 top3-field3
c a b
b c a
a d d
So, I tried to sort field(n) column values, limit top 3 results and generate new columns using withColumn method, like this:
df1 = df.orderBy(f.col("field1").desc(), "name") \
.limit(3) \
.withColumn("top3-field1", df["name"]) \
.select("top3-field1", "field1")
With this approach I have to create different dataframes for each field(n), and then join them to get the result as described above. I feel that there must be better solution for this problem. Hope someone can give me suggestions
You can first stack the df, then get the rank descending, then filter out rank less than or equal to 3, finally pivot the names:
Note that I am using this function in my code to make stacking a little easier in typing per se:
from pyspark.sql import functions as F, Window as W #imports
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",stack_multiple_col(df,df.columns[1:]))
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
out.show()
+---+------+------+------+
|Rnk|field1|field2|field3|
+---+------+------+------+
| 1| c| a| b|
| 2| b| c| a|
| 3| a| d| d|
+---+------+------+------+
If you are not willing to use the function, you can write the same as :
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",
'stack(3,"field1",field1,"field2",field2,"field3",field3) as (col,values)')
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
Full code:
def stack_multiple_col(df,cols=df.columns,output_columns=["col","values"]):
"""stacks multiple columns in a dataframe,
takes all columns by default unless passed a list of values"""
return (f"""stack({len(cols)},{','.join(map(','.join,
(zip([f'"{i}"' for i in cols],cols))))}) as ({','.join(output_columns)})""")
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",stack_multiple_col(df,df.columns[1:]))
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
out.show()

Populate a column based on previous value and row Pyspark

I have a spark dataframe with 5 columns group, date, a, b, and c and I want to do the following:
given df
group date a b c
a 2018-01 2 3 10
a 2018-02 4 5 null
a 2018-03 2 1 null
expected output
group date a b c
a 2018-01 2 3 10
a 2018-02 4 5 10*3+2=32
a 2018-03 2 1 32*5+4=164
for each group, calculate c by b * c + a and use the output as the c of the next row.
I tried using Lag and window function but couldn't find the right way for this.
Within a window you cannot access results of a column that you are currently about to calculate. This would force Spark to do the calculations sequentially and should be avoided. Another approach is to transform the recursive calculation c_n = func(c_(n-1)) into a formula that only uses the (constant) values of a, b and the first value of c:
All input values for this formula can be collected with a window and the formula itself is implemented as udf:
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window
df = ...
w=Window.partitionBy('group').orderBy('date')
df1 = df.withColumn("la", F.collect_list("a").over(w)) \
.withColumn("lb", F.collect_list("b").over(w)) \
.withColumn("c0", F.first("c").over(w))
import numpy as np
def calc_c(c0, a, b):
if c0 is None:
return 0.0
if len(a) == 1:
return float(c0)
e1 = c0 * np.prod(b[:-1])
e2 = 0.0
for i,an in enumerate(a[:-1]):
e2 = e2 + an * np.prod(b[i+1:-1])
return float(e1 + e2)
calc_c_udf= F.udf(calc_c, T.DoubleType())
df1.withColumn("result", calc_c_udf("c0", "la", "lb")) \
.show()
Output:
+-----+-------+---+---+----+---------+---------+---+------+
|group| date| a| b| c| la| lb| c0|result|
+-----+-------+---+---+----+---------+---------+---+------+
| a|2018-01| 2| 3| 10| [2]| [3]| 10| 10.0|
| a|2018-02| 4| 5|null| [2, 4]| [3, 5]| 10| 32.0|
| a|2018-03| 2| 1|null|[2, 4, 2]|[3, 5, 1]| 10| 164.0|
+-----+-------+---+---+----+---------+---------+---+------+

How to calculate the number of column values got changed by comparing two dataframes with same columns in spark

How to compare two data frames and get the count of number of columns that are changed from first dataframe to second dataframe based on joining key using spark.
df1
id val1 val2 val3 val4
1 a b c d
2 d f k e
4 r t y u
df2
id val1 val2 val3 val4
1 a h c l
2 d f k e
4 g a w u
count:
id count
1 2
2 0
4 3
from pyspark.sql.functions import col
#change aliases to avoid duplicate columns in joined dataframe
df2=df2.select(*(col(x).alias('d2'+x) for x in df2.columns))
joineddf=df1.alias('df1').join(df2.alias('df2'), df1.id == df2.d2id)
col = [z for z in df1.columns]
jd=joineddf.rdd.map(lambda row: (row.id,sum([int( not x) for x in [row[y]==row['d2'+y] for y in col ]])))
spark.createDataFrame(jd, ['id', 'count']).show()
Output:
+---+-----+
| id|count|
+---+-----+
| 1| 2|
| 2| 0|
| 4| 3|
+---+-----+
i have taken all the columns in sum including 'id' field as the resulting 0 would not add to the sum.
Hope that helps!

Resources