I have the following dataframe and need to compute the standard deviation of each vector in the column salary.
dept_name
salary
Sales
[30, 36]
Finance
[10, 98]
Marketing
[20, 22]
IT
[40, 90]
Option 1 - using UDF
Create a function to calculate the standard deviation for a python list.
Assign that function to a pyspark sql udf.
Create a new stdev_salary column that applies the udf to the salary column using withColumn.
# imports required for this solution
from pyspark.sql.types import *
from pyspark.sql.functions import udf
# calculate std dev for list input
def stdev_list(salary_list):
mean = sum(salary_list) / len(salary_list)
variance = sum([((x - mean) ** 2) for x in salary_list]) / len(salary_list)
stdev = variance ** 0.5
return stdev
# apply std dev function to pyspark sql udf
stdev_udf = udf(stdev_list, FloatType() )
# make a new column using the pyspark sql udf
df = df.withColumn('stdev_salary',stdev_udf('salary'))
More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html
Option 2 - not using UDF
First explode the salary column so each salary item is represented on a new row
from pyspark.sql import functions as F
df_exploded = df.select('dept_name', 'salary', F.explode('salary').alias('salary_item'))
Then, calculate the standard deviation using the salary_item column while grouping by dept_name and salary
df_final = df_exploded.groupBy('dept_name', 'salary').agg(F.stddev('salary_item').alias('stddev_salary'))
Related
I have a dataframe with two numeric columns item_cnt_day and item_price.
I want to create a new column called rev in my dataframe which is calculated by (item_cnt_day * item_price). However, I want to add the condition rev = item_cnt_day * item_price only if item_cnt_day is larger or equal to 0, otherwise, rev = 0. Could you help me write the code for this condition when creating new column rev?
You can use loc accessor with boolean masking:
df['rev']=0
df.loc[df['item_cnt_day'].ge(0),'rev']=df['item_cnt_day'].mul(df['item_price'])
OR
You can use where():
df['rev']=df['item_cnt_day'].mul(df['item_price']).where(df['item_cnt_day'].ge(0),0)
#df['item_cnt_day'].mul(df['item_price']).mask(df['item_cnt_day'].lt(0),0)
OR
via np.where():
import numpy as np
df['rev']=np.where(df['item_cnt_day'].ge(0),df['item_cnt_day'].mul(df['item_price']),0)
Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.
I have calculated cdf for a data set in pandas df and want to determine the respective percentile from the cdf chart.
code for cdf:
def cdf(x):
df_1=pmf(x)
df1 = pd.DataFrame()
df1['pmf'] = df_1['pmf'].sort_index()
df1['x'] = df_1['x']
df1['cdf'] = np.cumsum(df1['pmf'])
return df1
This is the generated cdf df:
Now i want to write a simple logic to fetch the "x" data corresponding to a cdf for determining percentile.
Appreciate any help in this regard.
you can do it as below(use df name in place of df below):
df.loc[df['cdf'] == 0.999083, 'x']
output:
12.375
Is there any function in Spark which can calculate the mean of a column in a DataFrame by ignoring null/NaN? Like in R, we can pass an option such as na.rm=TRUE.
When I apply avg() on a column with a NaN, I get NaN only.
You can do the following :
df.na.drop(Seq("c_name")).select(avg(col("c_name")))
Create a dataframe without the null values in all the columns so that column mean can be calculated in the next step
removeAllDF = df.na.drop()
Create a list of columns in which the null values have to be replaced with column means and call the list "columns_with_nas"
Now iterate through the list "columns_with_nas" replace all the null values with the calculated mean values
for x in columns_with_nas:
meanValue = removeAllDF.agg(avg(x)).first()[0]
print(x, meanValue)
df= df.na.fill(meanValue, [x])
This seems to work for me in Spark 2.1.0:
In [16]: mydesc=[{'name':'Fela', 'age':46},
{'name':'Menelik','age':None},
{'name':'Zara','age':39}]
In [17]: mydf = sc.parallelize(mydesc).toDF()
In [18]: from pyspark.sql.functions import avg
In [20]: mydf.select(avg('age')).collect()[0][0]
Out[20]: 42.5
I have a pandas dataframe with 1mi rows and hierarchical indexes (country, state, city, in this order) with price observations of a product for each row. How can I calculate de mean and standard deviation (std) for each country, state and city (keeping in mind I am avoinding loops as my df is big)?
For each level of mean and std, I want to save the values in new columns in this dataframe for future access.
Use groupby with the argument levels to group your data and then use mean and std. If you want to have your mean as new column in your existing dataframe, use transform which return a Series with the same index as your df :
grouped = df.groupby(level = ['Country','State', 'City'])
df['Mean'] = grouped['price_observation'].transform('mean')
df['Std'] = grouped['price_observation'].transform('std')
If you want to read more on grouping, you can read the pandas documentation