How to agg a column to get standard dev in pyspark - apache-spark

I have a data frame which I groupBy Date and name and I aggregate the average of price.
df.filter("some filter")
.withColumn("price_int", df.print.cast('integer'))
.groupBy("Date", 'name')
.agg(avg(col('price_int')))
I want to get the max, and stddev of price_int column.
df.filter("some filter")
.withColumn("price_int", df.print.cast('integer'))
.groupBy("Date", 'name')
.agg(avg(col('price_int'),
max(col('price_int'),
stddev(col('price_int'),
)))
But when I add max(col('price_int), I get error saying Column is not iterable
And when I add stddev(col('price_int), I get error saying 'stdev is not defined'
Can you please tell me how can I get 'max' and 'standard dev' of the column 'price_int'?

I'm skeptical about where the issue is, as the code you provided seems to have syntax messed up, but in any case - the below would work:
from pyspark.sql import functions as F
df.filter("some filter")
.withColumn("price_int", df.print.cast('integer'))
.groupBy("Date", 'name')
.agg(F.avg(F.col('price_int')).alias("Average"), F.stddev(F.col("price_int")).alias("Std_Deviation"), F.max(F.col("price_int")).alias("Maximum"))
Worked for me:
Also, note Spark has 2 different standard deviation functions:
stddev() or stddev_samp() - returns the unbiased sample standard
deviation of the expression in a group
stddev_pop() - returns population standard deviation of the expression
in a group.

Related

seasonal_decompose : How to use seasonal_decompose:Practical Implementation for seasonal_decompose

How to use seasonal_decompose. How to deal with various errors while using seasonal_decompose. How can we practically use or implement seasonal_decompose.
Get all imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
from statsmodels.tsa.seasonal import seasonal_decompose
Prepare test data
data = {'Unix Timestamp': ['1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12'],
'Date': ['4/20/2021 0:02','4/20/2021 0:01','4/20/2021 0:00','4/19/2021 23:59','4/19/2021 23:58','4/19/2021 23:57','4/19/2021 23:56','4/19/2021 23:55','4/19/2021 23:54','4/19/2021 23:53','4/19/2021 23:52','4/19/2021 23:51','4/19/2021 23:50','4/19/2021 23:49','4/19/2021 23:48','4/19/2021 23:47','4/19/2021 23:46','4/20/2021 0:02','4/20/2021 0:01','4/20/2021 0:00','4/19/2021 23:59','4/19/2021 23:58','4/19/2021 23:57','4/19/2021 23:56','4/19/2021 23:55','4/19/2021 23:54','4/19/2021 23:53','4/19/2021 23:52','4/19/2021 23:51','4/19/2021 23:50','4/19/2021 23:49','4/19/2021 23:48','4/19/2021 23:47','4/19/2021 23:46'],
'Symbol': ['BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD'],
'Open': [55717.47,55768.94,55691.79,55777.86,55803.5,55690.64,55624.69,55651.82,55688.08,55749.28,55704.59,55779.38,55816.61,55843.69,55880.12,55890.88,0,55717.47,55768.94,55691.79,55777.86,55803.5,55690.64,55624.69,55651.82,55688.08,55749.28,55704.59,55779.38,55816.61,55843.69,55880.12,55890.88,0],
'High': [55723,55849.82,55793.15,55777.86,55823.88,55822.91,55713.02,55675.92,55730.21,55749.28,55759.27,55779.38,55835.57,55863.89,55916.47,55918.87,0,55723,55849.82,55793.15,55777.86,55823.88,55822.91,55713.02,55675.92,55730.21,55749.28,55759.27,55779.38,55835.57,55863.89,55916.47,55918.87,0],
'Low': [55541.69,55711.74,55691.79,55677.92,55773.08,55682.56,55624.63,55621.58,55641.46,55688.08,55695.42,55688.66,55769.46,55797.08,55815.99,55826.84,0,55541.69,55711.74,55691.79,55677.92,55773.08,55682.56,55624.63,55621.58,55641.46,55688.08,55695.42,55688.66,55769.46,55797.08,55815.99,55826.84,0]}
df=pd.DataFrame(data)
Perform decomposition
df_seasonal = seasonal_decompose(df)
We get our first error
ValueError: could not convert string to float:
Lets fix the above error, for this run below code
df['Date'] = df['Date'].apply(
lambda x : datetime.datetime.strptime(str(x),'%m/%d/%Y %H:%M')
)
Now if you run seasonal_decompose again, you will get new error
df_seasonal = seasonal_decompose(df)
Now the new error will be
TypeError: float() argument must be a string or a number, not 'Timestamp'
To fix this error we pass one column at a time and the column passed should be a string or a number. Try the decompose using below code
df_seasonal = seasonal_decompose(df['Open'])
Now you get a new error, as shown below
ValueError: You must specify a period or x must be a pandas object with a PeriodIndex or a DatetimeIndex with a freq not set to None
There are two solution's to this error
First Solution:- use period parameter for seasonal_decompose
df_seasonal = seasonal_decompose(df['Open'],period = 1) ## here we have data for every minute and hence period is 1 , but this need not be correct.
In above code we have data for every minute and hence period is 1. However, this need not be correct period is actually cycle period of input data. To know more on how to decide on period read this page. To know the complete list of freq abbrevations click here
Second Solution:- create an datetime index for the data along with frequency
df = df.set_index(df.Date).asfreq('2Min') ## M for Months S for Seconds. Here we cannot resample data with frequency 1Min, as data is already in frequency of 1Min, hence we used 2Min here
df_seasonal = seasonal_decompose(df['Open']) ## here we didn't use period and freq argument
In seasonal_decompose we have to set the model ( By default its Addictive). We can either set the model to be Additive or Multiplicative. A rule of thumb for selecting the right model is to see in our plot if the trend and seasonal variation are relatively constant over time, in other words, linear. If yes, then we will select the Additive model. Otherwise, if the trend and seasonal variation increase or decrease over time then we use the Multiplicative model. So that means before we do seasonal_decompose we must plot the preprocessed data over time and see if there are any trends or cycles.
Finally we could run it without error.
Another error that we might see is TypeError: Index(...) must be called with a collection of some kind, 'seasonal' was passed, this again happens due to wrong usage of seasonal_decompose like for example below
df_bt_decomp = seasonal_decompose(df_bt[['Open','High']],period=1) ## this is wrong because we have used two columns together and both are valid metric and not an index.

How to preprocess a dataset with many types of missing data

I'm trying to do the beginner machine learning project Big Mart Sales.
The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
Here are some rows of my data:
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
If you have a numerical column, you can use some approaches to fill the missing data:
A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Lets see how it works for a mean for one column e.g.:
One method would be to use fillna from pandas:
X['Name'].fillna(X['Name'].mean(), inplace=True)
For categorical data please have a look here: Impute categorical missing values in scikit-learn

Iterating throughput dataframe columns and using .apply() gives KeyError

So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)

error in calculating minutes from timedelta objects in pandas dataframe

I tried calculating minutes by subtracting two columns. But an error shows up which is " ValueError: unit abbreviation w/o a number".
However, I tried similar operation on two different columns but it worked.
I tried calculating arr_delay, but there was no error. But error came up when I tried calculating dep_delay.
data['arr_delay'] = (pd.to_timedelta(data.ATA) - pd.to_timedelta(data.STA)).dt.total_seconds()/60
data['dep_delay'] = (pd.to_timedelta(data.ATD) - pd.to_timedelta(data.STD)).dt.total_seconds()/60
I was able to calculate arr_delay. But an error came up while calculating dep_delay which is :-" ValueError: unit abbreviation w/o a number "
In my opinion there are some bad values in some column, so use parameter errors='coerce' for convert these values to NaT:
data['dep_delay'] = (pd.to_timedelta(data.ATD, errors='coerce') -
pd.to_timedelta(data.STD, errors='coerce')).dt.total_seconds()/60

How to use select() transformation in Apache Spark?

I am following the Intro to Spark course on edX. However, I cant understand few things, following is an lab assignment. FYI, I am not looking for solution.
I am not able to understand as why I am receiving the error
TypeError: 'Column' object is not callable
Following is the code
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""
Args:
column (Column): A Column containing a sentence.
"""
# This following is giving error. I believe I am calling all the rows from the dataframe 'column' where the attribute is named as 'sentence'
result = column.select('sentence')
return result
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
Can you be little elaborate? TIA.
The column parameter is not a DataFrame object and, therefore, does not have access to the select method. You'll need to use other functions to solve this problem.
Hint: Look at the import statement.

Resources