Iterating throughput dataframe columns and using .apply() gives KeyError - python-3.x

So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())

I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)

Related

Iterate over 4 pandas data frame columns and store them into 4 lists with one for loop instead of 4 for loops

I am currently working on pandas structure in Python. I wrote a function that extracts data from Pandas data frame and stores it in lists. The code is working but I feel like there is a part that I could write in one for loop instead four for loops. I will give you an example below. The idea of this part of the code is to extract four columns from a pandas data frame into four lists. I did it with 4 separate for loops but I want to have one loop that does the thing.
col1,col1,col1,col1 = [],[],[],[]
for j in abc['col1']:
col1.append(j)
for k in abc['col2']:
col2.append(k)
for l in abc['col3']:
col3.append(l)
for n in abc['col4']:
col4.append(n)
And my idea is to write a one for loop that does all the code. I tried to do something like this, but it doesn't work.
col1,col1,col1,col1 = [],[],[],[]
for j,k,l,n in abc[['col1','col2','col3','col4']]
col1.append(j)
col2.append(k)
col3.append(l)
col4.append(n)
Can you help me with this idea to wrap four for loops into the one? I would appreciate your help!
You don't need to use loops at all; you can just convert each column into a list directly.
list_1 = df["col"]to_list()
Have a look at this previous question.
Treating a panda dataframe like a list usually works, but is very bad for performance. I'd consider using the iterrows() function instead.
This would work as in the following example:
col1,col2,col3,col4 = [],[],[],[]
for index, row in df.iterrows():
col1.append(row['col1'])
col2.append(row['col2'])
col3.append(row['col3'])
col4.append(row['col4'])
It's probably easier to use pandas.values and then numpy.ndarray.to_list():
col = ['col1','col2','col3']
data = []*len(col)
for i in range(len(col)):
data[i] = df[col(i)].values.to_list()

pd.to_numeric not working

I am facing a weird problem with pandas.
I donot know where I am going wrong?
But when I am creating a new df, there seems to be no problem. like
Any idea why?
Edit :
sat=pd.read_csv("2012_SAT_Results.csv")
sat.head()
#converted columns to numeric types
sat.iloc[:,2:]=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat.dtypes
sat_1=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat_1.head()
The fact that you can't apply to_numeric directly using .iloc appears to be a bug, but to get the same results that you're looking for (applying to_numeric to multiple columns at the same time), you could instead use:
df = pd.DataFrame({'a':['1','2'],'b':['3','4']})
# If you're applying to entire columns
df[df.columns[1:]] = df[df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')
# If you want to apply to specific rows within columns
df.loc[df.index[1:], df.columns[1:]] = df.loc[df.index[1:], df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')

Hash each row of pandas dataframe column using apply

I'm trying to hash each value of a python 3.6 pandas dataframe column with the following algorithm on the dataframe-column ORIG:
HK_ORIG = base64.b64encode(hashlib.sha1(str(df.ORIG).encode("UTF-8")).digest())
However, the above mentioned code does not hash each value of the column, so, in order to hash each value of the df-column ORIG, I need to use the apply function. Unfortunatelly, I don't seem to be good enough to get this done.
I imagine it to look like the following code:
df["HK_ORIG"] = str(df['ORIG']).encode("UTF-8")).apply(hashlib.sha1)
I'm looking very much forward to your answers!
Many thanks in advance!
You can either create a named function and apply it - or apply a lambda function. In either case, do as much processing as possible withing the dataframe.
A lambda-based solution:
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(lambda x: base64.b64encode(hashlib.sha1(x).digest()))
A named function solution:
def hashme(x):
return base64.b64encode(hashlib.sha1(x).digest())
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(hashme)

Warning: A value is trying to be set on a copy of a slice from a DataFrame -- Using List of Columns

I am getting the following warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here is my code that is getting the warning:
col_names = ['Column1', 'Column2']
features = X_train[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
X_train[col_names] = features
I realize this is happening because I'm copying the dataframe. But what I am doing here is not like any of the answers I found googling, so I can't figure out how to apply their answers to my particular situation. It looks like the normal scenario where you get this warning is if you do something like this:
d2 = data[data['name'] == 'fred']
So .loc doesn't work. And .assign doesn't either because I have a list of columns instead of just a column I can assign. I'm just not quite sure how to handle this the way it wants me too.
It works fine the way it is, other than the warning. So the way I have it is correct.
I think the warning is saying for you to do something like
X_train.loc[:, col_names] = features

Spark - Optimize calculation time over a data frame, by using groupBy() instead of filter()

I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)

Resources