How can you identify the best companies for each variable and copy the cases? - python-3.x

i want to compare the means of subgroups. The cases of the subgroup with the lowest and the highest mean should be copied and applied to the end of the dataset:
Input
df.head(10)
Outcome
Company Satisfaction Image Forecast Contact
0 Blue 2 3 3 1
1 Blue 2 1 3 2
2 Yellow 4 3 3 3
3 Yellow 3 4 3 2
4 Yellow 4 2 1 5
5 Blue 1 5 1 2
6 Blue 4 2 4 3
7 Yellow 5 4 1 5
8 Red 3 1 2 2
9 Red 1 1 1 2
I have around 100 cases in my sample. Now i look at the means for each company.
Input
df.groupby(['Company']).mean()
Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
So for satisfaction Yellow got the best and Blue the worst value. I want to copy the cases of yellow and blue and add them to the dataset but now with the new lable "Best" and "Worst". I dont want to rename it and i want to iterate over the dataset and to this for other columns, too (for example Image). Is there a solution for it? After i added the cases i want an output like this:
Input
df.groupby(['Company']).mean()
Expected Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
Best 3.857143 3.142857 3.000000 3.142857
Worst 2.666667 2.583333 2.866667 2.625000
But how i said. It is really important that the companies with the best and worst values for each column will be added again and not just be renamed because i want to do to further data processing with another software.
************************UPDATE****************************
I found out how to copy the correct cases:
Input
df2 = df.loc[df['Company'] == 'Yellow']
df2 = df2.replace('Yellow','Best')
df2 = df2[['Company','Satisfaction']]
new = [df,df2]
result = pd.concat(new)
result
Output
Company Contact Forecast Image Satisfaction
0 Blue 1.0 3.0 3.0 2
1 Blue 2.0 3.0 1.0 2
2 Yellow 3.0 3.0 3.0 4
3 Yellow 2.0 3.0 4.0 3
..........................................
87 Best NaN NaN NaN 3
90 Best NaN NaN NaN 4
99 Best NaN NaN NaN 1
111 Best NaN NaN NaN 2
Now i want to copy the cases of the company with the best values for the other variables, too. But now i have to identify manually which company is best for each category. Isnt there a more comfortable solution?

I have a solution. First i create a dictionary with the variables i want to create a dummy company for best and worst:
variables = ['Contact','Forecast','Satisfaction','Image']
After i loop over this columns and adding the cases again with the new label "Best" or "Worst":
for n in range(0,len(variables),1):
Start = variables[n-1]
neu = df.groupby(['Company'], as_index=False)[Start].mean()
Best = neu['Company'].loc[neu[Start].idxmax()]
Worst = neu['Company'].loc[neu[Start].idxmin()]
dfBest = df.loc[df['Company'] == Best]
dfWorst = df.loc[df['Company'] == Worst]
dfBest = dfBest.replace(Best,'Best')
dfWorst = dfWorst.replace(Worst,'Worst')
dfBest = dfBest[['Company',Start]]
dfWorst = dfWorst[['Company',Start]]
new = [df,dfBest,dfWorst]
df = pd.concat(new)
Thanks guys :)

Related

Replace value with nan based on required quarters

As a part of model requirement, I am stuck on weird spot where I need to replace actual value with Nan for extra quarters.
In the below example,
Id 1 should have nan in column Q4, 2 should have no nan and 3 should have Q3 and Q4 both as nan.
d = {'ID': [1, 2,3], 'QTR_req': [3,4,2],'Q1':[1,1,1],'Q2':[2,2,2],'Q3':[3,3,3],'Q4':[4,4,4]}
df2 = pd.DataFrame(data=d)
I have reached till the part of accessing QTR_req using df.loc but then stuck on how to make specific quarter nan. Could you suggest what am I looking for here?
May be this:
df2[cols_needed] = (df2[ cols_needed ]
.where(df2['QTR_req'].values[:,None] >np.arange(len(cols_needed )) )
)
Output:
ID QTR_req Q1 Q2 Q3 Q4
0 1 3 1 2 3.0 NaN
1 2 4 1 2 3.0 4.0
2 3 2 1 2 NaN NaN

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

Replacing values in specific columns in a Pandas Dataframe, when number of columns are unknown

I am brand new to Python and stacks exchange. I have been trying to replace invalid values ( x<-3 and x>12) with np.nan in specific columns.
I don't know how many columns I will have to deal with and thus will have to create a general code that takes this into account. I do however know, that the first two columns are ids and names respectively. I have searched google and stacks exchange for a solution but haven't been able to find a solution that solves my specific objective.
My question is; How would one replace values found in the third column and onwards?
My dataframe looks like this;
Data
I tried this line:
Data[Data > 12.0] = np.nan.
this replaced the first two columns with nan
1st attempt
I tried this line:
Data[(Data.iloc[(range(2,Columns))] >=12) & (Data.iloc[(range(2,Columns))]<=-3)] = np.nan
where,
Columns = len(Data.columns)
This is clearly wrong replacing all values in rows 2 to 6 (Columns = 7).
2nd attempt
Any thoughts would be greatly appreciated.
Python 3.6.1 64bits, Qt 5.6.2, PyQt5 5.6 on Darwin
You're looking for the applymap() method.
import pandas as pd
import numpy as np
# get the columns after the second one
cols = Data.columns[2:]
# apply mask to those columns
new_df = Data[cols].applymap(lambda x: np.nan if x > 12 or x <= -3 else x)
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html
This approach assumes your columns after the second contain float or int values.
You can set values to specific columns of a dataframe by using iloc and slicing the columns that you need. Then we can set the values using where
A short example using some random data
df = pd.DataFrame(np.random.randint(0,10,(4,10)))
0 1 2 3 4 5 6 7 8 9
0 7 7 9 4 2 6 6 1 7 9
1 0 1 2 4 5 5 3 9 0 7
2 0 1 4 4 3 8 7 0 6 1
3 1 4 0 2 5 7 2 7 9 9
Now we set the region to update and the region we want to update using iloc, and we slice columns indexed as 2 to the last column
df.iloc[:,2:] = df.iloc[:,2:].where((df < 7) & (df > 2))
Which will set the values in the Data Frame to NaN.
0 1 2 3 4 5 6 7 8 9
0 7 7 NaN 4.0 NaN 6.0 6.0 NaN NaN NaN
1 0 1 NaN 4.0 5.0 5.0 3.0 NaN NaN NaN
2 0 1 4.0 4.0 3.0 NaN NaN NaN 6.0 NaN
3 1 4 NaN NaN 5.0 NaN NaN NaN NaN NaN
For your data the code would be this
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data <= 12) & (Data >= -3))
Operator clarification
The setup I show directly above would look like this
-3 <= Data <= 12, gives everything between those numbers
If we reverse this logic using the & operator it looks like this
-3 >= Data <= 12, a number cannot be both less than -3 and greater than 12 at the same time.
So we use the or operator instead |. Code looks like this now....
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data >= 12) | (Data <= -3))
So the data is checked on a conditional basis
Data <= -3 or Data >= 12

Append Two Dataframes Together (Pandas, Python3)

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

Resources