I want to calculate the frequencies of values of a dataframe column in a column from another dataframe. Right now, I have the code as below:
df2["freq"] = df1[["col1"]].groupby(df2["col2"])["col1"].transform('count')
But it is giving freq of 1.0 for all the values in df2["col2"], even for those values that don't exist in df1["col1"].
df1:
col1
0 636
1 636
2 801
3 802
df2:
col2
0 636
1 734
2 801
3 803
df2 after adding freq column:
col2 freq
0 636 1.0
1 734 1.0
2 801 1.0
3 803 1.0
What I actually want:
col2 freq
0 636 2
1 734 0
2 801 1
3 803 0
I am new to pandas, so I am not getting what I am doing wrong. Any help is appreciated! Thanks!
Use Series.map by Series created by Series.value_counts, last replace missing values to 0:
df2["freq"] = df2["col2"].map(df1["col1"].value_counts()).fillna(0).astype(int)
print (df2)
col2 freq
0 636 2
1 734 0
2 801 1
3 803 0
Related
I want to replace outliers with NaN so that I can concat that dataframe with the other dataframe where I don't want to remove the outliers. Following is the dataset. I want to perform outlier removal only on 'age', 'height', 'weight', 'ap_hi', 'ap_lo'.
id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
988 22469 1 155 69.0 130 80 2 2 0 0 1 0
989 14648 1 163 71.0 110 70 1 1 0 0 1 1
990 21901 1 165 70.0 120 80 1 1 0 0 1 0
991 14549 2 165 85.0 120 80 1 1 1 1 1 0
992 23393 1 155 62.0 120 80 1 1 0 0 1 0
I tried the following method but it's taking all columns into consideration:
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
I have a df like this,
Date Value
0 2019-03-01 0
1 2019-04-01 1
2 2019-09-01 0
3 2019-10-01 1
4 2019-12-01 0
5 2019-12-20 0
6 2019-12-20 0
7 2020-01-01 0
Now, I need to group them by quarter and get the proportions of 1 and 0. So, I get my final output like this,
Date Value1 Value0
0 2019-03-31 0 1
1 2019-06-30 1 0
2 2019-09-30 0 1
3 2019-12-31 0.25 0.75
4 2020-03-31 0 1
I tried the following code, doesn't seem to work.
def custom_resampler(array):
import numpy as np
return array/np.sum(array)
>>df.set_index('Date').resample('Q')['Value'].apply(custom_resampler)
Is there a pandastic way I can achieve my desired output?
Resample by quarter, get the value_counts, and unstack. Next, rename the columns, using the name property of the columns. Last, divide each row value by the total per row :
df = pd.read_clipboard(sep='\s{2,}', parse_dates = ['Date'])
res = (df
.resample(rule="Q",on="Date")
.Value
.value_counts()
.unstack("Value",fill_value=0)
)
res.columns = [f"{res.columns.name}{ent}" for ent in res.columns]
res = res.div(res.sum(axis=1),axis=0)
res
Value0 Value1
Date
2019-03-31 1.00 0.00
2019-06-30 0.00 1.00
2019-09-30 1.00 0.00
2019-12-31 0.75 0.25
2020-03-31 1.00 0.00
I have three columns in dataframe , X1 X2 X3 , i want to count rows when value change from value greater than 1 to 0 . if before 0 value less than 1 dont need to count.
input df:
df1=pd.DataFrame({'x1':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X2':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X3':[6,3,0.5,0,0,0,0,20,15,16,0,0,70]})
print(df1)
x1 X2 X3
0 3 3 6.0
1 4 4 3.0
2 7 7 0.5
3 0 0 0.0
4 0 0 0.0
5 0 0 0.0
6 0 0 0.0
7 20 20 20.0
8 15 15 15.0
9 16 16 16.0
10 0 0 0.0
11 0 0 0.0
12 70 70 70.0
desired_output
x1_count X2_count X3_count
0 6 6 2
Idea is replace 0 to missing values, forward filling them, convert all another values to NaNs, compare greater like 1 and count Trues by sum to Series converted to one row DataFrame with transpose:
m = df1.eq(0)
df2 = (df1.mask(m)
.ffill()
.where(m)
.gt(1)
.sum()
.add_suffix('_count')
.to_frame()
.T
)
print (df2)
x1_count X2_count X3_count
0 6 6 2
I have a df that looks something like the below
Index Col1 Col2 Col3 Col4 Col5
0 12 121 346 abc 747
1 156 121 146 68 75967
2 234 121 346 567
3 gj 161 646
4 214 171
5 fhg
....
.....
And I want to make the dataframe appear such that the columns where there are null values, the columns move/shift their data to the bottom of the dataframe.
Eg it should look like:
Index Col1 Col2 Col3 Col4 Col5
0 12
1 156 121
2 234 121 346
3 gj 121 146 abc
4 214 161 346 68 747
5 fhg 171 646 567 75967
I have thought along the lines of shift and/or justify.
However not sure how it can be accomplished in the most efficient way for a large dataframe
You can use a bit changed justify function for working also with non numeric values:
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notnull(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
arr = justify(df.values, invalid_val=np.nan, side='down', axis=0)
df = pd.DataFrame(arr, columns=df.columns, index=df.index).astype(df.dtypes)
print (df)
Col1 Col2 Col3 Col4 Col5
0 12 NaN NaN NaN NaN
1 156 121 NaN NaN NaN
2 234 121 346 NaN NaN
3 gj 121 346 567 NaN
4 214 121 346 567 75967
5 fhg 121 346 567 75967
I tried this,
t=df.isnull().sum()
for val in zip(t.index.values,t.values):
df[val[0]]=df[val[0]].shift(val[1])
print df
Output:
Index Col1 Col2 Col3 Col4 Col5
0 0 12 NaN NaN NaN NaN
1 1 156 121.0 NaN NaN NaN
2 2 234 121.0 346.0 NaN NaN
3 3 gj 121.0 146.0 abc NaN
4 4 214 161.0 346.0 68 747.0
5 5 fhg 171.0 646.0 567 75967.0
Note: Here I have used loop, may be not a better solution, but it will give you an idea to solve this.
Using Python/Pandas I am trying to transform a dataframe by creating two new columns (A and B) conditional on values from different lines (from column ID3), but from within the same group (as determined by ID1).
For each ID1 group, I want to take the ID2 value where ID3 is equal to 31 and put this value in a new column called A conditional on ID3 being a 1 or a 2. Similarly, I want to take the ID2 value where ID3 is equal to 41 and put this value in a new column called B, again conditional on ID3 being a 1 or a 2.
Assuming I have a dataframe in the following format:
import pandas as pd
df = pd.DataFrame({'ID1': (1, 1, 1, 1, 2, 2, 2), 'ID2': (151, 152, 153, 154, 261, 262, 263), 'ID3': (1, 2, 31, 41, 1, 2, 41), 'ID4': (2, 2, 1, 2, 1, 1, 2)})
print(df)
ID1 ID2 ID3 ID4
0 1 151 1 2
1 1 152 2 2
2 1 153 31 1
3 1 154 41 2
4 2 261 1 1
5 2 262 2 1
6 2 263 41 2
Post-transformation format should look like what is shown below. Where columns A and B are populated with values from ID2, conditional on values within ID3.
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153 154
1 1 152 2 2 153 154
2 1 153 31 1
3 1 154 41 2
4 2 261 1 1
5 2 262 2 1 263
6 2 263 41 2 263
I have attempted using what is shown below, but transform will retain the same number of values as the original dataset. This poses a problem for the lines in which ID3 = 31 or 41. Also, it returns the ID2 value by default if there is no ID2 value of 31 within the group.
df['A'] = df.groupby('ID1')['ID2'].transform(lambda x: x.loc[df['ID3'] == 31])
df['B'] = df.groupby('ID1')['ID2'].transform(lambda x: x.loc[df['ID3'] == 41])
Result:
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153 154
1 1 152 2 2 153 154
2 1 153 31 1 153 154
3 1 154 41 2 153 154
4 2 261 1 1 261 263
5 2 262 2 1 262 263
6 2 263 41 2 263 263
Any suggestions? Thank you in advance!
In no why do I think this is the best solution, but it its a solution.
You can replace .loc with .where, which will return NaN wherever the condition is not true. Then backfill NaN, and then again filter with .where on ID3 being 1 or 2
df['A'] = df.groupby('ID1')['ID2'].transform(lambda x:
x.where(df.ID3==31).fillna(method='bfill').where(df.ID3.isin([1,2])))
df['B'] = df.groupby('ID1')['ID2'].transform(lambda x:
x.where(df.ID3==41).fillna(method='bfill').where(df.ID3.isin([1,2])))
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153.0 154.0
1 1 152 2 2 153.0 154.0
2 1 153 31 1 NaN NaN
3 1 154 41 2 NaN NaN
4 2 261 1 1 NaN 263.0
5 2 262 2 1 NaN 263.0
6 2 263 41 2 NaN NaN