How to calculate percentile in a pd dataframe? - python-3.x

Trying to calculate the percentile of a value in a pd column but only for x number of values:
Column name = 'Signal'
50
20
30
20
102
22
1
2
43
49
38
173
371
312
Next to each value would like to calculate on a rolling basis the percentile of that value among X numbers preceding that one (inclusive).
Thanks!

You can use scipy.stats.rankdata() to calculate your data's rank and pd.rolling() to calculate the rank on a rolling basis.
import pandas as pd
import numpy as np
from scipy.stats import rankdata
df = pd.DataFrame({'Signal':[50,20,30,20,102,22,1,2,43,49,38,173,371,312]})
def last_value_rank(arr):
"""
Calculates the last value's rank.
Rankings start from 1.
Duplicates are considered (e.g. [1,3,2,2] ranks are [1,4,2.5,2.5], result = 2.5)
"""
return rankdata(arr)[-1]
def last_value_rank_normalized(arr):
return (last_value_rank(arr) - 1) * 100 / (len(arr) - 1)
print(df.join(df['Signal'].rolling(5).agg([last_value_rank, last_value_rank_normalized])))
output:
Signal last_value_rank last_value_rank_normalized
0 50 NaN NaN
1 20 NaN NaN
2 30 NaN NaN
3 20 NaN NaN
4 102 5.0 100.0
5 22 3.0 50.0
6 1 1.0 0.0
7 2 2.0 25.0
8 43 4.0 75.0
9 49 5.0 100.0
10 38 3.0 50.0
11 173 5.0 100.0
12 371 5.0 100.0
13 312 4.0 75.0

Related

Interpolate above and below a range of values in a column - Pandas

I was looking for the way to extend the range values inside a Pandas column by interpolation, but I still don't know how to set the 'limits' of the interpolation, I mean, it's something like:
[Distance] [Radiation]
12 120
13 130
14 140
15 150
16 160
17 170
So, what I'm trying to get is the full range of column [Radiation] according to the complete secuence of column [Distance] by interpolation.
[Distance] [Radiation]
1 10
2 20
. .
. .
12 120
13 130
14 140
15 150
16 160
. .
. .
20 200
I was looking in the documentation of pandas and scipy methods but I think I couldn't find it yet.
Thanks for your insights.
One idea is use DataFrame.reindex for add all not existing values of distance and then use DataFrame.interpolate with barycentric method:
df = (df.set_index('Distance')
.reindex(range(1, 21))
.interpolate(method='barycentric', limit_direction='both')
.reset_index())
print (df)
Distance Radiation
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
10 11 110.0
11 12 120.0
12 13 130.0
13 14 140.0
14 15 150.0
15 16 160.0
16 17 170.0
17 18 180.0
18 19 190.0
19 20 200.0

Is it possible to only fill in 50% of the missing values in pandas?

This is the DF:
amount cost
5 NaN
7 NaN
9 78.0
6 80.0
12 NaN
14 NaN
And I only want to fill 50% of the NANs so that I would get something like this:
amount cost
5 'hello'
7 NaN
9 78.0
6 80.0
12 NaN
14 'hello'
And is it possible to fill lets say 28% of the missing data with bigger dataSets.
Thanks for help.
We can do
idx=df.index[df.cost.isna()]
df.loc[np.random.choice(idx, size=int(len(idx)/2) ,replace=False),'cost']='somevalue'
df
Out[16]:
amount cost
0 5 NaN
1 7 somevalue
2 9 78
3 6 80
4 12 somevalue
5 14 NaN
Try with df.update()
nans = df.loc[df.cost.isna(), ]
nans.iloc[:int(len(nans) * 0.5), 'cost'] = 'hello'
df.update(nans.cost)

Create a specific column by looping over the user defined dictionary in pandas

I have a df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 23
2020-02-06 14
2020-02-09 23
2020-02-10 23
2020-02-11 23
2020-02-13 30
2020-02-20 29
2020-02-29 100
2020-03-01 38
2020-03-10 38
2020-03-11 38
2020-03-26 70
2020-03-29 70
From that I would like to create a function that will calculate the column called t_function based on the calculated values t1, t2 and t3.
where input parameters are stored in a dictionary as shown below.
d1 = {'b1': {'s': '2020-02-01', 'e':'2020-02-06', 'coef':[3, 1, 0]},
'b2': {'s': '2020-02-13', 'e':'2020-02-29', 'coef':[2, 0, 1]},
'b3': {'s': '2020-03-11', 'e':'2020-03-29', 'coef':[4, 0, 0]}}
Expected output:
Date t_factor t1 t2 t3 t_function
2020-02-01 5 4 NaN NaN 4
2020-02-03 23 6 NaN NaN 6
2020-02-06 14 9 NaN NaN 9
2020-02-09 23 NaN NaN NaN 0
2020-02-10 23 NaN NaN NaN 0
2020-02-11 23 NaN NaN NaN 0
2020-02-13 30 NaN 3 NaN 3
2020-02-20 29 NaN 66 NaN 66
2020-02-29 100 NaN 291 NaN 291
2020-03-01 38 NaN NaN NaN 0
2020-03-10 38 NaN NaN NaN 0
2020-03-11 38 NaN NaN 4 4
2020-03-26 70 NaN NaN 4 4
2020-03-29 70 NaN NaN 4 4
I tried below code
def fun(x, start="2020-02-01", end="2020-02-06", a0=3, a1=1, a2=0):
start = datetime.strptime(start, "%Y-%m-%d")
end = datetime.strptime(end, "%Y-%m-%d")
if start <= x.Date <= end:
t2 = (x.Date - start)/np.timedelta64(1, 'D') + 1
diff = a0 + a1*t2 + a2*(t2)**2
else:
diff = np.NaN
return diff
df["t1"] = df.apply(lambda x: fun(x), axis=1)
df["t2"] = df.apply(lambda x: fun(x, "2020-02-13", "2020-02-29", 2, 0, 1), axis=1)
df["t3"] = df.apply(lambda x: fun(x, "2020-03-11", "2020-03-29", 4, 0, 0), axis=1)
df["t_function"] = df['t1'].fillna(0) + df['t2'].fillna(0) + df['t3'].fillna(0)
Above code I would like change by looping over the dictionary d1.
Note:
The dictionary d1 may have more than three keys such as 'b1', 'b2', 'b3', 'b4' then we have to create t1, t2, t3 and t4 columns. I would like to automate this with looping over the dictionary d1:
I would propose that you store the data as a list of tuples. Like so,
params = [('2020-02-01', '2020-02-06', 3, 1, 0),
('2020-02-13', '2020-02-29', 2, 0, 1),
('2020-03-11', '2020-03-29', 4, 0, 0)]
Now all you need is to loop over  params and add the columns to your dataframe df.
total = None
for i, param in enumerate(params):
s, e, a0, a1, a2 = param
df[f"t{i+1}"] = df.apply(lambda x: fun(x, s, e, a0, a1, a2), axis=1)
if i==0:
total = df[f"t{i+1}"].fillna(0)
else:
total += df[f"t{i+1}"].fillna(0)
df["t_function"] = total
This gives the desired output:
Date t_factor t1 t2 t3 t_function
0 2020-02-01 5 4.0 NaN NaN 4.0
1 2020-02-03 23 6.0 NaN NaN 6.0
2 2020-02-06 14 9.0 NaN NaN 9.0
3 2020-02-09 23 NaN NaN NaN 0.0
4 2020-02-10 23 NaN NaN NaN 0.0
5 2020-02-11 23 NaN NaN NaN 0.0
6 2020-02-13 30 NaN 3.0 NaN 3.0
7 2020-02-20 29 NaN 66.0 NaN 66.0
8 2020-02-29 100 NaN 291.0 NaN 291.0
9 2020-03-01 38 NaN NaN NaN 0.0
10 2020-03-10 38 NaN NaN NaN 0.0
11 2020-03-11 38 NaN NaN 4.0 4.0
12 2020-03-26 70 NaN NaN 4.0 4.0
13 2020-03-29 70 NaN NaN 4.0 4.0

How to sum variable ranges in a pandas column to another column

I'm relatively new to pandas and I don't know the best approach to solve my problem.
Well, I have a df with: an index, and the data in a column called 'Data' and an empty column called 'sum'.
I need help to create a function to add the sum of the variable group of rows of the 'Data' column in the column 'sum'. The grouping criteria is that there should not be empty rows in the group.
Here an example:
index Data Sum
0 1
1 1 2
2
3
4 1
5 1
6 1 3
7
8 1
9 1 2
10
11 1
12 1
13 1
14 1
15 1 5
16
17 1 1
18
19 1 1
20
As you see, the length of each group of data in 'Data' is variable, could be only one row or any number of rows. Always the sum must be at the end of the group. As an example: the sum of the group of rows 4,5,6 of the 'Data' column should be at row 6 in the 'sum' column.
any insight will be appreciated.
UPDATE
The problem was solved by implementing the Method 3 suggested by ansev. However due to a change in the main program, the sum of each block, now need to be at the beggining of each one (in case the block has more than one row). Then I use the df = df.iloc[::-1] instruction twice in order to reverse the column and back again to normal. Thank you very much!!!!!
df = df.iloc[::-1]
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['Sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
df = df.iloc[::-1]
print(df)
Data Sum
0 1.0 2.0
1 1.0 NaN
2 NaN NaN
3 NaN NaN
4 1.0 3.0
5 1.0 NaN
6 1.0 NaN
7 NaN NaN
8 1.0 2.0
9 1.0 NaN
10 NaN NaN
11 1.0 5.0
12 1.0 NaN
13 1.0 NaN
14 1.0 NaN
15 1.0 NaN
16 NaN NaN
17 1.0 1.0
18 NaN NaN
19 1.0 1.0
20 NaN NaN
We can use GroupBy.cumsum:
# if you need replace blanks
#df = df.replace(r'^\s*$', np.nan, regex=True)
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
print(df)
index Data sum
0 0 1.0 NaN
1 1 1.0 2.0
2 2 NaN NaN
3 3 NaN NaN
4 4 1.0 NaN
5 5 1.0 NaN
6 6 1.0 3.0
7 7 NaN NaN
8 8 1.0 NaN
9 9 1.0 2.0
10 10 NaN NaN
11 11 1.0 NaN
12 12 1.0 NaN
13 13 1.0 NaN
14 14 1.0 NaN
15 15 1.0 5.0
16 16 NaN NaN
17 17 1.0 1.0
18 18 NaN NaN
19 19 1.0 1.0
20 20 NaN NaN
Method 2
#df = df.drop(columns='index') #if neccesary
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))
Method 3
Series.duplicated and Series.mask
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
as you can see the methods only differ in the way of masking the values ​​we don't need from the sum column.
We can also use .transform('sum') instead .cumsum()
performance with the sample dataframe
%%timeit
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
4.52 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))
8.52 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
3.02 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Code Used for replication
import numpy as np
data = {'Data': [1,1, np.nan , np.nan,1, 1, 1,np.nan , 1,1,np.nan,1,1,1,1,1,np.nan,1,np.nan,1,np.nan]}
df = pd.DataFrame (data)
Iterative Approach Solution
count = 0
for i in range(df.shape[0]):
if df.iloc[i, 0] == 1:
count += 1
elif i != 0 and count != 0:
df.at[i - 1, 'Sum'] = count
print(count)
count = 0
Create a new column that is equal to the index in the data gaps and undefined, otherwise:
df.loc[:, 'Sum'] = np.where(df.Data.isnull(), df.index, np.nan)
Fill the column backward, count the lengths of the identically labeled spans, redefine the column:
df.Sum = df.groupby(df.Sum.bfill()).count()
Align the new column with the original data:
df.Sum = df.Sum.shift(-1)
Eliminate 0-length spans:
df.loc[df.Sum == 0, 'Sum'] = np.nan

Data Cleaning Python: Replacing the values of a column not within a range with NaN and then dropping the raws which contain NaN

I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.
My Dataset in Excel:
I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
Now I want to drop/ delete the raws contains the NaN values
Expected output should be like:
You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
print(data)
data = data.dropna()
print(data)
Orignal data:
A B C
0 1 10 20
1 2 11 22
2 4 15 25
3 8 20 30
4 12 25 35
5 18 40 55
6 20 45 60
Output with NaN:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.0 30.0
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Final Output:
A B C
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Try this for non-integer numbers:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
print(data)
data = data.dropna()
print(data)
Output:
A B C
0 1.25 10.56 20.11
1 2.39 11.19 22.92
2 4.00 15.65 25.27
3 8.89 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
try this,
df= df.drop(df.index[df.idxmax()])
O/P:
A B C D
0 1 21 41 61
1 2 22 42 62
2 3 23 43 63
3 4 24 44 64
4 5 25 45 65
5 6 26 46 66
6 7 27 47 67
7 8 28 48 68
8 9 29 49 69
13 14 34 54 74
14 15 35 55 75
15 16 36 56 76
16 17 37 57 77
17 18 38 58 78
18 19 39 59 79
19 20 40 60 80
use idxmax and drop the returned index.

Resources