I was looking for the way to extend the range values inside a Pandas column by interpolation, but I still don't know how to set the 'limits' of the interpolation, I mean, it's something like:
[Distance] [Radiation]
12 120
13 130
14 140
15 150
16 160
17 170
So, what I'm trying to get is the full range of column [Radiation] according to the complete secuence of column [Distance] by interpolation.
[Distance] [Radiation]
1 10
2 20
. .
. .
12 120
13 130
14 140
15 150
16 160
. .
. .
20 200
I was looking in the documentation of pandas and scipy methods but I think I couldn't find it yet.
Thanks for your insights.
One idea is use DataFrame.reindex for add all not existing values of distance and then use DataFrame.interpolate with barycentric method:
df = (df.set_index('Distance')
.reindex(range(1, 21))
.interpolate(method='barycentric', limit_direction='both')
.reset_index())
print (df)
Distance Radiation
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
10 11 110.0
11 12 120.0
12 13 130.0
13 14 140.0
14 15 150.0
15 16 160.0
16 17 170.0
17 18 180.0
18 19 190.0
19 20 200.0
Related
Trying to calculate the percentile of a value in a pd column but only for x number of values:
Column name = 'Signal'
50
20
30
20
102
22
1
2
43
49
38
173
371
312
Next to each value would like to calculate on a rolling basis the percentile of that value among X numbers preceding that one (inclusive).
Thanks!
You can use scipy.stats.rankdata() to calculate your data's rank and pd.rolling() to calculate the rank on a rolling basis.
import pandas as pd
import numpy as np
from scipy.stats import rankdata
df = pd.DataFrame({'Signal':[50,20,30,20,102,22,1,2,43,49,38,173,371,312]})
def last_value_rank(arr):
"""
Calculates the last value's rank.
Rankings start from 1.
Duplicates are considered (e.g. [1,3,2,2] ranks are [1,4,2.5,2.5], result = 2.5)
"""
return rankdata(arr)[-1]
def last_value_rank_normalized(arr):
return (last_value_rank(arr) - 1) * 100 / (len(arr) - 1)
print(df.join(df['Signal'].rolling(5).agg([last_value_rank, last_value_rank_normalized])))
output:
Signal last_value_rank last_value_rank_normalized
0 50 NaN NaN
1 20 NaN NaN
2 30 NaN NaN
3 20 NaN NaN
4 102 5.0 100.0
5 22 3.0 50.0
6 1 1.0 0.0
7 2 2.0 25.0
8 43 4.0 75.0
9 49 5.0 100.0
10 38 3.0 50.0
11 173 5.0 100.0
12 371 5.0 100.0
13 312 4.0 75.0
If we have the following df,
df
A A B B B
0 10 2 0 3 3
1 20 4 19 21 36
2 30 20 24 24 12
3 40 10 39 23 46
How can I combine the content of the columns with the same names?
e.g.
A B
0 10 0
1 20 19
2 30 24
3 40 39
4 2 3
5 4 21
6 20 24
7 10 23
8 Na 3
9 Na 36
10 Na 12
11 Na 46
I tried groupby and merge and both are not doing this job.
Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat:
df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B'])
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
EDIT:
uniq = df.columns.unique()
df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq)
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
I have this dataframe (df) in python:
Cumulative sales
0 12
1 28
2 56
3 87
I want to create a new column in which I whould have the the number of new sales (N-(N-1)) as below:
Cumulative sales New Sales
0 12 12
1 28 16
2 56 28
3 87 31
You can do
df['new sale']=df.Cumulativesales.diff().fillna(df.Cumulativesales)
df
Cumulativesales new sale
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0
Do this:
df['New_sales'] = df['Cumlative_sales'].diff()
df.fillna(df.iloc[0]['Cumlative_sales'], inplace=True)
print(df)
Output:
Cumlative_sales New_sales
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0
I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.
My Dataset in Excel:
I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
Now I want to drop/ delete the raws contains the NaN values
Expected output should be like:
You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
print(data)
data = data.dropna()
print(data)
Orignal data:
A B C
0 1 10 20
1 2 11 22
2 4 15 25
3 8 20 30
4 12 25 35
5 18 40 55
6 20 45 60
Output with NaN:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.0 30.0
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Final Output:
A B C
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Try this for non-integer numbers:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
print(data)
data = data.dropna()
print(data)
Output:
A B C
0 1.25 10.56 20.11
1 2.39 11.19 22.92
2 4.00 15.65 25.27
3 8.89 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
try this,
df= df.drop(df.index[df.idxmax()])
O/P:
A B C D
0 1 21 41 61
1 2 22 42 62
2 3 23 43 63
3 4 24 44 64
4 5 25 45 65
5 6 26 46 66
6 7 27 47 67
7 8 28 48 68
8 9 29 49 69
13 14 34 54 74
14 15 35 55 75
15 16 36 56 76
16 17 37 57 77
17 18 38 58 78
18 19 39 59 79
19 20 40 60 80
use idxmax and drop the returned index.
I have a large DataFrame which is indexed by datetime, in particular, by days. I am looking for an efficient function which, for each column, checks the most common non-null value in each week, and outputs a dataframe which is indexed by weeks consisting of these within-week most common values.
Here is an example. The following DataFrame consists of two weeks of daily data:
0 1
2015-11-12 00:00:00 8 nan
2015-11-13 00:00:00 7 nan
2015-11-14 00:00:00 nan 5
2015-11-15 00:00:00 7 nan
2015-11-16 00:00:00 8 nan
2015-11-17 00:00:00 7 nan
2015-11-18 00:00:00 5 nan
2015-11-19 00:00:00 9 nan
2015-11-20 00:00:00 8 nan
2015-11-21 00:00:00 6 nan
2015-11-22 00:00:00 6 nan
2015-11-23 00:00:00 6 nan
2015-11-24 00:00:00 6 nan
2015-11-25 00:00:00 2 nan
and should be transformed into:
0 1
2015-11-12 00:00:00 7 5
2015-11-19 00:00:00 6 nan
My DataFrame is very large so efficiency is important. Thanks.
EDIT: If possible, can someone suggest a method that would be applicable if the entries are tuples (instead of floats as in my example)?
You can use resample to group your data by the weekly interval. Then, count the number of occurences via pd.value_counts and select the most common with idxmax:
df.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
Edit
Here is another numpy version which is faster than the above solution:
def numpy_mode(series):
values = series.values
dropped = values[~np.isnan(values)]
# check for empty array and return NaN
if not dropped.size:
return np.NaN
uniques, counts = np.unique(series.dropna(), return_counts=True)
return uniques[np.argmax(counts)]
df2.resample("7D").apply(lambda x: x.apply(get_mode))
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
And here the timings based on the dummy data (for further improvements, have a look here):
%%timeit
df2.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 100 loops, best of 3: 18.6 ms per loop
%%timeit
df2.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 100 loops, best of 3: 3.72 ms per loop
I also tried scipy.stats.mode however it was also slower than the numpy solution:
size = 1000
index = pd.DatetimeIndex(start="2012-12-12", periods=size, freq="D")
dummy = pd.DataFrame(np.random.randint(0, 20, size=(size, 50)), index=index)
print(dummy.head)
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
2012-12-12 18 2 7 1 7 9 16 2 19 19 ... 10 2 18 16 15 10 7 19 9 6
2012-12-13 7 4 11 19 17 10 18 0 10 7 ... 19 11 5 5 11 4 0 16 12 19
2012-12-14 14 0 14 5 1 11 2 19 5 9 ... 2 9 4 2 9 5 19 2 16 2
2012-12-15 12 2 7 2 12 12 11 11 19 5 ... 16 0 4 9 13 5 10 2 14 4
2012-12-16 8 15 2 18 3 16 15 0 14 14 ... 18 2 6 13 19 10 3 16 11 4
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 1 loop, best of 3: 926 ms per loop
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 1 loop, best of 3: 5.84 s per loop
%%timeit
dummy.resample("7D").apply(lambda x: stats.mode(x).mode)
>>> 1 loop, best of 3: 1.32 s per loop