how to get quartiles and classify a value according to this quartile range

how to get quartiles and classify a value according to this quartile range - python-3.x

I have this df:
d = pd.DataFrame({'Name':['Andres','Lars','Paul','Mike'],
'target':['A','A','B','C'],
'number':[10,12.3,11,6]})
And I want classify each number in a quartile. I am doing this:
(d.groupby(['Name','target','number'])['number']
.quantile([0.25,0.5,0.75,1]).unstack()
.reset_index()
.rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
)
But as you can see, the 4 quartiles are all equal because the code above is calculating per row so if there's one 1 number per row all quartiles are equal.
If a run instead:
d['number'].quantile([0.25,0.5,0.75,1])
Then I have the 4 quartiles I am looking for:
0.25 9.000
0.50 10.500
0.75 11.325
1.00 12.300
What I need as output(showing only first 2 rows)
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.30 1
1 Lars A 12.3 9.0 10.5 11.325 12.30 4
you can see all quartiles has the the values considering tall values in the number column. Besides that, now we have a column names Rank that classify the number according to it's quartile. ex. In the first row 10 is within the 1st quartile.

Here's one way that build on the quantiles you've created by making it a DataFrame and joining it to d. Also assigns "Rank" column using rank method:
out = (d.join(d['number'].quantile([0.25,0.5,0.75,1])
.set_axis([f'{i}Q' for i in range(1,5)], axis=0)
.to_frame().T
.pipe(lambda x: x.loc[x.index.repeat(len(d))])
.reset_index(drop=True))
.assign(Rank=d['number'].rank(method='dense')))
Output:
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.3 2.0
1 Lars A 12.3 9.0 10.5 11.325 12.3 4.0
2 Paul B 11.0 9.0 10.5 11.325 12.3 3.0
3 Mike C 6.0 9.0 10.5 11.325 12.3 1.0

Related

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?

Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks

IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Replace missing values based on another column

I am trying to replace the missing values in a dataframe based on filtering of another column, "Country"
>>> data.head()
Country Advanced skiers, freeriders Snow parks
0 Greece NaN NaN
1 Switzerland 5.0 5.0
2 USA NaN NaN
3 Norway NaN NaN
4 Norway 3.0 4.0
Obviously this is just a small snippet of the data, but I am looking to replace all the NaN values with the average value for each feature.
I have tried grouping the data by the country and then calculating the mean of each column. When I print out the resulting array, it comes up with the expected values. However, when I put it into the .fillna() method, the data appears unchanged
I've tried #DSM's solution from this similar post, but I am not sure how to apply it to multiple columns.
listOfRatings = ['Advanced skiers, freeriders', 'Snow parks']
print (data.groupby('Country')[listOfRatings].mean().fillna(0))
-> displays the expected results
data[listOfRatings] = data[listOfRatings].fillna(data.groupby('Country')[listOfRatings].mean().fillna(0))
-> appears to do nothing to the dataframe
Assuming this is the complete dataset, this is what I would expect the results to be.
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0
Can anyone explain what I am doing wrong, and how to fix the code?

You can use transform for return new DataFrame with same size as original filled by aggregated values:
print (data.groupby('Country')[listOfRatings].transform('mean').fillna(0))
Advanced skiers, freeriders Snow parks
0 0.0 0.0
1 5.0 5.0
2 0.0 0.0
3 3.0 4.0
4 3.0 4.0
#dynamic generate all columns names without Country
listOfRatings = data.columns.difference(['Country'])
df1 = data.groupby('Country')[listOfRatings].transform('mean').fillna(0)
data[listOfRatings] = data[listOfRatings].fillna(df1)
print (data)
print (data)
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0

How to fill NaN with user defined value in pandas dataframe

How to fill NaN with user defined value in pandas dataframe.
For text columns like A and B, user defined text like 'Missing' should be imputed. For discrete numeric variables like C and D, median value should be imputed. I have many columns like these, I would like apply rule for all vars in the dataframe
DF
A B C D
A0A1 Railway 10 NaN
A1A1 Shipping NaN 1
NaN Shipping 3 2
B1A1 NaN 1 7
DF out:
A B C D
A0A1 Railway 10 2
A1A1 Shipping 3 1
Missing Shipping 3 2
B1A1 Missing 1 7

You can fillna by pass dict
df.fillna({'A':'Miss','B':"Your2",'C':df.C.median(),'D':df.D.mean()})
Out[373]:
A B C D
0 A0A1 Railway 10.0 3.333333
1 A1A1 Shipping 3.0 1.000000
2 Miss Shipping 3.0 2.000000
3 B1A1 Your2 1.0 7.000000

Fun way!
d = {np.dtype('O'): 'Missing'}
df.fillna(df.dtypes.map(d).fillna(df.median()))
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

First replace median for numeric columns and then fillna for non numeric:
df = df.fillna(df.median()).fillna('Missing')
print (df)
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

Split rows with same name into multiple columns

My excel sheet looks like this:
Name C.p Value
a 1 1.75
b 1 2.35
c 1 1.32
d 1 2.45
a 2 2.7
b 2 1.85
c 2 1.9
d 2 2.6
a 3 3.2
b 3 4.5
c 3 9.2
d 3 5.01
Like this 4~5 names 50 ~ 60 check points and values at those check points
I want the excel to look like
C.p a b c d
1 1.75 2.35 1.32 2.45
2 2.7 1.85 1.9 2.6
3 3.2 4.5 9.2 5.01
Here C.p is check point. it is not always 1 2 3 .. it changes values form sheet to sheet
Could Some one help with the code
thank you

If that is the only thing you want to do,You can do it quickly by pivot table in excel itself. You will get some extra columns like Grand Total Which you can remove. As far as effort for removing the unwanted columns to the code it will be quite less.
see the below pic.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to get quartiles and classify a value according to this quartile range - python-3.x

Related

how to update rows based on previous row of dataframe python

Fill missing value in different columns of dataframe using mean or median of last n values

Replace missing values based on another column

How to fill NaN with user defined value in pandas dataframe

Split rows with same name into multiple columns

Categories

Resources