Is there a way to find location of the top n elements in a group by - python-3.x

Need to extract attribute of top n elements of a pandas dataframe
input data is like below
KEY variable value
0 1 A 0.476970
101 1 B 0.513333
202 1 C 0.376970
203 2 B 0.5667
101 2 A 0.513333
202 2 C 0.376970
...
i need out put of top two as this
KEY variable value
1 A 0.476970
1 B 0.513333
2 B 0.5667
2 A 0.513333
...
the code i tried is as follows
test=pred_melt.groupby(['KEY'])['value'].nlargest(2)
this gives me
KEY
1 101 0.513333
0 0.476970
...
Name: value, Length: 198, dtype: float64
idea was to join with original with the index (101,0 etc) to add the variable column but cannot get the index out of get the desired output as above.
not the group by column is key and not the variable.

Thanks Supratim, yes index but i added the rest of details that i had to workout. please comment if you feel needed.
test=pred_melt.groupby(['KEY'])['value'].nlargest(2)
test.index
returns MultiIndex
as per
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
the structure is
MultiIndex(levels=[...], [...]],
codes=[[...], [..]],
names=[...])
I am interested in
test.index.levels[1]
which is giving me the second column of this
KEY
1 101 0.513333
0 0.476970
...
Name: value, Length: 198, dtype: float64
as 0,101 etc which can use to get the records from pred_melt
KEY variable value
0 1 A 0.476970
101 1 B 0.513333
202 1 C 0.376970
203 2 B 0.5667
101 2 A 0.513333
202 2 C 0.376970
as
pred_melt.iloc[test.index.levels[1]]

Related

Unable to understand DataFrame method "loc" logic , If we use incorrect names of labels

I am using the method loc for extracting the columns with the use of labels. I encountered an issue while using incorrect names of labels resulting in some output as follows. PLease help me to understand the logic behind the loc method in terms of labels use.
import pandas as pd
Dic={'empno':(101,102,103,104),'name':('a','b','c','d'),'salary':(3000,5000,8000,9000)}
df=pd.DataFrame(Dic)
print(df)
print()
print(df.loc[0:2,'empsfgsdzfsdfsdaf':'salary'])
print(df.loc[0:2,'empno':'salarysadfsa'])
print(df.loc[0:2,'name':'asdfsdafsdaf'])
print(df.loc[0:2,'sadfsadfsadf':'sasdfsdflasdfsdfsdry'])
print(df.loc[0:2,'':'nasdfsd'])
OUTPUT:
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
3 104 d 9000
name salary
0 a 3000
1 b 5000
2 c 8000
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
Empty DataFrame
Columns: []
Index: [0, 1, 2]
salary
0 3000
1 5000
2 8000
empno name
0 101 a
1 102 b
2 103 c
.loc[A : B, C : D] will select:
index (row) labels from (and including) A to (and including) B; and
column labels from (and including) C to (and including) D.
Let's look at the column label slice 'a' : 'salary'. Since a is before the first column label, we get empno, name, salary.
print(df.loc[0:2, 'a':'salary'])
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
It works the same way at the upper end of the slice:
print(df.loc[0:2, 'name':'z'])
name salary
0 a 3000
1 b 5000
2 c 8000
Here is a list comprehension that shows how the second slice works:
# code
[col for col in df.columns if 'name' <= col <= 'z']
# result
['name', 'salary']
There is a good description for all most used subsetting methods here:
https://www.kdnuggets.com/2019/06/select-rows-columns-pandas.html

Create two new Dataframes from existing one based on unique and repeated values of a column

colA colB
A 125
B 546
C 4586
D 547
A 869
B 789
A 258
E 123
I want to create two new dataframe and the first one should be based on the unique values in 'colA' and the second one should be the repeated values of 'colB'. The colB has no repeated values. The first output is like this:
ColA colB
A 125
B 546
C 4586
D 547
E 123
The second output is like this:
colA colB
A 869
B 789
A 258
For the first group, use drop_duplicates. For second group, use duplicated:
print (df.drop_duplicates("colA"))
colA colB
0 A 125
1 B 546
2 C 4586
3 D 547
7 E 123
print (df[df.duplicated("colA")])
colA colB
4 A 869
5 B 789
6 A 258

panda value_counts show duplicate

Here is the code that I am using
all_data.groupby('BsmtFullBath').BsmtFullBath.count()
and the output is coming up as
BsmtFullBath
0 856
1 588
2 15
3 1
0 849
1 584
2 23
3 1
NA 2
Name: BsmtFullBath, dtype: int64
Expecting it to have a unique value for the each value, but "0" is coming two times.
I believe if you want to get rid of the duplicated values, to use the map function just like in the example below, just ch
df_final['DC'] = df_final['DC'].map({'NO':0, 'WT':1, 'BU':2,'CT':3,'BT':4, 'CD':5})

Sum of next n rows in python

I have a dataframe which is grouped at product store day_id level Say it looks like the below and I need to create a column with rolling sum
prod store day_id visits
111 123 1 2
111 123 2 3
111 123 3 1
111 123 4 0
111 123 5 1
111 123 6 0
111 123 7 1
111 123 8 1
111 123 9 2
need to create a dataframe as below
prod store day_id visits rolling_4_sum cond
111 123 1 2 6 1
111 123 2 3 5 1
111 123 3 1 2 1
111 123 4 0 2 1
111 123 5 1 4 0
111 123 6 0 4 0
111 123 7 1 NA 0
111 123 8 1 NA 0
111 123 9 2 NA 0
i am looking for create a
cond column: that recursively checks a condition , say if rolling_4_sum is greater than 5 then make the next 4 rows as 1 else do nothing ,i.e. even if the condition is not met retain what was already filled before , do this check for each row until 7 th row.
How can i achieve this using python ? i am trying
d1['rolling_4_sum'] = d1.groupby(['prod', 'store']).visits.rolling(4).sum()
but getting an error.
The formation of rolling sums can be done with rolling method, using boxcar window:
df['rolling_4_sum'] = df.visits.rolling(4, win_type='boxcar', center=True).sum().shift(-2)
The shift by -2 is because you apparently want the sums to be placed at the left edge of the window.
Next, the condition about rolling sums being less than 4:
df['cond'] = 0
for k in range(1, 4):
df.loc[df.rolling_4_sum.shift(k) < 7, 'cond'] = 1
A new column is inserted and filled with 0; then for each k=1,2,3,4, look k steps back; if the sum then less than 7, then set the condition to 1.

Spotfire Consecutive Count

I am new to Spotfire so I hope that I ask this question correctly.
My table contains Corp_ID, Date and Flagged columns. The flagged column is either "1" or "0" based on if that Corp_ID had production on that date.
I need a custom expression that will return "0" if the flagged column is "0", BUT if the flagged column is "1" then I need it to return how many consecutive "1"s are in that string for that Corp_ID.
Corp_ID Date Flagged New Column
101 1/1/2016 1 1
101 1/2/2016 0 0
101 1/3/2016 1 4
101 1/4/2016 1 4
101 1/5/2016 1 4
101 1/6/2016 1 4
101 1/7/2016 0 0
101 1/8/2016 0 0
101 1/9/2016 1 2
101 1/10/2016 1 2
102 1/2/2016 1 3
102 1/3/2016 1 3
102 1/4/2016 1 3
102 1/5/2016 0 0
102 1/6/2016 0 0
102 1/7/2016 0 0
102 1/8/2016 1 4
102 1/9/2016 1 4
102 1/10/2016 1 4
102 1/11/2016 1 4
Thanks in advance for any assistance!
KC
This would be a lot easier to implement as part of the query you’re using to return the data, but if you have to do it in spotfire, I suggest this.
1- Create a hierarchy column containing [Corp_ID] and [Date] (Named ‘DateHr’)
2- Add a calculated column named ‘Concat Flags’ which concatenates all the previous flag values: Concatenate([Flagged]) OVER (Intersect(Parent([Hierarchy.DateHr]),allPrevious([Hierarchy.DateHr])))
3- Add a calculated column which will return the number of 0’s in the Concat Flags field (Named ‘# of 0s’): Len([Concat Flags]) - Len(Substitute([Concat Flags],"0",""))
4- Add a hierarchy column containing [Corp_ID] and [# of 0s] (Named ‘CorpHr’)
5- Add a calculated column to return your desired value: case when [Flagged]=1 then Sum([Flagged]) OVER (Intersect([Hierarchy.CorpHr])) else 0 end
Note: the above assumes you are working in Spotfire version 7.5. The syntax for using hierarchies in calculated columns differs slightly in earlier versions).

Resources