panda value_counts show duplicate - pandas-groupby

Here is the code that I am using
all_data.groupby('BsmtFullBath').BsmtFullBath.count()
and the output is coming up as
BsmtFullBath
0 856
1 588
2 15
3 1
0 849
1 584
2 23
3 1
NA 2
Name: BsmtFullBath, dtype: int64
Expecting it to have a unique value for the each value, but "0" is coming two times.

I believe if you want to get rid of the duplicated values, to use the map function just like in the example below, just ch
df_final['DC'] = df_final['DC'].map({'NO':0, 'WT':1, 'BU':2,'CT':3,'BT':4, 'CD':5})

Related

Get OrderID with min score [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

Selecting columns by their position while using a function in pandas

My dataframe looks somthing like this
frame = pd.DataFrame({'id':[1,2,3,4,5],
'week1_values':[0,0,13,39,64],
'week2_values':[32,35,25,78,200]})
I am trying to apply a function to calculate the Week over Week percentage difference between two columns('week1_values' and 'week2_values') whose names are being generated dynamically.
I want to create a function to calculate the percentage difference between weeks keeping in mind the zero values in the 'week1_values' column.
My function is something like this:
def WoW(df):
if df.iloc[:,1] == 0:
return (df.iloc[:,1] - df.iloc[:,2])
else:
return ((df.iloc[:,1] - df.iloc[:,2]) / df.iloc[:,1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
When i try to do that, i end up with this error
IndexingError: ('Too many indexers', 'occurred at index 0')
How is it that one is supposed to specify columns by their positions inside a function?
PS: Just want to clarify that since the column names are being generated dynamically, i am trying to select them by their position with iloc function.
Because working with Series, remove indexing columns:
def WoW(df):
if df.iloc[1] == 0:
return (df.iloc[1] - df.iloc[2])
else:
return ((df.iloc[1] - df.iloc[2]) / df.iloc[1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
Vectorized alternative:
s = frame.iloc[:,1] - frame.iloc[:,2]
frame['WoW%1'] = np.where(frame.iloc[:, 1] == 0, s, (s / frame.iloc[:,1]) *100)
print (frame)
id week1_values week2_values WoW% WoW%1
0 1 0 32 -32.000000 -32.000000
1 2 0 35 -35.000000 -35.000000
2 3 13 25 -92.307692 -92.307692
3 4 39 78 -100.000000 -100.000000
4 5 64 200 -212.500000 -212.500000
You can use pandas pct_change method to automatically compute the percent change.
s = (frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100)
frame['WoW%'] = s.mask(np.isinf(s), frame.iloc[:, -1])
output:
id week1_values week2_values WoW
0 1 0 32 32.000000
1 2 0 35 35.000000
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000
Note however that the way you currently do it in your custom function is biased. Changes from 0->20, or 10->12, or 100->120 would all produce 20 as output, which seems ambiguous.
suggested alternative
use a classical percent increase, even if it leads to infinite:
frame['WoW'] = frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100
output:
id week1_values week2_values WoW
0 1 0 32 inf
1 2 0 35 inf
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000

how to pass pandas series element to another dataframe

I want to check if an error occurred.
I have this two dataframes, from excel files:
Log_frame is a dataframe of log files, reporting data recording and error:
Time Voltage[V] Freq[Hz] Speed Motor_Stt: ErrNo
0 10:00 220 50 30 1 0
1 10:10 220 50 30 1 0
2 10:20 220 50 0 2 3601
3 10:30 220 47 0 1 1500
4 10:40 250 50 0 1 7707
5 10:50 220 50 0 2 3601
6 11:00 220 50 0 2 3601
7 11:10 220 47 0 1 1500
8 11:20 220 50 30 1 0
9 11:30 220 50 30 1 0
Dev_frame is the dataframe of error description:
Fehler-Nr. Descr Cause
0 1500 Chk_Voltage Voltage out of range
1 7707 Chk_Freq. Freq. out of range
2 3601 Chk_Motor_Stt Motor_defec
3 7704 switch_trip chk_over_curr
from Log_frame I can check if, which and how many errors occurred during a day by:
Err_log = Log_frame['ErrNo']
p = Err_log[Err_log != 0].drop_duplicates('first').reset_index(drop=True)
and this result is a pandas series:
<class 'pandas.core.series.Series'>
0 3601
1 1500
2 7707
I can "pass" first error (or second and all the other) by this:
Dev_Err = Dev_frame['Fehler-Nr.']
n = Dev_Err[Dev_Err == p.iloc[0]] #or 1, 2 and so on
I was wondering how to loop trough p.iloc[i].
Should I use a for loop or can be done by any pandas function
EDIT: e.g. if I put 1 in p.iloc[] I can get:
0 1500
if 2:
1 7707
No need to create a loop to check each value, you can use isin method that pandas.DataFrame has as following:
n = dev_frame[dev_frame['Fehler-Nr.'].isin(p)]['Fehler-Nr.']
which is going to return:
0 1500
1 7707
2 3601
Name: Fehler-Nr., dtype: int64
Ref: pandas.DataFrame.isin
If you're using pandas and going for for loops you are wrong. Use pandas vectorised operations. These are done using (simple exaple)
df.apply(some function, axis)
I'm not 100% convinced I understood what you're trying to achieve, but I believe you just want to merge/join number of errors for a given error. If so, pandas.join() and pandas.merge() are to help. Check the docs.

Selective multiplication of a pandas dataframe

I have a pandas Dataframe and Series of the form
df = pd.DataFrame({'Key':[2345,2542,5436,2468,7463],
'Segment':[0] * 5,
'Values':[2,4,6,6,4]})
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 4
2 5436 0 6
3 2468 0 6
4 7463 0 4
s = pd.Series([5436, 2345])
print (s)
0 5436
1 2345
dtype: int64
In the original df, I want to multiply the 3rd column(Values) by 7 except for the keys which are present in the series. So my final df should look like
What should be the best way to achieve this in Python 3.x?
Use DataFrame.loc with Series.isin for filter Value column with inverted condition for non membership with multiple by scalar:
df.loc[~df['Key'].isin(s), 'Values'] *= 7
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 28
2 5436 0 6
3 2468 0 42
4 7463 0 28
Another method could be using numpy.where():
df['Values'] *= np.where(~df['Key'].isin([5436, 2345]), 7,1)

Spotfire Consecutive Count

I am new to Spotfire so I hope that I ask this question correctly.
My table contains Corp_ID, Date and Flagged columns. The flagged column is either "1" or "0" based on if that Corp_ID had production on that date.
I need a custom expression that will return "0" if the flagged column is "0", BUT if the flagged column is "1" then I need it to return how many consecutive "1"s are in that string for that Corp_ID.
Corp_ID Date Flagged New Column
101 1/1/2016 1 1
101 1/2/2016 0 0
101 1/3/2016 1 4
101 1/4/2016 1 4
101 1/5/2016 1 4
101 1/6/2016 1 4
101 1/7/2016 0 0
101 1/8/2016 0 0
101 1/9/2016 1 2
101 1/10/2016 1 2
102 1/2/2016 1 3
102 1/3/2016 1 3
102 1/4/2016 1 3
102 1/5/2016 0 0
102 1/6/2016 0 0
102 1/7/2016 0 0
102 1/8/2016 1 4
102 1/9/2016 1 4
102 1/10/2016 1 4
102 1/11/2016 1 4
Thanks in advance for any assistance!
KC
This would be a lot easier to implement as part of the query you’re using to return the data, but if you have to do it in spotfire, I suggest this.
1- Create a hierarchy column containing [Corp_ID] and [Date] (Named ‘DateHr’)
2- Add a calculated column named ‘Concat Flags’ which concatenates all the previous flag values: Concatenate([Flagged]) OVER (Intersect(Parent([Hierarchy.DateHr]),allPrevious([Hierarchy.DateHr])))
3- Add a calculated column which will return the number of 0’s in the Concat Flags field (Named ‘# of 0s’): Len([Concat Flags]) - Len(Substitute([Concat Flags],"0",""))
4- Add a hierarchy column containing [Corp_ID] and [# of 0s] (Named ‘CorpHr’)
5- Add a calculated column to return your desired value: case when [Flagged]=1 then Sum([Flagged]) OVER (Intersect([Hierarchy.CorpHr])) else 0 end
Note: the above assumes you are working in Spotfire version 7.5. The syntax for using hierarchies in calculated columns differs slightly in earlier versions).

Resources