Highest frequency in a dataframe - python-3.x

I am looking for a way to get the highest frequency in the entire pandas, not in a particular column. I have looked at value count, but it seems that works in a column specific way. Any way to do that?

Use DataFrame.stack with Series.mode for top values, for first select by position:
df = pd.DataFrame({
'B':[4,5,4,5,4,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
a = df.stack().mode().iat[0]
print (a)
4
Or if need also frequency is possible use Series.value_counts:
s = df.stack().value_counts()
print (s)
4 6
5 4
3 3
9 2
7 2
2 2
1 2
8 1
6 1
0 1
dtype: int64
print (s.index[0])
4
print (s.iat[0])
6

Related

Getting rows with minimum col2 given same col1 [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

How can I transform this dataset in pandas so that it easy to filter and compare?

I have the following DataFrame:
Segments Airline_pct_tesco Airline_pct_asda food_pct_tesco food_pct_asda Airline_diff food_diff
A 1 2 4 2 -1 2
B 2 2 4 4 0 0
c 10 5 12 10 5 2
I want to convert it to this format:
Segments Category Asda% Tesco% Diff%
A Airline 2 1 -1
b Food 4 4 0
c Airline 5 10 5
A Food 2 4 2
(only partially showing). Note
category is the col name without the '_pct_tesco' or '_diff' or '_pct_asda'
I am unsure how to go about this - I have tried transform but I just don't know how I can get it in a way which is easy for any user to use. I am doing this in pandas and am not sure how to even begin! The Asda% are related to '_pct_asda' columns and same for diff and tesco columns respectively..
Let's try set_index to save columns, then create a MultiIndex.from_frame using str.extract on the columns to create a MultiIndex based on the values before a list of suffixes, then stack to go to long-form.
new_df = df.set_index('Segments')
# Define allowed suffixes here
suffixes = ['_pct_asda', '_pct_tesco', '_diff']
# Extract Values
new_df.columns = (
pd.MultiIndex.from_frame(
new_df.columns.str.extract(rf'(.*?)({"|".join(suffixes)})'),
names=['Category', None]
)
)
new_df = new_df.stack(0)
new_df:
_diff _pct_asda _pct_tesco
Segments Category
A Airline -1 2 1
food 2 2 4
B Airline 0 2 2
food 0 4 4
c Airline 5 5 10
food 2 10 12
To get cleaner output add reset_index + rename to fix column names and index and also re-order columns.
new_df = new_df.reset_index().rename(columns={
'_pct_asda': 'Asda%',
'_pct_tesco': 'Tesco%',
'_diff': 'Diff%'
})[['Segments', 'Category', 'Asda%', 'Tesco%', 'Diff%']]
new_df:
Segments Category Asda% Tesco% Diff%
0 A Airline 2 1 -1
1 A food 2 4 2
2 B Airline 2 2 0
3 B food 4 4 0
4 c Airline 5 10 5
5 c food 10 12 2

Replacing str by int for all the columns of dataframe without making dictionary for each column

Suppose I have the following dataframe,
d = {'col1':['a','b','c','a','c','c','c','c','c','c'],
'col2':['a1','b1','c1','a1','c1','c1','c1','c1','c1','c1'],
'col3':[1,2,3,2,3,3,3,3,3,3]}
data = pd.DataFrame(d)
I want to go through categorical columns and replace strings with integers. The usual way of doing this is to do:
col1 = {'a': 1,'b': 2, 'c':3}
data.col1 = [col1[item] for item in data.col1]
Namely to make a dictionary for each categorical column and do the replacement. But if you have many columns making dictionary for them one by one is time consuming, so I wonder if there is a better way of doing it? Also how can I do this without dictionary. In this example we can 3 distinct values on col1 for example but if we have many more we should have wrote all that by hand (say {'a': 1,'b': 2, 'c':3, ..., 'z':26}). I wonder what is the most efficient way of doing this? namely to go through all the categorical column and replace the string with numbers without needing to make dictionaries column by column?
Get only object columns first by DataFrame.select_dtypes and then for each column use factorize in DataFrame.apply:
cols = data.select_dtypes(object).columns
data[cols] = data[cols].apply(lambda x: pd.factorize(x)[0]) + 1
print (data)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
3 1 1 2
4 3 3 3
5 3 3 3
6 3 3 3
7 3 3 3
8 3 3 3
9 3 3 3
If possible, you could avoid the apply,by using a dictionary comprehension in the assign expression(I feel a dictionary is going to be more efficient; I may be wrong):
values = {col: data[col].factorize()[0] + 1
for col in data.select_dtypes(object)}
data.assign(**values)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
3 1 1 2
4 3 3 3
5 3 3 3
6 3 3 3
7 3 3 3
8 3 3 3
9 3 3 3

Pandas data frame concat return same data of first dataframe

I have this datafram
PNN_sh NN_shap PNN_corr NN_corr
1 25005 1 25005
2 25012 2 25001
3 25011 3 25009
4 25397 4 25445
5 25006 5 25205
Then I made 2 dataframs from this one.
NN_sh = data[['PNN_sh', 'NN_shap']]
NN_corr = data[['PNN_corr', 'NN_corr']]
Thereafter, I sorted them and saved in new dataframes.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'])
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'])
Now I want to combine 2 columns from the 2 dataframs above.
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')
But what I got is only the first column copied into second one also.
PNN_sh PNN_corr
1 1
5 5
3 3
2 2
4 4
The second column should be
PNN_corr
2
1
3
5
4
Any idea how to fix it? Thanks in advance
Put ignore_index=True to sort_values():
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'], ignore_index=True)
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'], ignore_index=True)
Then the result after concat will be:
PNN_sh PNN_corr
0 1 2
1 5 1
2 3 3
3 2 5
4 4 4
I think when you sort you are preserving the original indices of the example DataFrames. Therefore, it is joining the PNN_corr value that was originally in the same row (at same index). Try resetting the index of each DataFrame after sorting, then join/concat.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap']).reset_index()
NN_corr_sort = NN_corr.sort_values(by=['NN_corr']).reset_index()
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')

Complex group by using Pandas

I am facing a situation where I need to group-by a dataframe by a column 'ID' and also calculate the total time frame depicted for that particular ID to complete. I only want to calculate the difference between the date_open and data_closed for the particular ID with the ID count.
We only need to focus on the date open and the date closed field. So it needs to do something taking the max closing date and the min open date and subtracting the two
The dataframe looks as follows:
ID Date_Open Date_Closed
1 01/01/2019 02/01/2019
1 07/01/2019 09/01/2019
2 10/01/2019 11/01/2019
2 13/01/2019 19/01/2019
3 10/01/2019 11/01/2019
The output should look like this :
ID Count_of_ID Total_Time_In_Days
1 2 8
2 2 9
3 1 1
How should I achieve this ?
Using GroupBy with named_aggregation and the min and max of the dates:
df[['Date_Open', 'Date_Closed']] = (
df[['Date_Open', 'Date_Closed']].apply(lambda x: pd.to_datetime(x, format='%d/%m/%Y'))
)
dfg = df.groupby('ID').agg(
Count_of_ID=('ID','size'),
Date_Open=('Date_Open','min'),
Date_Closed=('Date_Closed','max')
)
dfg['Total_Time_In_Days'] = dfg['Date_Closed'].sub(dfg['Date_Open']).dt.days
dfg = dfg.drop(columns=['Date_Closed', 'Date_Open']).reset_index()
ID Count_of_ID Total_Time_In_Days
0 1 2 8
1 2 2 9
2 3 1 1
Now we have Total_Time_In_Days as int:
print(dfg.dtypes)
ID int64
Count_of_ID int64
Total_Time_In_Days int64
dtype: object
This can also be used:
df['Date_Open'] = pd.to_datetime(df['Date_Open'], dayfirst=True)
df['Date_Closed'] = pd.to_datetime(df['Date_Closed'], dayfirst=True)
df_grouped = df.groupby(by='ID').count()
df_grouped['Total_Time_In_Days'] = df.groupby(by='ID')['Date_Closed'].max() - df.groupby(by='ID')['Date_Open'].min()
df_grouped = df_grouped.drop(columns=['Date_Open'])
df_grouped.columns=['Count', 'Total_Time_In_Days']
print(df_grouped)
Count Total_Time_In_Days
ID
1 2 8 days
2 2 9 days
3 1 1 days
I'll try first to create the a column depicting how much time passed from Date_open to Date_closed for each instance of the dataframe. Like this:
df['Total_Time_In_Days'] = df.Date_closed - df.Date_open
Then you can use groupby:
df.groupby('id').agg({'id':'count','Total_Time_In_Days':'sum'})
If you need any help with the .agg function you can refer to it's official documentation here.

Resources