Finding the largest (N) proportion of percentage in pandas dataframe - python-3.x

Suppose I have the following df:
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df
looks like:
name visits
0 Sara 0
1 John 0
2 Christine 1
3 Paul 2
4 Jo 3
5 Zack 9
6 Chris 6
7 Mathew 10
8 Suzan 3
I did some lines of code to get the percentage of visits per name and sort them descending:
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.sort_values(by='percent', ascending=False).reset_index(drop=True)
Now I have got the percent of visits to total visits by all names:
name visits percent
0 Mathew 10 0.294118
1 Zack 9 0.264706
2 Chris 6 0.176471
3 Jo 3 0.088235
4 Suzan 3 0.088235
5 Paul 2 0.058824
6 Christine 1 0.029412
7 Sara 0 0.000000
8 John 0 0.000000
What I need to get is the largest proportion of names with the highest percentage. For example, the first 3 rows represent ~73% of the total visits, and others could be neglected compared sum of % of the first 3 rows.
I know I can select the top 3 by using nlargest:
df.nlargest(3, 'percent')
But there is high variability in the data and the largest proportion could be the first 2 or 3 rows or even more.
EDIT:
How can I do it automatically to find the largest(N) proportion of % out of the total count of rows?

You have to define outliers in some way. One way is to use scipy.stats.zscore like in this answer:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.loc[df['percent'][stats.zscore(df['percent']) > 0.6].index]
which prints
name visits percent
5 Zack 9 0.264706
6 Chris 6 0.176471
7 Mathew 10 0.294118

Related

Getting rows with minimum col2 given same col1 [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

list of visited interval

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])
Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]

taking top 3 in a groupby, and lumping rest into 'other category'

I am currently doing a groupby in pandas like this:
df.groupby(['grade'])['students'].nunique())
and the result I get is this:
grade
grade 1 12
grade 2 8
grade 3 30
grade 4 2
grade 5 600
grade 6 90
Is there a way to get the output such that I see the groups of the top 3, and everything else is classified under other?
this is what I am looking for
grade
grade 3 30
grade 5 600
grade 6 90
other (3 other grades) 22
I think you can add a helper column in the df and call it something like "Grouping".
name the top 3 rows with its original name and name the remaining as "other" and then just group by the "Grouping" column.
Can't do much without the actual input data, but if this is your starting dataframe (df) after your groupby -
grade unique
0 grade_1 12
1 grade_2 8
2 grade_3 30
3 grade_4 2
4 grade_5 600
5 grade_6 90
You can do a few more steps to get to your table -
ddf = df.nlargest(3, 'unique')
ddf = ddf.append({'grade': 'Other', 'unique':df['unique'].sum()-ddf['unique'].sum()}, ignore_index=True)
grade unique
0 grade_5 600
1 grade_6 90
2 grade_3 30
3 Other 22

Increasing iteration speed

Good afternoon,
I'm iterating through a huge Dataframe (104062 x 20) with the following code:
import pandas as pd
df_tot = pd.read_csv("C:\\Users\\XXXXX\\Desktop\\XXXXXXX\\LOGS\\DF_TOT.txt", header=None)
df_tot = df_tot.replace("\[", "", regex=True)
df_tot = df_tot.replace("\]", "", regex=True)
df_tot = df_tot.replace("\'", "", regex=True)
i = 0
while i < len(df_tot):
to_compare = df_tot.iloc[i].tolist()
for j in range(len(df_tot)):
if to_compare == df_tot.iloc[j].tolist():
if i == j:
print('Matched itself.')
else:
print('MATCH FOUND - row: {} --- match row: {}'.format(i,j))
i += 1
I am looking to optimize time spent for each iteration as much as possible, since this code iterates 104062(^2) times. (More or less ten billions iterations).
With my computing power, time spent comparing to_compare in the whole DF is around 26 seconds.
I want to clarify that in case it would be needed, the whole code could be changed with faster constructs.
As usual, Thanks in advance.
as far as i understand, You just want to find duplicated rows.
Sample data(2 last rows are duplicated):
In [1]: df = pd.DataFrame([[1,2], [3,4], [5,6], [7,8], [1,2], [5,6]], columns=['a', 'b'])
df
Out[1]:
a b
0 1 2
1 3 4
2 5 6
3 7 8
4 1 2
5 5 6
This will return all duplicated rows:
In [2]: df[df.duplicated(keep=False)]
Out[2]:
a b
0 1 2
2 5 6
4 1 2
5 5 6
And indexes, grouped by duplicated row:
In [3]: df[df.duplicated(keep=False)].reset_index().groupby(list(df.columns), as_index=False)['index'].apply(list)
Out[3]: a b
1 2 [0, 4]
5 6 [2, 5]
You can also just remove duplicates from dataframe:
In [4]: df.drop_duplicates()
Out[4]:
a b
0 1 2
1 3 4
2 5 6
3 7 8

How to remove the repeated row spaning two dataframe index in python

I have a dataframe as follow:
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
The dataframe df means there is a road between two locations. look like:
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
The first row means there is a road between locationID1 and locationID2, however, the second row also encodes this information. The forth and fifth rows also have repeated information. I am trying the remove those repeated by keeping only one row. Any of row is okay.
For example, my expected output is
location1 location2
0 1 2
2 3 4
4 6 8
Any efficient way to do that because I have a large dataframe with lots of repeated rows.
Thanks a lot,
It looks like you want every other row in your dataframe. This should work.
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
print(df)
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
def Every_other_row(a):
return a[::2]
Every_other_row(df)
location1 location2
0 1 2
2 3 4
4 6 8

Resources