Filter rows based on the count of unique values - python-3.x

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1

value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Related

How to find the total length of a column value that has multiple values in different rows for another column

Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!
If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64
You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()

Sum of all rows based on specific column values

I have a df like this:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
I want to add all the rows which has Parameter values as "Apple" , "Banana" and "Pear".
Output:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
6 Total 7 11 13 11 13
My Effort:
df[:,'Total'] = df.sum(axis=1) -- Works but I want specific values only and not all
Tried by the index in my case 1,2 and 5 but in my original df the index can vary from time to time and hence rejected that solution.
Saw various answers on SO but none of them could solve my problem!!
First idea is create index by Parameters column and select rows for sum and last convert index to column:
L = ["Apple" , "Banana" , "Pear"]
df = df.set_index('Parameters')
df.loc['Total'] = df.loc[L].sum()
df = df.reset_index()
print (df)
Parameters A B C D E
0 Apple 1 2 3 4 5
1 Banana 2 4 5 3 5
2 Potato 3 5 3 2 1
3 Tomato 1 1 1 1 1
4 Pear 4 5 5 4 3
5 Total 7 11 13 11 13
Or add new row for filtered rows by membership with Series.isin and overwrite last added value by Total:
last = len(df)
df.loc[last] = df[df['Parameters'].isin(L)].sum()
df.loc[last, 'Parameters'] = 'Total'
print (df)
Parameters A B C D E
Index
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Total 7 11 13 11 13
Another similar solution is filtering all columns without first and add value in one element list:
df.loc[len(df)] = ['Total'] + df.iloc[df['Parameters'].isin(L).values, 1:].sum().tolist()

Add rows according to other rows

My DataFrame object similar to this one:
Product StoreFrom StoreTo Date
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
3 out Apple StoreE StoreU 20170802
4 in Apple StoreE StoreU 20170812
I want to avoid duplications, in 3rd and 4th row show same action. I try to reach
Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
5 in Apple StoreE StoreU 20170812 10
and I got more than 10k entry. I could not find similar work to this. Any help will be very useful.
d1 = df.assign(Date=pd.to_datetime(df.Date.astype(str)))
d2 = d1.assign(Days=d1.groupby(cols).Date.apply(lambda x: x - x.iloc[0]))
d2.drop_duplicates(cols, 'last')
io Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 2017-06-02 0 days
2 out cherry StoreW StoreO 2017-06-14 0 days
4 in Apple StoreE StoreU 2017-08-12 10 days

Append Two Dataframes Together (Pandas, Python3)

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

excel/vba - find fist and last occurrence of a particular value in a column

So if I have a column such as:
A1
1 Apple
2 Apple
3 Apple
4 Oj
5 Oj
6 Oj
7 Oj
8 Pear
9 Pear
How could I return the values 1 & 3 for Apple, 4 & 7 for OJ, etc?
Formula-wise you can use MATCH functions, e.g. for first Apple position
=MATCH("Apple",A1:A9,0)
for last
=MATCH(2,INDEX(1/(A1:A9="Apple"),0))
or if the fruit are sorted as per your example (or merely grouped) you can get the last by adding the number of apples to the first -1
so with first MATCH function in C1 that would be
=COUNTIF(A1:A9,"Apple")+C1-1

Resources