How to find the total length of a column value that has multiple values in different rows for another column - python-3.x

Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!

If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64

You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.
IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

On SQL with one-to-many merging and many as a narrowing condition

Use sqlalchemy
Parent table
id name
1 sea bass
2 Tanaka
3 Mike
4 Louis
5 Jack
Child table
id user_id pname number
1 1 Apples 2
2 1 Banana 1
3 1 Grapes 3
4 2 Apples 2
5 2 Banana 2
6 2 Grapes 1
7 3 Strawberry 5
8 3 Banana 3
9 3 Grapes 1
I want to sort by parent id with apples and number of bananas, but when I search for "parent id with apples", the search is filtered and the bananas disappear. I have searched for a way to achieve this, but have not been able to find it.
Thank you in advance for your help.
Translated with www.DeepL.com/Translator (free version)

Compare values in two different pandas columns

I have a dataframe that looks like this:
Fruit Cost Quantity Fruit_Copy
Apple 0.5 6 Watermelon
Orange 0.3 2 Orange
Apple 0.5 8 Apple
Apple 0.5 7 Apple
Banana 0.25 8 Banana
Banana 0.25 7 Banana
Apple 0.5 6 Apple
Apple 0.5 3 Apple
I want to write a snippet that, in pandas, compares Fruit and Fruit_Copy and outputs a new column "Match" that indicates if the values in Fruit = Fruit_Copy.
Thanks in advance!
Lets say your dataframe is 'fruits'. Then you can make use of the Pandas Series Equals function pd.Series.eq as,
fruits['Match'] = pd.Series.eq(fruits['Fruit'],fruits['Fruit_Copy'])
Something like this would work.
df.loc[df['Fruit'] == df['Fruit_Copy'], 'Match'] = 'Yes'
Using numpy.where:
df['Match'] = np.where(df['Fruit'] == df['Fruit_Copy'], 'Yes', 'No')
You could try something like this:
import pandas as pd
import numpy as np
fruits = pd.DataFrame({'Fruit':['Apple', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple'], 'Cost':[0.5,0.3,0.5,0.5,0.25,0.25,0.5,0.5], 'Quantity':[6,2,8,7,8,7,6,3], 'Fruit_Copy':['Watermelon', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple']})
fruits['Match'] = np.where(fruits['Fruit'] == fruits['Fruit_Copy'], 1, 0)
fruits
Fruit Cost Quantity Fruit_Copy Match
0 Apple 0.50 6 Watermelon 0
1 Orange 0.30 2 Orange 1
2 Apple 0.50 8 Apple 1
3 Apple 0.50 7 Apple 1
4 Banana 0.25 8 Banana 1
5 Banana 0.25 7 Banana 1
6 Apple 0.50 6 Apple 1
7 Apple 0.50 3 Apple 1

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Add rows according to other rows

My DataFrame object similar to this one:
Product StoreFrom StoreTo Date
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
3 out Apple StoreE StoreU 20170802
4 in Apple StoreE StoreU 20170812
I want to avoid duplications, in 3rd and 4th row show same action. I try to reach
Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
5 in Apple StoreE StoreU 20170812 10
and I got more than 10k entry. I could not find similar work to this. Any help will be very useful.
d1 = df.assign(Date=pd.to_datetime(df.Date.astype(str)))
d2 = d1.assign(Days=d1.groupby(cols).Date.apply(lambda x: x - x.iloc[0]))
d2.drop_duplicates(cols, 'last')
io Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 2017-06-02 0 days
2 out cherry StoreW StoreO 2017-06-14 0 days
4 in Apple StoreE StoreU 2017-08-12 10 days

Resources