I have a dataframe that looks like this:
Fruit Cost Quantity Fruit_Copy
Apple 0.5 6 Watermelon
Orange 0.3 2 Orange
Apple 0.5 8 Apple
Apple 0.5 7 Apple
Banana 0.25 8 Banana
Banana 0.25 7 Banana
Apple 0.5 6 Apple
Apple 0.5 3 Apple
I want to write a snippet that, in pandas, compares Fruit and Fruit_Copy and outputs a new column "Match" that indicates if the values in Fruit = Fruit_Copy.
Thanks in advance!
Lets say your dataframe is 'fruits'. Then you can make use of the Pandas Series Equals function pd.Series.eq as,
fruits['Match'] = pd.Series.eq(fruits['Fruit'],fruits['Fruit_Copy'])
Something like this would work.
df.loc[df['Fruit'] == df['Fruit_Copy'], 'Match'] = 'Yes'
Using numpy.where:
df['Match'] = np.where(df['Fruit'] == df['Fruit_Copy'], 'Yes', 'No')
You could try something like this:
import pandas as pd
import numpy as np
fruits = pd.DataFrame({'Fruit':['Apple', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple'], 'Cost':[0.5,0.3,0.5,0.5,0.25,0.25,0.5,0.5], 'Quantity':[6,2,8,7,8,7,6,3], 'Fruit_Copy':['Watermelon', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple']})
fruits['Match'] = np.where(fruits['Fruit'] == fruits['Fruit_Copy'], 1, 0)
fruits
Fruit Cost Quantity Fruit_Copy Match
0 Apple 0.50 6 Watermelon 0
1 Orange 0.30 2 Orange 1
2 Apple 0.50 8 Apple 1
3 Apple 0.50 7 Apple 1
4 Banana 0.25 8 Banana 1
5 Banana 0.25 7 Banana 1
6 Apple 0.50 6 Apple 1
7 Apple 0.50 3 Apple 1
Related
I need to find the difference between values with the same names.
I have two csv files that I merged together and placed in another csv file to have a side by side comparison of the number differences.
Below is the sample merged csv file:
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
10 bread 5 pineapple
4 pies 12 bread
3 cookies 4 pies
32 chocolate 3 cookies
[Total count: 60] 27 chocolate
NaN NaN [Total count: 61]
All the names are the same (almost), but I would like to have a way to make a new row space for the new name that popped up under Q2Names, pinapple.
Below is the code I implemented so far:
import pandas as pd
import csv
Q1ReportsDir='/path/to/Q1/Reports/'
Q2ReportsDir='/path/to/Q2/Reports/'
Q1lineCount = f'{Q1ReportsDir}Q1Report.csv'
Q2lineCount = f'{Q2ReportsDir}Q2Report.csv'
merged_destination = f'{Q2ReportsDir}DifferenceReport.csv'
diffDF = [pd.read_csv(p) for p in (Q1lineCount, Q2lineCount)]
merged_dataframe = pd.concat(diffDF, axis=1)
merged_dataframe.to_csv(merged_destination, index=False)
diffGenDF = pd.read_csv(merged_destination)
# getting Difference
diffGenDF ['Difference'] = diffGenDF ['Q1Count'] - diffGenDF ['Q2Count']
diffGenDF = diffGenDF [['Difference', 'Q1Count', 'Q1Names', 'Q2Count ', 'Q2Names']]
diffGenDF.to_csv(merged_destination, index=False)
So, making a space under Q1Names and adding a 0 under Q1Count in the same row where pineapple is under column Q2Names would make this easier to see an accurate difference between the values.
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
0 5 pineapple
10 bread 12 bread
4 pies 4 pies
3 cookies 3 cookies
32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
The final desired output I would get if I can get past that part is this:
Difference Q1Count Q1Names Q2Count Q2Names
0 2 candy 2 candy
1 9 apple 8 apple
-5 0 5 pineapple
-2 10 bread 12 bread
0 4 pies 4 pies
0 3 cookies 3 cookies
5 32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
I was able to get your same results using a pd.merge with the dataframe you provided
df_merge = pd.merge(df1, df2, left_on = 'Q1Names', right_on = 'Q2Names', how = 'outer')
df_merge[['Q1Count', 'Q2Count']] = df_merge[['Q1Count', 'Q2Count']].fillna(0)
df_merge[['Q1Names', 'Q2Names']] = df_merge[['Q1Names', 'Q2Names']].fillna('')
df_merge['Difference'] = df_merge['Q1Count'].sub(df_merge['Q2Count'])
I have a df. In one column is "State" and in another column is "Text". I want to make a new column called "my_new_col" that extracts the word "Lime" from the "Text" column, only when the State Column = "Idaho"
df = {'State': ["Idaho", "Washington","Oregon","Idaho","Oregon"], 'Text': ["Lime Light","New Egg","Lime Inc","Monteray","NovaDing"]}
df = pd.DataFrame(df)
df
Output:
State Text
0 Idaho Lime Light
1 Washington New Egg
2 Oregon Lime Inc
3 Idaho Monteray
4 Oregon NovaDing
How do I get a dataframe that shows the following
State Text my_new_col
0 Idaho Lime Light Lime
1 Washington New Egg None
2 Oregon Lime Inc None
3 Idaho Monteray None
4 Oregon NovaDing None
Another example could be to pull out text that matches regex into a new column
df = {'State': ["Idaho", "Washington","Oregon","Idaho","Oregon"], 'Text': ["1,234 Light","New Egg","Lime Inc","1223 Ring","NovaDing"]}
df = pd.DataFrame(df)
df
Output:
State Text
0 Idaho 1,234 Light
1 Washington New Egg
2 Oregon Lime Inc
3 Idaho 1223 Ring
4 Oregon NovaDing
How do I get a dataframe that shows the following. The regex would be \d,\d\d\d
State Text my_new_col
0 Idaho 1,234 Light 1,234
1 Washington New Egg None
2 Oregon Lime Inc None
3 Idaho 1223 Ring None
4 Oregon NovaDing None
If it's case-sensitive:
df['my_new_col'] = None
df.loc[(df['State']=='Idaho') & (df['Text'].str.contains("Lime")), 'my_new_col'] = 'Lime'
print(df)
State Text my_new_col
0 Idaho Lime Light Lime
1 Washington New Egg None
2 Oregon Lime Inc None
3 Idaho Monteray None
4 Oregon NovaDing None
If case-insensitive:
df.loc[(df['State']=='Idaho') & (df['Text'].str.contains("Lime", case=False)), 'my_new_col'] = 'Lime'
...based on the update to the question, from the second example dataframe:
df.loc[(df['State']=='Idaho'), 'my_new_col'] = df['Text'].str.extract(r"(\d,\d\d\d)")[0]
That puts NaN values in the column instead of None. If that matters:
df['my_new_col'] = None
df.loc[(df['State']=='Idaho'), 'my_new_col'] = df['Text'].str.extract(r"(\d,\d\d\d)")[0]
df.loc[df['my_new_col'].isnull(), 'my_new_col'] = None
Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!
If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64
You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()
I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Input
Fruit Count Price tag
Apple 55 35 red
Orange 60 40 orange
Apple 60 36 red
Apple 70 41 red
Output 1
Fruit Mean tag
Apple 35.5 red
Orange 40 orange
I need mean on condition price between 31 and 40
Output 2
Fruit Count tag
Apple 2 red
Orange 1 orange
I need count on condition price between 31 and 40
pls help
Use between with boolean indexing for filtering:
df1 = df[df['Price'].between(31, 40)]
print (df1)
Fruit Count Price tag
0 Apple 55 35 red
1 Orange 60 40 orange
2 Apple 60 36 red
If possible multiple columns by aggregated functions:
df2 = df1.groupby(['Fruit', 'tag'])['Price'].agg(['mean','size']).reset_index()
print (df2)
Fruit tag mean size
0 Apple red 35.5 2
1 Orange orange 40.0 1
Or 2 separately DataFrames:
df3 = df1.groupby(['Fruit', 'tag'], as_index=False)['Price'].mean()
print (df3)
Fruit tag Price
0 Apple red 35.5
1 Orange orange 40.0
df4 = df1.groupby(['Fruit', 'tag'])['Price'].size().reset_index()
print (df4)
Fruit tag Price
0 Apple red 2
1 Orange orange 1