Here is my issue. I have data like this:
data = {
'name': ["Jack ;; Josh ;; John", "Apple ;; Fruit ;; Pear"],
'grade': [11, 12],
'color':['black', 'blue']
}
df = pd.DataFrame(data)
It looks like:
name grade color
0 Jack ;; Josh ;; John 11 black
1 Apple ;; Fruit ;; Pear 12 blue
I want it to look like:
name age color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue
So first I'd need to split name by using ";;" and then explode that list into different rows
Use Series.str.split with reshape by DataFrame.stack and add orriginal another columns by DataFrame.join:
c = df.columns
s = (df.pop('name')
.str.split(' ;; ', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('name'))
df = df.join(s).reset_index(drop=True).reindex(columns=c)
print (df)
name grade color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue
You have 2 challenges:
split the name with ;; into a list AND have each item in the list as a column such that:
df['name']=df.name.str.split(';;')
df_temp = df.name.apply(pd.Series)
df = pd.concat([df[:], df_temp[:]], axis=1)
df.drop('name', inplace=True, axis=1)
result:
grade color 0 1 2
0 11 black Jack Josh John
1 12 blue Apple Fruit Pear
Melt the list to get desired result:
df.melt(id_vars=["grade", "color"],
value_name="Name").sort_values('grade').drop('variable', axis=1)
desired result:
grade color Name
0 11 black Jack
2 11 black Josh
4 11 black John
1 12 blue Apple
3 12 blue Fruit
5 12 blue Pear
Related
I need to find the difference between values with the same names.
I have two csv files that I merged together and placed in another csv file to have a side by side comparison of the number differences.
Below is the sample merged csv file:
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
10 bread 5 pineapple
4 pies 12 bread
3 cookies 4 pies
32 chocolate 3 cookies
[Total count: 60] 27 chocolate
NaN NaN [Total count: 61]
All the names are the same (almost), but I would like to have a way to make a new row space for the new name that popped up under Q2Names, pinapple.
Below is the code I implemented so far:
import pandas as pd
import csv
Q1ReportsDir='/path/to/Q1/Reports/'
Q2ReportsDir='/path/to/Q2/Reports/'
Q1lineCount = f'{Q1ReportsDir}Q1Report.csv'
Q2lineCount = f'{Q2ReportsDir}Q2Report.csv'
merged_destination = f'{Q2ReportsDir}DifferenceReport.csv'
diffDF = [pd.read_csv(p) for p in (Q1lineCount, Q2lineCount)]
merged_dataframe = pd.concat(diffDF, axis=1)
merged_dataframe.to_csv(merged_destination, index=False)
diffGenDF = pd.read_csv(merged_destination)
# getting Difference
diffGenDF ['Difference'] = diffGenDF ['Q1Count'] - diffGenDF ['Q2Count']
diffGenDF = diffGenDF [['Difference', 'Q1Count', 'Q1Names', 'Q2Count ', 'Q2Names']]
diffGenDF.to_csv(merged_destination, index=False)
So, making a space under Q1Names and adding a 0 under Q1Count in the same row where pineapple is under column Q2Names would make this easier to see an accurate difference between the values.
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
0 5 pineapple
10 bread 12 bread
4 pies 4 pies
3 cookies 3 cookies
32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
The final desired output I would get if I can get past that part is this:
Difference Q1Count Q1Names Q2Count Q2Names
0 2 candy 2 candy
1 9 apple 8 apple
-5 0 5 pineapple
-2 10 bread 12 bread
0 4 pies 4 pies
0 3 cookies 3 cookies
5 32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
I was able to get your same results using a pd.merge with the dataframe you provided
df_merge = pd.merge(df1, df2, left_on = 'Q1Names', right_on = 'Q2Names', how = 'outer')
df_merge[['Q1Count', 'Q2Count']] = df_merge[['Q1Count', 'Q2Count']].fillna(0)
df_merge[['Q1Names', 'Q2Names']] = df_merge[['Q1Names', 'Q2Names']].fillna('')
df_merge['Difference'] = df_merge['Q1Count'].sub(df_merge['Q2Count'])
I have a Dataframe in Pandas where there are 2 columns that are almost identical but not quite and hence sometimes I want to group by both columns ignoring the order.
As an example:
mydf = pd.DataFrame({'Colour1': ['Red', 'Red', 'Blue', 'Green', 'Blue'], 'Colour2': ['Red', 'Blue', 'Red', 'Blue', 'Green'], 'Rating': [4, 5, 7, 8, 2]})
Colour1 Colour2 Rating
0 Red Red 4
1 Red Blue 5
2 Blue Red 7
3 Green Blue 8
4 Blue Green 2
I would like to group by Colour1 and Colour2 whilst ignoring the order and then transforming the Dataframe by taking the mean to produce the following Dataframe:
Colour1 Colour2 Rating MeanRating
0 Red Red 4 4
1 Red Blue 5 6
2 Blue Red 7 6
3 Green Blue 8 5
4 Blue Green 2 5
Is there a good way of doing this? Thanks in advance.
You can first sort the column1 and 2 using np.sort then groupby:
s = pd.Series(map(tuple,np.sort(mydf[['Colour1','Colour2']],axis=1)),index=mydf.index)
mydf['MeanRating'] = mydf['Rating'].groupby(s).transform('mean')
print(mydf)
Colour1 Colour2 Rating MeanRating
0 Red Red 4 4
1 Red Blue 5 6
2 Blue Red 7 6
3 Green Blue 8 5
4 Blue Green 2 5
I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Input
Fruit Count Price tag
Apple 55 35 red
Orange 60 40 orange
Apple 60 36 red
Apple 70 41 red
Output 1
Fruit Mean tag
Apple 35.5 red
Orange 40 orange
I need mean on condition price between 31 and 40
Output 2
Fruit Count tag
Apple 2 red
Orange 1 orange
I need count on condition price between 31 and 40
pls help
Use between with boolean indexing for filtering:
df1 = df[df['Price'].between(31, 40)]
print (df1)
Fruit Count Price tag
0 Apple 55 35 red
1 Orange 60 40 orange
2 Apple 60 36 red
If possible multiple columns by aggregated functions:
df2 = df1.groupby(['Fruit', 'tag'])['Price'].agg(['mean','size']).reset_index()
print (df2)
Fruit tag mean size
0 Apple red 35.5 2
1 Orange orange 40.0 1
Or 2 separately DataFrames:
df3 = df1.groupby(['Fruit', 'tag'], as_index=False)['Price'].mean()
print (df3)
Fruit tag Price
0 Apple red 35.5
1 Orange orange 40.0
df4 = df1.groupby(['Fruit', 'tag'])['Price'].size().reset_index()
print (df4)
Fruit tag Price
0 Apple red 2
1 Orange orange 1
I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].