How to split Pandas string column into different rows? - python-3.x

Here is my issue. I have data like this:
data = {
'name': ["Jack ;; Josh ;; John", "Apple ;; Fruit ;; Pear"],
'grade': [11, 12],
'color':['black', 'blue']
}
df = pd.DataFrame(data)
It looks like:
name grade color
0 Jack ;; Josh ;; John 11 black
1 Apple ;; Fruit ;; Pear 12 blue
I want it to look like:
name age color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue
So first I'd need to split name by using ";;" and then explode that list into different rows

Use Series.str.split with reshape by DataFrame.stack and add orriginal another columns by DataFrame.join:
c = df.columns
s = (df.pop('name')
.str.split(' ;; ', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('name'))
df = df.join(s).reset_index(drop=True).reindex(columns=c)
print (df)
name grade color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue

You have 2 challenges:
split the name with ;; into a list AND have each item in the list as a column such that:
df['name']=df.name.str.split(';;')
df_temp = df.name.apply(pd.Series)
df = pd.concat([df[:], df_temp[:]], axis=1)
df.drop('name', inplace=True, axis=1)
result:
grade color 0 1 2
0 11 black Jack Josh John
1 12 blue Apple Fruit Pear
Melt the list to get desired result:
df.melt(id_vars=["grade", "color"],
value_name="Name").sort_values('grade').drop('variable', axis=1)
desired result:
grade color Name
0 11 black Jack
2 11 black Josh
4 11 black John
1 12 blue Apple
3 12 blue Fruit
5 12 blue Pear

Related

Finding the difference between values with the same name in a merged CSV file

I need to find the difference between values with the same names.
I have two csv files that I merged together and placed in another csv file to have a side by side comparison of the number differences.
Below is the sample merged csv file:
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
10 bread 5 pineapple
4 pies 12 bread
3 cookies 4 pies
32 chocolate 3 cookies
[Total count: 60] 27 chocolate
NaN NaN [Total count: 61]
All the names are the same (almost), but I would like to have a way to make a new row space for the new name that popped up under Q2Names, pinapple.
Below is the code I implemented so far:
import pandas as pd
import csv
Q1ReportsDir='/path/to/Q1/Reports/'
Q2ReportsDir='/path/to/Q2/Reports/'
Q1lineCount = f'{Q1ReportsDir}Q1Report.csv'
Q2lineCount = f'{Q2ReportsDir}Q2Report.csv'
merged_destination = f'{Q2ReportsDir}DifferenceReport.csv'
diffDF = [pd.read_csv(p) for p in (Q1lineCount, Q2lineCount)]
merged_dataframe = pd.concat(diffDF, axis=1)
merged_dataframe.to_csv(merged_destination, index=False)
diffGenDF = pd.read_csv(merged_destination)
# getting Difference
diffGenDF ['Difference'] = diffGenDF ['Q1Count'] - diffGenDF ['Q2Count']
diffGenDF = diffGenDF [['Difference', 'Q1Count', 'Q1Names', 'Q2Count ', 'Q2Names']]
diffGenDF.to_csv(merged_destination, index=False)
So, making a space under Q1Names and adding a 0 under Q1Count in the same row where pineapple is under column Q2Names would make this easier to see an accurate difference between the values.
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
0 5 pineapple
10 bread 12 bread
4 pies 4 pies
3 cookies 3 cookies
32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
The final desired output I would get if I can get past that part is this:
Difference Q1Count Q1Names Q2Count Q2Names
0 2 candy 2 candy
1 9 apple 8 apple
-5 0 5 pineapple
-2 10 bread 12 bread
0 4 pies 4 pies
0 3 cookies 3 cookies
5 32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
I was able to get your same results using a pd.merge with the dataframe you provided
df_merge = pd.merge(df1, df2, left_on = 'Q1Names', right_on = 'Q2Names', how = 'outer')
df_merge[['Q1Count', 'Q2Count']] = df_merge[['Q1Count', 'Q2Count']].fillna(0)
df_merge[['Q1Names', 'Q2Names']] = df_merge[['Q1Names', 'Q2Names']].fillna('')
df_merge['Difference'] = df_merge['Q1Count'].sub(df_merge['Q2Count'])

Grouping By 2 Columns In Pandas Ignoring Order

I have a Dataframe in Pandas where there are 2 columns that are almost identical but not quite and hence sometimes I want to group by both columns ignoring the order.
As an example:
mydf = pd.DataFrame({'Colour1': ['Red', 'Red', 'Blue', 'Green', 'Blue'], 'Colour2': ['Red', 'Blue', 'Red', 'Blue', 'Green'], 'Rating': [4, 5, 7, 8, 2]})
Colour1 Colour2 Rating
0 Red Red 4
1 Red Blue 5
2 Blue Red 7
3 Green Blue 8
4 Blue Green 2
I would like to group by Colour1 and Colour2 whilst ignoring the order and then transforming the Dataframe by taking the mean to produce the following Dataframe:
Colour1 Colour2 Rating MeanRating
0 Red Red 4 4
1 Red Blue 5 6
2 Blue Red 7 6
3 Green Blue 8 5
4 Blue Green 2 5
Is there a good way of doing this? Thanks in advance.
You can first sort the column1 and 2 using np.sort then groupby:
s = pd.Series(map(tuple,np.sort(mydf[['Colour1','Colour2']],axis=1)),index=mydf.index)
mydf['MeanRating'] = mydf['Rating'].groupby(s).transform('mean')
print(mydf)
Colour1 Colour2 Rating MeanRating
0 Red Red 4 4
1 Red Blue 5 6
2 Blue Red 7 6
3 Green Blue 8 5
4 Blue Green 2 5

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

calculating mean with a condition on python pandas Group by on two columns. And print only the mean for each category?

Input
Fruit Count Price tag
Apple 55 35 red
Orange 60 40 orange
Apple 60 36 red
Apple 70 41 red
Output 1
Fruit Mean tag
Apple 35.5 red
Orange 40 orange
I need mean on condition price between 31 and 40
Output 2
Fruit Count tag
Apple 2 red
Orange 1 orange
I need count on condition price between 31 and 40
pls help
Use between with boolean indexing for filtering:
df1 = df[df['Price'].between(31, 40)]
print (df1)
Fruit Count Price tag
0 Apple 55 35 red
1 Orange 60 40 orange
2 Apple 60 36 red
If possible multiple columns by aggregated functions:
df2 = df1.groupby(['Fruit', 'tag'])['Price'].agg(['mean','size']).reset_index()
print (df2)
Fruit tag mean size
0 Apple red 35.5 2
1 Orange orange 40.0 1
Or 2 separately DataFrames:
df3 = df1.groupby(['Fruit', 'tag'], as_index=False)['Price'].mean()
print (df3)
Fruit tag Price
0 Apple red 35.5
1 Orange orange 40.0
df4 = df1.groupby(['Fruit', 'tag'])['Price'].size().reset_index()
print (df4)
Fruit tag Price
0 Apple red 2
1 Orange orange 1

Append Two Dataframes Together (Pandas, Python3)

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

Resources