Making a list from a pandas column containing multiple values - python-3.x

Let's use this as an example data set:
Year Breeds
0 2009 Collie
1 2010 Shepherd
2 2011 Collie, Shepherd
3 2012 Shepherd, Retriever
4 2013 Shepherd
5 2014 Shepherd, Bulldog
6 2015 Collie, Retriever
7 2016 Retriever, Bulldog
I want to create a list dogs in which dogs contains the unique dog breeds Collie, Shepherd, Retriever, Bulldog. I know it is as simple as calling .unique() on the appropriate column, but I am running into the issue of having more than one value in the Breeds column. Any ideas to circumvent that?
Thanks!

EDIT:
If need extract all possible values use split:
df['new'] = df['Breeds'].str.split(', ')
For unique values convert to sets:
df['new'] = df['Breeds'].str.split(', ').apply(lambda x: list(set(x)))
Or use list comprehension:
df['new'] = [list(set(x.split(', '))) for x in df['Breeds']]
Use findall for extract by list and regex - | for OR if want extract only some values:
L = ["Collie", "Shepherd", "Retriever", "Bulldog"]
df['new'] = df['Breeds'].str.findall('|'.join(L))
If possible duplicates:
df['new'] = df['Breeds'].str.findall('|'.join(L)).apply(lambda x: list(set(x)))
print (df)
Year Breeds new
0 2009 Collie [Collie]
1 2010 Shepherd [Shepherd]
2 2011 Collie, Shepherd [Collie, Shepherd]
3 2012 Shepherd, Retriever [Shepherd, Retriever]
4 2013 Shepherd [Shepherd]
5 2014 Shepherd, Bulldog [Shepherd, Bulldog]
6 2015 Collie, Retriever [Collie, Retriever]
7 2016 Retriever, Bulldog [Retriever, Bulldog]

Related

Merging two pandas dataframes with date variable

I want to merger two pandas dataframes based on common date variable. Below is my code
import pandas as pd
data = pd.DataFrame({'date' : pd.to_datetime(['2010-12-31', '2012-12-31']), 'val' : [1,2]})
datarange = pd.DataFrame(pd.period_range('2009-12-31', '2012-12-31', freq='A'), columns = ['date'])
pd.merge(datarange, data, how = 'left', on = 'date')
With this I get below result
date val
0 2009 NaN
1 2010 NaN
2 2011 NaN
3 2012 NaN
Could you please help how can I correctly merge these two dataframes?
Use right_on for same anual periods like in datarange['date'] column:
df = pd.merge(datarange,
data,
how = 'left',
left_on = 'date',
right_on=data['date'].dt.to_period('A'))
print (df)
date date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0
Or create helper column:
df = pd.merge(datarange,
data.assign(datetimes=data['date'], date=data['date'].dt.to_period('A')),
how = 'left',
on = 'date')
print (df)
date val datetimes
0 2009 NaN NaT
1 2010 1.0 2010-12-31
2 2011 NaN NaT
3 2012 2.0 2012-12-31
You need to merge on a common type.
For example you can set the year as merging key on each side:
pd.merge(datarange, data, how='left',
left_on=datarange['date'].dt.year,
right_on=data['date'].dt.year
)
output:
key_0 date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0

Filter and display all duplicated rows based on multiple columns in Pandas [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 years ago.
Given a dataset as follows:
name month year
0 Joe December 2017
1 James January 2018
2 Bob April 2018
3 Joe December 2017
4 Jack February 2018
5 Jack April 2018
I need to filter and display all duplicated rows based on columns month and year in Pandas.
With code below, I get:
df = df[df.duplicated(subset = ['month', 'year'])]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)
Out:
name month year
3 Joe December 2017
5 Jack April 2018
But I want the result as follows:
name month year
0 Joe December 2017
1 Joe December 2017
2 Bob April 2018
3 Jack April 2018
How could I do that in Pandas?
The following code works, by adding keep = False:
df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

Optimizing pandas iteration

Customer Year Customer Lost/Retained
A 2009 Retained
A 2010 Retained
A 2011 Lost
B 2008 Lost
C 2008 Retained
C 2009 lost
I have used itterrows() for creating Customer Lost/Retained column based on the above logic.
If a customer is duplicated for the consecutive year, he is retained else lost.
for i, row in df.iterrows():
if (df[df['Year'] == row['Year']+1]['Customer']).str.contains(df['Customer'].iloc[i]).any():
df['Customer Lost/Retained'].iloc[i] = 'Retained'
else:
df['Customer Lost/Retained'].iloc[i] = 'Lost'
Can this code be optimized further?
# groupby customer
g = df.groupby('Customer')['Year']
# create a mask of conditions by using shift
mask = (g.shift(0) == g.shift(-1)-1)
# use npy.wehre to create a list of results based on the mask
df['Retained/lost'] = np.where(mask, 'Retained', 'Lost')
Customer Year Retained/lost
0 A 2009 Retained
1 A 2010 Retained
2 A 2011 Lost
3 B 2008 Lost
4 C 2008 Retained
5 C 2009 Lost
You could do this as a merge with itself but modifying the year:
In [83]: df['retained'] = pd.notnull(df.merge(
...: df,
...: how="left",
...: left_on=["Customer", "Year"],
...: right_on=["Customer", df["Year"].sub(1)],
...: suffixes=['', "_match"]
...: )["Year_match"]).map({True: 'Retained', False: 'Lost'})
In [84]: df
Out[84]:
Customer Year Customer Lost/Retained retained
0 A 2009 Retained Retained
1 A 2010 Retained Retained
2 A 2011 Lost Lost
3 B 2008 Lost Lost
4 C 2008 Retained Retained
5 C 2009 lost Lost
We add a column which says 'Retained':
df['Customer Lost/Retained'] = 'Retained'
Except for the indices with the highest year per customer, they get value 'Lost':
mask = df.groupby('Customer')['Year'].idxmax()
df.loc[mask, 'Customer Lost/Retained'] = 'Lost'
Customer Year Customer Lost/Retained
0 A 2009 Retained
1 A 2010 Retained
2 A 2011 Lost
3 B 2008 Lost
4 C 2008 Retained
5 C 2009 Lost
Or, alternatively, insert the 'Lost' first and then .fillna():
df.loc[df.groupby('Customer')['Year'].idxmax(), 'Customer Lost/Retained'] = 'Lost'
df['Customer Lost/Retained'] = df['Customer Lost/Retained'].fillna('Retained')

Pandas - How do you grouby multiple columns and get the lowest value?

I have data frame with 75+ number of columns. I am trying to eliminate and keep the relevant data rows for a test. I just created sample data set. I know how I could tackle this in SQL group by and get all the columns. How do I do this here? I have posted one of many tries which made sense to me.
u_id = ['A123','A123','A123','A124','A124','A125']
year = [2016,2017,2018,2018,1997,2015]
text = ['text1','text2','text1','text1','text56','text100']
df = pd.DataFrame({'u_id': u_id,'year': year,'text':text})
df
Data Input
u_id year text
0 A123 2016 text1
1 A123 2017 text2
2 A123 2018 text1
3 A124 2018 text1
4 A124 1997 text56
5 A125 2015 text100
Tried:
df[df.groupby(['u_id','year'])['year'].min()]
# error: `KeyError: '[2016 2017 2018 1997 2018 2015] not in index'`
# Key exists here, why is this an error? 'groupby/having' in SQL?
Output Needed:
u_id year text ... col1 col2 ..... col_x
A123 2016 text1 ...
A124 1997 text56 ...
A125 2015 text100 ...
I think,what you need is groupby u_id and keep the min year
df["year"] = pd.to_numeric(df["year"])
newdf = df.loc[df.groupby(['u_id'])['year'].idxmin()].reset_index(drop=True)

assigning a column to be index and dropping it

As the title suggests I am reassigning a column as index but I want the column to appear only as index.
df.set_index(df['col_name'], drop = True, inplace = True)
The documenatation as I understand it says that the above will reassign the column to the df and drop the initial column. But when I print out the df the column is now duplicated (as index and still as column). Can anyone point out where I am missing something ?
You need only set column name col_name and parameter inplace in set_index. If use not column name, but column like df['a'], set_index doesn't drop column, only copy it:
print df
col_name a b
0 1.255 2003 1
1 3.090 2003 2
2 3.155 2003 3
3 3.115 2004 1
4 3.010 2004 2
5 2.985 2004 3
df.set_index('col_name', inplace = True)
print df
a b
col_name
1.255 2003 1
3.090 2003 2
3.155 2003 3
3.115 2004 1
3.010 2004 2
2.985 2004 3
df.set_index(df['a'], inplace = True)
print df
a b
a
2003 2003 1
2003 2003 2
2003 2003 3
2004 2004 1
2004 2004 2
2004 2004 3
This works for me in Python 2.x, should work for you too (3.x)!
pandas.__version__
u'0.17.0'
df.set_index('col_name', inplace = True)

Resources