Insert list of strings as a column in a dataframe - python-3.x

I have a pandas dataframe into which I would like to include a new column ('colors'), that contains a list of all colors (column 'color') of an item in that year previous to that row (i.e. grouped by the columns 'year' and 'item' and only including the rows above).
Suppose my df looks like this:
id item year color
0 shirt 2021 yellow
1 shoes 2022 pink
2 shirt 2021 green
3 shirt 2021 black
My goal would be:
id item year color colors
0 shirt 2021 yellow []
1 shoes 2022 pink [pink]
2 shirt 2021 green [yellow]
3 shirt 2021 black [yellow, green]
So far I have played around with code like this
self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(list())
or
self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(lambda x : list(x.shift())
But I ran into errors around re-indexing etc., so after so I would be glad if some of you experts could help me here.

Here is one way you could do it:
import itertools
df['colors'] = df.groupby(['item', 'year'])['color'].transform(lambda x: list(itertools.accumulate(x, '{} {}'.format))).shift()
print(df)
id item year color colors
0 0 shirt 2021 yellow NaN
1 1 shoes 2022 pink yellow
2 2 shirt 2021 green pink
3 3 shirt 2021 black yellow green
If you need them to be stored as lists, just add df['colors'] = df['colors'].str.split(). Replace the nan with an empty list is also possible if you want that, shown here.

Related

Pandas Drop an Entire Column if All of the Values equal a Certain Value

Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.
Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]
Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green
my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green

Pandas create a new data frame from counting rows into columns

I have something like this data frame:
item color
0 A red
1 A red
2 A green
3 B red
4 B green
5 B green
6 C red
7 C green
And I want to count the times a color repeat for each item and group-by it into columns like this:
item red green
0 A 2 1
1 B 1 2
2 C 1 1
Any though? Thanks in advance

How to count matching words from 2 csv files

I have 2 csv files, dictionary.csv and story.csv. I wanted to count how many words in story.csv per row matches with words from dictionary.csv
Below are truncated examples
Story.csv
id STORY
0 Jennie have 2 shoes, a red heels and a blue sneakers
1 The skies are pretty today
2 One of aesthetic color is grey
Dictionary.csv
red
green
grey
blue
black
The output i expected is
output.csv
id STORY Found
0 Jennie have 2 shoes, a red heels and a blue sneakers 2
1 The skies are pretty today 0
2 One of aesthetic color is grey 1
These are the codes i have so far, but i only got NaN(empty cells)
import pandas as pd
import csv
news=pd.read_csv("Story.csv")
dictionary=pd.read_csv("Dictionary.csv")
news['STORY'].value_counts()
news['How many found in 1'] = dictionary['Lists'].map(news['STORY'].value_counts())
news.to_csv("output.csv")
I tried using .str.count as well, but i kept on getting zeros
Try this
import pandas as pd
#create the sample data frame
data = {'id':[0,1,2],'STORY':['Jennie have 2 shoes, a red heels and a blue sneakers',\
'The skies are pretty today',\
'One of aesthetic color is grey']}
word_list = ['red', 'green', 'grey', 'blue', 'black']
df = pd.DataFrame(data)
#start counting
df['Found'] = df['STORY'].astype(str).apply(lambda t: pd.Series({word: t.count(word) for word in word_list}).sum())
#alternatively, can use this
#df['Found'] = df['STORY'].astype(str).apply(lambda t: sum([t.count(word) for word in word_list]))
Output
df
# id STORY Found
#0 0 Jennie have 2 shoes, a red heels and a blue sneakers 2
#1 1 The skies are pretty today 0
#2 2 One of aesthetic color is grey 1
Bonus edit: if you want to see the detailed break down of word count by word, then run this
df['STORY'].astype(str).apply(lambda t: pd.Series({word: t.count(word) for word in word_list}))
# red green grey blue black
#0 1 0 0 1 0
#1 0 0 0 0 0
#2 0 0 1 0 0

Concatenate two rows based on the same value in the next row of a new column

I am creating a new column and trying to concatenate the rows where the column value is the same. 1 the 1st row would have the initial value in that row, second row would the value of the 1st row and 2nd row. I have been able to make it work where the column has two values, if the column has 3 or more values only two values are being concatenated in the final row.
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit']+df['length'].map(lambda x: ' '*x)
df['same_fruit']=np.where(df['Fruit']!=df['Fruit'].shift(1),df['Fruit_color'],df['Fruit_color'].shift(1)+" "+df['Fruit_color]
Current output:
How do i get the expected output.
Below is the output that i am expecting
Regards,
Ren.
Here is an answer:
In [1]:
import pandas as pd
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit'] + ' ' + df['Color']
df.sort_values(by=['Fruit_color'], inplace=True)
## Get the maximum of fruit occurrence
maximum = df[['Fruit', 'Color']].groupby(['Fruit']).count().max().tolist()[0]
## Iter shift as many times as the highest occurrence
new_cols = []
for i in range(maximum):
temporary_col = 'Fruit_' + str(i)
df[temporary_col] = df['Fruit'].shift(i+1)
new_col = 'new_col_' + str(i)
df[new_col] = df['Fruit_color'].shift(i+1)
df.loc[df[temporary_col] != df['Fruit'], new_col] = ''
df.drop(columns=[temporary_col], axis=1, inplace=True)
new_cols.append(new_col)
## Use this shifted columns to create `same fruit` and drop useless columns
df['same_fruit'] = df['Fruit_color']
for col in new_cols:
df['same_fruit'] = df['same_fruit'] + ' ' + df[col]
df.drop(columns=[col], axis=1, inplace=True)
Out [1]:
Fruit Color length Fruit_color same_fruit
1 Apple Green 5 Apple Green Apple Green
0 Apple Red 5 Apple Red Apple Red Apple Green
3 Mango Green 5 Mango Green Mango Green
4 Mango Orange 5 Mango Orange Mango Orange Mango Green
2 Mango Yellow 5 Mango Yellow Mango Yellow Mango Orange Mango Green
5 Watermelon Green 10 Watermelon Green Watermelon Green

How to split Pandas string column into different rows?

Here is my issue. I have data like this:
data = {
'name': ["Jack ;; Josh ;; John", "Apple ;; Fruit ;; Pear"],
'grade': [11, 12],
'color':['black', 'blue']
}
df = pd.DataFrame(data)
It looks like:
name grade color
0 Jack ;; Josh ;; John 11 black
1 Apple ;; Fruit ;; Pear 12 blue
I want it to look like:
name age color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue
So first I'd need to split name by using ";;" and then explode that list into different rows
Use Series.str.split with reshape by DataFrame.stack and add orriginal another columns by DataFrame.join:
c = df.columns
s = (df.pop('name')
.str.split(' ;; ', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('name'))
df = df.join(s).reset_index(drop=True).reindex(columns=c)
print (df)
name grade color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue
You have 2 challenges:
split the name with ;; into a list AND have each item in the list as a column such that:
df['name']=df.name.str.split(';;')
df_temp = df.name.apply(pd.Series)
df = pd.concat([df[:], df_temp[:]], axis=1)
df.drop('name', inplace=True, axis=1)
result:
grade color 0 1 2
0 11 black Jack Josh John
1 12 blue Apple Fruit Pear
Melt the list to get desired result:
df.melt(id_vars=["grade", "color"],
value_name="Name").sort_values('grade').drop('variable', axis=1)
desired result:
grade color Name
0 11 black Jack
2 11 black Josh
4 11 black John
1 12 blue Apple
3 12 blue Fruit
5 12 blue Pear

Resources