How to count matching words from 2 csv files - python-3.x

I have 2 csv files, dictionary.csv and story.csv. I wanted to count how many words in story.csv per row matches with words from dictionary.csv
Below are truncated examples
Story.csv
id STORY
0 Jennie have 2 shoes, a red heels and a blue sneakers
1 The skies are pretty today
2 One of aesthetic color is grey
Dictionary.csv
red
green
grey
blue
black
The output i expected is
output.csv
id STORY Found
0 Jennie have 2 shoes, a red heels and a blue sneakers 2
1 The skies are pretty today 0
2 One of aesthetic color is grey 1
These are the codes i have so far, but i only got NaN(empty cells)
import pandas as pd
import csv
news=pd.read_csv("Story.csv")
dictionary=pd.read_csv("Dictionary.csv")
news['STORY'].value_counts()
news['How many found in 1'] = dictionary['Lists'].map(news['STORY'].value_counts())
news.to_csv("output.csv")
I tried using .str.count as well, but i kept on getting zeros

Try this
import pandas as pd
#create the sample data frame
data = {'id':[0,1,2],'STORY':['Jennie have 2 shoes, a red heels and a blue sneakers',\
'The skies are pretty today',\
'One of aesthetic color is grey']}
word_list = ['red', 'green', 'grey', 'blue', 'black']
df = pd.DataFrame(data)
#start counting
df['Found'] = df['STORY'].astype(str).apply(lambda t: pd.Series({word: t.count(word) for word in word_list}).sum())
#alternatively, can use this
#df['Found'] = df['STORY'].astype(str).apply(lambda t: sum([t.count(word) for word in word_list]))
Output
df
# id STORY Found
#0 0 Jennie have 2 shoes, a red heels and a blue sneakers 2
#1 1 The skies are pretty today 0
#2 2 One of aesthetic color is grey 1
Bonus edit: if you want to see the detailed break down of word count by word, then run this
df['STORY'].astype(str).apply(lambda t: pd.Series({word: t.count(word) for word in word_list}))
# red green grey blue black
#0 1 0 0 1 0
#1 0 0 0 0 0
#2 0 0 1 0 0

Related

Insert list of strings as a column in a dataframe

I have a pandas dataframe into which I would like to include a new column ('colors'), that contains a list of all colors (column 'color') of an item in that year previous to that row (i.e. grouped by the columns 'year' and 'item' and only including the rows above).
Suppose my df looks like this:
id item year color
0 shirt 2021 yellow
1 shoes 2022 pink
2 shirt 2021 green
3 shirt 2021 black
My goal would be:
id item year color colors
0 shirt 2021 yellow []
1 shoes 2022 pink [pink]
2 shirt 2021 green [yellow]
3 shirt 2021 black [yellow, green]
So far I have played around with code like this
self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(list())
or
self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(lambda x : list(x.shift())
But I ran into errors around re-indexing etc., so after so I would be glad if some of you experts could help me here.
Here is one way you could do it:
import itertools
df['colors'] = df.groupby(['item', 'year'])['color'].transform(lambda x: list(itertools.accumulate(x, '{} {}'.format))).shift()
print(df)
id item year color colors
0 0 shirt 2021 yellow NaN
1 1 shoes 2022 pink yellow
2 2 shirt 2021 green pink
3 3 shirt 2021 black yellow green
If you need them to be stored as lists, just add df['colors'] = df['colors'].str.split(). Replace the nan with an empty list is also possible if you want that, shown here.

Pandas create a new data frame from counting rows into columns

I have something like this data frame:
item color
0 A red
1 A red
2 A green
3 B red
4 B green
5 B green
6 C red
7 C green
And I want to count the times a color repeat for each item and group-by it into columns like this:
item red green
0 A 2 1
1 B 1 2
2 C 1 1
Any though? Thanks in advance

How to draw venn diagram from a dummy variable in Python Matplotlib_venn?

I have the following code to draw the venn diagram.
import numpy as np
import pandas as pd
import matplotlib_venn as vplt
x = np.random.randint(2, size=(10,3))
df = pd.DataFrame(x, columns=['A', 'B','C'])
print(df)
v = vplt.venn3(subsets=(1,1,1,1,1,1,1))
and the output looks like this:
I actually want to find the numbers in subsets() using the data set. How to do that? or is there any other easy way to make these venn diagram directly from the dataset.
I also want to make a box around it and annotate the remaining area as people with all the A,B,C are 0. Then calculate the percentage of the people in each circle and keep it as label. Not sure how to achieve this.
Background of the Problem:
I have a dataset of more than 500 observations and these three columns are recorded from one variable where multiple choices can be chosen as answers.
I want to visualize the data in a graph which shows that how many people have chosen 1st, 2nd, etc., as well as how many people have chosen 1st and 2nd, 1st and 3rd, etc.,
Use numpy.argwhere to get the indices of the 1s for each column and plot them the resultant
In [85]: df
Out[85]:
A B C
0 0 1 1
1 1 1 0
2 1 1 0
3 0 0 1
4 1 1 0
5 1 1 0
6 0 0 0
7 0 0 0
8 1 1 0
9 1 0 0
In [86]: sets = [set(np.argwhere(v).ravel()) for k,v in df.items()]
...: venn3(sets, df.columns)
...: plt.show()
Note: if you want to draw an additional box with the number of items not in either of the categories, add those lines:
In [87]: ax = plt.gca()
In [88]: xmin, _, ymin, _ = ax.axes.axis('on')
In [89]: ax.text(xmin, ymin, (df == 0).all(1).sum(), ha='left', va='bottom')

Merging two sheets of one excel into single sheet

I am trying to merge 2 sheets from excel.xlsx using python script. I want when sheet1('CLASS') matches to sheet2('C_MAP') then merge DSC and ASC after CLASS in sheet1 or in a new sheet.
To clarify it i am attaching my excel sheets.
this is my Sheet1:
P_MAP Q_GROUP CLASS
0 ram 2 pink
1 4 silver
2 sham 5 green
3 0 default
4 nil 2 pink
it contains P_MAP,Q_GROUP,CLASS
this is my Sheet2:
C_MAP DSC ASC
0 pink h1 match
1 green h2 match
2 silver h3 match
it contains C_MAP,ASC,DSC
So, I want when the CLASS matches to C_MAP it should add ASC and DSC and if it doesnt match add NA.
The output i want will be like this:
P_MAP Q_GROUP CLASS DSC ASC
0 ram 2 pink h1 match
1 4 silver h3 match
2 sham 5 green h2 match
3 0 default 0 NA
4 nil 2 pink h1 match
What you want is pd.merge:
df1 = pd.read_excel('filename.xlsx', sheet_name='Sheet1') # fill in the correct excel filename
df2 = pd.read_excel('filename.xlsx', sheet_name='Sheet2') # fill in the correct excel filename
df_final = df1.merge(df2,
left_on='CLASS',
right_on='C_MAP',
how='left').drop('C_MAP', axis=1)
df_final.to_excel('filename2.xlsx')
Output
P_MAP Q_GROUP CLASS DSC ASC
0 ram 2 pink h1 match
1 4 silver h3 match
2 sham 5 green h2 match
3 0 default NaN NaN
4 nil 2 pink h1 match

Splitting Column Lists in Pandas DataFrame

I'm looking for an good way to solve the following problem. My current fix is not particularly clean, and I'm hoping to learn from your insight.
Suppose I have a Panda DataFrame, whose entries look like this:
>>> df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
>>> df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
>>> df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
>>> df['IsGlass']=[1,0,1]
>>> df
Color Texture IsGlass
1 NaN ['Rough'] 1
2 ['Red', 'Blue'] NaN 0
3 ['Blue', 'Green', 'Purple'] ['Silky','Shiny','Fuzzy'] 1
So each observation in the index corresponds to something I measured about its color, texture, and whether it's glass or not. What I'd like to do is turn this into a new "indicator" DataFrame, by creating a column for each observed value, and changing the corresponding entry to a one if I observed it, and NaN if I have no information.
>>> df
Red Blue Green Purple Rough Silky Shiny Fuzzy Is Glass
1 Nan Nan Nan Nan 1 NaN Nan Nan 1
2 1 1 Nan Nan Nan Nan Nan Nan 0
3 Nan 1 1 1 Nan 1 1 1 1
I have solution that loops over each column, looks at its values, and through a series of Try/Excepts for non-Nan values splits the lists, creates a new column, etc., and concatenates.
This is my first post to StackOverflow - I hope this post conforms to the posting guidelines. Thanks.
Stacking Hacks!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.stack().unstack(fill_value=[])
def b(c):
d = mlb.fit_transform(c)
return pd.DataFrame(d, c.index, mlb.classes_)
pd.concat([b(df[c]) for c in ['Color', 'Texture']], axis=1).join(df.IsGlass)
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
I am just using pandas, get_dummies
l=[pd.get_dummies(df[x].apply(pd.Series).stack(dropna=False)).sum(level=0) for x in ['Color','Texture']]
pd.concat(l,axis=1).assign(IsGlass=df.IsGlass)
Out[662]:
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
For each texture/color in each row, I check if the value is null. If not, we add that value as a column = 1 for that row.
import numpy as np
import pandas as pd
df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
df['IsGlass']=[1,0,1]
for row in df.itertuples():
if not np.all(pd.isnull(row.Color)):
for val in row.Color:
df.loc[row.Index,val] = 1
if not np.all(pd.isnull(row.Texture)):
for val in row.Texture:
df.loc[row.Index,val] = 1

Resources