keep unique words in a pandas dataframe row - python-3.x

Dataframe:
> df
>type(df)
pandas.core.frame.DataFrame
ID Property Type Amenities
1952043 Apartment, Villa, Apartment Park, Jogging Track, Park
1918916 Bungalow, Cottage House, Cottage, Bungalow Garden, Play Ground
How can I keep just the unique words separated by "comma" in the dataframe row? In this case it must not consider "Cottage House" and "Cottage" same. It must check this for all columns of the dataframe. So my desired output should look like below:
Desired Output :
ID Property Type Amenities
1952043 Apartment, Villa Park, Jogging Track
1918916 Bungalow, Cottage House, Cottage Garden, Play Ground

First, I create a function that does what you want for a given string. Secondly, I apply this function to all strings in the column.
import numpy as np
import pandas as pd
df = pd.DataFrame([['Apartment, Villa, Apartment',
'Park, Jogging Track, Park'],
['Bungalow, Cottage House, Cottage, Bungalow',
'Garden, Play Ground']],
columns=['Property Type', 'Amenities'])
def drop_duplicates(row):
# Split string by ', ', drop duplicates and join back.
words = row.split(', ')
return ', '.join(np.unique(words).tolist())
# drop_duplicates is applied to all rows of df.
df['Property Type'] = df['Property Type'].apply(drop_duplicates)
df['Amenities'] = df['Amenities'].apply(drop_duplicates)
print(df)

Read the file into pandas DataFrame
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
The main idea is to
iterate through every row,
split the string in the target column by ,
return the unique set() of the list from step 2
Code:
>>> for row in proptype_column: # Step 1.
... items_in_row = row.split(', ') # Step 2.
... uniq_items_in_row = set(row.split(', ')) # Step 3.
... print(uniq_items_in_row)
...
set(['Apartment', 'Villa'])
set(['Cottage', 'Bungalow', 'Cottage House'])
Now you can achieve the same with DataFrame.apply() function:
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
>>> proptype_uniq = df['Property Type'].apply(lambda cell: set(cell.split(', ')))
>>> df['Property Type (Unique)'] = proptype_uniq
>>> df
ID Property Type \
0 12345 Apartment, Villa, Apartment
1 67890 Bungalow, Cottage House, Cottage, Bungalow
Amenities Property Type (Unique)
0 Park, Jogging Track, Park {Apartment, Villa}
1 Garden, Play Ground {Cottage, Bungalow, Cottage House}

Related

how to extract a specific text from str column and take it to another column

I have a df with str columns like: All Unidades, Peter Lopez [QX1234]
And I need to select QX1234 and create new columns with this.
How can I create colum2?
You can use regex and then set the new column with the result. Since you didn't provide any code, here's a sample with what you have posted.
>>> import re
>>> sample = 'Peter Lopez [QX1234] '
>>> sample
'Peter Lopez [QX1234] '
>>> match = re.search(r'\[(.*?)\]', sample).group(1)
>>> match
'QX1234'
>>> df['column2'] = match

pandas: calculate fuzzywuzzy for each category separately

I have a dataset as follows, only with more rows:
import pandas as pd
data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the young boy is there','the young girl is here','the old girl is here']}
df = pd.DataFrame (data, columns = ['First','Second'])
i have calculated the fuzzywuzzy average for the entire dataset like this:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def similarity_measure(doc1, doc2):
return fuzz.token_set_ratio(doc1, doc2)
d= df.groupby('First')['Second'].apply(lambda x: (', '.join(x)))
d= d.reset_index()
all=[]
for val in list(combinations(range(len(d)), 2)):
all.append(similarity_measure(d.iloc[val[0],1],d.iloc[val[1],1]))
avg = sum(all)/len(all)
print('lexical overlap between all example pairs in the dataset is: ', avg)
however, I would like to also get this average for each category in the first column separately.
so, i would like something like(for example):
similarity average for sentences in First value: 85.56
similarity average for sentences in Second value: 89.01
similarity average for sentences in Third value: 90.01
so I would like to modify the for loop in a way that i would have the above output.
To compute the mean within each group, you need two steps:
To group by some criteria, in your case column First. It seems that you already know how.
Create a function to compute the similarity for a group the all_similarity_measure function in the code below.
Code
import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations
def similarity_measure(doc1, doc2):
return fuzz.token_set_ratio(doc1, doc2)
data = {'First': ['First value', 'Third value', 'Second value', 'First value', 'Third value', 'Second value'],
'Second': ['the old man is here', 'the young girl is there', 'the old woman is here', 'the young boy is there',
'the young girl is here', 'the old girl is here']}
df = pd.DataFrame(data, columns=['First', 'Second'])
def all_similarity_measure(gdf):
"""This function computes the similarity between all pairs of sentences in a Series"""
return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()
res = df.groupby('First', as_index=False)['Second'].apply(all_similarity_measure)
print(res)
Output
First Second
0 First value 63.0
1 Second value 86.0
2 Third value 98.0
The key to compute the mean similarity is this expression:
return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()
basically you generate the pairs of sentences using combinations (no need to access by index), construct a Series and compute mean on it.
Any function for computing the mean can be use instead of the above, for example, you could use statistics.mean, to avoid constructing a Series.
from statistics import mean
def all_similarity_measure(gdf):
"""This function computes the similarity between all pairs of sentences in a Series"""
return mean(similarity_measure(*docs) for docs in combinations(gdf, 2))

How to get only different words from two pandas.DataFrame columns

I have a DataFrame with columns id, keywords1 and keywords2. I would like to get only words from column keywords2 that are not in the column keywords1. Also I need to clean my new column with different words from meaningless words like phph, wfgh... I'm only interested in English words.
Example:
data = [[1, 'detergent', 'detergent for cleaning stains'], [2, 'battery charger', 'wwfgh, old, glass'], [3, 'sunglasses, black, metal', 'glass gggg jik xxx,'], [4, 'chemicals, flammable', 'chemicals, phph']]
df = pd.DataFrame(data, columns = ['id', 'keywords1','keywords2'])
df
Try:
import numpy as np
#we split to get words - by every sequence of 1, or more non-letters characters
df["keywords1"]=df["keywords1"].str.split("[^\w+]").map(set)
df["keywords2"]=df["keywords2"].str.split("[^\w+]").map(set)
df["keywords3"]=np.bitwise_and(np.bitwise_xor(df["keywords1"], df["keywords2"]), df["keywords2"])
#optional-if you wish to keep it as a string, and not set:
df["keywords3"]=df["keywords3"].str.join(", ")
Outputs:
id ... keywords3
0 1 ... cleaning, for, stains
1 2 ... , wwfgh, glass, old
2 3 ... jik, xxx, glass, gggg
3 4 ... phph
Let's try:
def words_diff(words1, words2)
kw1=words1.str.split()
kw2= words2.str.split()
diff=[x for x in kw2 if x not in kw1]
return diff
df['diff'] = df.apply(lambda x: words_diff(x['keywords1'] , x['keywords2'] ), axis=1)

How to pair each column in the list form in pandas by Python3

df = pd.DataFrame([['an apple is red', 'pop is here'],['pear is green', 'see is blue']], columns=['A', 'B']
from nltk.tokenize import TweetTokenizer
df['A'] = [TweetTokenizer().tokenize(text) for text in df['A']]
df['id']=[1,2]
for k in df['A']:
print(k)
pid = df[df['A']==k]['id'].values[0]
['an', 'apple', 'is', 'red']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-156-f9e4fa5143b0> in <module>
1 for k in df['A']:
2 print(k)
----> 3 pid = df[df['A']==k]['id'].values[0]
~/Library/Python/3.7/lib/python/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
1743 # as it will broadcast
1744 if other.ndim != 0 and len(self) != len(other):
-> 1745 raise ValueError('Lengths must match to compare')
1746
1747 res_values = na_op(self.values, np.asarray(other))
ValueError: Lengths must match to compare
I want to get the pid when k is equal to each row of column A.
I can make it through when each row is string. How to do it when each row is a list?
once matching, I get the corresponding value of 'id'
expected pid out put:
1
2
I have made some changes to your code and it is returning the pid correctly as per your question.
df = pd.DataFrame([['an apple is red', 'pop is here'],['pear is green', 'see is blue']], columns=['A', 'B'])
from nltk.tokenize import TweetTokenizer
df['A'] = [TweetTokenizer().tokenize(text) for text in df['A']]
df['id']=[1,2]
for k in df['A']:
print(k)
pid = df[df['A'].map(tuple)==tuple(k)]['id'].values[0]
print(pid)
Since list comparison are messy in dataframe, it is better to convert the column values to tuple and then do the comparison.
You can also look into getting the index of dataframe rows while iterating if you need just index of each row here.

Pandas dataframe column names seem wrong

I'm a student and therefore a rookie. I'm trying to create a Pandas dataframe of crime statistics by neighborhood in San Francisco. My problem is that I want the column names to be simply "Neighborhood" and "Count". Instead I seem to be stuck with a separate line that says "('Neighborhood', 'count')" instead of the proper labels. Here's the code:
df_counts = df_incidents.copy()
df_counts.rename(columns={'PdDistrict':'Neighborhood'}, inplace=True)
df_counts.drop(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'Location', 'Resolution', 'Address', 'X', 'Y', 'PdId'], axis=1, inplace=True)
df_totals=df_counts.groupby(['Neighborhood']).agg({'Neighborhood':['count']})
df_totals.columns = list(map(str, df_totals.columns)) # Not sure if I need this
df_totals
Output:
('Neighborhood', 'count')
Neighborhood
BAYVIEW 14303
CENTRAL 17666
INGLESIDE 11594
MISSION 19503
NORTHERN 20100
PARK 8699
RICHMOND 8922
SOUTHERN 28445
TARAVAL 11325
TENDERLOIN 9942
No need for agg() here, you can simply do:
df_totals = df_counts.groupby(['Neighborhood']).count()
df_totals.columns = ['count']
df_totals = df_totals.reset_index() # flatten the column headers
And if you want to print the output without the numerical index:
print(df_totals.to_string(index=False))

Resources