Best python data structure to replace values in a column? - python-3.x

I am working with a dataframe where I need to replace values in 1 column. My natural instinct is to go towards a python dictionary HOWEVER, this is an example of what my data looks like (original_col):
original_col desired_col
cat animal
dog animal
bunny animal
cat animal
chair furniture
couch furniture
Bob person
Lisa person
A dictionary would look something like:
my_dict: {'animal': ['cat', 'dog', 'bunny'], 'furniture': ['chair', 'couch'], 'person': ['Bob', 'Lisa']}
I can't use the typical my_dict.get() since I am looking to retrieve corresponding KEY rather than the value. Is dictionary the best data structure? Any suggestions?

flip your dictionary:
my_new_dict = {v: k for k, vals in my_dict.items() for v in vals}
note, this will not work if you have values like: dog->animal, dog->person

DataFrame.replace already accepts a dictionary in a specific structure so you don't need to re-invent the wheel: {col_name: {old_value: new_value}}
df.replace({'original_col': {'cat': 'animal', 'dog': 'animal', 'bunny': 'animal',
'chair': 'furniture', 'couch': 'furniture',
'Bob': 'person', 'Lisa': 'person'}})
Alternatively you could use Series.replace, then only the inner dictionary is required:
df['original_col'].replace({'cat': 'animal', 'dog': 'animal', 'bunny': 'animal',
'chair': 'furniture', 'couch': 'furniture',
'Bob': 'person', 'Lisa': 'person'})

The pandas map() function uses a dictionary or another pandas Series to perform this kind of lookup, IIUC:
# original column / data
data = ['cat', 'dog', 'bunny', 'cat', 'chair', 'couch', 'Bob', 'Lisa']
# original dict
my_dict: {'animal': ['cat', 'dog', 'bunny'],
'furniture': ['chair', 'couch'],
'person': ['Bob', 'Lisa']
}
# invert the dictionary
new_dict = { v: k
for k, vs in my_dict.items()
for v in vs }
# create series and use `map()` to perform dictionary lookup
df = pd.concat([
pd.Series(data).rename('original_col'),
pd.Series(data).map(new_values).rename('desired_col')], axis=1)
print(df)
original_col desired_col
0 cat animal
1 dog animal
2 bunny animal
3 cat animal
4 chair furniture
5 couch furniture
6 Bob person
7 Lisa person

Related

How to get only different words from two pandas.DataFrame columns

I have a DataFrame with columns id, keywords1 and keywords2. I would like to get only words from column keywords2 that are not in the column keywords1. Also I need to clean my new column with different words from meaningless words like phph, wfgh... I'm only interested in English words.
Example:
data = [[1, 'detergent', 'detergent for cleaning stains'], [2, 'battery charger', 'wwfgh, old, glass'], [3, 'sunglasses, black, metal', 'glass gggg jik xxx,'], [4, 'chemicals, flammable', 'chemicals, phph']]
df = pd.DataFrame(data, columns = ['id', 'keywords1','keywords2'])
df
Try:
import numpy as np
#we split to get words - by every sequence of 1, or more non-letters characters
df["keywords1"]=df["keywords1"].str.split("[^\w+]").map(set)
df["keywords2"]=df["keywords2"].str.split("[^\w+]").map(set)
df["keywords3"]=np.bitwise_and(np.bitwise_xor(df["keywords1"], df["keywords2"]), df["keywords2"])
#optional-if you wish to keep it as a string, and not set:
df["keywords3"]=df["keywords3"].str.join(", ")
Outputs:
id ... keywords3
0 1 ... cleaning, for, stains
1 2 ... , wwfgh, glass, old
2 3 ... jik, xxx, glass, gggg
3 4 ... phph
Let's try:
def words_diff(words1, words2)
kw1=words1.str.split()
kw2= words2.str.split()
diff=[x for x in kw2 if x not in kw1]
return diff
df['diff'] = df.apply(lambda x: words_diff(x['keywords1'] , x['keywords2'] ), axis=1)

filter dataframe columns as you iterate through rows and create dictionary

I have the following table of data in a spreadsheet:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I'm reading the spreadsheet and passing the data as a dataframe.
I need to transform this table of data to json following a specific schema.
I have the following script:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row.to_dict())
which return:
{'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}
I want to be able to filter out a specific column. For example, to return this:
{'Name': 'bar', 'Description': 'foofoo'}
I know that I can print only the columns I want with this print(row['Name'],row['Description']) however this is only returning me values when I also want to return the key.
How can I do this?
I wrote this entire thing only to realize that #anky_91 had already suggested it. Oh well...
import pandas as pd
data = {
"name": ["foo", "abc", "baz", "bar"],
"description": ["foobar", "foofoo", "foobaz", "foofoo"],
"value": [5, 3, 4, 8],
}
df = pd.DataFrame(data=data)
print(df, end='\n\n')
rec_dicts = df.loc[df["description"] == "foofoo", ["name", "description"]].to_dict(
"records"
)
print(rec_dicts)
Output:
name description value
0 foo foobar 5
1 abc foofoo 3
2 baz foobaz 4
3 bar foofoo 8
[{'name': 'abc', 'description': 'foofoo'}, {'name': 'bar', 'description': 'foofoo'}]
After converting to dictionary you can delete the key which you don't need with:
del(row[value])
Now the dictionary will have only name and description.
You can try this:
import io
import pandas as pd
s="""Name,Description,Value
foo,foobar,5
baz,foobaz,4
bar,foofoo,8
"""
df = pd.read_csv(io.StringIO(s))
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row[['Name', 'Description']].to_dict())
Result:
{'Name': 'bar', 'Description': 'foofoo'}

join strings within a list of lists by 4

Background
I have a list of lists as seen below
l = [['NAME',':', 'Mickey', 'Mouse', 'was', 'here', 'and', 'Micky', 'mouse', 'went', 'out'],
['Donal', 'duck', 'was','Date', 'of', 'Service', 'for', 'Donald', 'D', 'Duck', 'was', 'yesterday'],
['I', 'like','Pluto', 'the', 'carton','Dog', 'bc','he', 'is','fun']]
Goal
Join l by every 4 elements (when possible)
Problem
But sometimes 4 elements won't cleanly join as 4 as seen in my desired output
Desired Output
desired_l = [['NAME : Mickey Mouse', 'was here and Micky', 'mouse went out'],
['Donal duck was Date', 'of Service for Donald', 'D Duck was yesterday'],
['I like Pluto the', 'carton Dog bc he', 'is fun']]
Question
How do I achive desired_l?
itertools has some nifty functions, one of which can do this to do just this.
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
[[' '.join(filter(None, x)) for x in list(grouper(sentence, 4, fillvalue=''))] for sentence in l]
Result:
[['NAME : Mickey Mouse', 'was here and Micky', 'mouse went out'],
['Donal duck was Date', 'of Service for Donald', 'D Duck was yesterday'],
['I like Pluto the', 'carton Dog bc he', 'is fun']]

Pandas dataframe column names seem wrong

I'm a student and therefore a rookie. I'm trying to create a Pandas dataframe of crime statistics by neighborhood in San Francisco. My problem is that I want the column names to be simply "Neighborhood" and "Count". Instead I seem to be stuck with a separate line that says "('Neighborhood', 'count')" instead of the proper labels. Here's the code:
df_counts = df_incidents.copy()
df_counts.rename(columns={'PdDistrict':'Neighborhood'}, inplace=True)
df_counts.drop(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'Location', 'Resolution', 'Address', 'X', 'Y', 'PdId'], axis=1, inplace=True)
df_totals=df_counts.groupby(['Neighborhood']).agg({'Neighborhood':['count']})
df_totals.columns = list(map(str, df_totals.columns)) # Not sure if I need this
df_totals
Output:
('Neighborhood', 'count')
Neighborhood
BAYVIEW 14303
CENTRAL 17666
INGLESIDE 11594
MISSION 19503
NORTHERN 20100
PARK 8699
RICHMOND 8922
SOUTHERN 28445
TARAVAL 11325
TENDERLOIN 9942
No need for agg() here, you can simply do:
df_totals = df_counts.groupby(['Neighborhood']).count()
df_totals.columns = ['count']
df_totals = df_totals.reset_index() # flatten the column headers
And if you want to print the output without the numerical index:
print(df_totals.to_string(index=False))

Assign column contents to categories

I have a data frame with one column of sub-instances of a larger group, and want to categorize this into a smaller number of groups. How do I do this?
Consider the following sample data:
df = pd.DataFrame({
'a':np.random.randn(60),
'b':np.random.choice( [5,7,np.nan], 60),
'c':np.random.choice( ['panda', 'elephant', 'python', 'anaconda', 'shark', 'clown fish'], 60),
# some ways to create systematic groups for indexing or groupby
'e':np.tile( range(20), 3 ),
# a date range and set of random dates
})
I now would want, in a new row, e.g. panda and elephant categorized as mammals, etc.
The most intuitive would be to create a new series, create a dict and then remap according to it:
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal', 'python': 'snake', 'anaconda': 'snake', 'shark': 'fish', 'clown fish': 'fish'}
c_Series = pd.Series(df['c']) # create new series
classified_c = c_Series.map(mapping_dict) # remap new series
if 'c_classified' not in df.columns: df.insert(3, 'c_classified', classified_c) # insert if not in df already (if you want to run the code multiple times
I think need map with fillna for replace NaNs if non match values:
#borrowed dict from Ivo's answer
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal',
'python': 'snake', 'anaconda': 'snake',
'shark': 'fish', 'clown fish': 'fish'}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')
Also if change format of dictionary is possible generate final dictioanry with swap keys with values:
d = {'mammal':['panda','elephant'],
'snake':['python','anaconda'],
'fish':['shark','clown fish']}
mapping_dict = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')

Resources