I have a data frame with one column of sub-instances of a larger group, and want to categorize this into a smaller number of groups. How do I do this?
Consider the following sample data:
df = pd.DataFrame({
'a':np.random.randn(60),
'b':np.random.choice( [5,7,np.nan], 60),
'c':np.random.choice( ['panda', 'elephant', 'python', 'anaconda', 'shark', 'clown fish'], 60),
# some ways to create systematic groups for indexing or groupby
'e':np.tile( range(20), 3 ),
# a date range and set of random dates
})
I now would want, in a new row, e.g. panda and elephant categorized as mammals, etc.
The most intuitive would be to create a new series, create a dict and then remap according to it:
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal', 'python': 'snake', 'anaconda': 'snake', 'shark': 'fish', 'clown fish': 'fish'}
c_Series = pd.Series(df['c']) # create new series
classified_c = c_Series.map(mapping_dict) # remap new series
if 'c_classified' not in df.columns: df.insert(3, 'c_classified', classified_c) # insert if not in df already (if you want to run the code multiple times
I think need map with fillna for replace NaNs if non match values:
#borrowed dict from Ivo's answer
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal',
'python': 'snake', 'anaconda': 'snake',
'shark': 'fish', 'clown fish': 'fish'}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')
Also if change format of dictionary is possible generate final dictioanry with swap keys with values:
d = {'mammal':['panda','elephant'],
'snake':['python','anaconda'],
'fish':['shark','clown fish']}
mapping_dict = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')
Related
I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
I have a dataframe containing 4 columns. I want to use 2 of the columns as keys for a dictionary of dictionaries, where the values inside are the remaining 2 columns (so a dataframe)
birdies = pd.DataFrame({'Habitat' : ['Captive', 'Wild', 'Captive', 'Wild'],
'Animal': ['Falcon', 'Falcon','Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.],
'Color': ["white", "grey", "green", "blue"]})
#this should ouput speed and color
birdies_dict["Falcon"]["Wild"]
#this should contain a dictionary, which the keys are 'Captive','Wild'
birdies_dict["Falcon"]
I have found a way to generate a dictionary of dataframes with a single column as a key, but not with 2 columns as keys:
birdies_dict = {k:table for k,table in birdies.groupby("Animal")}
I suggest to use defaultdict for this, a solution for the 2 column problem is:
from collections import defaultdict
d = defaultdict(dict)
for (hab, ani), _df in df.groupby(['Habitat', 'Animal']):
d[hab][ani] = _df
This breaks with 2 columns, if you want it with a higher depth, you can just define a recursive defaultdict:
from collections import defaultdict
recursive_dict = lambda: defaultdict(recursive_dict)
dct = recursive_dict()
dct[1][2][3] = ...
Pass to_dict to the inside:
birdies_dict = {k:d.to_dict() for k,d in birdies.groupby('Animal')}
birdies_dict['Falcon']['Habitat']
Output:
{0: 'Captive', 1: 'Wild'}
Or do you mean:
out = birdies.set_index(['Animal','Habitat'])
out.loc[('Falcon','Captive')]
which gives:
Max Speed 380
Color white
Name: (Falcon, Captive), dtype: object
IIUC:
birdies_dict = {k:{habitat: table[['Max Speed', 'Color']].to_numpy() for habitat in table['Habitat'].to_numpy()} for k,table in birdies.groupby("Animal")}
OR
birdies_dict = {k:{habitat: table[['Max Speed', 'Color']] for habitat in table['Habitat'].to_numpy()} for k,table in birdies.groupby("Animal")}
#In this case inner key will have a dataframe
OR
birdies_dict = {k:{inner_key: inner_table for inner_key, inner_table in table.groupby('Habitat')} for k,table in birdies.groupby("Animal")}
I have the following dictionary of lists
d = {1: ['1','B1',['C1','C2','C3']], 2: ['2','B2','C15','D12'], 3: ['3','B3'], 4: ['4', 'B4', 'C4', ['D1', 'D2']]}
writing that to a csv using
with open('test.csv', "w", newline = '') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(d.values())
gives me a csv that looks like
A B C D
1 B1 ['C1','C2',C3']
2 B2 C15 D12
3 B3
4 B4 C4 ['D1','D2']
If there is a multiple item list in the value (nested list?), I would like that list to be expanded down the column like this
A B C D
1 B1 C1
1 C2
1 C3
2 B2 C15 D12
3 B3
4 B4 C4 D1
4 D2
I'm fairly new to python and can't seem to figure out a way to do what I need after a few days sifting through forums and banging my head on the wall. I think I may need to break apart the nested lists, but I need to keep them tied to their respective "A" value. Columns A and B will always have 1 entry, columns C and D can have 1 to X number of entries.
Any help is much appreciated
Seems like it might be easier to make a list of lists, with appropriately-located empty spaces, than what you're doing. Here's something that might do:
import csv
from itertools import zip_longest
def condense(dct):
# get the maximum number of columns of any list
num_cols = len(max(dct.values(), key=len)) - 1
# Ignore the key, it's not really relevant.
for _, v in dct.items():
# first, memorize the index of this list,
# since we need to repeat it no matter what
idx = v[0]
# next, use zip_longest to make a correspondence.
# We will deliberately make a 2d list,
# and we will later withdraw elements from it one by one.
matrix = [([] if elem is None else
[elem] if not isinstance(elem, list) else
elem[:] # soft copy to avoid altering original dict
) for elem, _ in zip_longest(v[1:], range(num_cols), fillvalue=None)
]
# Now, we output the top row of the matrix as long as it has contents
while any(matrix):
# If a column in the matrix is empty, we put an empty string.
# Otherwise, we remove the row as we pass through it,
# progressively emptying the matrix top-to-bottom
# as we output a row, we also remove that row from the matrix.
# *-notation is more convenient than concatenating these two lists.
yield [idx, *((col.pop(0) if col else '') for col in matrix)]
# e.g. for key 0 and a matrix that looks like this:
# [['a1', 'a2'],
# ['b1'],
# ['c1', 'c2', 'c3']]
# this would yield the following three lists before moving on:
# ['0', 'a1', 'b1', 'c1']
# ['0', 'a2', '', 'c2']
# ['0', '', '', 'c3']
# where '' should parse into an empty column in the resulting CSV.
The biggest thing to note here is that I use isinstance(elem, list) as a shorthand to check whether the thing is a list (which you need to be able to do, one way or another, to flatten or rounden lists as we do here). If you have more complicated or more varied data structures, you'll need to improvise with this check - maybe write a helper function isiterable() that tries to iterate through and returns a boolean based on whether doing so produced an error.
That done, we can call condense() on d and have the csv module deal with the output.
headers = ['A', 'B', 'C', 'D']
d = {1: ['1','B1',['C1','C2','C3']], 2: ['2','B2','C15','D12'], 3: ['3','B3'], 4: ['4', 'B4', 'C4', ['D1', 'D2']]}
# condense(d) produces
# [['1', 'B1', 'C1', '' ],
# ['1', '', 'C2', '' ],
# ['1', '', 'C3', '' ],
# ['2', 'B2', 'C15', 'D12'],
# ['3', 'B3', '', '' ],
# ['4', 'B4', 'C4', 'D1' ],
# ['4', '', '', 'D2' ]]
with open('test.csv', "w", newline = '') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(condense(d))
Which produces the following file:
A,B,C,D
1,B1,C1,
1,,C2,
1,,C3,
2,B2,C15,D12
3,B3,,
4,B4,C4,D1
4,,,D2
This is equivalent to your expected output. Hopefully the solution is sufficiently extensible for you to apply it to your non-MVCE problem.
I'm a student and therefore a rookie. I'm trying to create a Pandas dataframe of crime statistics by neighborhood in San Francisco. My problem is that I want the column names to be simply "Neighborhood" and "Count". Instead I seem to be stuck with a separate line that says "('Neighborhood', 'count')" instead of the proper labels. Here's the code:
df_counts = df_incidents.copy()
df_counts.rename(columns={'PdDistrict':'Neighborhood'}, inplace=True)
df_counts.drop(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'Location', 'Resolution', 'Address', 'X', 'Y', 'PdId'], axis=1, inplace=True)
df_totals=df_counts.groupby(['Neighborhood']).agg({'Neighborhood':['count']})
df_totals.columns = list(map(str, df_totals.columns)) # Not sure if I need this
df_totals
Output:
('Neighborhood', 'count')
Neighborhood
BAYVIEW 14303
CENTRAL 17666
INGLESIDE 11594
MISSION 19503
NORTHERN 20100
PARK 8699
RICHMOND 8922
SOUTHERN 28445
TARAVAL 11325
TENDERLOIN 9942
No need for agg() here, you can simply do:
df_totals = df_counts.groupby(['Neighborhood']).count()
df_totals.columns = ['count']
df_totals = df_totals.reset_index() # flatten the column headers
And if you want to print the output without the numerical index:
print(df_totals.to_string(index=False))
I am trying to prune texts in a list based on texts in another list. The following function works fine when called directly on two lists
def remove_texts(texts, texts2):
to_remove = []
for i in texts2:
if i in texts:
to_remove.append(i)
texts = [j for j in texts if j not in to_remove]
return texts
However, the following does nothing and I get no errors
df_other.texts = df_other.texts.map(lambda x: remove_texts(x, df_other.to_remove_split))
Nor does the following. Again no error is returned
for i, row in df_other.iterrows():
row['texts'] = remove_texts(row['texts'], row['to_remove_split'])
Any thoughts appreciated.
You actually want to find the set difference between texts
and texts2. Assume that they contain:
texts = [ 'AAA', 'BBB', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH' ]
texts2 = [ 'CCC', 'EEE' ]
Then, the shortes solution is to compute just the set difference,
without using Pandas:
set(texts).difference(texts2)
gives:
{'AAA', 'BBB', 'DDD', 'FFF', 'GGG', 'HHH'}
Or if you want just a list (not set), write:
sorted(set(texts).difference(texts2))
And if for some reason you want to use Pandas, then start from
creting of both DataFrames:
df = pd.DataFrame(texts, columns=['texts'])
df2 = pd.DataFrame(texts2, columns=['texts'])
Then you can compute the set difference as:
df.query('texts not in #df2.texts')
or
df.texts[~df.texts.isin(df2.texts)]