I'm a student and therefore a rookie. I'm trying to create a Pandas dataframe of crime statistics by neighborhood in San Francisco. My problem is that I want the column names to be simply "Neighborhood" and "Count". Instead I seem to be stuck with a separate line that says "('Neighborhood', 'count')" instead of the proper labels. Here's the code:
df_counts = df_incidents.copy()
df_counts.rename(columns={'PdDistrict':'Neighborhood'}, inplace=True)
df_counts.drop(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'Location', 'Resolution', 'Address', 'X', 'Y', 'PdId'], axis=1, inplace=True)
df_totals=df_counts.groupby(['Neighborhood']).agg({'Neighborhood':['count']})
df_totals.columns = list(map(str, df_totals.columns)) # Not sure if I need this
df_totals
Output:
('Neighborhood', 'count')
Neighborhood
BAYVIEW 14303
CENTRAL 17666
INGLESIDE 11594
MISSION 19503
NORTHERN 20100
PARK 8699
RICHMOND 8922
SOUTHERN 28445
TARAVAL 11325
TENDERLOIN 9942
No need for agg() here, you can simply do:
df_totals = df_counts.groupby(['Neighborhood']).count()
df_totals.columns = ['count']
df_totals = df_totals.reset_index() # flatten the column headers
And if you want to print the output without the numerical index:
print(df_totals.to_string(index=False))
Related
I am working with a dataframe where I need to replace values in 1 column. My natural instinct is to go towards a python dictionary HOWEVER, this is an example of what my data looks like (original_col):
original_col desired_col
cat animal
dog animal
bunny animal
cat animal
chair furniture
couch furniture
Bob person
Lisa person
A dictionary would look something like:
my_dict: {'animal': ['cat', 'dog', 'bunny'], 'furniture': ['chair', 'couch'], 'person': ['Bob', 'Lisa']}
I can't use the typical my_dict.get() since I am looking to retrieve corresponding KEY rather than the value. Is dictionary the best data structure? Any suggestions?
flip your dictionary:
my_new_dict = {v: k for k, vals in my_dict.items() for v in vals}
note, this will not work if you have values like: dog->animal, dog->person
DataFrame.replace already accepts a dictionary in a specific structure so you don't need to re-invent the wheel: {col_name: {old_value: new_value}}
df.replace({'original_col': {'cat': 'animal', 'dog': 'animal', 'bunny': 'animal',
'chair': 'furniture', 'couch': 'furniture',
'Bob': 'person', 'Lisa': 'person'}})
Alternatively you could use Series.replace, then only the inner dictionary is required:
df['original_col'].replace({'cat': 'animal', 'dog': 'animal', 'bunny': 'animal',
'chair': 'furniture', 'couch': 'furniture',
'Bob': 'person', 'Lisa': 'person'})
The pandas map() function uses a dictionary or another pandas Series to perform this kind of lookup, IIUC:
# original column / data
data = ['cat', 'dog', 'bunny', 'cat', 'chair', 'couch', 'Bob', 'Lisa']
# original dict
my_dict: {'animal': ['cat', 'dog', 'bunny'],
'furniture': ['chair', 'couch'],
'person': ['Bob', 'Lisa']
}
# invert the dictionary
new_dict = { v: k
for k, vs in my_dict.items()
for v in vs }
# create series and use `map()` to perform dictionary lookup
df = pd.concat([
pd.Series(data).rename('original_col'),
pd.Series(data).map(new_values).rename('desired_col')], axis=1)
print(df)
original_col desired_col
0 cat animal
1 dog animal
2 bunny animal
3 cat animal
4 chair furniture
5 couch furniture
6 Bob person
7 Lisa person
Name = [list(['Amy', 'A', 'Angu']),
list(['Jon', 'Johnson']),
list(['Bob', 'Barker'])]
Other = [list(['Amy', 'Any', 'Anguish']),
list(['Jon', 'Jan']),
list(['Baker', 'barker'])]
import pandas as pd
df = pd.DataFrame({'Other' : Other,
'ID': ['E123','E456','E789'],
'Other_ID': ['A123','A456','A789'],
'Name' : Name,
})
ID Name Other Other_ID
0 E123 [Amy, A, Angu] [Amy, Any, Anguish] A123
1 E456 [Jon, Johnson] [Jon, Jan] A456
2 E789 [Bob, Barker] [Baker, barker] A789
I have the df as seen above. I want to make columns ID, Name and Other into a dictionary with they key being ID. I tried this according to python pandas dataframe columns convert to dict key and value
todict = dict(zip(df.ID, df.Name))
Which is close to what I want
{'E123': ['Amy', 'A', 'Angu'],
'E456': ['Jon', 'Johnson'],
'E789': ['Bob', 'Barker']}
But I would like to get this output that includes values from Other column
{'E123': ['Amy', 'A', 'Angu','Amy', 'Any','Anguish'],
'E456': ['Jon', 'Johnson','Jon','Jan'],
'E789': ['Bob', 'Barker','Baker','barker']
}
And If I put the third column Other it gives me errors
todict = dict(zip(df.ID, df.Name, df.Other))
How do I get the output I want?
Why not just combine the Name and Other column before creating a dict of the Name column.
df['Name'] = df['Name'] + df['Other']
dict(zip(df.ID, df.Name))
Gives
{'E123': ['Amy', 'A', 'Angu', 'Amy', 'Any', 'Anguish'],
'E456': ['Jon', 'Johnson', 'Jon', 'Jan'],
'E789': ['Bob', 'Barker', 'Baker', 'barker']}
I am using python-docx to extract two tables from a document.
I have iterated over the tables and created a list of lists. Each individual list represents a table, and within that I have dictionaries per row. Each dictionary contains a key / value pair. The key is the column heading from the table and value is the cell contents for that row's data for that column.
I am facing difficulty when creating a data frame for each table and writing each table on a seperate excel sheet.
from docx.api import Document
import pandas as pd
import csv
import json
import unicodedata
document = Document('Sampletable1.docx')
tables = document.tables
print (len(tables))
big_data = []
for table in document.tables:
data = []
Keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
dic = dict(zip(keys, text))
data.append(dic)
big_data.append(data)
print(big_data)
The output of the above code is:
2
[[{'Asset': 'Growth investments', 'Target investment mix': '66.50%', 'Actual investment mix': '66.30%', 'Variance': '-0.20%'}, {'Asset': 'Defensive investments', 'Target investment mix': '33.50%', 'Actual investment mix': '33.70%', 'Variance': '0.20%'}], [{'Owner': 'REST Super', 'Product': 'Superannuation', 'Type': 'Existing', 'Status': 'Existing', 'Customer 2': 'Customer 1'}, {'Owner': 'TWUSUPER TransPension', 'Product': 'TTR Pension', 'Type': 'New', 'Status': 'New', 'Customer 2': 'Customer 1'}, {'Owner': 'TWUSUPER', 'Product': 'Superannuation', 'Type': 'Existing', 'Status': 'Existing'}]]
How do I access the above lists??
Further I tried to create a pandas data frame
#write the data into a data frame
for thing in big_data:
#print(thing)
df = pd.DataFrame(thing)
print(df)
writer = pd.ExcelWriter('dftable3.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
I got the first table on the excel but unable to work with second table.
I am expecting both the table to be in the same excel workbook(dftable3.xlsx) but in different worksheets(Sheet1,Sheet2)
I have attached the images of the tables.
Thanks in advance
How do I access the above lists??
You already did, by iterating over them, or printing them.
Consider using the pretty-print library:
import pprint
pprint.pprint(big_data)
I am expecting ... different worksheets(Sheet1,Sheet2)
Well, that's unlikely, given the constant 'Sheet1' argument you supplied.
Here is one way to accomplish that:
writer = pd.ExcelWriter('dftable3.xlsx', engine='xlsxwriter')
for i, thing in enumerate(big_data):
df = pd.DataFrame(thing)
df.to_excel(writer, sheet_name=f'Sheet{i}')
writer.save()
Note the scope of writer -- it must be longer lived than each of the constituent dfs.
I have a data frame with one column of sub-instances of a larger group, and want to categorize this into a smaller number of groups. How do I do this?
Consider the following sample data:
df = pd.DataFrame({
'a':np.random.randn(60),
'b':np.random.choice( [5,7,np.nan], 60),
'c':np.random.choice( ['panda', 'elephant', 'python', 'anaconda', 'shark', 'clown fish'], 60),
# some ways to create systematic groups for indexing or groupby
'e':np.tile( range(20), 3 ),
# a date range and set of random dates
})
I now would want, in a new row, e.g. panda and elephant categorized as mammals, etc.
The most intuitive would be to create a new series, create a dict and then remap according to it:
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal', 'python': 'snake', 'anaconda': 'snake', 'shark': 'fish', 'clown fish': 'fish'}
c_Series = pd.Series(df['c']) # create new series
classified_c = c_Series.map(mapping_dict) # remap new series
if 'c_classified' not in df.columns: df.insert(3, 'c_classified', classified_c) # insert if not in df already (if you want to run the code multiple times
I think need map with fillna for replace NaNs if non match values:
#borrowed dict from Ivo's answer
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal',
'python': 'snake', 'anaconda': 'snake',
'shark': 'fish', 'clown fish': 'fish'}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')
Also if change format of dictionary is possible generate final dictioanry with swap keys with values:
d = {'mammal':['panda','elephant'],
'snake':['python','anaconda'],
'fish':['shark','clown fish']}
mapping_dict = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')
Dataframe:
> df
>type(df)
pandas.core.frame.DataFrame
ID Property Type Amenities
1952043 Apartment, Villa, Apartment Park, Jogging Track, Park
1918916 Bungalow, Cottage House, Cottage, Bungalow Garden, Play Ground
How can I keep just the unique words separated by "comma" in the dataframe row? In this case it must not consider "Cottage House" and "Cottage" same. It must check this for all columns of the dataframe. So my desired output should look like below:
Desired Output :
ID Property Type Amenities
1952043 Apartment, Villa Park, Jogging Track
1918916 Bungalow, Cottage House, Cottage Garden, Play Ground
First, I create a function that does what you want for a given string. Secondly, I apply this function to all strings in the column.
import numpy as np
import pandas as pd
df = pd.DataFrame([['Apartment, Villa, Apartment',
'Park, Jogging Track, Park'],
['Bungalow, Cottage House, Cottage, Bungalow',
'Garden, Play Ground']],
columns=['Property Type', 'Amenities'])
def drop_duplicates(row):
# Split string by ', ', drop duplicates and join back.
words = row.split(', ')
return ', '.join(np.unique(words).tolist())
# drop_duplicates is applied to all rows of df.
df['Property Type'] = df['Property Type'].apply(drop_duplicates)
df['Amenities'] = df['Amenities'].apply(drop_duplicates)
print(df)
Read the file into pandas DataFrame
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
The main idea is to
iterate through every row,
split the string in the target column by ,
return the unique set() of the list from step 2
Code:
>>> for row in proptype_column: # Step 1.
... items_in_row = row.split(', ') # Step 2.
... uniq_items_in_row = set(row.split(', ')) # Step 3.
... print(uniq_items_in_row)
...
set(['Apartment', 'Villa'])
set(['Cottage', 'Bungalow', 'Cottage House'])
Now you can achieve the same with DataFrame.apply() function:
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
>>> proptype_uniq = df['Property Type'].apply(lambda cell: set(cell.split(', ')))
>>> df['Property Type (Unique)'] = proptype_uniq
>>> df
ID Property Type \
0 12345 Apartment, Villa, Apartment
1 67890 Bungalow, Cottage House, Cottage, Bungalow
Amenities Property Type (Unique)
0 Park, Jogging Track, Park {Apartment, Villa}
1 Garden, Play Ground {Cottage, Bungalow, Cottage House}