Create Multiple New Columns From Multiple Dictionaries - python-3.x

I can successfully create a single new attribute on a dataframe called df using a single dictionary as follows:
Create Precursor DataFrame mye2_men:
In [13]: mye2_men = pd.read_csv("~/03_Maps_March_2020/mye2_men.csv",index_col="Code")
...: mye2_men.head()
Out[13]:
Name Geography1 All ages 0 1 2 3 4 5 6 7 8 9 ... 78 79 80 81 82 83 84 85 86 87 88 89 90
Code ...
K02000001 UNITED KINGDOM Country 32790202 382332 395273 408684 408882 412553 421934 434333 427809 419161 414994 ... 192839 186251 175626 160475 146314 132941 116050 103669 93155 81174 68110 55652 183486
K03000001 GREAT BRITAIN Country 31864002 370474 382933 395754 396181 399764 409061 420947 414613 406062 401647 ... 188073 181546 171350 156506 142851 129815 113306 101194 91038 79342 66699 54387 179629
K04000001 ENGLAND AND WALES Country 29215251 343642 355122 366722 366885 370156 379046 389944 382853 375940 370701 ... 172046 166392 157065 143896 131207 119193 104143 93055 83798 73224 61794 50297 167009
E92000001 ENGLAND Country 27667942 327309 338368 349229 349199 352148 360688 370995 363496 356965 351790 ... 161540 156343 147733 135514 123492 112133 98000 87528 79030 69067 58264 47498 157788
E12000001 NORTH EAST Region 1305486 13992 14423 15124 15159 15542 15839 16314 16283 16068 15748 ... 8130 8108 7601 6977 6118 5723 4958 4383 3889 3360 2747 2148 6822
[5 rows x 94 columns]
Create Target DataFrame df
In [14]: df = pd.DataFrame({"A":[num for num in range(0,430)],
...: "B":[num**2 for num in range(0,430)],
...: "Code":mye2_men.index})
...: df.head()
Out[14]:
A B Code
0 0 0 K02000001
1 1 1 K03000001
2 2 4 K04000001
3 3 9 E92000001
4 4 16 E12000001
Create Dictionary To Use In Mapping:
In [15]: male_counts = mye2_men["All ages"].to_dict()
...: male_counts
Out[15]:
{'K02000001': 32790202,
'K03000001': 31864002,
'K04000001': 29215251,
'E92000001': 27667942,
'E12000001': 1305486,
'E06000047': 259299,
'E06000005': 51919,
'E06000001': 45524 ....}
Map the male_counts Dictionary to DataFrame df To Create New Column "male_count":
In [19]: # CREATE NEW male_count COLUMN IN df
...: df["male_count"] = df["Code"].map(male_counts)
...: df.head()
Out[19]:
A B Code male_count
0 0 0 K02000001 32790202
1 1 1 K03000001 31864002
2 2 4 K04000001 29215251
3 3 9 E92000001 27667942
4 4 16 E12000001 1305486
For the 2nd dictionary:
In [20]: female_counts = (mye2_men["All ages"]+10).to_dict()
...: female_counts
Out[20]:
{'K02000001': 32790212,
'K03000001': 31864012,
'K04000001': 29215261,
'E92000001': 27667952,
'E12000001': 1305496,
'E06000047': 259309,
'E06000005': 51929 ...}
I can successfully produce a second attribute called df["female_count"] by repeating step 4 above, but this time using the female_counts dictionary.
How can I create multiple new df columns (ie df["male_count"] and df["female_count"]) in a single step?
Many thanks
Note:
The mye2_men data is from the "MYE2 - Males" tab of the following excel doc:
https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationestimates%2fdatasets%2fpopulationestimatesforukenglandandwalesscotlandandnorthernireland%2fmid2019april2020localauthoritydistrictcodes/ukmidyearestimates20192020ladcodes.xls

Create DataFrame from dictionaries and then use DataFrame.join:
new = pd.DataFrame({'male_count':male_counts, 'female_count':female_count})
df = df.join(new, on='Code')

Related

The output of my code comes out too slowly.. How can i speed up my process

Thanks to the help from some users of this sites.
My code seems to work fine, but it's taking too long..
I'm trying to compare two data frames.(df1 has 1,291,250 rows / df2 has 1,286,692 rows)
if df1.iloc[0,0] == df2.iloc[0,0] and df1.iloc[0,1] == df2.iloc[0,1], then compare df1.iloc[0,2], df2.iloc[0,2].
If the first(df1.iloc[0,2]) is larger, I want to put the first index into the list, and if the second(df2.iloc[0,2]) is larger, I want to put the second index into the list.
Example DataFrame
In [1]: df1 = pd.DataFrame([[0, 1, 98], [1, 1, 198], [2, 2, 228]], columns = ['A1', 'B1', 'C1'])
In [2]: df1
Out[3]:
A1 B1 C1
0 0 1 98
1 1 1 198
2 2 2 228
In [4]: df2 = pd.DataFrame([[0, 1, 228], [1, 2, 110], [2, 2, 130]], columns = ['A2', 'B2', 'C2'])
In [5]: df2
Out[6]:
A2 B2 C2
0 0 1 228
1 1 2 110
2 2 2 130
In [7]: def find_high(df1, df2) # def function code is below
Out[8]: ([2], [0]) # The result what i want
This is just simple example. my data is bigger than this
my code is:
for i in range(60):
setattr(mod, f'df_1_{i}', np.array_split(df1, 60)[i])
getattr(mod, f'df_1_{i}').to_pickle(f'df_1_{i}')
import glob
files = glob.glob('df_1_*')
def find_high_pre(df1 = files, df2):
subtract_df2 = []
subtract_df1 = []
same_data = []
for df1_index, line in enumerate(df1.to_numpy()):
for df2_idx, row in enumerate(df2.to_numpy()):
if (line[0:2] == row[0:2]).all():
if line[2] < row[2]:
subtract_df2.append(df2_idx)
break
elif line[2] > row[2]:
subtract_df1.append(df1_idx)
break
else:
continue
break
return df1.iloc[subtract_df1].index.tolist(), df2.iloc[subtract_df2].index.tolist(), df1.iloc[same_data].index.to_list();
data_1 = []
for i in files:
e_data = pd.read_pickle(i)
num_cores = 30
df_split = np.array_split(e_data, num_cores)
data_1 += parmap.map(find_high_pre, df_split, pm_pbar=True, pm_processes =num_cores)
My code seems to work fine, but it's taking too long..
Chances are that replacing your nested for loops with a DataFrame.merge operation will take less time:
keys = ['A', 'B']
df1.columns = [*keys, 'C1']
df2.columns = [*keys, 'C2']
df = df1.reset_index().set_index(keys).merge(
df2.reset_index().set_index(keys), on=keys)
# now we have a merged dataframe like this:
# index_x C1 index_y C2
# A B
# 0 1 0 98 0 228
# 2 2 2 228 2 130
# therefrom we can easily extract the wanted indexes
data = [df.loc[df['C1'] > df['C2'], 'index_x'].values,
df.loc[df['C1'] < df['C2'], 'index_y'].values]

Python - Pandas: perform column value based data grouping across separate dataframe chunks

I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.
To explain the problem, here is a minimal version:
The CSV (save it as test1.csv, for example)
1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24
Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.
I achieved it using the following:
import pandas as pd
list_of_ids = dict() # this will contain all "id"s and the start and end row index for each id
# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
#print(df_chunk)
# In each chunk, get the unique id values and add to the list
for i in df_chunk['id'].unique().tolist():
if i not in list_of_ids:
list_of_ids[i] = [] # initially new values do not have the start and end row index
for i in list_of_ids.keys(): # ---------MARKER 1-----------
idx = df_chunk[df_chunk['id'] == i].index # get row index for particular value of id
if len(idx) != 0: # if id is in this chunk
if len(list_of_ids[i]) == 0: # if the id is new in the final dictionary
list_of_ids[i].append(idx.tolist()[0]) # start
list_of_ids[i].append(idx.tolist()[-1]) # end
else: # if the id was there in previous chunk
list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]] # keep old start, add new end
#print(df_chunk.iloc[idx, :])
#print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])
print(list_of_ids)
skip = None
rows = None
# Now from the file, I will read only particular id group using following
# I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
print('Data for id: {}'.format(id))
skip, rows = se[0], (se[-1] - se[0]+1)
for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
print(df_chunk)
Truncated output from my code:
{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
id val
0 1 10
1 1 11
id val
2 1 12
Data for id: 2
id val
0 2 13
1 2 14
id val
2 2 15
3 2 16
Data for id: 3
id val
0 3 17
1 3 18
What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?
(I am looking for complete code in answer)
I suggest you use itertools for this, as follows:
import pandas as pd
import csv
import io
from itertools import groupby, islice
from operator import itemgetter
def chunker(n, iterable):
"""
From answer: https://stackoverflow.com/a/31185097/4001592
>>> list(chunker(3, 'ABCDEFG'))
[['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
"""
iterable = iter(iterable)
return iter(lambda: list(islice(iterable, n)), [])
chunk_size = 5
with open('test1.csv') as csv_file:
reader = csv.reader(csv_file)
for _, group in groupby(reader, itemgetter(0)):
for chunk in chunker(chunk_size, group):
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
Output (partial)
0 1
0 1 10
1 1 11
2 1 12
---
0 1
0 2 13
1 2 14
2 2 15
3 2 16
---
0 1
0 3 17
1 3 18
2 3 19
3 3 20
---
...
This approach will read first in groups by column 1:
for _, group in groupby(reader, itemgetter(0)):
and each group will be read in chunks of 5 rows (this can be change using chunk_size):
for chunk in chunker(chunk_size, group):
The last part:
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
creates a suitable string to be pass to pandas.

Structural Question Regarding pandas .drop method

df2=df.drop(df[df['issue']=="prob"].index)
df2.head()
The code immediately below works fine.
But why is there a need to type df[df[ rather than the below?
df2=df.drop(df['issue']=="prob"].index)
df2.head()
I know that the immediately above won't work while the former does. I would like to understand why or know what exactly I should google.
Also ~ any advice on a more relevant title would be appreciated.
Thanks!
Option 1: df[df['issue']=="prob"] produces a DataFrame with a subset of values.
Option 2: df['issue']=="prob" produces a pandas.Series with a Boolean for every row.
.drop works for Option 1, because it knows to just drop the selected indices, vs. all of the indices returned from Option 2.
I would use the following methods to remove rows.
Use ~ (not) to select the opposite of the Boolean selection.
df = df[~(df.treatment == 'Yes')]
Select rows with only the desired value
df = df[(df.treatment == 'No')]
import pandas as pd
import numpy as np
import random
# sample dataframe
np.random.seed(365)
random.seed(365)
rows = 25
data = {'a': np.random.randint(10, size=(rows)),
'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(rows)],
'date': pd.bdate_range(datetime.today(), freq='d', periods=rows).tolist()}
df = pd.DataFrame(data)
df[df.treatment == 'Yes'].index
Produces just the indices where treatment is 'Yes', therefore df.drop(df[df.treatment == 'Yes'].index) only drops the indices in the list.
df[df.treatment == 'Yes'].index
[out]:
Int64Index([0, 1, 2, 4, 6, 7, 8, 11, 12, 13, 14, 15, 19, 21], dtype='int64')
df.drop(df[df.treatment == 'Yes'].index)
[out]:
a groups treatment date
3 5 6-25 No 2020-08-15
5 2 500-1000 No 2020-08-17
9 0 500-1000 No 2020-08-21
10 3 100-500 No 2020-08-22
16 8 1-5 No 2020-08-28
17 4 1-5 No 2020-08-29
18 3 1-5 No 2020-08-30
20 6 500-1000 No 2020-09-01
22 6 6-25 No 2020-09-03
23 8 100-500 No 2020-09-04
24 9 26-100 No 2020-09-05
(df.treatment == 'Yes').index
Produces all of the indices, therefore df.drop((df.treatment == 'Yes').index) drops all of the indices, leaving an empty dataframe.
(df.treatment == 'Yes').index
[out]:
RangeIndex(start=0, stop=25, step=1)
df.drop((df.treatment == 'Yes').index)
[out]:
Empty DataFrame
Columns: [a, groups, treatment, date]
Index: []

How to append dataframes from different files, but having same structure?

I have different datasets in a json format, with each file containing different matches details but have the same column names. I've isolated the 'Shots' taken by one team in a single match. How should i modify my code to take only the shots of that particular team for different matches.
def key_pass(filename):
with open(filename) as f:
comp = json.load(f)
eng = pd.json_normalize(comp)
for team in eng['possession_team.name'].unique():
if team != 'Belgium':
opp = team
eng = pd.json_normalize(comp).assign(Oppn = opp)
eng_pan = eng[['shot.statsbomb_xg','minute','player.name','shot.outcome.name','shot.key_pass_id','location','type.name','play_pattern.name','possession_team.name']]
eng_pan.rename(columns={'shot.statsbomb_xg':'Statsbomb_xG','shot.outcome.name':'Outcome','shot.key_pass_id':'Keypass_id'})
total_attempts = eng_pan.loc[(eng_pan['type.name'] == 'Shot') & (eng_pan['possession_team.name'] == 'Belgium')]
total_attempts.reset_index(drop=True,inplace=True)
return(total_attempts)
When i Call the function,
total_attempts = key_pass('7584.json')
total_attempts
The Output I get is,
Now, if i have to call another file, I need the shots from that file to continue from where the previous file has finished.
Should i pass the file names as list ? And add a for loop in the function, but then again how do i append the shots ?
You can use the pandas DataFrame append method easily if both df's have the same structure:
(notice the ignore index parameter)
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
A B
0 1 2
1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8

Python Passing Dynamic Table Name in For Loop

table_name = []
counter=0
for year in ['2017', '2018', '2019']:
table_name.append(f'temp_df_{year}')
print(table_name[counter])
table_name[counter] = pd.merge(table1, table2.loc[table2.loc[:, 'year'] == year, :], left_on='col1', right_on='col1', how='left')
counter += 1
temp_df_2017
The print statement outputs are correct:
temp_df_2017,
temp_df_2018,
temp_df_2019
However, when I try to see what's in temp_df_2017, I get an error: name 'temp_df_2017' is not defined
I would like to create those three tables. How can I make this work?
PS: ['2017', '2018', '2019'] list will vary. It can be a list of quarters. That's why I want to do this in a loop, instead of using the merge statement 3x.
I think the easiest/most practical approach would be to create a dictionary to store names/df.
import pandas as pd
import numpy as np
# Create dummy data
data = np.arange(9).reshape(3,3)
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df_year_names = ['2017', '2018', '2019']
dict_of_dfs = {}
for year in df_year_names:
df_name = f'some_name_year_{year}'
dict_of_dfs[df_name] = df
dict_of_dfs.keys()
Out:
dict_keys(['some_name_year_2017', 'some_name_year_2018', 'some_name_year_2019'])
Then to access a particular year:
dict_of_dfs['some_name_year_2018']
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8

Resources