Python Passing Dynamic Table Name in For Loop

Python Passing Dynamic Table Name in For Loop - python-3.x

table_name = []
counter=0
for year in ['2017', '2018', '2019']:
table_name.append(f'temp_df_{year}')
print(table_name[counter])
table_name[counter] = pd.merge(table1, table2.loc[table2.loc[:, 'year'] == year, :], left_on='col1', right_on='col1', how='left')
counter += 1
temp_df_2017
The print statement outputs are correct:
temp_df_2017,
temp_df_2018,
temp_df_2019
However, when I try to see what's in temp_df_2017, I get an error: name 'temp_df_2017' is not defined
I would like to create those three tables. How can I make this work?
PS: ['2017', '2018', '2019'] list will vary. It can be a list of quarters. That's why I want to do this in a loop, instead of using the merge statement 3x.

I think the easiest/most practical approach would be to create a dictionary to store names/df.
import pandas as pd
import numpy as np
# Create dummy data
data = np.arange(9).reshape(3,3)
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df_year_names = ['2017', '2018', '2019']
dict_of_dfs = {}
for year in df_year_names:
df_name = f'some_name_year_{year}'
dict_of_dfs[df_name] = df
dict_of_dfs.keys()
Out:
dict_keys(['some_name_year_2017', 'some_name_year_2018', 'some_name_year_2019'])
Then to access a particular year:
dict_of_dfs['some_name_year_2018']
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8

Related

Python - Pandas: perform column value based data grouping across separate dataframe chunks

I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.
To explain the problem, here is a minimal version:
The CSV (save it as test1.csv, for example)
1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24
Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.
I achieved it using the following:
import pandas as pd
list_of_ids = dict() # this will contain all "id"s and the start and end row index for each id
# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
#print(df_chunk)
# In each chunk, get the unique id values and add to the list
for i in df_chunk['id'].unique().tolist():
if i not in list_of_ids:
list_of_ids[i] = [] # initially new values do not have the start and end row index
for i in list_of_ids.keys(): # ---------MARKER 1-----------
idx = df_chunk[df_chunk['id'] == i].index # get row index for particular value of id
if len(idx) != 0: # if id is in this chunk
if len(list_of_ids[i]) == 0: # if the id is new in the final dictionary
list_of_ids[i].append(idx.tolist()[0]) # start
list_of_ids[i].append(idx.tolist()[-1]) # end
else: # if the id was there in previous chunk
list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]] # keep old start, add new end
#print(df_chunk.iloc[idx, :])
#print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])
print(list_of_ids)
skip = None
rows = None
# Now from the file, I will read only particular id group using following
# I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
print('Data for id: {}'.format(id))
skip, rows = se[0], (se[-1] - se[0]+1)
for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
print(df_chunk)
Truncated output from my code:
{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
id val
0 1 10
1 1 11
id val
2 1 12
Data for id: 2
id val
0 2 13
1 2 14
id val
2 2 15
3 2 16
Data for id: 3
id val
0 3 17
1 3 18
What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?
(I am looking for complete code in answer)

I suggest you use itertools for this, as follows:
import pandas as pd
import csv
import io
from itertools import groupby, islice
from operator import itemgetter
def chunker(n, iterable):
"""
From answer: https://stackoverflow.com/a/31185097/4001592
>>> list(chunker(3, 'ABCDEFG'))
[['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
"""
iterable = iter(iterable)
return iter(lambda: list(islice(iterable, n)), [])
chunk_size = 5
with open('test1.csv') as csv_file:
reader = csv.reader(csv_file)
for _, group in groupby(reader, itemgetter(0)):
for chunk in chunker(chunk_size, group):
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
Output (partial)
0 1
0 1 10
1 1 11
2 1 12
---
0 1
0 2 13
1 2 14
2 2 15
3 2 16
---
0 1
0 3 17
1 3 18
2 3 19
3 3 20
---
...
This approach will read first in groups by column 1:
for _, group in groupby(reader, itemgetter(0)):
and each group will be read in chunks of 5 rows (this can be change using chunk_size):
for chunk in chunker(chunk_size, group):
The last part:
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
creates a suitable string to be pass to pandas.

How to append dataframes from different files, but having same structure?

I have different datasets in a json format, with each file containing different matches details but have the same column names. I've isolated the 'Shots' taken by one team in a single match. How should i modify my code to take only the shots of that particular team for different matches.
def key_pass(filename):
with open(filename) as f:
comp = json.load(f)
eng = pd.json_normalize(comp)
for team in eng['possession_team.name'].unique():
if team != 'Belgium':
opp = team
eng = pd.json_normalize(comp).assign(Oppn = opp)
eng_pan = eng[['shot.statsbomb_xg','minute','player.name','shot.outcome.name','shot.key_pass_id','location','type.name','play_pattern.name','possession_team.name']]
eng_pan.rename(columns={'shot.statsbomb_xg':'Statsbomb_xG','shot.outcome.name':'Outcome','shot.key_pass_id':'Keypass_id'})
total_attempts = eng_pan.loc[(eng_pan['type.name'] == 'Shot') & (eng_pan['possession_team.name'] == 'Belgium')]
total_attempts.reset_index(drop=True,inplace=True)
return(total_attempts)
When i Call the function,
total_attempts = key_pass('7584.json')
total_attempts
The Output I get is,
Now, if i have to call another file, I need the shots from that file to continue from where the previous file has finished.
Should i pass the file names as list ? And add a for loop in the function, but then again how do i append the shots ?

You can use the pandas DataFrame append method easily if both df's have the same structure:
(notice the ignore index parameter)
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
A B
0 1 2
1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8

Python using apply function to skip Nan

I am trying to preprocess a dataset to use for XGBoost by mapping the classes in each column to numerical values. A working example looks like this:
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df1 = pd.DataFrame(data = {'col1': ['A', 'B','C','B','A'], 'col2': ['Z', 'X','Z','Z','Y'], 'col3':['I','J','I','J','J']})
d = defaultdict(LabelEncoder)
encodedDF = df1.apply(lambda x: d[x.name].fit_transform(x))
inv = encodedDF.apply(lambda x: d[x.name].inverse_transform(x))
Where encodedDF gives the output:
col1 col2 col3
0 2 0
1 0 1
2 2 0
1 2 1
0 1 1
And inv just reverts it back to the original dataframe. My issue is when null values get introduced:
df2 = pd.DataFrame(data = {'col1': ['A', 'B',None,'B','A'], 'col2': ['Z', 'X','Z',None,'Y'], 'col3':['I','J','I','J','J']})
encodedDF = df2.apply(lambda x: d[x.name].fit_transform(x))
Running the above will throw the error:
"TypeError: ('argument must be a string or number', 'occurred at index col1')"
Basically, I want to apply the encoding, but skip over the individual cell values that are null to get an output like this:
col1 col2 col3
0 2 0
1 0 1
NaN 2 0
1 NaN 1
0 1 1
I can't use dropna() before applying the encoding because then I lose data that I will be trying to impute down the line with XGBoost. I can't use conditionals to skip x if null, (e.g. using x.notnull() in the lambda function) because fit_transform(x) uses a Pandas.Series object as the argument, and none of the logical operators that I could use in the conditional appear to do what I'm trying to do. I'm not sure what else to try in order to get this to work. I hope what I'm trying to do makes sense. Let me know if I need to clarify.

I think I figured out a workaround. I probably should have been using sklearn's OneHotEncoder class from the beginning instead of the LabelEncoder/defaultdict combo. I'm brand new to all this. I replaced NaNs with dummy values, and then dropped those dummy values once I encoded the dataframe.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame(data = {'col1': ['A', 'B','C',None,'A'], 'col2': ['Z', 'X',None,'Z','Y'], 'col3':['I','J',None,'J','J'], 'col4':[45,67,None,32,94]})
replaceVals = {'col1':'missing','col2':'missing','col3':'missing','col4':-1}
df = df.fillna(value = replaceVals,axis=0)
drop = [['missing'],['missing'],['missing'],[-1]]
enc = OneHotEncoder(drop=drop)
encodeDF = enc.fit_transform(df)

Matching subset of two columns of two different dataframes

Comparing specific columns from two different dataframes. Counting if subset of both dataframe is matching or not matching.
Condition:
If any element of file small['genes of cluster'] is matching with the big['genes of cluster'], output should be: match: 1.
For below example only OR4F16 is matching to both dataframes.
So Output: match: 1; unmatch: 3.
file1: big <tab separated>
cl nP genes of cluster
1 11 DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138C, FAM138F, FAM138A, OR4F5, LOC729737, LOC102725121, FAM138D
2 4 OR4F16, OR4F3, OR4F29, LOC100132287
3 64 LOC100133331, LOC100288069, FAM87B, LINC00115, LINC01128, FAM41C, LINC02593, SAMD11
4 7 GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC105378591, PRKCZ
file2: small <tab separated>
cl nP genes of cluster
1 11 A, B, C, D
2 4 OR4F16, X, Y, Z
My Code: Python3
def genes_coordinates(big, small):
b = pd.read_csv(big, header=0, sep="\t")
s = pd.read_csv(small, header=0, sep="\t")
match = 0
unmatch = 0
for index, row in b.iterrows():
if row[row['genes of cluster'].isin(s['genes of cluster'])]:
match+1
else:
unmatch+1
print("match: ", match, "\nunmatch: ", unmatch)
genes_coordinates('big','small')

I would go with a pandas.merge() followed by counting by list comprehension.
import pandas as pd
df1 = pd.DataFrame({'cl':[1,2], 'nP':[11,4], 'gene of cluster':[['A', 'B', 'C', 'D'], ['OR4F16', 'X', 'Y', 'Z']]})
df2 = pd.DataFrame({'cl':[1,2,3,4], 'nP':[11,4,64,7], 'gene of cluster':[['DDX11L1', 'MIR6859-3', 'WASH7P', 'MIR1302-2', 'FAM138C', 'FAM138F', 'FAM138A', 'OR4F5', 'LOC729737', 'LOC102725121', 'FAM138D'], ['OR4F16', 'OR4F3', 'OR4F29', 'LOC100132287'], ['LOC100133331', 'LOC100288069', 'FAM87B', 'LINC00115', 'LINC01128', 'FAM41C', 'LINC02593', 'SAMD11'], ['GNB1', 'CALML6', 'TMEM52', 'CFAP74', 'GABRD', 'LOC105378591', 'PRKCZ']]})
df_m = df1.merge(df2, on=['cl', 'nP'], how='outer')
>>>df_m
cl nP gene of cluster_x gene of cluster_y
0 1 11 [A, B, C, D] [DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138...
1 2 4 [OR4F16, X, Y, Z] [OR4F16, OR4F3, OR4F29, LOC100132287]
2 3 64 NaN [LOC100133331, LOC100288069, FAM87B, LINC00115...
3 4 7 NaN [GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC10537...
# An np.nan value is an outright 'unmatch'
found = []
for x in df_m.index:
if isinstance(df_m.iloc[x]['gene of cluster_x'], float):
found.append(0)
else:
if isinstance(df_m.iloc[x]['gene of cluster_y'], float):
found.append(0)
elif any([y in df_m.iloc[x]['gene of cluster_y'] for y in df_m.iloc[x]['gene of cluster_x']]):
found.append(1)
else:
found.append(0)
# The counts
match = sum(found)
unmatch = len(found) - match

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)

I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python Passing Dynamic Table Name in For Loop - python-3.x

Related

Python - Pandas: perform column value based data grouping across separate dataframe chunks

How to append dataframes from different files, but having same structure?

Python using apply function to skip Nan

Matching subset of two columns of two different dataframes

Using non-zero values from columns in function - pandas

Categories

Resources