Calling a data frame via a string - python-3.x

I have a list of countries such as:
country = ["Brazil", "Chile", "Colombia", "Mexico", "Panama", "Peru", "Venezuela"]
I created data frames using the names from the country list:
for c in country:
c = pd.read_excel(str(c + ".xls"), skiprows = 1)
c = pd.to_datetime(c.Date, infer_datetime_format=True)
c = c[["Date", "spreads"]]
Now I want to be able to merge all the countries data frames using the columns date as the key. The idea is to create a loop like the following:
df = Brazil #this is the first dataframe, which also corresponds to the first element of the list country.
for i in range(len(country)-1):
df = df.merge(country[i+1], on = "Date", how = "inner")
df.set_index("Date", inplace=True)
I got the error ValueError: can not merge DataFrame with instance of type <class 'str'>. It seems python is not calling the data frame which the name is in the country list. How can I call those data frames starting from the country list?
Thanks masters!

Your loop doesn't modify the contents of the country list, so country is still a list of strings.
Consider building a new list of dataframes and looping over that:
country_dfs = []
for c in country:
df = pd.read_excel(c + ".xls", skiprows=1)
df = pd.to_datetime(df.Date, infer_datetime_format=True)
df = df[["Date", "spreads"]]
# add new dataframe to our list of dataframes
country_dfs.append(df)
then to merge,
merged_df = country_dfs[0]
for df in country_dfs[1:]:
merged_df = merged_df.merge(df, on='Date', how='inner')
merged_df.set_index('Date', inplace=True)

Related

Adding new columns in a dataframe gives length mismatch error

From a csv file (initial.csv):
"Id","Name"
1,"CLO"
2,"FEV"
2,"GEN"
3,"HYP"
4,"DIA"
1,"COL"
1,"EOS"
4,"GAS"
1,"AEK"
I am grouping by the Id column and agreggating the Name column values so that each unique Id has all the Name values appended on the same row (new.csv):
"Id","Name"
1,"CLO","COL","EOS","AEK"
2,"FEV","GEN"
3,"HYP"
4,"DIA","GAS"
Now some rows have extra name values for which I want to append corresponding columns according the maximum count of Name values that exist on the rows, i.e.
"Id","Name","Name2","Name3","Name4"
1,"CLO","COL","EOS","AEK"
2,"FEV","GEN"
3,"HYP"
4,"DIA","GAS"
I do not understand how I can add new columns on dataframe to match the data.
Below is my code:
import pandas as pd
df = pd.read_csv('initial.csv', delimiter=',')
max_names_count = 0
for id in unique_ids_list:
mask = df['ID'] == id
names_count = len(df[mask])
if names_count > max_names_count:
max_names_count = names_count
group_by_id = df.groupby(["Id"]).agg({"Name": ','.join})
# Create new columns 'Id', 'Name', 'Name2', 'Name3', 'Name4'
new_column_names = ["Id", "Name"] + ['Name' + str(i) for i in range(2, max_names_count+1)]
group_by_id.columns = new_column_names # <-- ValueError: Length mismatch: Expected axis has 1 elements, new values have 5 elements
group_by_id.to_csv('new.csv', encoding='utf-8')
Try:
df = pd.read_csv("initial.csv")
df_out = (
df.groupby("Id")["Name"]
.agg(list)
.to_frame()["Name"]
.apply(pd.Series)
.rename(columns=lambda x: "Name" if x == 0 else "Name{}".format(x + 1))
.reset_index()
)
df_out.to_csv("out.csv", index=False)
Creates out.csv:
Id,Name,Name2,Name3,Name4
1,CLO,COL,EOS,AEK
2,FEV,GEN,,
3,HYP,,,
4,DIA,GAS,,

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

how to traverse a pandas dataframe to form a nested json?

I have a pandas dataframe with the following structure. It can created using the following code
import pandas as pd
import numpy as np
word = ['this','is','a','test','call','this','is','a','test','call','this','is ','a','test','call', np.NaN]
level_3_start = [np.NaN,np.NaN,'<tyre>','<steering>',np.NaN,np.NaN,np.NaN,np.NaN,'<leg>',np.NaN,'<clutch>',np.NaN,np.NaN,'<break>',np.NaN]
level_3_end = [np.NaN,np.NaN,'</tyre>',np.NaN,'</steering>',np.NaN,np.NaN,np.NaN,'</leg>',np.NaN,np.NaN,np.NaN,'</clutch>','</break>',np.NaN]
level_2_start = [np.NaN,np.NaN,'<car>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<dog>',np.NaN,'<car>',np.NaN,np.NaN,'<bus>',np.NaN]
level_2_end = [np.NaN,np.NaN,np.NaN,np.NaN,'</car>',np.NaN,np.NaN,np.NaN,'</dog>',np.NaN,np.NaN,np.NaN,'</car>','</bus>',np.NaN]
level_1_start= [np.NaN,np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<animal>',np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN]
level_1_end= [np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN,'</animal>',np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN]
df1 = pd.DataFrame(list(zip(word, level_3_start,level_3_end, level_2_start,level_2_end, level_1_start, level_1_end )),
columns =['word', 'level_3_start', 'level_3_end', 'level_2_start', 'level_2_end', 'level_1_start', 'level_1_end'])
I want to traverse the dataframe into a JSON. The output should look like the one below:
{
"vehicle": {
"car":{
"tyre": True,
"steering": True
"clutch": True
},
"bus":{
"break": True
}
},
"animal": {
"dog":{
"leg": True
}
}
}
What is the best way to achieve this in pandas?
You are capturing more information than required. end columns are not needed.
remove rows that have nothing in them dropna()
forward fill the tags and remove < and > from string
use a comprehension to build the dictionary from dataframe to_dict()
df = pd.DataFrame({"word":["this","is","a","test","call","this","is","a","test","call","this","is","a","test","call"],
"level_3_start":["","","<tyre>","<steering>","","","","","<leg>","","<clutch>","","","<break>",""],
"level_3_end":["","","</tyre>","","</steering>","","","","</leg>","","","","</clutch>","</break>",""],
"level_2_start":["","","<car>","","","","","","<dog>","","<car>","","","<bus>",""],
"level_2_end":["","","","","</car>","","","","</dog>","","","","</car>","</bus>",""],
"level_1_start":["","","<vehicle>","","","","","","<animal>","","<vehicle>","","","",""],
"level_1_end":["","","","","","","</vehicle>","","</animal>","","","","","</vehicle>",""]})
# cleanup
df = df.replace({"":np.nan}).dropna(subset=[c for c in df.columns if c!="word"], how="all")
for c in [c for c in df.columns if "start" in c]:
df[c].fillna(method="ffill", inplace=True)
df[c] = df[c].str.replace("<","")
df[c] = df[c].str.replace(">","")
dfd = df.loc[:,[c for c in df.columns if "level" in c]].drop_duplicates().to_dict(orient="records")
{d["level_1_start"]:
{d2["level_2_start"]:
{d3["level_3_start"]:True
for d3 in dfd if d3["level_1_start"]==d["level_1_start"] and d3["level_2_start"]==d2["level_2_start"]
}
for d2 in dfd if d2["level_1_start"]==d["level_1_start"]
}
for d in dfd
}
output
{'vehicle': {'car': {'tyre': True, 'steering': True, 'clutch': True},
'bus': {'break': True}},
'animal': {'dog': {'leg': True}}}
To get the final results, your data has to go through a 3 step process:
step 1: remove all columns that are not required for processing
step 2: clean data to remove tags and sort them in level_1, level_2, level_3 order
step 3: create the nested dictionary
Here's how I have done it. Commented each section to show clearly what we are doing.
import pandas as pd
import numpy as np
import collections
word = ['this','is','a','test','call','this','is','a','test','call','this','is ','a','test','call', np.NaN]
level_3_start = [np.NaN,np.NaN,'<tyre>','<steering>',np.NaN,np.NaN,np.NaN,np.NaN,'<leg>',np.NaN,'<clutch>',np.NaN,np.NaN,'<break>',np.NaN]
level_3_end = [np.NaN,np.NaN,'</tyre>',np.NaN,'</steering>',np.NaN,np.NaN,np.NaN,'</leg>',np.NaN,np.NaN,np.NaN,'</clutch>','</break>',np.NaN]
level_2_start = [np.NaN,np.NaN,'<car>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<dog>',np.NaN,'<car>',np.NaN,np.NaN,'<bus>',np.NaN]
level_2_end = [np.NaN,np.NaN,np.NaN,np.NaN,'</car>',np.NaN,np.NaN,np.NaN,'</dog>',np.NaN,np.NaN,np.NaN,'</car>','</bus>',np.NaN]
level_1_start= [np.NaN,np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<animal>',np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN]
level_1_end= [np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN,'</animal>',np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN]
df1 = pd.DataFrame(list(zip(word, level_3_start,level_3_end, level_2_start,level_2_end, level_1_start, level_1_end )),
columns =['word', 'level_3_start', 'level_3_end', 'level_2_start', 'level_2_end', 'level_1_start', 'level_1_end'])
#creating df_temp for processing
df_temp = df1
#drop columns that are not important for this problem statement
df_temp = df_temp.drop(columns=['word','level_1_end','level_2_end','level_3_end'])
#remove all < and >
df_temp['level_1_start'] = df_temp['level_1_start'].str.replace("<","").str.replace(">","")
df_temp['level_2_start'] = df_temp['level_2_start'].str.replace("<","").str.replace(">","")
df_temp['level_3_start'] = df_temp['level_3_start'].str.replace("<","").str.replace(">","")
#drop all rows that don't have any value
df_temp.dropna(how='all', inplace = True)
#forwardfill all level_1 columns
df_temp['level_1_start'] = df_temp['level_1_start'].ffill()
#drop rows that have no data in level_2 and level_3
df_temp = df_temp.dropna(subset=['level_3_start','level_2_start'],how='all')
#forwardfill all level_2_start columns
df_temp['level_2_start'] = df_temp['level_2_start'].ffill()
#drop rows that have no data in level_3
df_temp = df_temp.dropna(subset=['level_3_start'],how='all')
#now we have the all data ready for processing
#sort them in level_1, level_2, level_3 order
df_temp = df_temp.sort_values(by=['level_1_start', 'level_2_start','level_3_start'])
#to create nested dictionary, you need to use collections.defaultdict
df_dict = collections.defaultdict(dict)
#iterate through the dataframe. each row will have a unique record for level_3
for idx,row in df_temp.iterrows():
lev_1 = row['level_1_start']
lev_2 = row['level_2_start']
lev_3 = row['level_3_start']
#if level_1 does not exist, create new entry for level_1, level_2, & level_3 (ex: animal does not exist)
#if level_1 exists but no level_2, create new entry for level_2 & level_3 (ex: car does not exist but bus exists)
#if level_1 and level 2 exists, then create a new entry for level 3 (ex: vehicle, car exists, but tyre does not)
if lev_1 in df_dict:
if lev_2 in df_dict[lev_1]:
df_dict[lev_1][lev_2][lev_3] = True
else:
df_dict[lev_1][lev_2] = {lev_3:True}
else:
df_dict[lev_1] = {lev_2 : {lev_3:True}}
#convert collection back to normal dictionary
df_dict = dict(df_dict)
print(df_dict)
Output will be as follows:
{'animal':
{'dog': {'leg': True}
},
'vehicle':
{'bus': {'break': True},
'car': {'clutch': True, 'steering': True, 'tyre': True}
}
}

Filter dataframe based on groupby sum()

I want to filter my dataframe based on a groupby sum(). I am looking for lines where the amounts for a spesific date, gets to zero.
I have solve this by creating a for loop. I suspect this will reduce performance if the dataframe is large.
It also seems clunky.
newdf = pd.DataFrame()
newdf['name'] = ('leon','eurika','monica','wian')
newdf['surname'] = ('swart','swart','swart','swart')
newdf['birthdate'] = ('14051981','198001','20081012','20100621')
newdf['tdate'] = ('13/05/2015','14/05/2015','15/05/2015', '13/05/2015')
newdf['tamount'] = (100.10, 111.11, 123.45, -100.10)
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
df2 = df.loc[df["tamount"] == 0, "tdate"]
df3 = pd.DataFrame()
for i in df2:
df3 = df3.append(newdf.loc[newdf["tdate"] == i])
print (df3)
The below code is creating an output of the two lines getting to zero when combined on tamount
name surname birthdate tdate tamount
0 leon swart 1981-05-14 13/05/2015 100.1
3 wian swart 2010-06-21 13/05/2015 -100.1
Just use basic numpy :)
import numpy as np
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
dates = df['tdate'][np.where(df['tamount'] == 0)[0]]
newdf[np.isin(newdf['tdate'], dates) == True]
Hope this helps; let me know if you have any questions.

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

Resources