Adding new columns in a dataframe gives length mismatch error - python-3.x

From a csv file (initial.csv):
"Id","Name"
1,"CLO"
2,"FEV"
2,"GEN"
3,"HYP"
4,"DIA"
1,"COL"
1,"EOS"
4,"GAS"
1,"AEK"
I am grouping by the Id column and agreggating the Name column values so that each unique Id has all the Name values appended on the same row (new.csv):
"Id","Name"
1,"CLO","COL","EOS","AEK"
2,"FEV","GEN"
3,"HYP"
4,"DIA","GAS"
Now some rows have extra name values for which I want to append corresponding columns according the maximum count of Name values that exist on the rows, i.e.
"Id","Name","Name2","Name3","Name4"
1,"CLO","COL","EOS","AEK"
2,"FEV","GEN"
3,"HYP"
4,"DIA","GAS"
I do not understand how I can add new columns on dataframe to match the data.
Below is my code:
import pandas as pd
df = pd.read_csv('initial.csv', delimiter=',')
max_names_count = 0
for id in unique_ids_list:
mask = df['ID'] == id
names_count = len(df[mask])
if names_count > max_names_count:
max_names_count = names_count
group_by_id = df.groupby(["Id"]).agg({"Name": ','.join})
# Create new columns 'Id', 'Name', 'Name2', 'Name3', 'Name4'
new_column_names = ["Id", "Name"] + ['Name' + str(i) for i in range(2, max_names_count+1)]
group_by_id.columns = new_column_names # <-- ValueError: Length mismatch: Expected axis has 1 elements, new values have 5 elements
group_by_id.to_csv('new.csv', encoding='utf-8')

Try:
df = pd.read_csv("initial.csv")
df_out = (
df.groupby("Id")["Name"]
.agg(list)
.to_frame()["Name"]
.apply(pd.Series)
.rename(columns=lambda x: "Name" if x == 0 else "Name{}".format(x + 1))
.reset_index()
)
df_out.to_csv("out.csv", index=False)
Creates out.csv:
Id,Name,Name2,Name3,Name4
1,CLO,COL,EOS,AEK
2,FEV,GEN,,
3,HYP,,,
4,DIA,GAS,,

Related

Python3 multiple equal sign in the same line

There is a function in the python2 code that I am re-writing into python3
def abc(self, id):
if not isinstance(id, int):
id = int(id)
mask = self.programs['ID'] == id
assert sum(mask) > 0
name = self.programs[mask]['name'].values[0]
"id" here is a panda series where the index is strings and the column is int like the following
data = np.array(['1', '2', '3', '4', '5'])
# providing an index
ser = pd.Series(data, index =['a', 'b', 'c'])
print(ser)
self.programs['ID'] is a dataframe column where there is one row with integer data like '1'
import pandas as pd
# initialize list of lists
data = [[1, 'abc']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID', 'name'])
I am really confused with the line "mask = self.programs['ID'] == id \ assert sum(mask) > 0". Could someone enlighten?
Basically, mask = self.programs['ID'] == id would return a series of boolean values, whether thoses 'ID' values are equal to id or not.
Then assert sum(mask) > 0 sums up the boolean series. Note that, bool True can be treated as 1 in python and 0 for False. So this asserts that, there is at least one case where programs['ID'] column has a value equal to id.

how to traverse a pandas dataframe to form a nested json?

I have a pandas dataframe with the following structure. It can created using the following code
import pandas as pd
import numpy as np
word = ['this','is','a','test','call','this','is','a','test','call','this','is ','a','test','call', np.NaN]
level_3_start = [np.NaN,np.NaN,'<tyre>','<steering>',np.NaN,np.NaN,np.NaN,np.NaN,'<leg>',np.NaN,'<clutch>',np.NaN,np.NaN,'<break>',np.NaN]
level_3_end = [np.NaN,np.NaN,'</tyre>',np.NaN,'</steering>',np.NaN,np.NaN,np.NaN,'</leg>',np.NaN,np.NaN,np.NaN,'</clutch>','</break>',np.NaN]
level_2_start = [np.NaN,np.NaN,'<car>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<dog>',np.NaN,'<car>',np.NaN,np.NaN,'<bus>',np.NaN]
level_2_end = [np.NaN,np.NaN,np.NaN,np.NaN,'</car>',np.NaN,np.NaN,np.NaN,'</dog>',np.NaN,np.NaN,np.NaN,'</car>','</bus>',np.NaN]
level_1_start= [np.NaN,np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<animal>',np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN]
level_1_end= [np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN,'</animal>',np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN]
df1 = pd.DataFrame(list(zip(word, level_3_start,level_3_end, level_2_start,level_2_end, level_1_start, level_1_end )),
columns =['word', 'level_3_start', 'level_3_end', 'level_2_start', 'level_2_end', 'level_1_start', 'level_1_end'])
I want to traverse the dataframe into a JSON. The output should look like the one below:
{
"vehicle": {
"car":{
"tyre": True,
"steering": True
"clutch": True
},
"bus":{
"break": True
}
},
"animal": {
"dog":{
"leg": True
}
}
}
What is the best way to achieve this in pandas?
You are capturing more information than required. end columns are not needed.
remove rows that have nothing in them dropna()
forward fill the tags and remove < and > from string
use a comprehension to build the dictionary from dataframe to_dict()
df = pd.DataFrame({"word":["this","is","a","test","call","this","is","a","test","call","this","is","a","test","call"],
"level_3_start":["","","<tyre>","<steering>","","","","","<leg>","","<clutch>","","","<break>",""],
"level_3_end":["","","</tyre>","","</steering>","","","","</leg>","","","","</clutch>","</break>",""],
"level_2_start":["","","<car>","","","","","","<dog>","","<car>","","","<bus>",""],
"level_2_end":["","","","","</car>","","","","</dog>","","","","</car>","</bus>",""],
"level_1_start":["","","<vehicle>","","","","","","<animal>","","<vehicle>","","","",""],
"level_1_end":["","","","","","","</vehicle>","","</animal>","","","","","</vehicle>",""]})
# cleanup
df = df.replace({"":np.nan}).dropna(subset=[c for c in df.columns if c!="word"], how="all")
for c in [c for c in df.columns if "start" in c]:
df[c].fillna(method="ffill", inplace=True)
df[c] = df[c].str.replace("<","")
df[c] = df[c].str.replace(">","")
dfd = df.loc[:,[c for c in df.columns if "level" in c]].drop_duplicates().to_dict(orient="records")
{d["level_1_start"]:
{d2["level_2_start"]:
{d3["level_3_start"]:True
for d3 in dfd if d3["level_1_start"]==d["level_1_start"] and d3["level_2_start"]==d2["level_2_start"]
}
for d2 in dfd if d2["level_1_start"]==d["level_1_start"]
}
for d in dfd
}
output
{'vehicle': {'car': {'tyre': True, 'steering': True, 'clutch': True},
'bus': {'break': True}},
'animal': {'dog': {'leg': True}}}
To get the final results, your data has to go through a 3 step process:
step 1: remove all columns that are not required for processing
step 2: clean data to remove tags and sort them in level_1, level_2, level_3 order
step 3: create the nested dictionary
Here's how I have done it. Commented each section to show clearly what we are doing.
import pandas as pd
import numpy as np
import collections
word = ['this','is','a','test','call','this','is','a','test','call','this','is ','a','test','call', np.NaN]
level_3_start = [np.NaN,np.NaN,'<tyre>','<steering>',np.NaN,np.NaN,np.NaN,np.NaN,'<leg>',np.NaN,'<clutch>',np.NaN,np.NaN,'<break>',np.NaN]
level_3_end = [np.NaN,np.NaN,'</tyre>',np.NaN,'</steering>',np.NaN,np.NaN,np.NaN,'</leg>',np.NaN,np.NaN,np.NaN,'</clutch>','</break>',np.NaN]
level_2_start = [np.NaN,np.NaN,'<car>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<dog>',np.NaN,'<car>',np.NaN,np.NaN,'<bus>',np.NaN]
level_2_end = [np.NaN,np.NaN,np.NaN,np.NaN,'</car>',np.NaN,np.NaN,np.NaN,'</dog>',np.NaN,np.NaN,np.NaN,'</car>','</bus>',np.NaN]
level_1_start= [np.NaN,np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'<animal>',np.NaN,'<vehicle>',np.NaN,np.NaN,np.NaN,np.NaN]
level_1_end= [np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN,'</animal>',np.NaN,np.NaN,np.NaN,np.NaN,'</vehicle>',np.NaN]
df1 = pd.DataFrame(list(zip(word, level_3_start,level_3_end, level_2_start,level_2_end, level_1_start, level_1_end )),
columns =['word', 'level_3_start', 'level_3_end', 'level_2_start', 'level_2_end', 'level_1_start', 'level_1_end'])
#creating df_temp for processing
df_temp = df1
#drop columns that are not important for this problem statement
df_temp = df_temp.drop(columns=['word','level_1_end','level_2_end','level_3_end'])
#remove all < and >
df_temp['level_1_start'] = df_temp['level_1_start'].str.replace("<","").str.replace(">","")
df_temp['level_2_start'] = df_temp['level_2_start'].str.replace("<","").str.replace(">","")
df_temp['level_3_start'] = df_temp['level_3_start'].str.replace("<","").str.replace(">","")
#drop all rows that don't have any value
df_temp.dropna(how='all', inplace = True)
#forwardfill all level_1 columns
df_temp['level_1_start'] = df_temp['level_1_start'].ffill()
#drop rows that have no data in level_2 and level_3
df_temp = df_temp.dropna(subset=['level_3_start','level_2_start'],how='all')
#forwardfill all level_2_start columns
df_temp['level_2_start'] = df_temp['level_2_start'].ffill()
#drop rows that have no data in level_3
df_temp = df_temp.dropna(subset=['level_3_start'],how='all')
#now we have the all data ready for processing
#sort them in level_1, level_2, level_3 order
df_temp = df_temp.sort_values(by=['level_1_start', 'level_2_start','level_3_start'])
#to create nested dictionary, you need to use collections.defaultdict
df_dict = collections.defaultdict(dict)
#iterate through the dataframe. each row will have a unique record for level_3
for idx,row in df_temp.iterrows():
lev_1 = row['level_1_start']
lev_2 = row['level_2_start']
lev_3 = row['level_3_start']
#if level_1 does not exist, create new entry for level_1, level_2, & level_3 (ex: animal does not exist)
#if level_1 exists but no level_2, create new entry for level_2 & level_3 (ex: car does not exist but bus exists)
#if level_1 and level 2 exists, then create a new entry for level 3 (ex: vehicle, car exists, but tyre does not)
if lev_1 in df_dict:
if lev_2 in df_dict[lev_1]:
df_dict[lev_1][lev_2][lev_3] = True
else:
df_dict[lev_1][lev_2] = {lev_3:True}
else:
df_dict[lev_1] = {lev_2 : {lev_3:True}}
#convert collection back to normal dictionary
df_dict = dict(df_dict)
print(df_dict)
Output will be as follows:
{'animal':
{'dog': {'leg': True}
},
'vehicle':
{'bus': {'break': True},
'car': {'clutch': True, 'steering': True, 'tyre': True}
}
}

Python Pandas : Extend operation of a column if a condition matches

I have two different dataframes, i.e.,
firstDF = pd.DataFrame([{'mac':1,'location':['kitchen']}])
predictedDF = pd.DataFrame([{'mac':1,'location':['lab']}])
If the mac column value of predictedDF contains in mac column value of firstDF , then location column value of firstDF should extend the location column of predictedDF and the result of firstDF should be,
firstDF
mac location
0 1 ['kitchen','lab']
I have tried with,
firstDF.loc[firstDF['mac'] == predictedDF['mac'], 'mac'] = firstDF.loc[firstDF['location'].extend(predictedDF['location']), 'location']
Whereas the same returns,
AttributeError: 'Series' object has no attribute 'extend'
If lists in location columns first DataFrame.merge for one DataFrame and then join with + and DataFrame.pop for extract column (use and drop):
df = firstDF.merge(predictedDF, on='mac', how='left')
df['location'] = df.pop('location_x') + df.pop('location_y')
print (df)
mac location
0 1 [kitchen, lab]
Test with more values - if missing values then replace them to []:
firstDF = pd.DataFrame({'mac':[1, 2],'location':[['kitchen'],['kitchen']]})
predictedDF = pd.DataFrame([{'mac':1,'location':['lab']}])
df = firstDF.merge(predictedDF, on='mac', how='left').applymap(lambda x: x if x == x else [])
df['location'] = df.pop('location_x') + df.pop('location_y')
print (df)
mac location
0 1 [kitchen, lab]
1 2 [kitchen]

PySpark: Replace Punctuations with Space Looping Through Columns

I have the following code running successfully in PySpark:
def pd(data):
df = data
df = df.select('oproblem')
text_col = ['oproblem']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
But when I add a second column in and try to loop it, it doesn't work:
def pd(data):
df = data
df = df.select('oproblem', 'lca')
text_col = ['oproblem', 'lca']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
Below is the error I get:
TypeError: 'Column' object is not callable
I think it should be df = df.select(['oproblem', 'lca']) instead of df = df.select('oproblem', 'lca').
Better yet for code quality purposes, have the select statement use the text_columns variable, so you only have to change 1 line of code if you need to do this with more columns or if your column names change. Eg,
def pd(data):
df = data
text_col = ['oproblem', 'lca']
df = df.select(text_col)
....

Calling a data frame via a string

I have a list of countries such as:
country = ["Brazil", "Chile", "Colombia", "Mexico", "Panama", "Peru", "Venezuela"]
I created data frames using the names from the country list:
for c in country:
c = pd.read_excel(str(c + ".xls"), skiprows = 1)
c = pd.to_datetime(c.Date, infer_datetime_format=True)
c = c[["Date", "spreads"]]
Now I want to be able to merge all the countries data frames using the columns date as the key. The idea is to create a loop like the following:
df = Brazil #this is the first dataframe, which also corresponds to the first element of the list country.
for i in range(len(country)-1):
df = df.merge(country[i+1], on = "Date", how = "inner")
df.set_index("Date", inplace=True)
I got the error ValueError: can not merge DataFrame with instance of type <class 'str'>. It seems python is not calling the data frame which the name is in the country list. How can I call those data frames starting from the country list?
Thanks masters!
Your loop doesn't modify the contents of the country list, so country is still a list of strings.
Consider building a new list of dataframes and looping over that:
country_dfs = []
for c in country:
df = pd.read_excel(c + ".xls", skiprows=1)
df = pd.to_datetime(df.Date, infer_datetime_format=True)
df = df[["Date", "spreads"]]
# add new dataframe to our list of dataframes
country_dfs.append(df)
then to merge,
merged_df = country_dfs[0]
for df in country_dfs[1:]:
merged_df = merged_df.merge(df, on='Date', how='inner')
merged_df.set_index('Date', inplace=True)

Resources