Pandas dataframe manipulation with conditions - python-3.x

How can I iterate through Pandas DataFrame field and fill null values with input from another field within the same data frame
My objective is to fill the na values in column y with corresponding values in column z

Its best to avoid iterating through dataframes when it can be accomplished using vector expressions. Something like this should work, although it may need to be massaged a little for your specific case though.
# Set your dataframe
df = ...
# Gets a boolean vector for positions where you have na in column b
nulls_in_b = df["b"].isna()
# Set the places where its null to values from column c
df["b"].loc[nulls_in_b] = df["c"].loc[nulls_in_b]

Related

Expand list columns with variable length into rows of a pandas dataframe

I have a input dataframe containing multiple list columns with unequal number of elements with in the list. I need to expand all the list columns into rows so that each bin has the corresponding value in the same row.
code for generating the df:
df_dict = {'vin':['VIN123','VIN123','VIN123','VIN234','VIN345'],
'date':['01-22-2022','01-23-2022','01-23-2022','01-23-2022','01-22-2022'],
'celltype':['A','A','B','A','B'],
'soc_bins':[['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170']],
'soc_value': [[10,300,85,20,5,0],[20,400,125,670,5,7],[20,500,55,60,9,9],[40,300,65,90,1,0],[20,700,35,50,2,0]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']],
'temp_value':[[1,2,3],[4,3,4],[5,3,5],[6,900,7],[3,600,9]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']]}
Input_df:
Output_df:
vin
date
celltype
soc_bins
soc_value
temp_bins
temp_value
VIN123
01-22-2022
A
0-10
10
50f-55f
1
VIN123
01-22-2022
A
10-20
300
60f-70f
2
In short, each value in the soc_value column corresponds to the corresponding bin in the soc_bin column and same goes for the temp columns.
Few problems I encountered using the explode method or similar methods is:
The number of bins in soc_bins (5) and temp_bins (3) are not equal.
Also, there might be a same value for two bins (ex: 3rd row, soc_value contains two values as 9) so when I first expand the soc_value column there is no way for the explode fucntion to identify the two rows as different and hence i am getting an error "cannot handle a non-unique multi-index!"
There are a lot many columns that has to be manipulated in the same way.
Can use df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index() but i am getting NaN's in the indexed columns.
To fill the NaN's I can use the .ffill() but I am unable to distinguish between original null values.
Also, in this method if some of indexes are null's i'm getting an error "cannot handle a non-unique multi-index!"
Current output:
Required output: I need the output similar to my current output but without the null values. I could use .ffill() to fill the null values, but then i am unable to differentiate the actual null values vs the the ones created from the df.set_index().
Assigning a row_number to the df before exploding it into columns has solved the "cannot handle a non-unique multi-index!" issue.
df['row_number'] = np.arange(len(df))
df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index()

How to store a df.column in a list without index in a loop?

df.shape (15,4)
I want to store 4th column of df within the loop in a list. What I'm trying is:
l=[]
n=1000 #No. of iterations
for i in range(0,n):
#df expressions and results calcualtion equations
l.append(df.iloc[:,2]) # This is storing values with index. I want to store then without indices while keeping inplace=True.
df_new = pd.DataFrame(np.array(l), columns = df.index)
I want l list to append only values from df column 3. Not series object of pandas.core.series module in each cell.
Use df.iloc[:,2]).tolist() inside append to get the desired result.

Split dataframe into several ones (not the same size) based on consecuvite zero values

i have a dataframe with numeric values (i show here only the column used for the "condition").
I would like splitting it into several others (the size of the splitted dataframes could be different). The "splitting" should be based on consecutive no zero values.
In the following case from this initial dataframe:
:
I would like these three dataframes into new variables
, ,
Is there any function to achieve that without parsing all the initial dataframe?
Thank you
I think i found the solution...
df['index']=df.index.values #Create columns with index
s = df.iloc[:,3].eq(0) #mask of zero value
new_df = df.groupby([s, s.cumsum()]).apply(lambda x: list(x.index)) #find on stackoverflow to group value based on the previous mask
out=new_df.loc[False] #Select only False Value therefore only value >0
Finally i have a dataframe with the group of index of consecutive non zero values

Cleaner way to one-by-one construct a row and add it to a final dataframe?

I have large Excel files, which contain observations of objects. I read the files via pandas, group them and then iterate over each group. For each group I calculate specific and quite complex results - let's say result1, result2 and optional result3.
I define an empty df with predefined columns, in which I insert my calculated values. In the end I combine all dfs to one final df.
Maybe better explained by code:
data = pd.read_excel()
grouped = data.groupby('obj_id')
columns = ['result1','result2','result3']
combined_results = pd.DataFrame(columns=columns)
for obj_id, obj_df in grouped:
obj_results = pd.DataFrame(columns=columns,index=[0])
# creates empty df with one all NaN row
obj_results['result1'] = fooCalculation(obj_df)
obj_results['result2'] = fooCalculation(obj_df)
combined_results = combined_results.append(obj_results, sort=False)
I like my current method, because if optional columns end up having no value, the col still exists, because it was set (to NaN) initially. This way I can one by one calculate my result values per object and update my df row as soon as I have updates.
I can't help but think, that this is not the cleanest way. Especially because obj_results['result1'] = fooCalculation() sets an entire column and I falsely use it to set one value.
What is the clean / best-pratice way here.
Should I instead "cache" the results in a dict and insert it into the combine_results?

Handle missing data for a dataframe column of datatype object

I have a pandas dataframe and one of the column is of datatype object . There is a blank element present in this column, so I tried to check if there are other empty element in this column by using df['colname'].isnull().sum() but it is giving me 0. How can I replace the above value(empty) with some arbitrary value(numeric) so that I can convert this column into a column of float datatype for further computation.
pandas.to_numeric
df['colname'] = pd.to_numeric(df['colname'], errors='coerce')
This will produce np.nan for any thing it can't convert to a number. After this, you can fill in with any value you'd like with fillna
df['colname'] = df['colname'].fillna(0)
All in one go
df['colname'] = pd.to_numeric(df['colname'], errors='coerce').fillna(0)

Resources