Iterating over rows of dataframe but keep each row as a dataframe - python-3.x

I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.

There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.

Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff

If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.

Related

Expand list columns with variable length into rows of a pandas dataframe

I have a input dataframe containing multiple list columns with unequal number of elements with in the list. I need to expand all the list columns into rows so that each bin has the corresponding value in the same row.
code for generating the df:
df_dict = {'vin':['VIN123','VIN123','VIN123','VIN234','VIN345'],
'date':['01-22-2022','01-23-2022','01-23-2022','01-23-2022','01-22-2022'],
'celltype':['A','A','B','A','B'],
'soc_bins':[['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170']],
'soc_value': [[10,300,85,20,5,0],[20,400,125,670,5,7],[20,500,55,60,9,9],[40,300,65,90,1,0],[20,700,35,50,2,0]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']],
'temp_value':[[1,2,3],[4,3,4],[5,3,5],[6,900,7],[3,600,9]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']]}
Input_df:
Output_df:
vin
date
celltype
soc_bins
soc_value
temp_bins
temp_value
VIN123
01-22-2022
A
0-10
10
50f-55f
1
VIN123
01-22-2022
A
10-20
300
60f-70f
2
In short, each value in the soc_value column corresponds to the corresponding bin in the soc_bin column and same goes for the temp columns.
Few problems I encountered using the explode method or similar methods is:
The number of bins in soc_bins (5) and temp_bins (3) are not equal.
Also, there might be a same value for two bins (ex: 3rd row, soc_value contains two values as 9) so when I first expand the soc_value column there is no way for the explode fucntion to identify the two rows as different and hence i am getting an error "cannot handle a non-unique multi-index!"
There are a lot many columns that has to be manipulated in the same way.
Can use df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index() but i am getting NaN's in the indexed columns.
To fill the NaN's I can use the .ffill() but I am unable to distinguish between original null values.
Also, in this method if some of indexes are null's i'm getting an error "cannot handle a non-unique multi-index!"
Current output:
Required output: I need the output similar to my current output but without the null values. I could use .ffill() to fill the null values, but then i am unable to differentiate the actual null values vs the the ones created from the df.set_index().
Assigning a row_number to the df before exploding it into columns has solved the "cannot handle a non-unique multi-index!" issue.
df['row_number'] = np.arange(len(df))
df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index()

How can I groupby rows by the columns in which they actually posses a data point?

I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.
This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.

Spark: Filter & withColumn using row values?

I need to create a column called sim_count for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row values while using when?
Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.
Edit
<Row's col_1 val> refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where. For example, I know this is incorrect syntax, but I'm looking for something like:
df.withColumn('sim_count',
f.when(
f.col("col_1").isNotNull(),
(
df.where(
f.col("price_list").between(f.col("col1"), f.col("col2"))
).count()
)
).otherwise(f.lit(None).cast(LongType()))
)

Cleaner way to one-by-one construct a row and add it to a final dataframe?

I have large Excel files, which contain observations of objects. I read the files via pandas, group them and then iterate over each group. For each group I calculate specific and quite complex results - let's say result1, result2 and optional result3.
I define an empty df with predefined columns, in which I insert my calculated values. In the end I combine all dfs to one final df.
Maybe better explained by code:
data = pd.read_excel()
grouped = data.groupby('obj_id')
columns = ['result1','result2','result3']
combined_results = pd.DataFrame(columns=columns)
for obj_id, obj_df in grouped:
obj_results = pd.DataFrame(columns=columns,index=[0])
# creates empty df with one all NaN row
obj_results['result1'] = fooCalculation(obj_df)
obj_results['result2'] = fooCalculation(obj_df)
combined_results = combined_results.append(obj_results, sort=False)
I like my current method, because if optional columns end up having no value, the col still exists, because it was set (to NaN) initially. This way I can one by one calculate my result values per object and update my df row as soon as I have updates.
I can't help but think, that this is not the cleanest way. Especially because obj_results['result1'] = fooCalculation() sets an entire column and I falsely use it to set one value.
What is the clean / best-pratice way here.
Should I instead "cache" the results in a dict and insert it into the combine_results?

How to fillna() all columns of a dataframe from a single row of another dataframe with identical structure

I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)

Resources