Pandas DataFrame won't pivot. Says duplicate indices - python-3.x

So basically I have 3 columns in my dataframe as follows:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 158143 entries, 0 to 203270
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 users 158143 non-null int64
1 dates 158143 non-null datetime64[ns]
2 medium_of_ans 158143 non-null object
And I want it to be reshaped such that each entry in medium_of_ans value has a separate column and dates as row indices with users of a particular answer medium on a particular date resides in the junction of that row and column. In pandas similar functionality can be achieved by pivoting the dataframe although I am not able to achieve that as following attempt:
df.pivot(columns= 'medium_of_ans', index = 'dates', values = 'users')
throws this error:
ValueError: Index contains duplicate entries, cannot reshape
And I'm not sure why as a dataframe to be pivoted will obviously have duplicates in indices. That's why it is being pivoted. Resetting dataframe index as follows:
df.reset_index().pivot(columns= 'medium_of_ans', index = 'dates', values = 'users')
does not help either and error persists.

You have duplicates not just by the index, dates, but by the combination of index and column together, the combined dates and medium_of_ans.
You can find these duplicates with something like this:
counts = df.groupby(['dates', 'medium_of_ans']).size().reset_index(name='n')
duplicates = counts[counts['n'] > 1]
If you want to combine the duplicates, for example by taking the mean of users for the cell, then you can use pivot_table.
df.pivot_table(columns='medium_of_ans', index='dates', values='users', aggfunc='mean')
Taking the mean is the default, but I have added the explicit parameter for clarity.

Related

Expand list columns with variable length into rows of a pandas dataframe

I have a input dataframe containing multiple list columns with unequal number of elements with in the list. I need to expand all the list columns into rows so that each bin has the corresponding value in the same row.
code for generating the df:
df_dict = {'vin':['VIN123','VIN123','VIN123','VIN234','VIN345'],
'date':['01-22-2022','01-23-2022','01-23-2022','01-23-2022','01-22-2022'],
'celltype':['A','A','B','A','B'],
'soc_bins':[['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170']],
'soc_value': [[10,300,85,20,5,0],[20,400,125,670,5,7],[20,500,55,60,9,9],[40,300,65,90,1,0],[20,700,35,50,2,0]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']],
'temp_value':[[1,2,3],[4,3,4],[5,3,5],[6,900,7],[3,600,9]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']]}
Input_df:
Output_df:
vin
date
celltype
soc_bins
soc_value
temp_bins
temp_value
VIN123
01-22-2022
A
0-10
10
50f-55f
1
VIN123
01-22-2022
A
10-20
300
60f-70f
2
In short, each value in the soc_value column corresponds to the corresponding bin in the soc_bin column and same goes for the temp columns.
Few problems I encountered using the explode method or similar methods is:
The number of bins in soc_bins (5) and temp_bins (3) are not equal.
Also, there might be a same value for two bins (ex: 3rd row, soc_value contains two values as 9) so when I first expand the soc_value column there is no way for the explode fucntion to identify the two rows as different and hence i am getting an error "cannot handle a non-unique multi-index!"
There are a lot many columns that has to be manipulated in the same way.
Can use df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index() but i am getting NaN's in the indexed columns.
To fill the NaN's I can use the .ffill() but I am unable to distinguish between original null values.
Also, in this method if some of indexes are null's i'm getting an error "cannot handle a non-unique multi-index!"
Current output:
Required output: I need the output similar to my current output but without the null values. I could use .ffill() to fill the null values, but then i am unable to differentiate the actual null values vs the the ones created from the df.set_index().
Assigning a row_number to the df before exploding it into columns has solved the "cannot handle a non-unique multi-index!" issue.
df['row_number'] = np.arange(len(df))
df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index()

Pandas: Compare between same columns for same Ids between 2 dataframes and create a new dataframe with the differences for each column

Hello
I have 2 dataframes one old and one new. After comparing the 2 dataframes I want output generated with the column names for each id and the only changes in values as shown below.
I could merge the 2 dataframes and find the differences for each column separately like
a=df1.merge(df2, on='Ids')
a[a['ColA_x'] != a['ColA_y']]
But I have 80 columns and I want to get the difference with column names and values as shown in the output. Is there a way to do this?
Stack each dataframe to convert column names into row indexes. Concatenate the dataframes side by side:
combined = pd.concat([df1.stack(), df2.stack()], axis=1)
Now, extract the rows with the values that do not match:
combined[combined[0] != combined[1]]
# 0 1
#Ids
#123 ColA AH AB
#234 ColB GO MO
#456 ColA GH AB
#...

Merging dataframes is causing to loose rows

I have a dataframe on which I divide into 3 sub dataframes. Then I am applying aggregate functions. After than, I merge the 3 dataframes.
However, when comparing the number of rows prior and post the merge, it shows a significant loss, although I used a command to fill in blanks to preserve the row count. I think the aggregation code is that trimmed everything. Maybe there is a better way to write that portion of the code which will fix the rest of it.
In: df.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 505960 entries, 640051 to 204623
Data columns (total 4 columns):
id 505960 non-null int64
session_number 505960 non-null int64
date 505960 non-null datetime64[ns]
purchases 505960 non-null int64
dtypes: datetime64[ns](1), int64(3)
memory usage: 19.3 MB
In: df.shape
Out: (505960, 4)
In:
#slice main dataframe
df_test=df[['id','purchases','session_number','date']].copy()
#aggregations I THINK HERE IS THE PROBLEM SOURCE!
df_1=df_test.groupby(['id'])["purchases"].apply(lambda x : x.astype(int).sum()).reset_index()
df_2=df_test.groupby(['id'])["session_number"].apply(lambda y : y.max()-y.min()).astype(int).reset_index()
df_3=df_test.groupby(['id'])["date"].apply(lambda z : z.max()-z.min()).reset_index()
#merge dfs sequentially by id
df_a=pd.merge(df_1, df_2, on='id', how='left').fillna(0)
df=pd.merge(df_a, df_3, on='id', how='left').fillna(0)
in: df.shape
Out: (292291, 4)
You can see that my rows shrunk from 505,960 to 292,291! What am I doing wrong with the aggregation portion of the code and how to fix?
By looking at the given code and metadata information about the data, groupby would aggregate records with the same id into a single GroupBy object, hence the total number of record counts will decrease if the id's are not unique. The count of unique id's should be same as the final count of records after groupby.
df['id'].nunique() will give you the count of unique id's, which should match your final count.
When you do df_test.groupby(['id']) this, it generates a GroupBy object and it sets group key as index an, which is 'id' in this case.
Hence, do the following rather:
df_a = df_1.merge(df_2, left_index = True, right_index =True).fillna(0)
df = df_a.merge(df_3, left_index = True, right_index =True).fillna(0)

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Copy data from one dataframe into the column of another dataframe depending on 'n' conditions [duplicate]

I have 2 dataframes:
restaurant_ids_dataframe
Data columns (total 13 columns):
business_id 4503 non-null values
categories 4503 non-null values
city 4503 non-null values
full_address 4503 non-null values
latitude 4503 non-null values
longitude 4503 non-null values
name 4503 non-null values
neighborhoods 4503 non-null values
open 4503 non-null values
review_count 4503 non-null values
stars 4503 non-null values
state 4503 non-null values
type 4503 non-null values
dtypes: bool(1), float64(3), int64(1), object(8)`
and
restaurant_review_frame
Int64Index: 158430 entries, 0 to 229905
Data columns (total 8 columns):
business_id 158430 non-null values
date 158430 non-null values
review_id 158430 non-null values
stars 158430 non-null values
text 158430 non-null values
type 158430 non-null values
user_id 158430 non-null values
votes 158430 non-null values
dtypes: int64(1), object(7)
I would like to join these two DataFrames to make them into a single dataframe using the DataFrame.join() command in pandas.
I have tried the following line of code:
#the following line of code creates a left join of restaurant_ids_frame and restaurant_review_frame on the column 'business_id'
restaurant_review_frame.join(other=restaurant_ids_dataframe,on='business_id',how='left')
But when I try this I get the following error:
Exception: columns overlap: Index([business_id, stars, type], dtype=object)
I am very new to pandas and have no clue what I am doing wrong as far as executing the join statement is concerned.
any help would be much appreciated.
You can use merge to combine two dataframes into one:
import pandas as pd
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer')
where on specifies field name that exists in both dataframes to join on, and how
defines whether its inner/outer/left/right join, with outer using 'union of keys from both frames (SQL: full outer join).' Since you have 'star' column in both dataframes, this by default will create two columns star_x and star_y in the combined dataframe. As #DanAllan mentioned for the join method, you can modify the suffixes for merge by passing it as a kwarg. Default is suffixes=('_x', '_y'). if you wanted to do something like star_restaurant_id and star_restaurant_review, you can do:
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer', suffixes=('_restaurant_id', '_restaurant_review'))
The parameters are explained in detail in this link.
Joining fails if the DataFrames have some column names in common. The simplest way around it is to include an lsuffix or rsuffix keyword like so:
restaurant_review_frame.join(restaurant_ids_dataframe, on='business_id', how='left', lsuffix="_review")
This way, the columns have distinct names. The documentation addresses this very problem.
Or, you could get around this by simply deleting the offending columns before you join. If, for example, the stars in restaurant_ids_dataframe are redundant to the stars in restaurant_review_frame, you could del restaurant_ids_dataframe['stars'].
In case anyone needs to try and merge two dataframes together on the index (instead of another column), this also works!
T1 and T2 are dataframes that have the same indices
import pandas as pd
T1 = pd.merge(T1, T2, on=T1.index, how='outer')
P.S. I had to use merge because append would fill NaNs in unnecessarily.
In case, you want to merge two DataFrames horizontally, then use this code:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)

Resources