python pandas data frame: single column to multiple columns based on values - python-3.x

I am new to pandas.
I am trying to split a single column to multiple columns based on index value using Groupby. Below is the program wrote.
import pandas as pd
data = [(0,1.1),
(1,1.2),
(2,1.3),
(0,2.1),
(1,2.2),
(0,3.1),
(1,3.2),
(2,3.3),
(3,3.4)]
df = pd.DataFrame(data, columns=['ID','test_data'])
df = df.groupby('ID',sort=True).apply(lambda g: pd.Series(g['test_data'].values))
print(df)
df=df.unstack(level=-1).rename(columns=lambda x: 'test_data%s' %x)
print(df)
I have to use unstack(level=-1) because when we have uneven column size, the groupie and series stores the result as shown below.
ID
0 0 1.1
1 2.1
2 3.1
1 0 1.2
1 2.2
2 3.2
2 0 1.3
1 3.3
3 0 3.4
dtype: float64
End result I am getting after unstack is like below
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 3.3 NaN
3 3.4 NaN NaN
but what I am expecting is
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NAN 3.3
3 NAN NAN 3.4
Let me know if there is any better way to do this other than groupby.

This will work if your dataframe is sorted as you show
df['num_zeros_seen'] = df['ID'].eq(0).cumsum()
#reshape the table
df = df.pivot(
index='ID',
columns='num_zeros_seen',
values='test_data',
)
print(df)
Output:
num_zeros_seen 1 2 3
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NaN 3.3
3 NaN NaN 3.4

Related

DataFrame of Dates into sequential dates

I would like to turn a dataframe as follows into a data frame of sequential dates.
Date
01/25/1995
01/20/1995
01/20/1995
01/23/1995
into
Date Value Cumsum
01/20/1995 2 2
01/21/1995 0 2
01/22/1995 0 2
01/23/1995 1 3
01/24/1995 0 3
01/25/1995 1 4
Try this:
df['Date'] = pd.to_datetime(df['Date'])
df_out = df.assign(Value=1).set_index('Date').resample('D').asfreq().fillna(0)
df_out = df_out.assign(Cumsum=df_out['Value'].cumsum())
print(df_out)
Output:
Value Cumsum
Date
1995-01-20 1.0 1.0
1995-01-21 0.0 1.0
1995-01-22 0.0 1.0
1995-01-23 1.0 2.0
1995-01-24 0.0 2.0
1995-01-25 1.0 3.0

can DataFrames have "holes" and how to implement it?

Talking about pandas we can have a DataFrame of the form
ao a1 a2
1 2.3 3.4 2.1
2 3.1 2.1 2.2
3 3.4 4.5 3.3
but is it possible to have something like
ao a1 a2
1 2.3 3.4 2.1
2 2.1 2.2
3 3.4 3.3
meaning no values (holes) inside the DataFrame?
How can I achieve something like this?
(My plan is later I will plot this including plotting nothing in the holes)
Are you trying to use NaN ?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a0': [2.3, np.nan, 3.4],
'a1': [3.4, 2.1, np.nan],
'a2': [2.1, 2.2, 3.3]})
print(df)
Output
a0 a1 a2
0 2.3 3.4 2.1
1 NaN 2.1 2.2
2 3.4 NaN 3.3

Table has several columns with the same type of information

My table has 4 columns: order_id, item_id_1, item_id_2 and item_id_3. The three last columns cover the same type of information (the ids of products). I want to transform this table to get 2-columns table with "order_id" and "item_id", so my columns cover unique type of informations. That means, if in a particular order_id there were 3 products ordered, I will get three (instead of one) rows in my new table).
This will alow me, for exapmle, perform 'grupby' operation on 'item_id" column to count how meny times a particular product was ordered.
How this table transformation process is called?
For example, if you have a dataframe like this -
df = pd.DataFrame({'order_id':[1,2,3], 'item_id_1':['a','b','c'], 'item_id_2':['x','y',''], 'item_id_3':['','q','']})
df
order_id item_id_1 item_id_2 item_id_3
0 1 a x
1 2 b y q
2 3 c
pd.melt(df, id_vars=['order_id'], \
value_vars=['item_id_1', 'item_id_2', 'item_id_3'], \
var_name='item_id', value_name='item_value').\
replace('',np.nan).dropna().\
sort_values(['order_id']).\
reset_index(drop=True)\
[['order_id', 'item_id']]
So I'm not aware of any method that allows you to expand rows automatically as you're suggesting, but you can easily reach you're goal without. Let's start from a similar data frame, I put nan in cells of items that have not been ordered:
import pandas as pd
import numpy as np
data = {'order_id':[1,2,3],'item_id_1':[11,12,13],'item_id_2':[21,np.nan,23],'item_id_3':[31,np.nan,np.nan]}
df = pd.DataFrame(data)
cols = ['item_id_1','item_id_2','item_id_3']
print(df)
Out:
order_id item_id_1 item_id_2 item_id_3
0 1 11 21.0 31.0
1 2 12 NaN NaN
2 3 13 23.0 NaN
Then you can define a new empty data frame to fill by iterating through the rows of the initial one. For every item a new row is added to the empty data frame with same order_id and different item_id.
new_df = pd.DataFrame(columns = ['order_id','item_id']) # ,'item_num']
for ind, row in df.iterrows():
new_row = {}
new_row['order_id'] = row['order_id']
for col in cols: # for num, col in enumerate(cols):
item = row[col]
if not pd.isna(item):
new_row['item_id'] = item
# new_row['item_num'] = num +1
new_df = new_df.append(new_row,ignore_index=True)
print(new_df)
Out: # shape (6,2), ok because because 6 items have been ordered
order_id item_id
0 1.0 11.0
1 1.0 21.0
2 1.0 31.0
3 2.0 12.0
4 3.0 13.0
5 3.0 23.0
If you want, you could also add a third column to keep trace of the category of each item (i.e. if it was item_1, 2 or 3) by uncommenting the lines in the code, which gives you this output:
order_id item_id item_num
0 1.0 11.0 1.0
1 1.0 21.0 2.0
2 1.0 31.0 3.0
3 2.0 12.0 1.0
4 3.0 13.0 1.0
5 3.0 23.0 2.0

Pandas groupby and append the original values. Count the mean of per row

I have a dataframe of IDs and Values. Where IDs are kind of repetition of trial and Values are the results.
I want to do groupby by ID and for same IDs the Values will be added to adjacent columns. Finally I want to calculate the mean of each of the rows.
>>>df
ID Value
0 1 1.1
1 2 1.2
2 3 2.4
3 1 1.7
4 2 4.3
5 3 2.2
>>>groups = df.groupby(by='ID')
#Now I cannot figure it what to do for my desired output.
I want the output like
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.9
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.3
Use DataFrame.assign for new column created by counter per groups by GroupBy.cumcount, reshape by DataFrame.pivot, change columns names by DataFrame.add_prefix, add new column filled by means and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df = (df.assign(g = df.groupby('ID').cumcount().add(1))
.pivot('ID','g','Value')
.add_prefix('Value_')
.assign(Mean = lambda x: x.mean(axis=1))
.reset_index()
.rename_axis(None, axis=1))
print (df)
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.40
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.30
One of possible solutions, assuming that you have 2 rows for each ID:
Define a function to be applied to groups:
def fn(grp):
vals = grp.Value.values
return [ vals[0], vals[-1], grp.Value.mean() ]
Then apply it and "move" ID column from index to regular column:
df2 = df.groupby('ID').apply(fn).apply(pd.Series).reset_index()
And the last point is to set proper column names:
df2.columns=[ 'ID', 'Value_1', 'Value_2', 'Mean' ]

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric
Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0
IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0
The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

Resources