python pandas data frame: single column to multiple columns based on values

python pandas data frame: single column to multiple columns based on values - python-3.x

I am new to pandas.
I am trying to split a single column to multiple columns based on index value using Groupby. Below is the program wrote.
import pandas as pd
data = [(0,1.1),
(1,1.2),
(2,1.3),
(0,2.1),
(1,2.2),
(0,3.1),
(1,3.2),
(2,3.3),
(3,3.4)]
df = pd.DataFrame(data, columns=['ID','test_data'])
df = df.groupby('ID',sort=True).apply(lambda g: pd.Series(g['test_data'].values))
print(df)
df=df.unstack(level=-1).rename(columns=lambda x: 'test_data%s' %x)
print(df)
I have to use unstack(level=-1) because when we have uneven column size, the groupie and series stores the result as shown below.
ID
0 0 1.1
1 2.1
2 3.1
1 0 1.2
1 2.2
2 3.2
2 0 1.3
1 3.3
3 0 3.4
dtype: float64
End result I am getting after unstack is like below
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 3.3 NaN
3 3.4 NaN NaN
but what I am expecting is
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NAN 3.3
3 NAN NAN 3.4
Let me know if there is any better way to do this other than groupby.

This will work if your dataframe is sorted as you show
df['num_zeros_seen'] = df['ID'].eq(0).cumsum()
#reshape the table
df = df.pivot(
index='ID',
columns='num_zeros_seen',
values='test_data',
)
print(df)
Output:
num_zeros_seen 1 2 3
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NaN 3.3
3 NaN NaN 3.4

Related

DataFrame of Dates into sequential dates

I would like to turn a dataframe as follows into a data frame of sequential dates.
Date
01/25/1995
01/20/1995
01/20/1995
01/23/1995
into
Date Value Cumsum
01/20/1995 2 2
01/21/1995 0 2
01/22/1995 0 2
01/23/1995 1 3
01/24/1995 0 3
01/25/1995 1 4

Try this:
df['Date'] = pd.to_datetime(df['Date'])
df_out = df.assign(Value=1).set_index('Date').resample('D').asfreq().fillna(0)
df_out = df_out.assign(Cumsum=df_out['Value'].cumsum())
print(df_out)
Output:
Value Cumsum
Date
1995-01-20 1.0 1.0
1995-01-21 0.0 1.0
1995-01-22 0.0 1.0
1995-01-23 1.0 2.0
1995-01-24 0.0 2.0
1995-01-25 1.0 3.0

can DataFrames have "holes" and how to implement it?

Talking about pandas we can have a DataFrame of the form
ao a1 a2
1 2.3 3.4 2.1
2 3.1 2.1 2.2
3 3.4 4.5 3.3
but is it possible to have something like
ao a1 a2
1 2.3 3.4 2.1
2 2.1 2.2
3 3.4 3.3
meaning no values (holes) inside the DataFrame?
How can I achieve something like this?
(My plan is later I will plot this including plotting nothing in the holes)

Are you trying to use NaN ?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a0': [2.3, np.nan, 3.4],
'a1': [3.4, 2.1, np.nan],
'a2': [2.1, 2.2, 3.3]})
print(df)
Output
a0 a1 a2
0 2.3 3.4 2.1
1 NaN 2.1 2.2
2 3.4 NaN 3.3

Table has several columns with the same type of information

My table has 4 columns: order_id, item_id_1, item_id_2 and item_id_3. The three last columns cover the same type of information (the ids of products). I want to transform this table to get 2-columns table with "order_id" and "item_id", so my columns cover unique type of informations. That means, if in a particular order_id there were 3 products ordered, I will get three (instead of one) rows in my new table).
This will alow me, for exapmle, perform 'grupby' operation on 'item_id" column to count how meny times a particular product was ordered.
How this table transformation process is called?

For example, if you have a dataframe like this -
df = pd.DataFrame({'order_id':[1,2,3], 'item_id_1':['a','b','c'], 'item_id_2':['x','y',''], 'item_id_3':['','q','']})
df
order_id item_id_1 item_id_2 item_id_3
0 1 a x
1 2 b y q
2 3 c
pd.melt(df, id_vars=['order_id'], \
value_vars=['item_id_1', 'item_id_2', 'item_id_3'], \
var_name='item_id', value_name='item_value').\
replace('',np.nan).dropna().\
sort_values(['order_id']).\
reset_index(drop=True)\
[['order_id', 'item_id']]

So I'm not aware of any method that allows you to expand rows automatically as you're suggesting, but you can easily reach you're goal without. Let's start from a similar data frame, I put nan in cells of items that have not been ordered:
import pandas as pd
import numpy as np
data = {'order_id':[1,2,3],'item_id_1':[11,12,13],'item_id_2':[21,np.nan,23],'item_id_3':[31,np.nan,np.nan]}
df = pd.DataFrame(data)
cols = ['item_id_1','item_id_2','item_id_3']
print(df)
Out:
order_id item_id_1 item_id_2 item_id_3
0 1 11 21.0 31.0
1 2 12 NaN NaN
2 3 13 23.0 NaN
Then you can define a new empty data frame to fill by iterating through the rows of the initial one. For every item a new row is added to the empty data frame with same order_id and different item_id.
new_df = pd.DataFrame(columns = ['order_id','item_id']) # ,'item_num']
for ind, row in df.iterrows():
new_row = {}
new_row['order_id'] = row['order_id']
for col in cols: # for num, col in enumerate(cols):
item = row[col]
if not pd.isna(item):
new_row['item_id'] = item
# new_row['item_num'] = num +1
new_df = new_df.append(new_row,ignore_index=True)
print(new_df)
Out: # shape (6,2), ok because because 6 items have been ordered
order_id item_id
0 1.0 11.0
1 1.0 21.0
2 1.0 31.0
3 2.0 12.0
4 3.0 13.0
5 3.0 23.0
If you want, you could also add a third column to keep trace of the category of each item (i.e. if it was item_1, 2 or 3) by uncommenting the lines in the code, which gives you this output:
order_id item_id item_num
0 1.0 11.0 1.0
1 1.0 21.0 2.0
2 1.0 31.0 3.0
3 2.0 12.0 1.0
4 3.0 13.0 1.0
5 3.0 23.0 2.0

Pandas groupby and append the original values. Count the mean of per row

I have a dataframe of IDs and Values. Where IDs are kind of repetition of trial and Values are the results.
I want to do groupby by ID and for same IDs the Values will be added to adjacent columns. Finally I want to calculate the mean of each of the rows.
>>>df
ID Value
0 1 1.1
1 2 1.2
2 3 2.4
3 1 1.7
4 2 4.3
5 3 2.2
>>>groups = df.groupby(by='ID')
#Now I cannot figure it what to do for my desired output.
I want the output like
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.9
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.3

Use DataFrame.assign for new column created by counter per groups by GroupBy.cumcount, reshape by DataFrame.pivot, change columns names by DataFrame.add_prefix, add new column filled by means and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df = (df.assign(g = df.groupby('ID').cumcount().add(1))
.pivot('ID','g','Value')
.add_prefix('Value_')
.assign(Mean = lambda x: x.mean(axis=1))
.reset_index()
.rename_axis(None, axis=1))
print (df)
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.40
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.30

One of possible solutions, assuming that you have 2 rows for each ID:
Define a function to be applied to groups:
def fn(grp):
vals = grp.Value.values
return [ vals[0], vals[-1], grp.Value.mean() ]
Then apply it and "move" ID column from index to regular column:
df2 = df.groupby('ID').apply(fn).apply(pd.Series).reset_index()
And the last point is to set proper column names:
df2.columns=[ 'ID', 'Value_1', 'Value_2', 'Mean' ]

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric

Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0

IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0

The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

python pandas data frame: single column to multiple columns based on values - python-3.x

This will work if your dataframe is sorted as you show df['num_zeros_seen'] = df['ID'].eq(0).cumsum() #reshape the table df = df.pivot( index='ID', columns='num_zeros_seen', values='test_data', ) print(df) Output: num_zeros_seen 1 2 3 ID 0 1.1 2.1 3.1 1 1.2 2.2 3.2 2 1.3 NaN 3.3 3 NaN NaN 3.4

Related

DataFrame of Dates into sequential dates

can DataFrames have "holes" and how to implement it?

Table has several columns with the same type of information

Pandas groupby and append the original values. Count the mean of per row

Multiple columns difference of 2 Pandas DataFrame

Categories

Resources