Add random data between rows in a pandas dataframe - python-3.x

I've got a pandas dataframe like this. This contains a timestamp, id, foo and bar.
The timestamp data is around every 10 minutes.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.10 0.05
2019-04-14 00:10:00 2 0.30 0.10
For each id, I'd like to create 5 additional rows with timestamp split equally between successive rows, and foo & bar values containing random values between the successive rows.
The start time should be the earliest timestamp for each id and the end time should be the latest timestamp for each id
So the output would be like this.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:02:10 1 0.14 0.06
2019-04-14 00:04:10 1 0.11 0.06
2019-04-14 00:06:10 1 0.29 0.07
2019-04-14 00:08:10 1 0.22 0.09
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.80 0.50
2019-04-14 00:02:00 2 0.45 0.48
2019-04-14 00:04:00 2 0.52 0.42
2019-04-14 00:06:00 2 0.74 0.48
2019-04-14 00:08:00 2 0.41 0.45
2019-04-14 00:10:00 2 0.40 0.40
I can reindex the timestamp column and create additional timestamp rows (eg. Pandas create new date rows and forward fill column values).
But I can't seem to wrap my head around how to compute the random values for foo and bar between the successive rows.
Appreciate if someone can point me in the right direction!

The close, what you need is use date_range with DataFrame.reindex by first and last value of DatetimeIndex:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = (df.set_index('timestamp')
.groupby('id')['foo','bar']
.apply(lambda x: x.reindex(pd.date_range(x.index[0], x.index[-1], periods=6))))
Then create helper DataFrame with same size like original and DataFrame.fillna missing values:
df1 = pd.DataFrame(np.random.rand(*df.shape), index=df.index, columns=df.columns)
df = df.fillna(df1)
print (df)
foo bar
id
1 2019-04-14 00:00:10.000 0.100000 0.050000
2019-04-14 00:02:08.400 0.903435 0.755841
2019-04-14 00:04:06.800 0.956002 0.253878
2019-04-14 00:06:05.200 0.388454 0.257639
2019-04-14 00:08:03.600 0.225535 0.195306
2019-04-14 00:10:02.000 0.300000 0.100000
2 2019-04-14 00:00:00.000 0.100000 0.050000
2019-04-14 00:02:00.000 0.180865 0.327581
2019-04-14 00:04:00.000 0.417956 0.414400
2019-04-14 00:06:00.000 0.012686 0.800948
2019-04-14 00:08:00.000 0.716216 0.941396
2019-04-14 00:10:00.000 0.300000 0.100000

If the 'randomness' is not as crucial. We can use Series.interpolate which will keep the values between your min and max per group:
df_new = pd.concat([
d.reindex(pd.date_range(d.timestamp.min(), d.timestamp.max(), periods=6))
for _, d in df.groupby('id')
])
df_new['timestamp'] = df_new.index
df_new.reset_index(drop=True, inplace=True)
df_new = df_new[['timestamp']].merge(df, on='timestamp', how='left')
df_new['id'].fillna(method='ffill', inplace=True)
df_new[['foo', 'bar']] = df_new[['foo', 'bar']].apply(lambda x: x.interpolate())
Which gives the following output:
print(df_new)
timestamp id foo bar
0 2019-04-14 00:00:10.000 1.0 0.10 0.05
1 2019-04-14 00:02:08.400 1.0 0.14 0.06
2 2019-04-14 00:04:06.800 1.0 0.18 0.07
3 2019-04-14 00:06:05.200 1.0 0.22 0.08
4 2019-04-14 00:08:03.600 1.0 0.26 0.09
5 2019-04-14 00:10:02.000 1.0 0.30 0.10
6 2019-04-14 00:00:00.000 2.0 0.10 0.05
7 2019-04-14 00:02:00.000 2.0 0.14 0.06
8 2019-04-14 00:04:00.000 2.0 0.18 0.07
9 2019-04-14 00:06:00.000 2.0 0.22 0.08
10 2019-04-14 00:08:00.000 2.0 0.26 0.09
11 2019-04-14 00:10:00.000 2.0 0.30 0.10

Related

Rescale pandas column based on a value within that column?

I'm trying to normalize a column of data to 1 based on an internal standard control across several batches of data. However, I'm struggling to do this natively in pandas and not splitting things into multiple chunks with for loops.
import pandas as pd
Test_Data = {"Sample":["Control","Test1","Test2","Test3","Test4","Control","Test1","Test2","Test3","Test4"],
"Batch":["A","A","A","A","A","B","B","B","B","B"],
"Input":[0.1,0.15,0.08,0.11,0.2,0.15,0.1,0.04,0.11,0.2],
"Output":[0.1,0.6,0.08,0.22,0.01,0.08,0.22,0.02,0.13,0.004]}
DB = pd.DataFrame(Test_Data)
DB.loc[:,"Ratio"] = DB["Output"]/DB["Input"]
DB:
Sample Batch Input Output Ratio
0 Control A 0.10 0.100 1.000000
1 Test1 A 0.15 0.600 4.000000
2 Test2 A 0.08 0.080 1.000000
3 Test3 A 0.11 0.220 2.000000
4 Test4 A 0.20 0.010 0.050000
5 Control B 0.15 0.080 0.533333
6 Test1 B 0.10 0.220 2.200000
7 Test2 B 0.04 0.020 0.500000
8 Test3 B 0.11 0.130 1.181818
9 Test4 B 0.20 0.004 0.020000
My desired output would be to normalize each ratio per Batch based on the Control sample, effectively multiplying all the Batch "B" samples by 1.875.
DB:
Sample Batch Input Output Ratio Norm_Ratio
0 Control A 0.10 0.100 1.000000 1.000000
1 Test1 A 0.15 0.600 4.000000 4.000000
2 Test2 A 0.08 0.080 1.000000 1.000000
3 Test3 A 0.11 0.220 2.000000 2.000000
4 Test4 A 0.20 0.010 0.050000 0.050000
5 Control B 0.15 0.080 0.533333 1.000000
6 Test1 B 0.10 0.220 2.200000 4.125000
7 Test2 B 0.04 0.020 0.500000 0.937500
8 Test3 B 0.11 0.130 1.181818 2.215909
9 Test4 B 0.20 0.004 0.020000 0.037500
I can do this by breaking up the dataframe using for loops and manually extracting the "Control" values, but this is slow and messy for large datasets.
Use where and groupby.transform:
DB['Norm_Ratio'] = DB['Ratio'].div(
DB['Ratio'].where(DB['Sample'].eq('Control'))
.groupby(DB['Batch']).transform('first')
)
Output:
Sample Batch Input Output Ratio Norm_Ratio
0 Control A 0.10 0.100 1.000000 1.000000
1 Test1 A 0.15 0.600 4.000000 4.000000
2 Test2 A 0.08 0.080 1.000000 1.000000
3 Test3 A 0.11 0.220 2.000000 2.000000
4 Test4 A 0.20 0.010 0.050000 0.050000
5 Control B 0.15 0.080 0.533333 1.000000
6 Test1 B 0.10 0.220 2.200000 4.125000
7 Test2 B 0.04 0.020 0.500000 0.937500
8 Test3 B 0.11 0.130 1.181818 2.215909
9 Test4 B 0.20 0.004 0.020000 0.037500

How to groupby a dataframe based on one column and transpose based on another column

I have a dataframe with values as:
col_1
Timestamp
data_1
data_2
aaa
22/12/2001
0.21
0.2
abb
22/12/2001
0.20
0
acc
22/12/2001
0.12
0.19
aaa
23/12/2001
0.23
0.21
abb
23/12/2001
0.32
0.18
acc
23/12/2001
0.52
0.20
I need to group the dataframe based on the timestamp and add columns w.r.t the col_1 column for data_1 and data_2 such as:
Timestamp
aaa_data_1
abb_data_1
acc_data_1
aaa_data_2
abb_data_2
acc_data_2
22/12/2001
0.21
0.20
0.12
0.2
0
0.19
23/12/2001
0.23
0.32
0.52
0.21
0.18
0.20
I am able to group by based on timestamp but not finding a way to update/add the columns.
And with df.pivot(index='Timestamp', columns='col_1'), I get
Timestamp
aaa_data_1
abb_data_1
acc_data_1
aaa_data_2
abb_data_2
acc_data_2
22/12/2001
0.12
0.19
22/12/2001
0.20
0
22/12/2001
0.21
0.2
23/12/2001
0.52
0.20
23/12/2001
0.32
0.18
23/12/2001
0.23
0.21
A pivot plus a column rename are all you need:
result = df.pivot(index='Timestamp', columns='col_1')
result.columns = [f'{col_1}_{data}' for data, col_1 in result.columns]
#CodeDifferent's answer suffices, since your data does not have aggregation; an alternative option is the dev version of pivot_wider from pyjanitor (they are wrappers around pandas functions):
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
df.pivot_wider(index='Timestamp',
names_from='col_1',
levels_order=['col_1', None],
names_sep='_')
Timestamp aaa_data_1 abb_data_1 acc_data_1 aaa_data_2 abb_data_2 acc_data_2
0 22/12/2001 0.21 0.20 0.12 0.20 0.00 0.19
1 23/12/2001 0.23 0.32 0.52 0.21 0.18 0.20
This will fail if there are duplicates in the combination of index and names_from; in that case you can use the pivot_table, which takes care of duplicates:
(df.pivot_table(index='Timestamp', columns='col_1')
.swaplevel(axis = 1)
.pipe(lambda df: df.set_axis(df.columns.map('_'.join), axis =1))
)
aaa_data_1 abb_data_1 acc_data_1 aaa_data_2 abb_data_2 acc_data_2
Timestamp
22/12/2001 0.21 0.20 0.12 0.20 0.00 0.19
23/12/2001 0.23 0.32 0.52 0.21 0.18 0.20
Or with a helper method from pyjanitor, for a bit of a cleaner method chaining syntax:
(df.pivot_table(index='Timestamp', columns='col_1')
.swaplevel(axis = 1)
.collapse_levels()

Pandas: How to separate a large df into multiple dfs based on column value

I was wondering whether there is a way to seperate the table below in multiple sub dfs using the periodicity in the first column e.g between ~5,..,~0
before:
a b c
5.10 1.00 0.00
4.20 2.00 0.00
3.01 3.00 0.00
2.10 4.00 0.00
1.20 5.00 0.00
0.52 6.00 0.00
0.02 6.00 1.00
5.30 7.00 0.40
4.20 8.00 0.00
3.10 9.00 0.00
2.40 10.00 0.00
1.30 11.00 0.00
0.20 12.00 0.00
5.98 13.00 0.00
4.23 14.00 0.30
3.33 15.00 0.00
2.11 16.00 0.00
1.30 17.00 0.00
0.30 18.00 0.00
5.50 13.00 0.00
output after separating into multiple dfs :
"sub_df1"
5.10 1.00 0.00
4.20 2.00 0.00
3.01 3.00 0.00
2.10 4.00 0.00
1.20 5.00 0.00
0.52 6.00 0.00
0.02 6.00 0.00
"sub_df2"
5.30 7.00 0.00
4.20 8.00 0.00
3.10 9.00 0.00
2.40 10.00 0.00
1.30 11.00 0.00
0.20 12.00 0.00
"sub_df3"
5.98 13.00 0.00
4.23 14.00 0.00
3.33 15.00 0.00
2.11 16.00 0.00
1.30 17.00 0.00
0.30 18.00 0.00
"sub_df4"
5.50 13.00 0.00
The periodicity is variable in length so I cannot assume a fixed length to separate. Therefore, I thought first to add another column 'id' like
df['id']=(df['a'].shift(1)>df['a']).astype(int)
this could show me at least from where (1st:"0") to where (2nd"0") to append the values. However, I don't quite know how to continue from here
a b c id
0 4.20 2.0 0.0 0
1 3.01 3.0 0.0 1
2 2.10 4.0 0.0 1
3 1.20 5.0 0.0 1
4 0.52 6.0 0.0 1
5 0.02 6.0 1.0 1
6 5.30 7.0 0.4 0
7 4.20 8.0 0.0 1
8 3.10 9.0 0.0 1
9 2.40 10.0 0.0 1
10 1.30 11.0 0.0 1
11 0.20 12.0 0.0 1
12 5.98 13.0 0.0 0
13 4.23 14.0 0.3 1
14 3.33 15.0 0.0 1
15 2.11 16.0 0.0 1
16 1.30 17.0 0.0 1
17 0.30 18.0 0.0 1
18 5.50 13.0 0.0 0
You can create a series s to identify the different groups. From there, you can create multiple dataframes and add the to a dictionary of dataframes df_dict. I show oyu how to access these in the print statement.:
s = (df['a'] > df['a'].shift()).cumsum() + 1
df_dict = {}
for frame, data in df.groupby(s):
df_dict[f'df{frame}'] = data
print(df_dict['df1'], '\n\n',
df_dict['df2'], '\n\n',
df_dict['df3'], '\n\n',
df_dict['df4'])
a b c
0 5.10 1.0 0.0
1 4.20 2.0 0.0
2 3.01 3.0 0.0
3 2.10 4.0 0.0
4 1.20 5.0 0.0
5 0.52 6.0 0.0
6 0.02 6.0 1.0
a b c
7 5.3 7.0 0.4
8 4.2 8.0 0.0
9 3.1 9.0 0.0
10 2.4 10.0 0.0
11 1.3 11.0 0.0
12 0.2 12.0 0.0
a b c
13 5.98 13.0 0.0
14 4.23 14.0 0.3
15 3.33 15.0 0.0
16 2.11 16.0 0.0
17 1.30 17.0 0.0
18 0.30 18.0 0.0
a b c
19 5.5 13.0 0.0
Try this:
listofdfs = [y for x,y in df.groupby(df['a'].diff().gt(0).cumsum())]
or
dict(list(df.groupby(df['a'].diff().gt(0).cumsum())))

Get first and last Index of a Pandas DataFrame subset

I do got some data within a pandas DataFrame looking like this.
df =
A B
time
0.1 10.0 1
0.15 12.1 2
0.19 4.0 2
0.21 5.0 2
0.22 6.0 2
0.25 7.0 1
0.3 8.1 1
0.4 9.45 2
0.5 3.0 1
Based on the following condition I look for a generic solution to find the first and last index of every subset.
cond = df.B == 2
So far I tried using the groupby concept but without the expected result.
df_1 = cond.reset_index()
df_2 = df_1.groupby(df_1['B']).agg(['first','last']).reset_index()
This is the output I got.
B time
first last
0 False 0.1 0.5
1 True 0.15 0.4
This is the output I like to get.
B time
first last
0 False 0.1 0.1
1 True 0.15 0.22
2 False 0.25 0.3
3 True 0.4 0.4
3 False 0.5 0.5
How can I accomplish this by a more or less generic approach?
Create helper Series by Series.shift with Series.ne and cumulative sum by Series.cumsum for groups by consecutive values, then for aggregation is used dictionary:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg({'B':'first','time': ['first','last']}).reset_index(drop=True)
print (df_2)
B time
first first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
If want avoid MultiIndex use named aggregations:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg(B=('B','first'),
first=('time','first'),
last=('time','last')).reset_index(drop=True)
print (df_2)
B first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50

get the value from another values if value is nan [duplicate]

I am trying to create a column which contains only the minimum of the one row and a few columns, for example:
A0 A1 A2 B0 B1 B2 C0 C1
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72
Here I am trying to create a column which contains the minimum for each row of columns B0, B1, B2.
The output would look like this:
A0 A1 A2 B0 B1 B2 C0 C1 Minimum
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75 0.42
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73 0.00
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03 0.51
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61 0.51
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53 0.17
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72 0.01
Here is part of the code, but it is not doing what I want it to do:
for i in range(0,2):
df['Minimum'] = df.loc[0,'B'+str(i)].min()
This is a one-liner, you just need to use the axis argument for min to tell it to work across the columns rather than down:
df['Minimum'] = df.loc[:, ['B0', 'B1', 'B2']].min(axis=1)
If you need to use this solution for different numbers of columns, you can use a for loop or list comprehension to construct the list of columns:
n_columns = 2
cols_to_use = ['B' + str(i) for i in range(n_columns)]
df['Minimum'] = df.loc[:, cols_to_use].min(axis=1)
For my tasks a universal and flexible approach is the following example:
df['Minimum'] = df[['B0', 'B1', 'B2']].apply(lambda x: min(x[0],x[1],x[2]), axis=1)
The target column 'Minimum' is assigned the result of the lambda function based on the selected DF columns['B0', 'B1', 'B2']. Access elements in a function through the function alias and his new Index(if count of elements is more then one). Be sure to specify axis=1, which indicates line-by-line calculations.
This is very convenient when you need to make complex calculations.
However, I assume that such a solution may be inferior in speed.
As for the selection of columns, in addition to the 'for' method, I can suggest using a filter like this:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
literally, a filter is applied to the list of DF columns through a lambda function that checks for the occurrence of the letter 'B'.
after that the first example can be written as follows:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
df['Minimum'] = df[calls_to_use].apply(lambda x: min(x), axis=1)
although after pre-selecting the columns, it would be preferable:
df['Minimum'] = df[calls_to_use].min(axis=1)

Resources