I'm trying to normalize a column of data to 1 based on an internal standard control across several batches of data. However, I'm struggling to do this natively in pandas and not splitting things into multiple chunks with for loops.
import pandas as pd
Test_Data = {"Sample":["Control","Test1","Test2","Test3","Test4","Control","Test1","Test2","Test3","Test4"],
"Batch":["A","A","A","A","A","B","B","B","B","B"],
"Input":[0.1,0.15,0.08,0.11,0.2,0.15,0.1,0.04,0.11,0.2],
"Output":[0.1,0.6,0.08,0.22,0.01,0.08,0.22,0.02,0.13,0.004]}
DB = pd.DataFrame(Test_Data)
DB.loc[:,"Ratio"] = DB["Output"]/DB["Input"]
DB:
Sample Batch Input Output Ratio
0 Control A 0.10 0.100 1.000000
1 Test1 A 0.15 0.600 4.000000
2 Test2 A 0.08 0.080 1.000000
3 Test3 A 0.11 0.220 2.000000
4 Test4 A 0.20 0.010 0.050000
5 Control B 0.15 0.080 0.533333
6 Test1 B 0.10 0.220 2.200000
7 Test2 B 0.04 0.020 0.500000
8 Test3 B 0.11 0.130 1.181818
9 Test4 B 0.20 0.004 0.020000
My desired output would be to normalize each ratio per Batch based on the Control sample, effectively multiplying all the Batch "B" samples by 1.875.
DB:
Sample Batch Input Output Ratio Norm_Ratio
0 Control A 0.10 0.100 1.000000 1.000000
1 Test1 A 0.15 0.600 4.000000 4.000000
2 Test2 A 0.08 0.080 1.000000 1.000000
3 Test3 A 0.11 0.220 2.000000 2.000000
4 Test4 A 0.20 0.010 0.050000 0.050000
5 Control B 0.15 0.080 0.533333 1.000000
6 Test1 B 0.10 0.220 2.200000 4.125000
7 Test2 B 0.04 0.020 0.500000 0.937500
8 Test3 B 0.11 0.130 1.181818 2.215909
9 Test4 B 0.20 0.004 0.020000 0.037500
I can do this by breaking up the dataframe using for loops and manually extracting the "Control" values, but this is slow and messy for large datasets.
Use where and groupby.transform:
DB['Norm_Ratio'] = DB['Ratio'].div(
DB['Ratio'].where(DB['Sample'].eq('Control'))
.groupby(DB['Batch']).transform('first')
)
Output:
Sample Batch Input Output Ratio Norm_Ratio
0 Control A 0.10 0.100 1.000000 1.000000
1 Test1 A 0.15 0.600 4.000000 4.000000
2 Test2 A 0.08 0.080 1.000000 1.000000
3 Test3 A 0.11 0.220 2.000000 2.000000
4 Test4 A 0.20 0.010 0.050000 0.050000
5 Control B 0.15 0.080 0.533333 1.000000
6 Test1 B 0.10 0.220 2.200000 4.125000
7 Test2 B 0.04 0.020 0.500000 0.937500
8 Test3 B 0.11 0.130 1.181818 2.215909
9 Test4 B 0.20 0.004 0.020000 0.037500
Related
Trying to avoid defining multiple individual polygons/quad, so I use polydata.
I need to define multiple polydata in a Matlab generated vtk file, but each one should be assigned a different color (defined in a lookup table).
The following code gives an error and accepts only the first color which it assigns to all polydata.
# vtk DataFile Version 5.1
vtk output
ASCII
DATASET POLYDATA
POINTS 12 float
0.500000 1.000000 0.000000
0.353553 1.000000 -0.353553
0.000000 1.000000 -0.500000
-0.353553 1.000000 -0.353553
-0.500000 1.000000 0.000000
-0.353553 1.000000 0.353553
0.000000 1.000000 0.500000
0.353553 1.000000 0.353553
0. 0. 0.
1. 1. 1.
2. 2. 2.
1. 2. 1.
POLYGONS 3 12
OFFSETS vtktypeint64
0 8 12
CONNECTIVITY vtktypeint64
0 1 2 3 4 5 6 7
9 10 11 12
CELL_DATA 2
SCALARS SMEARED float 1
LOOKUP_TABLE victor
0 1
LOOKUP_TABLE victor 1
1.000000 0.000000 0.000000 1.000000
0.000000 1.000000 0.000000 1.000000
LOOKUP_TABLE victor 1
This should be LOOKUP_TABLE victor 2, as you define 2 RGBA points in your table
I'd like to convert a dataframe to a matrix.
I took the titanic dataset as an example.
The dataframe looks like so:
x y ppscore
0 pclass pclass 1.000000
1 pclass survived 0.000000
2 pclass name 0.000000
3 pclass sex 0.000000
4 pclass age 0.088131
5 pclass sibsp 0.000000
6 pclass parch 0.000000
7 pclass ticket 0.000000
8 pclass fare 0.188278
9 pclass cabin 0.064250
and I want to have it in a matrix shape like so:
pclass survived age sibsp parch fare body
pclass 1.000000 -0.312469 -0.408106 0.060832 0.018322 -0.558629 -0.034642
survived -0.312469 1.000000 -0.055513 -0.027825 0.082660 0.244265 NaN
age -0.408106 -0.055513 1.000000 -0.243699 -0.150917 0.178739 0.058809
sibsp 0.060832 -0.027825 -0.243699 1.000000 0.373587 0.160238 -0.099961
parch 0.018322 0.082660 -0.150917 0.373587 1.000000 0.221539 0.051099
fare -0.558629 0.244265 0.178739 0.160238 0.221539 1.000000 -0.043110
body -0.034642 NaN 0.058809 -0.099961 0.051099 -0.043110 1.000000
Appreciate your help
Thanks!
I'm sure there are more efficient ways to this but this is solved my problem:
#this is the method I wanted to compare to the MIC
import ppscore as pps
df = pps.matrix(titanic)
this creates the following datframe:
x y ppscore
0 pclass pclass 1.000000
1 pclass survived 0.000000
2 pclass name 0.000000
3 pclass sex 0.000000
4 pclass age 0.088131
5 pclass sibsp 0.000000
6 pclass parch 0.000000
7 pclass ticket 0.000000
8 pclass fare 0.188278
9 pclass cabin 0.064250
Next this function did the job:
def to_matrix(df):
#since the data is symetrical, taking the sqrt gives us the required dimensions
leng=int(np.sqrt(len(df['ppscore'])))
#create the values for the matrix
val = df['ppscore'].values.reshape((leng,leng))
#create the columns and index for the matrix
X, ind_x = list(np.unique(data['x'],return_index=True))
X = X[np.argsort(ind_x)]
Y, ind_y = list(np.unique(data['x'],return_index=True))
Y = Y[np.argsort(ind_y)]
matrix = pd.DataFrame(val,columns=X,index=Y)
return matrix
the result is:
longitude latitude housing_median_age total_rooms \
longitude 1.00 0.78 0.13 0.00
latitude 0.76 1.00 0.09 0.00
housing_median_age 0.00 0.00 1.00 0.02
total_rooms 0.00 0.00 0.00 1.00
total_bedrooms 0.00 0.00 0.00 0.51
population 0.00 0.00 0.00 0.33
households 0.00 0.00 0.00 0.52
median_income 0.00 0.00 0.00 0.00
median_house_value 0.00 0.00 0.00 0.00
ocean_proximity 0.24 0.29 0.05 0.00
total_bedrooms population households median_income \
longitude 0.00 0.00 0.00 0.01
latitude 0.00 0.00 0.00 0.02
housing_median_age 0.02 0.00 0.00 0.00
total_rooms 0.48 0.31 0.46 0.00
total_bedrooms 1.00 0.42 0.81 0.00
population 0.38 1.00 0.49 0.00
households 0.81 0.54 1.00 0.00
median_income 0.00 0.00 0.00 1.00
median_house_value 0.00 0.00 0.00 0.13
ocean_proximity 0.00 0.00 0.00 0.01
median_house_value ocean_proximity
longitude 0.14 0.63
latitude 0.12 0.56
housing_median_age 0.00 0.15
total_rooms 0.00 0.01
total_bedrooms 0.00 0.04
population 0.00 0.01
households 0.00 0.03
median_income 0.04 0.05
median_house_value 1.00 0.25
ocean_proximity 0.14 1.00
I am trying to create a column which contains only the minimum of the one row and a few columns, for example:
A0 A1 A2 B0 B1 B2 C0 C1
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72
Here I am trying to create a column which contains the minimum for each row of columns B0, B1, B2.
The output would look like this:
A0 A1 A2 B0 B1 B2 C0 C1 Minimum
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75 0.42
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73 0.00
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03 0.51
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61 0.51
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53 0.17
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72 0.01
Here is part of the code, but it is not doing what I want it to do:
for i in range(0,2):
df['Minimum'] = df.loc[0,'B'+str(i)].min()
This is a one-liner, you just need to use the axis argument for min to tell it to work across the columns rather than down:
df['Minimum'] = df.loc[:, ['B0', 'B1', 'B2']].min(axis=1)
If you need to use this solution for different numbers of columns, you can use a for loop or list comprehension to construct the list of columns:
n_columns = 2
cols_to_use = ['B' + str(i) for i in range(n_columns)]
df['Minimum'] = df.loc[:, cols_to_use].min(axis=1)
For my tasks a universal and flexible approach is the following example:
df['Minimum'] = df[['B0', 'B1', 'B2']].apply(lambda x: min(x[0],x[1],x[2]), axis=1)
The target column 'Minimum' is assigned the result of the lambda function based on the selected DF columns['B0', 'B1', 'B2']. Access elements in a function through the function alias and his new Index(if count of elements is more then one). Be sure to specify axis=1, which indicates line-by-line calculations.
This is very convenient when you need to make complex calculations.
However, I assume that such a solution may be inferior in speed.
As for the selection of columns, in addition to the 'for' method, I can suggest using a filter like this:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
literally, a filter is applied to the list of DF columns through a lambda function that checks for the occurrence of the letter 'B'.
after that the first example can be written as follows:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
df['Minimum'] = df[calls_to_use].apply(lambda x: min(x), axis=1)
although after pre-selecting the columns, it would be preferable:
df['Minimum'] = df[calls_to_use].min(axis=1)
I have a Input dataframe as shown. Taking two rows at a time there are 4C2 combinations. I want the output to be saved in a dataframe as shown in output dataframe . In the output dataframe for each possible combination columns of two rows are side by side.
Input df
A B
0 0.5 12
1 0.7 16
2 0.9 20
3 0.11 24
Output df
combination A B A' B'
(0,1) 0.5 12 0.7 16
(0,2) 0.5 12 0.9 20
.................................
.................................
Method 1
Create an artificial key column, then merge the df to itself:
df['key'] = 1
df.merge(df, on='key',suffixes=["", "'"]).reset_index(drop=True).drop('key', axis=1)
A B A' B'
0 0.50 12 0.50 12
1 0.50 12 0.70 16
2 0.50 12 0.90 20
3 0.50 12 0.11 24
4 0.70 16 0.50 12
5 0.70 16 0.70 16
6 0.70 16 0.90 20
7 0.70 16 0.11 24
8 0.90 20 0.50 12
9 0.90 20 0.70 16
10 0.90 20 0.90 20
11 0.90 20 0.11 24
12 0.11 24 0.50 12
13 0.11 24 0.70 16
14 0.11 24 0.90 20
15 0.11 24 0.11 24
Method 2
First prepare a dataframe with all the possible combinations, then we merge our original dataframe to get the combinations side by side:
idx = [x for x in range(len(df))] * len(df)
idx.sort()
df2 = pd.concat([df]*len(df))
df2.index = idx
df.merge(df2, left_index=True, right_index=True, suffixes=["", "'"]).reset_index(drop=True)
A B A' B'
0 0.50 12 0.50 12
1 0.50 12 0.70 16
2 0.50 12 0.90 20
3 0.50 12 0.11 24
4 0.70 16 0.50 12
5 0.70 16 0.70 16
6 0.70 16 0.90 20
7 0.70 16 0.11 24
8 0.90 20 0.50 12
9 0.90 20 0.70 16
10 0.90 20 0.90 20
11 0.90 20 0.11 24
12 0.11 24 0.50 12
13 0.11 24 0.70 16
14 0.11 24 0.90 20
15 0.11 24 0.11 24
Let's use itertools.combinations:
from itertools import combinations
pd.concat([df.loc[[i,j]]
.unstack()
.set_axis(["A","A'","B","B'"], axis=0, inplace=False)
.to_frame(name=(i,j)).T
for i, j in combinations(df.index, 2)])
Output dataframe with multiindex:
A A' B B'
0 1 0.5 0.70 12.0 16.0
2 0.5 0.90 12.0 20.0
3 0.5 0.11 12.0 24.0
1 2 0.7 0.90 16.0 20.0
3 0.7 0.11 16.0 24.0
2 3 0.9 0.11 20.0 24.0
Or as index with string
pd.concat([df.loc[[i,j]]
.unstack()
.set_axis(["A","A'","B","B'"], axis=0, inplace=False)
.to_frame(name='('+str(i)+','+ str(j)+')').T
for i,j in combinations(df.index,2)]))
Output:
A A' B B'
(0,1) 0.5 0.70 12.0 16.0
(0,2) 0.5 0.90 12.0 20.0
(0,3) 0.5 0.11 12.0 24.0
(1,2) 0.7 0.90 16.0 20.0
(1,3) 0.7 0.11 16.0 24.0
(2,3) 0.9 0.11 20.0 24.0
I've got a pandas dataframe like this. This contains a timestamp, id, foo and bar.
The timestamp data is around every 10 minutes.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.10 0.05
2019-04-14 00:10:00 2 0.30 0.10
For each id, I'd like to create 5 additional rows with timestamp split equally between successive rows, and foo & bar values containing random values between the successive rows.
The start time should be the earliest timestamp for each id and the end time should be the latest timestamp for each id
So the output would be like this.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:02:10 1 0.14 0.06
2019-04-14 00:04:10 1 0.11 0.06
2019-04-14 00:06:10 1 0.29 0.07
2019-04-14 00:08:10 1 0.22 0.09
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.80 0.50
2019-04-14 00:02:00 2 0.45 0.48
2019-04-14 00:04:00 2 0.52 0.42
2019-04-14 00:06:00 2 0.74 0.48
2019-04-14 00:08:00 2 0.41 0.45
2019-04-14 00:10:00 2 0.40 0.40
I can reindex the timestamp column and create additional timestamp rows (eg. Pandas create new date rows and forward fill column values).
But I can't seem to wrap my head around how to compute the random values for foo and bar between the successive rows.
Appreciate if someone can point me in the right direction!
The close, what you need is use date_range with DataFrame.reindex by first and last value of DatetimeIndex:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = (df.set_index('timestamp')
.groupby('id')['foo','bar']
.apply(lambda x: x.reindex(pd.date_range(x.index[0], x.index[-1], periods=6))))
Then create helper DataFrame with same size like original and DataFrame.fillna missing values:
df1 = pd.DataFrame(np.random.rand(*df.shape), index=df.index, columns=df.columns)
df = df.fillna(df1)
print (df)
foo bar
id
1 2019-04-14 00:00:10.000 0.100000 0.050000
2019-04-14 00:02:08.400 0.903435 0.755841
2019-04-14 00:04:06.800 0.956002 0.253878
2019-04-14 00:06:05.200 0.388454 0.257639
2019-04-14 00:08:03.600 0.225535 0.195306
2019-04-14 00:10:02.000 0.300000 0.100000
2 2019-04-14 00:00:00.000 0.100000 0.050000
2019-04-14 00:02:00.000 0.180865 0.327581
2019-04-14 00:04:00.000 0.417956 0.414400
2019-04-14 00:06:00.000 0.012686 0.800948
2019-04-14 00:08:00.000 0.716216 0.941396
2019-04-14 00:10:00.000 0.300000 0.100000
If the 'randomness' is not as crucial. We can use Series.interpolate which will keep the values between your min and max per group:
df_new = pd.concat([
d.reindex(pd.date_range(d.timestamp.min(), d.timestamp.max(), periods=6))
for _, d in df.groupby('id')
])
df_new['timestamp'] = df_new.index
df_new.reset_index(drop=True, inplace=True)
df_new = df_new[['timestamp']].merge(df, on='timestamp', how='left')
df_new['id'].fillna(method='ffill', inplace=True)
df_new[['foo', 'bar']] = df_new[['foo', 'bar']].apply(lambda x: x.interpolate())
Which gives the following output:
print(df_new)
timestamp id foo bar
0 2019-04-14 00:00:10.000 1.0 0.10 0.05
1 2019-04-14 00:02:08.400 1.0 0.14 0.06
2 2019-04-14 00:04:06.800 1.0 0.18 0.07
3 2019-04-14 00:06:05.200 1.0 0.22 0.08
4 2019-04-14 00:08:03.600 1.0 0.26 0.09
5 2019-04-14 00:10:02.000 1.0 0.30 0.10
6 2019-04-14 00:00:00.000 2.0 0.10 0.05
7 2019-04-14 00:02:00.000 2.0 0.14 0.06
8 2019-04-14 00:04:00.000 2.0 0.18 0.07
9 2019-04-14 00:06:00.000 2.0 0.22 0.08
10 2019-04-14 00:08:00.000 2.0 0.26 0.09
11 2019-04-14 00:10:00.000 2.0 0.30 0.10