I have two df i want multiplied together without looping through.
For each of the below rows, i want to iterate the below and for each of the arm angle
sample pos arm1_angle arm2_angle arm3_angle arm4_angle
0 0 0.000000 0.000000 0.250000 0.500000 0.750000
1 1 0.134438 0.134438 0.384438 0.634438 0.884438
2 2 0.838681 0.838681 0.088681 0.338681 0.588681
3 3 1.755019 0.755019 0.005019 0.255019 0.505019
4 4 3.007274 0.007274 0.257274 0.507274 0.757274
5 5 4.186825 0.186825 0.436825 0.686825 0.936825
6 6 3.455513 0.455513 0.705513 0.955513 0.205513
7 7 4.916564 0.916564 0.166564 0.416564 0.666564
8 8 2.876257 0.876257 0.126257 0.376257 0.626257
9 9 2.549585 0.549585 0.799585 0.049585 0.299585
10 10 1.034488 0.034488 0.284488 0.534488 0.784488
multiply by the below table and concatenate. For example, if there are 10k rows above, the sample size will be 10k x 27 = 270,000 rows.
so for index 0, multiply the entire below table with 0 for arm1, 0.25 for arm2, 0.5 for arm3, and 0.75 for arm3.
I can easily loop through, multiple and concatenate. Is there a more efficient way?
id radius bag_count arm1 arm2 arm3 arm4
0 1 0.440 4 1.0 0.0 0.0 0.0
1 2 0.562 8 0.0 1.0 0.0 0.0
2 3 0.666 12 0.0 0.0 1.0 0.0
3 4 0.818 16 1.0 0.0 0.0 0.0
4 5 0.912 16 0.0 1.0 0.0 0.0
5 6 1.022 20 0.0 0.0 1.0 0.0
6 7 1.120 24 1.0 0.0 0.0 0.0
7 8 1.220 28 0.0 1.0 0.0 0.0
8 9 1.350 32 0.0 0.0 1.0 0.0
9 10 1.460 36 1.0 0.0 1.0 0.0
10 11 1.570 40 0.0 1.0 0.0 1.0
11 12 1.680 44 1.0 0.0 1.0 0.0
12 13 1.800 44 0.0 1.0 0.0 1.0
13 14 1.920 48 1.0 0.0 1.0 0.0
14 15 2.030 52 0.0 1.0 0.0 1.0
15 16 2.140 56 1.0 0.0 1.0 0.0
16 17 2.250 60 0.0 1.0 1.0 1.0
17 18 2.360 64 1.0 0.0 1.0 1.0
18 19 2.470 68 1.0 1.0 0.0 1.0
19 20 2.580 72 1.0 1.0 1.0 0.0
20 21 2.700 72 0.0 1.0 1.0 1.0
21 22 2.810 76 1.0 0.0 1.0 1.0
22 23 2.940 80 1.0 1.0 0.0 1.0
23 24 3.060 84 1.0 1.0 1.0 0.0
24 25 3.180 88 1.0 1.0 1.0 1.0
25 26 3.300 92 1.0 1.0 1.0 1.0
26 27 3.420 96 1.0 1.0 1.0 1.0
Use cross join for all rows and then select rows with arm and multiple:
df22 = df2.filter(like='arm')
cols = df1.filter(like='arm').columns
df = df1.merge(df22, how='cross')
df[cols] = df[cols].mul(df[df22.columns].to_numpy())
df = df.drop(df22.columns, axis=1)
Related
Given a following dataframe df:
date mom_pct
0 2020-1-31 1.4
1 2020-2-29 0.8
2 2020-3-31 -1.2
3 2020-4-30 -0.9
4 2020-5-31 -0.8
5 2020-6-30 -0.1
6 2020-7-31 0.6
7 2020-8-31 0.4
8 2020-9-30 0.2
9 2020-10-31 -0.3
10 2020-11-30 -0.6
11 2020-12-31 0.7
12 2021-1-31 1.0
13 2021-2-28 0.6
14 2021-3-31 -0.5
15 2021-4-30 -0.3
16 2021-5-31 -0.2
17 2021-6-30 -0.4
18 2021-7-31 0.3
19 2021-8-31 0.1
20 2021-9-30 0.0
21 2021-10-31 0.7
22 2021-11-30 0.4
23 2021-12-31 -0.3
24 2022-1-31 0.4
25 2022-2-28 0.6
26 2022-3-31 0.0
27 2022-4-30 0.4
28 2022-5-31 -0.2
I want to compare the chain ratio value of a month of the current year to the value of the month of the previous year. Assume that the value of the same period last year is y_t-1, and the current value of this year is y_t. I will create a new column according to the following rules:
If y_t = y_t-1, returns 0 for new column;
If y_t ∈ (y_t-1, y_t-1 + 0.3], returns 1;
If y_t ∈ (y_t-1 + 0.3, y_t-1 + 0.5], returns 2;
If y_t > (y_t-1 + 0.5), returns 3;
If y_t ∈ [y_t-1 - 0.3, y_t-1), returns -1;
If y_t ∈ [y_t-1 - 0.5, y_t-1 - 0.3), returns -2;
If y_t < (y_t-1 - 0.5), returns -3
The expected result:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0
I attempt to create multiple columns and ranges, then check mom_pct is in which range. Is it possible to do that in a more effecient way? Thanks.
df1['mom_pct_zero'] = df1['mom_pct'].shift(12)
df1['mom_pct_pos1'] = df1['mom_pct'].shift(12) + 0.3
df1['mom_pct_pos2'] = df1['mom_pct'].shift(12) + 0.5
df1['mom_pct_neg1'] = df1['mom_pct'].shift(12) - 0.3
df1['mom_pct_neg2'] = df1['mom_pct'].shift(12) - 0.5
I would do it as follows
def categorize(v):
if np.isnan(v) or v == 0.:
return v
sign = -1 if v < 0 else 1
eps = 1e-10
if abs(v) <= 0.3 + eps:
return sign * 1
if abs(v) <= 0.5 + eps:
return sign * 2
return sign * 3
df['categorial_mom_pct'] = df['mom_pct'].diff(12).map(categorize)
print(df)
Note that I added a very small eps to the threshold to counter the precision issue with floating point arithmetic
abs(-0.3) <= 0.3 # True
abs(-0.4 + 0.1) <= 0.3 # False
abs(-0.4 + 0.1) <= 0.3 + 1e-10 # True
Out:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0
I want to calculate a rolling mean only when a Marker column is1. This is a small example but real world data is massive and needs to be efficient.
df = pd.DataFrame()
df['Obs']=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
df['Marker']=[0,0,0,0,1,0,0,0,0,1,0,0,0,0,1]
df['Mean']=(df.Obs.rolling(5).mean())
How can I create a Desired column like this:
df['Desired']=[0,0,0,0,3.0,0,0,0,0,8.0,0,0,0,0,13.0]
print(df)
Obs Marker Mean Desired
0 1 0 NaN 0.0
1 2 0 NaN 0.0
2 3 0 NaN 0.0
3 4 0 NaN 0.0
4 5 1 3.0 3.0
5 6 0 4.0 0.0
6 7 0 5.0 0.0
7 8 0 6.0 0.0
8 9 0 7.0 0.0
9 10 1 8.0 8.0
10 11 0 9.0 0.0
11 12 0 10.0 0.0
12 13 0 11.0 0.0
13 14 0 12.0 0.0
14 15 1 13.0 13.0
You are close, just need a where:
df['Mean']= df.Obs.rolling(5).mean().where(df['Marker']==1, 0)
Output:
Obs Marker Mean
0 1 0 0.0
1 2 0 0.0
2 3 0 0.0
3 4 0 0.0
4 5 1 3.0
5 6 0 0.0
6 7 0 0.0
7 8 0 0.0
8 9 0 0.0
9 10 1 8.0
10 11 0 0.0
11 12 0 0.0
12 13 0 0.0
13 14 0 0.0
14 15 1 13.0
I try to get new columns a and b based on the following dataframe:
a_x b_x a_y b_y
0 13.67 0.0 13.67 0.0
1 13.42 0.0 13.42 0.0
2 13.52 1.0 13.17 1.0
3 13.61 1.0 13.11 1.0
4 12.68 1.0 13.06 1.0
5 12.70 1.0 12.93 1.0
6 13.60 1.0 NaN NaN
7 12.89 1.0 NaN NaN
8 11.68 1.0 NaN NaN
9 NaN NaN 8.87 0.0
10 NaN NaN 8.77 0.0
11 NaN NaN 7.97 0.0
If b_x or b_y are 0.0 (at this case they have same values if they both exist), then a_x and b_y share same values, so I take either of them as new columns a and b; if b_x or b_y are 1.0, they are different values, so I calculate means of a_x and a_y as the values of a, take either b_x and b_y as b;
If a_x, b_x or a_y, b_y is not null, so I'll take existing values as a and b.
My expected results will like this:
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0
1 13.42 0.0 13.42 0.0 13.420 0
2 13.52 1.0 13.17 1.0 13.345 1
3 13.61 1.0 13.11 1.0 13.360 1
4 12.68 1.0 13.06 1.0 12.870 1
5 12.70 1.0 12.93 1.0 12.815 1
6 13.60 1.0 NaN NaN 13.600 1
7 12.89 1.0 NaN NaN 12.890 1
8 11.68 1.0 NaN NaN 11.680 1
9 NaN NaN 8.87 0.0 8.870 0
10 NaN NaN 8.77 0.0 8.770 0
11 NaN NaN 7.97 0.0 7.970 0
How can I get an result above? Thank you.
Use:
#filter all a and b columns
b = df.filter(like='b')
a = df.filter(like='a')
#test if at least one 0 or 1 value
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
#get means of a columns
a1 = a.mean(axis=1)
#forward filling mising values and select last column
b1 = b.ffill(axis=1).iloc[:, -1]
a2 = a.ffill(axis=1).iloc[:, -1]
#new Dataframe with 2 conditions
df1 = pd.DataFrame(np.select([m1, m2], [[a2, b1], [a1, b1]]), index=['a','b']).T
#join to original
df = df.join(df1)
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
But I think solution should be simplify, because mean should be used for both conditions (because mean of same values is same like first value):
b = df.filter(like='b')
a = df.filter(like='a')
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
a1 = a.mean(axis=1)
b1 = b.ffill(axis=1).iloc[:, -1]
df['a'] = a1
df['b'] = b1
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
I have a table which looks something like this
import numpy as np
import pandas as pd
tmp=[["","5-9",""],["","",""],["17-","","4- -9 27-"],["-6","",""],["","","-15"]]
dat=pd.DataFrame(tmp).rename(columns={0:"V0",1:"V1",2:"V2"})
dat["Month"]=np.arange(1,6)
dat["Year"]=np.repeat(2015,5)
V0 V1 V2 Month Year
0 5-9 1 2015
1 2 2015
2 17- 4- -9 27- 3 2015
3 -6 4 2015
4 -15 5 2015
...
The numbers in the table represent the days (in the month) when a certain event happened. Note: months can have multiple events and events can span over multiple months.
V1, V2 and V3 are three different devices, each having its own separate events. So we have three different time series.
I would like to convert this table to a time series data frame, that is break it down per day for each device. Each row would be one day for one month (for one year) and each column would now only have values of 0 or 1, 0 if no event happened on that day, 1 otherwise (dummy variable). The result would contain three different time series, one for each device. How would I do that?
This is what the output would look like
V0 V1 V2 Day Month Year
0 0 0 0 1 1 2015
1 0 0 0 2 1 2015
2 0 0 0 3 1 2015
3 0 0 0 4 1 2015
4 0 0 0 5 1 2015
5 0 1 0 6 1 2015
6 0 1 0 7 1 2015
7 0 1 0 8 1 2015
8 0 1 0 9 1 2015
9 0 1 0 10 1 2015
10 0 0 0 11 1 2015
11 0 0 0 12 1 2015
12 0 0 0 13 1 2015
...
You can do this with a series of transformations as shown below. Don't know if this is the most efficient way of doing this ...
import numpy as np
import pandas as pd
tmp=[["","5-9",""],["","",""],["17-","","4- -9 27-"],["-6","",""],["","","-15"]]
df=pd.DataFrame(tmp).rename(columns={0:"V0",1:"V1",2:"V2"})
df["Month"]=np.arange(1,6)
df["Year"]=np.repeat(2015,5)
print(df)
V0 V1 V2 Month Year
0 5-9 1 2015
1 2 2015
2 17- 4- -9 27- 3 2015
3 -6 4 2015
4 -15 5 2015
1. Stack Only Non-Empty Values
days = df.set_index(['Year', 'Month']).stack().replace('', np.nan).dropna()
print(days)
Year Month
2015 1 V1 5-9
3 V0 17-
V2 4- -9 27-
4 V0 -6
5 V2 -15
dtype: object
2. Expand Date Ranges
String such as "5-9" need to be converted to an array with length 31, with values between indices 5 - 9 set to 1 and rest to 0. And similarly, for the other rows. This is a string parsing problem left as an exercise :-). In my example below, I am hard coding the solution based on the values in the question.
def _fill(arr, start, stop):
arr[np.arange(start-1, stop)] = 1
return arr
def expand_days(df_in):
df_out = df_in.copy()
days_all = np.zeros(31)
df_out.loc[2015, 1, 'V1'] = _fill(days_all.copy(), 5, 9)
df_out.loc[2015, 3, 'V0'] = _fill(days_all.copy(), 17, 31)
df_out.loc[2015, 3, 'V2'] = _fill(_fill(days_all.copy(), 4, 9), 27, 31)
df_out.loc[2015, 4, 'V0'] = _fill(days_all.copy(), 1, 6)
df_out.loc[2015, 5, 'V2'] = _fill(days_all.copy(), 1, 15)
return df_out
days_ex = expand_days(days)
print(days_ex)
Year Month
2015 1 V1 [0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
3 V0 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
V2 [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
4 V0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...
5 V2 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
dtype: object
3. Convert an Array to a series of Columns
days_fr = days_ex.apply(lambda x: pd.Series(x, index=np.arange(1, 32)))
print(days_fr)
1 2 3 4 5 6 7 8 9 10 ... 22 \
Year Month ...
2015 1 V1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.0
3 V0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0
V2 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.0
4 V0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0.0
5 V2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 0.0
23 24 25 26 27 28 29 30 31
Year Month
2015 1 V1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 V0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
V2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0
4 V0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 V2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[5 rows x 31 columns]
4. Set Correct Index Names And Stack
days_unstacked = days_fr.stack()
days_unstacked.index.set_names(['Year', 'Month', 'Devices', 'Days'], inplace=True)
print(days_unstacked.head())
Year Month Devices Days
2015 1 V1 1 0.0
2 0.0
3 0.0
4 0.0
5 1.0
dtype: float64
5. Unstack And Fill NA's With Zeros
days_stacked = days_unstacked.unstack('Devices').fillna(0).reset_index()
print(days_stacked.head(10))
Devices Year Month Days V0 V1 V2
0 2015 1 1 0.0 0.0 0.0
1 2015 1 2 0.0 0.0 0.0
2 2015 1 3 0.0 0.0 0.0
3 2015 1 4 0.0 0.0 0.0
4 2015 1 5 0.0 1.0 0.0
5 2015 1 6 0.0 1.0 0.0
6 2015 1 7 0.0 1.0 0.0
7 2015 1 8 0.0 1.0 0.0
8 2015 1 9 0.0 1.0 0.0
9 2015 1 10 0.0 0.0 0.0
The index name of the resulting frame is set to Devices which is an artifact of how we setup the problem. It will need to be changed to something else.
I'm trying to work with a dataset that has None values:
My uploading code is the following:
import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c = pd.DataFrame(s_rows_cols, columns = header_row)
and
the output from c is :
But it seems that there are some columns that has None values.
How do I replace this None values by zeros?
Thanks
I think it is not necessary, if use read_csv with sep=\s+ for whitespace separator and also parameter names for specify new columns names:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
cols = ['age','sex','chestpain','restBP','chol','sugar','ecg',
'maxhr','angina','dep','exercise','fluor','thal','diagnosis']
df = pd.read_csv(url, sep='\s+', names=cols)
print (df)
age sex chestpain restBP chol sugar ecg maxhr angina dep \
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2
.. ... ... ... ... ... ... ... ... ... ...
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
exercise fluor thal diagnosis
0 2.0 3.0 3.0 2
1 2.0 0.0 7.0 1
2 1.0 0.0 7.0 2
3 2.0 1.0 7.0 1
4 1.0 1.0 3.0 1
.. ... ... ... ...
265 1.0 0.0 7.0 1
266 1.0 0.0 7.0 1
267 2.0 0.0 3.0 1
268 2.0 0.0 6.0 1
269 2.0 3.0 3.0 2
[270 rows x 14 columns]
Then in data are not Nones and no missing values:
print (df.isna().any(1).any())
False
EDIT:
If need replace missing values or Nones to scalar use fillna:
c = c.fillna(0)