Using Regression - Differentiate between two data frame columns, which is Linear and which is polynomial function? - scikit-learn

In a dataframe with 6 columns (A B C D E F), from columns E or F, one is a linear combination of the first 4 columns with varying coefficients while the other column is a polynomial function of the same inputs.
Find which column is linear function and which is polynomial function.
Providing 30 Samples from dataframe (512 total rows)
A B C D E F
0 28400 28482 28025 28060 738.0 117.570740
1 28136 28382 28135 28184 -146.0 295.430176
2 28145 28255 28097 28119 30.0 132.123714
3 28125 28192 27947 27981 357.0 101.298064
4 28060 28146 27981 28007 124.0 112.153318
5 27995 28100 27945 28022 149.0 182.427089
6 28088 28195 27985 28019 167.0 141.255137
7 28049 28157 27996 28008 22.0 120.069010
8 28025 28159 28025 28109 34.0 218.401641
9 28170 28638 28170 28614 420.0 919.376358
10 28666 28980 28551 28710 234.0 475.389093
11 28660 28779 28531 28634 345.0 222.895307
12 28590 28799 28568 28783 265.0 425.738484
13 28804 28930 28740 28808 138.0 194.449548
14 28770 28770 28650 28719 378.0 69.289005
15 28769 28770 28600 28638 413.0 39.225874
16 28694 28866 28674 28847 214.0 346.158401
17 28843 28928 28807 28874 121.0 152.281425
18 28921 28960 28680 28704 491.0 63.234310
19 28683 28950 28628 28905 397.0 547.115621
20 28877 28877 28712 28749 404.0 37.212629
21 28685 29011 28680 28949 222.0 598.104568
22 29045 29180 29045 29111 -3.0 201.306765
23 29220 29499 29216 29481 259.0 546.566915
24 29439 29485 29310 29376 344.0 112.394063
25 29319 29345 28951 29049 906.0 125.333702
26 29001 29009 28836 28938 526.0 110.611943
27 28905 28971 28851 28917 174.0 132.274514
28 28907 28916 28711 28862 685.0 161.078158
29 28890 29025 28802 28946 329.0 280.114923
Performed Linear regression on (512 total rows)
Column A B C D as input, column E as target values.
OUTPUT-
Intercept [-2.67164069e-12]
coefficients[[ 2. 3. -1. -4.]]
Column A B C D as input, column F as target values.
OUTPUT-
Intercept [0.32815962]
coefficients[[ 1.01293825 -1.0003835 1.00503772 -1.01765453]]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
For column E
x = df.iloc[:, :4].values
y = df.iloc[:, [4]].values
regressor = LinearRegression()
regressor.fit(x, y)
print(regressor.intercept_)
print(regressor.coef_)
output
[-2.67164069e-12]
[[ 2. 3. -1. -4.]]
For column F
x_new = df.iloc[:, :4].values
y_new = df.iloc[:, [5]].values
regressor_new = LinearRegression()
regressor_new.fit(x_new, y_new)
print(regressor_new.intercept_)
print(regressor_new.coef_)
output
[0.32815962]
[[ 1.01293825 -1.0003835 1.00503772 -1.01765453]]
One of the 2 columns is a linear combination of the first 4 columns with varying coefficients while the other is a polynomial function of the same inputs.
Mention which column is a linear function and which is polynomial.

I think the columns with linear combination can be found by checking the multicollinearity between the columns. So, the column/s which is/are linear combination of remaining column/s will have a high VIF.

Try plotting the graphs (histograms) of the two columns, and see if you can identify the function as linear or polynomial based on the graph.

Related

Pandas rolling slope on groupby objects

I would like to estimate a rolling slope on a grouped dataframe.
Let's say that I have the following df:
Date tags weight
22 2004-05-12 a 0.000081
23 2004-05-13 a 0.000073
24 2004-05-14 a 0.000085
25 2004-05-17 a 0.000089
26 2004-05-18 b 0.000034
27 2004-05-19 b 0.000048
......
1000 2004-05-20 b 0.000034
1001 2004-05-21 b 0.000037
1002 2004-05-24 c 0.000043
1003 2004-05-25 c 0.000038
1004 2004-05-26 c 0.000029
How could I calculate a rolling slope over 10 dates and for each group?
I tried:
from scipy.stats import linregress
df['rolling_slope'] = df.groupby('tags').rolling(window=10,
min_periods=2).apply(lambda v: linregress(v.Date, v.weight))
but it seems that I can't apply the function to a Series
Try:
df['rolling_slope'] = (df.groupby('tags')['weight']
.rolling(window=10, min_period=2)
.apply(lambda v: linregress(np.arange(len(v)), v).slope )
.reset_index(level=0, drop=True)
)
But this is rolling on number of rows only, not really looking back 10 days. There's also an option rolling('10D') but you would need to set date as index.

How to convert values of panda dataframe to columns

I have a dataset given below:
weekid type amount
1 A 10
1 B 20
1 C 30
1 D 40
1 F 50
2 A 70
2 E 80
2 B 100
I am trying to convert it to another panda frame based on total number of type values defined with:
import pandas as pd
import numpy as np
df=pd.read_csv(INPUT_FILE)
for type in df["type"].unique():
//todo
My aim is to get a data given below:
weekid type_A type_B type_C type_D type_E type_F
1 10 20 30 40 0 50
2 70 100 0 0 80 0
Is there any specific function that convert unique values as a column and fills the missing values as 0 for each weekId groups? I am wondering that how this conversion can be done efficiently?
You can use the following:
df = df.pivot(columns=['type'], values=['amount'])
df.fillna(0)
dfp.columns = dfp.columns.droplevel(0)
Given your input this yields:
type A B C D F
weekid
1 10.0 20.0 30.0 40.0 50.0
2 70.0 80.0 100.0 0.0 0.0

Locate dataframe rows where values are outside bounds specified for each column

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?
One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

Calculating Cumulative Average every x successive rows in Excel( not to be confused with Average every x rows gap interval)

I want to calculate cumulative average every 3 rows from the value field. Above figure shows the Column cumulative average which is expected output. Tried offset method but it gives the average after every 3 rows gap interval and not the cumulative average every 3 continuous rows.
Use Series.rolling with mean and then Series.shift:
N = 3
df = pd.DataFrame({'Value': [6,9,15,3,27,33]})
df['Cum_sum'] = df['Value'].rolling(N).mean().shift(-N+1)
print (df)
Value Cum_sum
0 6 10.0
1 9 9.0
2 15 15.0
3 3 21.0
4 27 NaN
5 33 NaN

groupby and ranking based on the string in one column

I am working on a data frame, which contains 70 over actions. I have a column that groups those 70 actions. I want to create a new column that is the rank of string from an existing column. The following the sample of the data frame:
DF = pd.DataFrame()
DF ['template']= ['Attk','Attk','Attk','Attk','Attk','Attk','Def','Def','Def','Def','Def','Def','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy']
DF ['Stats'] = ['Goal','xG','xA','Goal','xG','xA','Block','interception','tackles','Block','interception','tackles','Acc.passes','Acc.actions','Acc.crosses','Acc.passes','Acc.actions','Acc.crosses']
DF=DF.sort_values(['template','Stats'])
The new column that I wanted to create is groupby [template] and ranking the Stats alphabetical order.
The expected data frame is as follow:
I have 10 to 15 of Stats under each of the template.
Use GroupBy.transform with lambda function and factorize, also because python counts from 0 is added 1:
f = lambda x: pd.factorize(x)[0]
DF['Order'] = DF.groupby('template')['Stats'].transform(f) + 1
print (DF)
template Stats Order
13 Accuracy Acc.actions 1
16 Accuracy Acc.actions 1
14 Accuracy Acc.crosses 2
17 Accuracy Acc.crosses 2
12 Accuracy Acc.passes 3
15 Accuracy Acc.passes 3
0 Attk Goal 1
3 Attk Goal 1
2 Attk xA 2
5 Attk xA 2
1 Attk xG 3
4 Attk xG 3
6 Def Block 1
9 Def Block 1
7 Def interception 2
10 Def interception 2
8 Def tackles 3
11 Def tackles 3

Resources