It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am new with excel, but how can i get an estimate for the values in 2013 of something like this:
I need an estimate which is the extrapolation of the value according to the linear regression the counterparts observed in recent years.
Thanks
To answer this, I plotted data in two ways: (a) showing each year separately, and (b) showing all the data as one line through time. The graphs are as follows:
Looking at the first graph, if there is any seasonality in the data, it's not very strong. However, looking at all the data plotted on one line through time, it looks as though there is an upward trend. So my suggestion is to do the most basic regression and fit a straight line to the data. The graph with the trend line added is as follows:
In numbers, the results are:
Data Best fit straight line
Jan-10 218 232.7
Feb-10 251 235.0
Mar-10 221 237.1
Apr-10 241 239.4
May-10 261 241.7
Jun-10 227 244.0
Jul-10 253 246.3
Aug-10 266 248.6
Sep-10 238 250.9
Oct-10 255 253.2
Nov-10 238 255.5
Dec-10 219 257.7
Jan-11 263 260.0
Feb-11 239 262.4
Mar-11 255 264.5
Apr-11 297 266.8
May-11 299 269.0
Jun-11 256 271.4
Jul-11 292 273.6
Aug-11 247 275.9
Sep-11 254 278.2
Oct-11 258 280.5
Nov-11 264 282.8
Dec-11 301 285.1
Jan-12 319 287.4
Feb-12 314 289.7
Mar-12 274 291.9
Apr-12 325 294.2
May-12 319 296.4
Jun-12 339 298.8
Jul-12 339 301.0
Aug-12 271 303.3
Sep-12 310 305.7
Oct-12 291 307.9
Nov-12 259 310.2
Dec-12 286 312.5
Jan-13 314.8
Feb-13 317.1
Mar-13 319.2
Apr-13 321.5
May-13 323.8
Jun-13 326.1
Jul-13 328.4
Aug-13 330.7
Sep-13 333.0
Oct-13 335.2
Nov-13 337.6
Dec-13 339.8
There are different ways you can apply linear regression. You could, for example, use all your data points to create an equation to calculate for all the subsequent months. However, if there are yearly cycles, you might just want to use the data for each January to estimate the next January; each month of February to estimate February; etc. To keep it simple, let's just work with January for now. In order to keep the numbers smaller, I'm just going to use the last two digits of the year:
X Y
10 218
11 263
12 319
Next calculate 4 different sums:
S[x] = Sum of all Xs = 33
S[y] = Sum of all Ys = 800
S[xx] = Sum of X squared = 100 + 121 + 144 = 365
S[xy] = Sum of X*Y = 2180 + 2893 + 3828 = 8901
Calculate slope and intercept:
N = Number of data points sampled = 3
M = Slope = (N*S[xy] - S[x]*S[y])/(N*S[xx] - S[x]^2)
M = (3*8901 - 33*800)/(3*365 - 33^2) = 303/6 = 50.5
B = Intercept = (S[y] - M*S[x])/N
B = (800 - 50.5*33)/3 = -866.5/3 = -289
Therefore the equation for January would be:
Y = M*X + B
Y = 50.5*X - 289
Calculate for the year 2013:
Y = 50.5*13 -289 = 368
Start by plotting your data. Decide what kind of function will be a good fit.
You can either create a fit for each month or try to create one that has both year and month as independent variables.
Let's assume that a polynomial fit for each month will work for you:
y = c0 + c1*m + c2*m^2
So for January:
218 = c0 + c1*2010 + c2*2010^2
263 = c0 + c1*2011 + c2*2011^2
319 = c0 + c1*2012 + c2*2012^2
So now you have three equations for three unknowns. Solve for (c0, c1, c2) and the substitute m = 2013 for your extrapolation.
Here are the results I get:
Month 2010 2011 2012 2013
1 218 263 319 386
2 251 239 314 476
3 221 255 274 278
4 241 297 325 325
5 261 299 319 321
6 227 256 339 476
7 253 292 339 394
8 266 247 271 338
9 238 254 310 406
10 255 258 291 354
11 238 264 259 223
12 219 301 286 174
See how you do.
Related
I was following this question to address a similar situation:
How to Calculate Area Under the Curve in Spotfire?
My data is in the following format:
PLANT
OBS_DATE_RECORDED
TRAIT_VALUE
period
A
3/16/2021
225
A3/16/2021
A
3/23/2021
227
A3/23/2021
A
3/30/2021
220
A3/30/2021
A
4/7/2021
240
A4/7/2021
A
4/13/2021
197
A4/13/2021
A
4/20/2021
197
A4/20/2021
A
4/27/2021
218
A4/27/2021
B
3/16/2021
253
B3/16/2021
B
3/23/2021
274
B3/23/2021
B
3/30/2021
271
B3/30/2021
B
4/7/2021
257
B4/7/2021
B
4/13/2021
250
B4/13/2021
A
4/20/2021
241
A4/20/2021
B
4/27/2021
255
B4/27/2021
Following the answer's formula as a calculated column:
([TRAIT_VALUE] + Avg([TRAIT_VALUE]) over (Intersect(NextPeriod([period]),[PLANT]))) / 2 * (Avg([OBS_DATE_RECORDED]) over (Intersect(NextPeriod([period]),[PLANT])) - [OBS_DATE_RECORDED])
However, the results don't appear correct.
AUDPC
1603.19:59:59.928
1608.17:59:59.956
2924.20:0:0.100
7732.21:0:0.000
1395.14:41:44.404
1461.23:30:0.050
-4393.7:59:59.712
I think the problem might be the date format but don't understand the formula well enough to troubleshoot. In Excel I usually compute the AUDPC by using the SUMPRODUCTS multiplying the days between two dates by the average TRAIT_VALUE between those two dates.
I am trying to apply normalization to my data and I have tried the Conventional scaling techniques using sklearn packages readily available for this kind of requirement. However, I am looking to implement something called Decimal scaling.
I read about it in this research paper and looks like a technique which can improve results of a neural network regression. As per my understanding, this is what I believe needs to be done -
Suppose the range of attribute X is −4856 to 28. The maximum absolute value of X is 4856.
To normalize by decimal scaling I will need to divide each value by 10000 (c = 4). In this case, −4856 becomes −0.4856 while 28 becomes 0.0028.
So for all values: new value = old value/ 10^c
How can I reproduce this as a function in Python so as to normalize all the features(column by column) in my data set?
Input:
A B C
30 90 75
56 168 140
28 84 70
369 1107 922.5
485 1455 1212.5
4856 14568 12140
40 120 100
56 168 140
45 135 112.5
78 234 195
899 2697 2247.5
Output:
A B C
0.003 0.0009 0.0075
0.0056 0.00168 0.014
0.0028 0.00084 0.007
0.0369 0.01107 0.09225
0.0485 0.01455 0.12125
0.4856 0.14568 1.214
0.004 0.0012 0.01
0.0056 0.00168 0.014
0.0045 0.00135 0.01125
0.0078 0.00234 0.0195
0.0899 0.02697 0.22475
Thank you guys for asking questions which led me to think about my problem more clearly and break it into steps. I have arrived to a solution. Here's how my solution looks like:
def Dec_scale(df):
for x in df:
p = df[x].max()
q = len(str(abs(p)))
df[x] = df[x]/10**q
I hope this solution looks agreeable!
def decimal_scaling (df):
df_abs = abs(df)
max_valus= df_abs.max()
log_num=[]
for i in range(max_valus.shape[0]):
log_num.append(int(math.log10(max_valus[i]))+1)
log_num = np.array(log_num)
log_num = [pow(10, number) for number in log_num]
X_full =df/log_num
return X_full
When extracting and moving data, the first column of the criteria is working but the second criteria is not engaging. It is returning the movement for all stores if they had sold that item.
List of column headers.
R2=Left Len,
S2=Store
A2=Left Len,
B2=UPC,
C2=Store,
D2=Movement,
The file is just short of 900k rows of data in total.
I believe it to be an issue with Current Region.
Also need for this to return zero if there is no movement for that store. This will be repeated 39 more times to the right in order to get results for each location.
Ultimate goal is to find the Zero Movers that need to be addressed. So the rows of upc's would need to stay aligned with the criteria.
Any help would be greatly appreciated.
Using Windows 7,
Office 2016
Sub Find_Fill_Data()
Range("u2:x" & Range("x" & Rows.Count).End(xlUp).Row).ClearContents
Range("a2:D" & Range("D" & Rows.Count).End(xlUp).Row).AdvancedFilter Action:=xlFilterCopy, criteriarange:=Range("r2").CurrentRegion, copytorange:=Range("u2"), unique:=False
Range("q4").Select
End Sub
**Left Len Item 5 7 8 9**
1070002152 MILK DUDS THEATER BOX 123 254 181 196
1070002385 WHOPPERS MALT BALLS 19 0 28 42
1070002440 WHOPPERS MALT BALLS 92 188 79 133
1070002660 WHOPPERS MALT BALLS 22 21 11 22
1070006080 CANDY BAR 575 463 446 303
1070006611 WHOPPER ROBIN EGGS 22 28 25 0
1070008807 CANDY 132 57 59 0
1070008813 THEATER BOX 331 127 101 171
1070013272 J/RANCHER CRNCH CHEW ASST 61 0 0 0
1070050180 WHOPPERS MALT BALLS CARTN 119 24 99 99
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
i have a very large tab separated file, a part of which looks like this:
33 x 171 297 126
4 x 171 300 129
2 x 171 303 132
11 y 163 289 126
5 y 163 290 127
3 y 163 291 128
2 y 163 292 129
2 y 170 289 119
2 z 166 307 141
2 z 166 308 142
6 z 166 309 143
4 z 166 329 163
2 z 166 330 164
i want to sort and select only one line for each: x,y, z based on the highest value associated with it in the first column (in unix)
You can do this with awk:
awk '
{
key = $2;
flag = 0;
if (key in value) { max = value[key] ; flag = 1 };
if (flag == 0 || max < $1) { value[key] = $1; line[key] = $0 };
}
END {
for (key in line) { print line[key] };
}
' data.tsv
I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204