I have a table with x,y values. I want to interpolate for a given x1 value between the y values using the hllokup function also. I have found fomrulas for vlookup and xlookup but not for hlookup. I cannot use xlookup becaus eogf the verison of excel I use.
Example:
x-values 0.2 0.5 0.8 1.0 1.25 1.5 1.75 2.0 2.5 3.0 4.0
y-values 0.1 0.11 0.12 0.15 0.18 0.2 0.23 0.24 0.28 0.31 0.32
I need the y-value for x=1.1
I appreciate any help
There are various ways to interpolate: spline, polynomial, linear and so on.
I assume that you want linear interpolation between 2 x values.
In this case first of all, you need to find closest larger and closest lower x values:
Lower x:
=MAX(IF(B1:L1<B5,B1:L1))
Larger x:
=MIN(IF(B1:L1>B5,B1:L1))
Now need to find corresponding y's with HLOOKUP.
Lower x's y:
=HLOOKUP(A9,B1:L2,2,FALSE)
Larger x's y:
=HLOOKUP(B9,B1:L2,2,FALSE)
Now that you have all needed values you can write linear interpolation formula or you can use excel formula FORECAST. With 2 x's and 2 y's it will work as linear interpolation.
=FORECAST(B5,A11:B11,A9:B9)
Formula without using helper cells:
=FORECAST(B5,CHOOSE({1,2},HLOOKUP(MAX(IF(B1:L1<B5,B1:L1)),B1:L2,2,FALSE),HLOOKUP(MIN(IF(B1:L1>B5,B1:L1)),B1:L2,2,FALSE)),CHOOSE({1,2},MAX(IF(B1:L1<B5,B1:L1)),MIN(IF(B1:L1>B5,B1:L1))))
Result:
Related
I just learned that you can handle missing data/ NaN with imputation and interpolation, what i just found is interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points while imputation is replacing the missing data of the mean of the column. But is there any differences more than that? When is the best practice to use each of them?
Interpolation
Interpolation (linear) is basically a straight line between two given points where data points between these two are missing:
Two red points are known
Blue point is missing
source: wikipedia
Oke nice explanation, but show me with data.
First of all the formula for linear interpolation is the following:
(y1-y0) / (x1-x0)
Let's say we have the three data points from the graph above:
df = pd.DataFrame({'Value':[0, np.NaN, 3]})
Value
0 0.0
1 NaN
2 3.0
As we can see row 1 (blue point) is missing.
So following formula from above:
(3-0) / (2-0) = 1.5
If we interpolate these using the pandas method Series.interpolate:
df['Value'].interpolate()
0 0.0
1 1.5
2 3.0
Name: Value, dtype: float64
For a bigger dataset it would look as follows:
df = pd.DataFrame({'Value':[1, np.NaN, 4, np.NaN, np.NaN,7]})
Value
0 1.0
1 NaN
2 4.0
3 NaN
4 NaN
5 7.0
df['Value'].interpolate()
0 1.0
1 2.5
2 4.0
3 5.0
4 6.0
5 7.0
Name: Value, dtype: float64
Imputation
When we impute the data with the (arithmetic) mean, we follow the following formula:
sum(all points) / n
So for our second dataframe we get:
(1 + 4 + 7) / 3 = 4
So if we impute our dataframe with Series.fillna and Series.mean:
df['Value'].fillna(df['Value'].mean())
0 1.0
1 4.0
2 4.0
3 4.0
4 4.0
5 7.0
Name: Value, dtype: float64
I will answer the second part of your question i.e. when to use what.
We use both techniques depending upon the use case.
Imputation:
If you are given a dataset of patients with a disease (say Pneumonia) and there is a feature called body temperature. So, if there are null values for this feature then you can replace it by average value i.e. Imputation.
Interpolation:
If you are given a dataset of the share price of a company, you know that every Saturday and Sunday are off. So those are missing values. Now, these values can be filled by the average of Friday value and Monday value i.e. Interpolation.
So, you can choose the technique depending upon the use case.
Im trying to find matches between column B and C when the value in A is above a certain threshold.
0.99 p269 p269
0.99 p312 p312
0.64 p249 p249
0.64 p247 p247
0.09 p243 p284
I'm trying the Countifs method but it doesnt work.
=COUNTIFS(
A1:A31968,">" & F2,
B1:B31968,C1:C31968
)
The first part works (F2 is my treshold), but the I want to check all rows.
So when my threshold is 0.5 I want 4 as a result. When the threshold is 0.08 I still want 4 because the labels of the fifth row don't match. How do I do this?
One option would be to add a fourth column to spreadsheet in column D containing the following formula:
=IF(B1=C1, 1, 0)
Here is what your spreadsheet looks like now:
A B C D
0.99 p269 p269 1
0.99 p312 p312 1
0.64 p249 p249 1
0.64 p247 p247 1
0.09 p243 p284 0
In other words, if columns B and C agree, there is a 1 otherwise 0. Then, you can use the following COUNTIFS formula:
=COUNTIFS(A1:A5,">0.5",D1:D5,"=1")
Here we check the 0.5 threshhold on column A as you were already doing, but we also check that the B and C values are in agreement.
The other option is to use a pseudo-array formula
=SUMPRODUCT((A1:A5>F2)*(B1:B5=C1:C5))
to combine the two conditions. It doesn't have to be entered as an array formula, but may have performance issues if used on several thousand rows of data.
I have a column of coefficients and values
(Column A) (column B)
0.5 17.0
0.2 15.0
1.0 21.0
0.7 30.0
And I want to sum a constant and each coefficients in the column, e.g.
(1.0-0.5)*17.0 + (1.0-0.2)*15.0 + (1.0-1.0)*21.0 + (1.0-0.7)*30.0
Here, the constant is 1.0. What is the equation that is needed to achieve that? I have tried something like
SUMPRODUCT((1-A:A),B:B)
Without success.
How about:
=COUNT(A:A)-SUM(A:A)
This for for the constant 1. If the constant was 7 then use:
=7 * COUNT(A:A)-SUM(A:A)
EDIT#1:
Based on your Edit, I tried your proposed formula and it worked just fine!:
I have two series that overlay pretty close in time but not exactly. So I need to plot each series with the corresponding time component to get the match. Also the number of points is different, by a factor of 10.
How to plot two scatter-plots on the same plot using the time domain as the x-axis? eg
t1: 0.1 0.3 0.5 ...
y1: 3 7 9 ...
t2: 0.18 0.21 0.34 0.41 0.56 ...
y2: 32 55 4 7 1 ...
As you can see I can't just highlight all because the series don't match up so well in time.
If output something like this is what you want:
then it may be simplest to plot one series (say y1), select the Plot Area, Select Data... and add your second (y2) series.
It's also pretty easy to make the chart using the first range, then copy the X and Y values for the second series (hopefully it's in adjacent columns, but you can use Ctrl+Select to select multiple areas). Then select the chart, and use Paste Special to add the copied data as a new series, in columns, X values in first column.
I am a beginner in programmin in general and R specifically.
I would like to generate a set of random numbers in a normal distribution but to limit the decimal places in these numbers to only 2.
I have been using x1 <- runif() to generate my numbers.
Can I add something to it to enable me to only get results rounded off to 2 decimal places?
You can limit the decimal places using the round() function.
If I understand your question correctly this should do the trick:
x1 <- round(
runif(5, min=0, max=1)
, digits = 2
)
x1
The results, which will be different each time, are:
[1] 0.55 0.55 0.75 0.85 0.13