Algorithm / Function to find weighting of parameters how influens on result - excel

Im trying to find a way to predict high Results based on statistics data, thats why I need to find weights of parameters.
I have a data sample with following structure:
Price A
Price B
Price C
Result
5
4
9
80
2
3
0
30
On that structure I would like to calculate weights for
Price A , Price B, Price C to predict highest values of result column
Basing on my previous structure Price C seems to be most valued weight,
so weight for Price C should be the highest one.
So for data:
weightA * priceA + weightB * priceB + weightC * priceC = should generate Result value something around 80, but for second row there should be ~30 predicted
I've tried something with Pearson algorithm (in excel) but I didn't get 'good' results :(.

Related

Rank order data

I have the loan dataset below -
Sector
Total Units
Bad units
Bad Rate
Retail Trade
16
5
31%
Construction
500
1100
20%
Healthcare
165
55
33%
Mining
3
2
67%
Utilities
56
19
34%
Other
300
44
15%
How can I create a ranking function to sort this data based on the bad_rate while also accounting for the number of units ?
e.g This is the result when I sort in descending order based on bad_rate
Sector
Total Units
Bad units
Bad Rate
Mining
3
2
67%
Utilities
56
19
34%
Healthcare
165
55
33%
Retail Trade
16
5
31%
Construction
500
1100
20%
Other
300
44
15%
Here, Mining shows up first but I don't really care about this sector as it only has a total of 3 units. I would like construction, other and healthcare to show up on the top as they have more # of total as well as bad units
STEP 1) is easy...
Use SORT("Range","ByColNumber","Order")
Just put it in the top left cell of where you want your sorted data.
=SORT(B3:E8,4,-1):
STEP 2)
Here's the tricky part... you need to decide how to weight the outage.
Here, I found multiplying the Rate% by the Total Unit Rank:
I think this approach gives pretty good results... you just need to play with the formula!
Please let me know what formula you eventually use!
You would need to define sorting criteria, since you don't have a priority based on column, but a combination instead. I would suggest defining a function that weights both columns: Total Units and Bad Rate. Using a weight function would be a good idea, but first, we would need to normalize both columns. For example put the data in a range 0-100, so we can weight each column having similar values. Once you have the data normalized then you can use criteria like this:
w_1 * x + w_2 * y
This is the main idea. Now to put this logic in Excel. We create an additional temporary variable with the previous calculation and name it crit. We Define a user LAMBDA function SORT_BY for calculating crit as follows:
LAMBDA(a,b, wu*a + wbr*b)
and we use MAP to calculate it with the normalized data. For convenience we define another user LAMBDA function to normalize the data: NORM as follows:
LAMBDA(x, 100*(x-MIN(x))/(MAX(x) - MIN(x)))
Note: The above formula ensures a 0-100 range, but because we are going to use weights maybe it is better to use a 1-100 range, so the weight takes effect for the minimum value too. In such case it can be defined as follow:
LAMBDA(x, ( 100*(x-MIN(x)) + (MAX(x)-x) )/(MAX(x)-MIN(x)))
Here is the formula normalizing for 0-100 range:
=LET(wu, 0.6, wbr, 0.8, u, B2:B7, br, D2:D7, SORT_BY, LAMBDA(a,b, wu*a + wbr*b),
NORM, LAMBDA(x, 100*(x-MIN(x))/(MAX(x) - MIN(x))),
crit, MAP(NORM(u), NORM(br), LAMBDA(a,b, SORT_BY(a,b))),
DROP(SORT(HSTACK(A2:D7, crit),5,-1),,-1))
You can customize how to weight each column (via wu for Total Units and wbr for Bad Rates columns). Finally, we present the result removing the sorting criteria (crit) via the DROP function. If you want to show it, then remove this step.
If you put the formula in F2 this would be the output:

Find Rank of a Variable in my Dataframe within For Loop

I understand how to add a new column that shows the Rank of the number, but I am looking to change this to show the rank of a variable in that column...
list_of_values = [1,14,125,23,12]
df['price'] contains all 500 of my prices, and I'd like to see how 1 compares to these 500 or how 125 ranks (ties should reflect the minimum (e.g. if there are two values of price=1, the ranking should be 500/500 for both))

Survival rates >1 when using mrOpen in the package FSA

I am currently doing some population analyses with the package "FSA" in R.
By using the mrOpencommand, I want to get the survival rate.
My rawdata is a simple table with one row per indidivual, one column per sample date and values of 0 and 1 (for not capured or captured during that respective sampling).
id
total.captures
date1
date2
date3
etc
1
3
1
1
1
...
2
1
1
0
0
...
The first two columns contain the individual id and the aggregated number of captures which is why I excluded them in the analysis.
This is the exact code:
hold.data<-capHistSum(data, cols2use = c(3:13))
est.data<-mrOpen(hold.data)
summary(est.data)
confint(est.data)
It seems to work out, as I get the tables and summaries with all the parameters. See here as an example:
Screenshot_Results
However, there's a problem with the survival estimate phi.
The phi value is not between 0 and 1, but in some cases, exceeds 1.
Any idea, what went wrong here?
Thanks,
Pia

Calculations within groupby in pandas, python

My data set contains house price for 4 different house types (A,B,C,D) in 4 different countries (USA, Germany, Uk, sweden). House price can be only three types (Upward, Downward, and Not Changed). I want to calculate Difition index (ID) for different House types (A,B,C,D) for different countries (USA, Germany, Uk, sweden) based on house price.
The formula that I want to use to calculate Difition index (DI) is:
DI = (Total Number of Upward * 1 + Total Number of Downward * 0 + Total Number of Not Changed * 0.5) / (Total Number of Upward + Total Number of Downward + Total Number of Not Changed)
Here is my data:
and the expected result is:
I really need your help.
Thanks.
You can do this by using groupby and assuming your file is named as text.xlsx
df = pd.read_excel('test.xlsx')
df = df.replace({'Upward':1,'Downward':0,'Notchanged':0.5})
df.groupby('Country').mean().reset_index()

Resampling on non-time related buckets

Say I have a df looking like this:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
There are no time intervals here. I need to calculate a Volume Weighted Average Price for every 50 items in quantity. Every row (index) in the output would represent 50 units (as opposed to say 5-min intervals), the output column would be the volume weighted price.
Any neat way to do this using pandas, or numpy for that matter?? I tried using a loop splitting every row into one item prices and them group them like this:
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
But it takes for ever and I run out of memory.. The df is a few million rows.
EDIT:
The output I want to see based on the above is:
vwap
0 101.20
1 102.12
2 103.36
3 101.00
Each 50 items gets a new average price.
I struck out on my first at-bat facing this problem. Here's my next plate appearance. Hopefully I can put the ball in play and score a run.
First, let's address some of the comments related to the expected outcome of this effort. The OP posted what he thought the results should be using the small sample data he provided. However, #user7138814 and I both came up with the same outcome that differed from the OP's. Let me explain how I believe the weighted average of exactly 50 units should be calculated using the OP's example. I'll use this worksheet as an illustration.
The first 2 columns (A and B) are the original values given by the OP. Given those values, the goal is to calculate a weighted average for each block of exactly 50 units. Unfortunately, the quantities are not evenly divisible by 50. Columns C and D represent how to create even blocks of 50 units by subdividing the original quantities as needed. The yellow shaded areas show how the original quantity was subdivided and each of the green bounded cells sum to exactly 50 units. Once 50 units are determined the weighted average can be calculated in column E. As you can see, the values in E match what #user7138814 posted in his comment, so I think we agree on the methodology.
After much trial and error the final solution is a function that operates on the numpy arrays of the underlying price and quantity series. The function is further optimized using Numba decorator to jit-compile the Python code into machine-level code. On my laptop, it processed a 3 million row arrays in a second.
Here's the function.
#numba.jit
def vwap50_jit(price_col, quantity_col):
n_rows = len(price_col)
assert len(price_col) == len(quantity_col)
qty_cumdif = 50 # cum difference of quantity to track when 50 units are reached
pq = 0.0 # cumsum of price * quantity
vwap50 = [] # list of weighted averages
for i in range(n_rows):
price, qty = price_col[i], quantity_col[i]
# if current qty will cause more than 50 units
# divide the units
if qty_cumdif < qty:
pq += qty_cumdif * price
# at this point, 50 units accumulated. calculate average.
vwap50.append(pq / 50)
qty -= qty_cumdif
# continue dividing
while qty >= 50:
qty -= 50
vwap50.append(price)
# remaining qty and pq become starting
# values for next group of 50
qty_cumdif = 50 - qty
pq = qty * price
# process price, qty pair as-is
else:
qty_cumdif -= qty
pq += qty * price
return np.array(vwap50)
Results of processing the OP's sample data.
Out[6]:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101. ])
Notice that I use the .values method to pass the numpy arrays of the pandas series. That's one of the requirements of using numba. Numba is numpy-aware and doesn't work on pandas objects.
It performs pretty well on 3 million row arrays, creating an output array of 2.25 million weighted averages.
df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
'quantity': np.random.randint(1, 75, 3000000)})
%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vwap = vwap50_jit(df.price.values, df.quantity.values)
vwap.shape
Out[11]: (2250037,)

Resources