I have a dataset of size (61573, 25). The rows represent users whereas the columns represent views on particular movie genres. For example, if data[i,j] == 3 that means that user i has viewed 3 movies of gender j in total. As expected ,rows are sparse and right-skewed.
What I would like to do is to compute how much engaged a user is on each of the 25 movie genders by assigning to him one of the following tags: {VL, L, A, H, VH}.
What I have tried so far is to compute z-scores, either row or column -wise (I haven't tried to standardize values twice, though (i.e. first on rows and then on columns)), and then apply the following function depending on how far away the z-scores are from 0:
(-oo, -2] --> VL
(-2, -1] --> L
(-1, +1) --> A
[+1, +2) --> H
[+2, +oo) --> VH
In either case, my problem is that the results seem very bad in most of the cases probably because they are laying between -1 and +1, and thus are almost always marked as A (i.e. average). So, what else should I try based on your opinion? How would YOU approach this problem?
The z-scores clearly are not the right way to go.
The reason is that they are based on the assumption that your data is normal distributed. I have strong doubts that your data is normal distributed - in particular, it probably doesn't have any negative values, does it?
Have you tried just using quantiles? top 10%, bottom 10% etc.?
Related
I have a dataset with 9448 data points (rows)
Whenever I choose values of K ranging BETWEEN 1 to 10, the accuracy comes out to be 100 percent ( which is an ideal case ofcourse! ) and wierd.
If I choose my K value to be be 100 or above the accuracy decreases gradually (95% to 90%).
How does one choose the value of K? We want a decent accuracy and not hypothetical as 100 percent
Well, a simple approach to select k is sqrt(no. of datapoints). In this case, it will be sqrt(9448) = 97.2 ~ 97. And please keep in mind that It is inappropriate to say which k value suits best without looking at the data. If training samples of similar classes form clusters, then using k value from 1 to 10 will achieve good accuracy. If data is randomly distributed then one cannot say which k value will give the best results. In such cases, you need to find it by performing an empirical analysis.
I am trying to port a very simple Excel into a Matlab code (I am not completely satisfied with Excel Solver!). My problem is this:
I have two materials (say A and B) with their properties (density, visco, etc) and prices, and I mix them to obtain a third material (say C), whose properties are a mix (non necessarily linear) of the two, and which, if it respects some limits (ie density max X, visco max Y), can be sold for a certain price. What I have is a function which takes the quantity of A and B, their properties, their prices, material C limits, and material C price. It then comes up with a profit (i.e. price C * (quantity A + quantity B) - (price A * quantity A + price B * quantity B) ), and an indicator which tells me if all the properties limits are satisfied in material C (basically it compares limits and actual properties, and puts 0 if ok and 1 otherwise ---> if all properties are respected, mean of that vector should be 0).
Thus I have:
[profit, ok] = blend([qA, qB], [specA, specB], [pA, pB], [limits], pC)
and I want to max profit by changing quantity A and quantity B, sub that the ok vector is 0 and qA+qB is less than a specified max quantity. The real problem is imposing the ok vector equal to 0. I thought about porting the limit check outside of the function, but I can only check if the limits are respected once the function has calculated the property of the blends, so I cannot put that outside. Is there a solution to this? Many thanks!
What your are looking for is called nonlinear constrained optimization, and most likely specifically the function fmincon
I am afraid, that it is probably best if you unpack your function to fit into the standard scheme. This is incidentally a very good thing to learn, as this is the common way to express such problems.
The call is
x=fmincon(fun,x0,A,b,Aeq,beq,lb,ub,nonlcon)
You have two parameters, giving the quantities of your materials, or if you normalize the total quantity to 1, you could get by with just one parameter x and express the other as (1-x) in all equations.
So you would need to write one function fun, that just computes the profit based on the parameters.
The material constraints are then put into the remaining parameters. You could put all constraints into the ceq return of nonlcon, as explained here to return zero when the mix is ok.
However, cleaner and more efficient is to encode all linear constraints using the A and b matrices.
For more details I would require the actual constraints and functions that you have.
I'm trying to find the rolling mean of time series while ignoring values that do not follow the trend.
x
869
1570
946
0
1136
So, what I would want the result to look like is...
x | y
869 | 0
1570 | 0
946 | 1128.33
3 | 0
1136 | 1217.33 ([1136+1570+946]/3)
900 | 2982 ([946+1136+900]/3)
860 | 2896
The tough part here is if the row I'm on is a trending value I want to take the 3 previous trending values and find them mean of them, but if it's a non-trending value I want it to just zero out. Sometimes I might have to skip 2 or 3 previous lines to get 3 trending values to take the average as well.
So far I've been using array, RC formulas in a VBA macro form, but I'm not sure I could use RC here or if it has to be something else completely. Any help would be greatly appreciated.
I believe I can help you with your problem. First three notes:
1) It appears to me that you are trying to do DCA on smoothed production profiles, ignoring months without a complete record or no data. I'm making this assumption since you mentioned this was time series data but didn't give a sample rate. 2) I've added some extra 'data' for the sake of demo-ing. 3) In your example you shared, the last two values in your 'Y' column it looks like you may have summed but have forgotten to divide.
The solution I came up with has three parts: 1) create a metric to identify 'outliers'; 2) flag 'outliers'; 3) smooth non-flagged data. Let's establish some worksheet infrastructure and say that your production values are in column B and the associated time is in column A as follows:
Part 1) In column 'C', estimate a rough data value based on a trend approximated from two points on either side of your current time step. Subtract the actual value from this approximation. The result will always be positive and quite large for a timestep with little or no production.
=(INTERCEPT(B1:B6,A1:A6)+(A4*SLOPE(B1:B6,A1:A6)))-B4
Part 2) In column 'D', add a condition for when the value computed above is larger than the actual data point. Have it use '0' to identify a point that shouldn't be included in your average. Copy this down to the end of your data as well.
=IF(C4>B4,0,1)
Our sheet now looks like this:
3) Your three element average can now be computed. In the last cell of column 'E', enter the following array formula. You have to accept this formula by pressing ctrl + shift + enter. Once that is done fill the column from bottom to top:
=IFERROR(IF(D17=1,AVERAGE(INDEX(B12:B17,MATCH(2,1/(FIND(1,D12:D17)))),INDEX(B12:B16,MATCH(2,1/(FIND(1,D12:D16)))-COUNTIF(D17,"=0")),INDEX(B12:B15,MATCH(2,1/(FIND(1,D12:D15)))-COUNTIF(D16:D17,"=0"))),0),"")
This takes averages the most recent three values and allows for a skip of up to three time steps of outlier data per your problem statement. For an idea of how the completed sheet looks:
This was a fun challenge, I have some ideas for a more efficient formula but this should get the job done. Please let me know how this works for you!
Cheers
[EDIT]
An alternative approach which allows the user to specify the number of previous entries to include is detailed below. This is a more general (preferred alternative) and picks up in place of the previously described step 3.
3Alt) In cell G2 enter a number of previous values to average, for this example I am sticking with 3. In cell E4 enter the following array expression (ctrl+shift+enter) and drag to the end of column E:
=IFERROR(IF(D4=1,SUM(INDEX(D:D,LARGE(($D$4:D4=1)*ROW($D$4:D4),$G$2)):D4 * INDEX(B:B,LARGE(($D$4:D4=1)*ROW($D$4:D4),$G$2)):B4)/$G$2,0),"")
This uses the LARGE function to find the 'nth' largest value, where n is the number of preceding values from the current time-step to average. Then it builds a range that extends from the found cell to the current time step. Then it multiplies the flags (0's and 1's) by each month's production value, sums them and divides by n. In this way months flagged as bad are set to 0 and not included in the sum.
This is a much cleaner way to achieve the desired result and has the flexibility to average different periods of time. See example of the final value below.
I'm going to try to explain this the best that I can.
Right now I have a spreadsheet with a list of football players, each of which has an assigned salary and projected point total for the week.
My goal is to use Solver or some other method to determine the best combination of players to maximize the projected point total while staying under a salary cap.
In this example I have 4 separate player lists, like this:
QB: Player A, Player B, Player C...Player N
RB: Player a, Player b, Player c...Player N
WR: Player X, Player Y, Player Z...Player N
TE: Player x, Player y, Player z...Player N
I need the best combination that includes 2 QBs, 2 RBs, 2 WRs, 1 TE, and 2 "Flex", which means any of RB/WR/TE.
I have tried using Solver to maximize the projected point total, but the variable fields in this case would be the Player's Names and it seems like the variable field needs to be a number, not a list of strings.
Any ideas?
My favorite kind of question :)
Here is the model setup:
Top table shows the decision variables: = 1 if player i = A, B, ..., N of list L = QB, .., TE is selected, =0 otherwise.
Entries in column R, (next to the top table) are the sums of each row. These must be constrained with the numbers in column T. Cell R7 is the total sum of players, which should be 9: 2 flexible and 7 as per the individual list requirements.
Middle table shows the salaries (randomly generated between 50,000 and 150,000). The Sum of Salaries formula is =SUMPRODUCT(C11:P14,C3:P6). The idea here is that only the salaries of players that are selected are taken into account. This SUMPRODUCT should be constrained with the budget, which is in cell T14. For my experiment, I put it equal to 80% of the total sum of all salaries.
Objective: Bottom table shows the projected points for each player. The formula in cell R22 is =SUMPRODUCT(C19:P22,C3:P6) (same logic as with salaries above). This is the value to be maximized.
Solver Model shown below:
I suggest selecting Simplex LP and going to Options and setting the Integer Optimality to zero (0).
Result:
Solver manages to find an optimal solution. The problem is really small and it is very quick. Solver works with up to 200 variables and 100 constraints, for large problems you will need the (commercial) extended version:
Of course, you can just order the real player names so that they fit this setting. For example, if you sort the players of each list alphabetically, then (Player A, QB) = first player of team QB, etc.
I hope this helps! Let me know if you would like me to upload the file for you.
Best,
Ioannis
Excel's solver is built on numerical methods. Applying to a domain that consists of discrete values, like strings or football players is probably going to fail. You should consider writing a brute force solver in a "real" programming language, like c#, java, python, ruby, or javascript. If there are performance problems, then optimize from there.
Solver won't work here because it's not a numeric solution you're after.
Make a spreadsheet that has every possible combination of position players (that meet your criteria) on each row. Then make an Excel formula that calculates projected point total based on the players in that row. Sort the spreadsheet by your projected point column.
I am not sure whether this is the right place to ask this question.
As this is more like a logic question.. but hey no harm in asking.
Suppose I have a huge list of data (customers)
and they all have a data_id
Now I want to select lets say split the data in ratio lets say 10:90 split.
Now rather than stating a condition that (example)
the sum of digits is even...go to bin 1
the sum of digits is odd.. go to bin 2
or sum of last three digits are x then go to bin 1
sum of last three digits is not x then go to bin 2
Now this might result in uneven data collection..sometimes it might be able to find the data.. more (which is fine) but sometimes it might not be able to find enough data
Is there a way (probabilistically speaking)
which says.. sample size is always greater than x%
Thanks
You want to partition your data by a feature that is uniformly distributed. Hash functions are designed to have this property ... so if you compute a hash of your customer ID, and then partition by the first n bits to get 2^n bins, each bin should have approximately the same number of items. (You can then select, say, 90% of your bins to get 90% of the data.) Hope this helps.