Excel formula for hierarchical data analysis Excel 2010 - excel-formula

I am trying to write an Excel formula that will take data and look at the starting type and then ending type and then depending on the final type will decide on what worksheet, column and row it will go into.
I want to set a priority system so if it is a planet it is top and if it's a moon it comes second and large asteroid would be third and so on. If the planet is the subsequently reclassified as a moon it would still stay in the planet column as the initial classification has a higher priority than the final classification.
So if an object is identified as a planet then it comes top and does not go into any other columns. If it is a moon it would go into moon column on initial identification but would move to the correct column on final classification.
So everything would be classified and if at any point something is classified as a planet it will stay stay in the planet column. If something is classified as a moon it will stay in the moon column despite its final reclassification unless it's reclassified as a planet then it will move to the planet column as it is of a higher priority than moon and so on. There are going to be 56 classifications in order of priority. This is what I have come up with:
=IF(OR(Data!B2="Planet", Data!C2="Planet"), 1, 0)
=IF(OR(Data!B3="planet", Data!C3="moon"), 1, 0)

Related

How can I calculate the difference between the previous row and the current row when there are blank rows in between, in Excel?

I am trying to calculate the time between oscillations on a wave, where the time between is the difference between the low point and the high point on any section of the entire wavelength. I have the sheet set up so it tells me if the previous value is lower (wave going up) or higher (wave going down). I want to calculate the time difference between the last-occurring different marker and the current marker. For example, if there is (up, up, up, up, down, up, up, up, down, down, down, down, down, down, down, up, up, up, down) I want to know the time difference between the first 'up' and first 'down' (entry 1 and entry 5), the second change (entry 5 and 8), etc.
How can I calculate the time difference between times when the entry changes?
I cannot figure this out and could not get it to work.
I would usually accomplish this by assigning an instance number for each thing I am trying to measure, so in this case it would be a wave direction number. Then, use that wave direction number to drive the calculation and identification of when to do some calculation, in this case, the distance from the first up to the first down.
I have data setup in the following way:
Column A: Time Entry
Column B: Wave Direction
Column C: Wave Direction Number
Column D: Wave Length
C2 Formula to identify the wave direction number =IF(B2=B1,C1,IFERROR(C1+1,1)) which is comparing the current row's wave direction to the previous row's wave direction.
D2 Formula to calculate the change since the first occurrence of the previous wave direction number =IF(C2<>C1,IFERROR(A2-XLOOKUP(C1,C:C,A:A),0),0)
The IFERROR is just to accommodate the header row.

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.

How to sum total hours in a row while skipping certain values?

I study wildlife and currently, I am doing an analysis regarding how long my focal species goes off of the mountain (its main habitat) and into human settlements.
Here is a picture with the data: data
Anyways, as you can see there are three coloured columns. Yellow is data, green is time, and blue is whether the animal is on or off the mountain (with red being when the animal is off).
As you can see, this one particular animal went off on several occasions. In this case, he went off the mountain three times but stayed off at various lengths. As I have thousands of data points, I essentially would like to determine how long each "off the mountain" event lasted. That is, since I consider every time the animal went off the mountain to be a separate event, I would like to determine how long the animal was off the mountain for each excursion, separately. In this case, the animal went off three times and I would like to total those three events individually.
So, as stated, an event would be every single occasion that the animal left the mountain, stayed there (for however long), and eventually made its way back up.
Any help would be greatly appreciated.
The simplest way would be just to count how many consecutive "off" periods there are in a particular run following an "on" period then multiply by 3 hours 20 minutes which you could do like this (starting in (say) K2)
=IF(AND(G1="On",G2="Off"), MATCH("On",G3:G$100,0)*TIME(3,20,0)*24,0)
You could take it further by looking at the individual times of the fixes as well to get an upper and lower limit (e.g. for the first excursion it could be between 3 hours 20 minutes and 10 hours 40 minutes roughly).
Upper limit
=IF(AND(G1="On",G2="Off"), (INDEX(J3:J$100,MATCH("On",G3:G$100,0))-J1)*24,0)
Lower limit
=IFERROR(IF(AND(G1="On",G2="Off"), (INDEX(J3:J$100,MATCH("On",G3:G$100,0)-1)-J2)*24,0),0)
where my column J contains a datetime value formed by adding the date and time in columns A and B together.
This raises an issue about what happens when the animal is still off-mountain at the end of its data (currently gives #N/A because MATCH is unable to find a cell containing "On"). Would need to decide how to treat this case if it ever occurs in practice.
Note when there is only one off-mountain measurement the lower limit is zero because in theory the animal could have left immediately before the measurement and returned immediately afterwards.
EDIT
To address the above issue where the animal is still off-mountain at the end of its data (and looking at the sample data it looks as if a different animal's data is immediately following the first animal's data) you would need this
=IF(AND(G1="On",G2="Off"), IFERROR(MATCH(1,(G3:G$100="On")*(E3:E$100=E2),0),MATCH(TRUE,E3:E$100<>E2,0))*TIME(3,20,0)*24,0)
which would have to be entered as an array formula using CtrlShiftEnter
You could argue that you might need to do some averaging for an incomplete off-mountain excursion like this which would make it even more complicated, but this is an Excel answer and can't go too far into the rights or wrongs of the analysis.
I guess a good starting-point would be knowing how you gather these statistics in the first place.

Show on PivotChart... sum of two fields

OK, let's say I have this PivotChart:
I have a excel sheet of football matches and I watch to see the highest scoring team but it only does either home or away and I want to combine the two... how can I do that on my PivotChart fields?
I suggest inserting a column immediately to the right of B say labelled GSum with:
=SUMIF(F:F,B2,E:E)+SUMIF(B:B,B2,D:D)
in C2 and copied down to suit. In the PT add a Calculated Field, say Goals with =SUM(GSum)/2 and Sum of Goals at the bottom of Σ Values.
With luck on refresh the results might be similar to shown in this simplified example:
Note that, for example, C has not scored at home (so no blue) but has had three goals scored against it at home (brown). You might prefer the latter to indicate how many goals C has scored away (ie 4 - the same as it has scored in total) instead.

Excel Solver Using Strings

I'm going to try to explain this the best that I can.
Right now I have a spreadsheet with a list of football players, each of which has an assigned salary and projected point total for the week.
My goal is to use Solver or some other method to determine the best combination of players to maximize the projected point total while staying under a salary cap.
In this example I have 4 separate player lists, like this:
QB: Player A, Player B, Player C...Player N
RB: Player a, Player b, Player c...Player N
WR: Player X, Player Y, Player Z...Player N
TE: Player x, Player y, Player z...Player N
I need the best combination that includes 2 QBs, 2 RBs, 2 WRs, 1 TE, and 2 "Flex", which means any of RB/WR/TE.
I have tried using Solver to maximize the projected point total, but the variable fields in this case would be the Player's Names and it seems like the variable field needs to be a number, not a list of strings.
Any ideas?
My favorite kind of question :)
Here is the model setup:
Top table shows the decision variables: = 1 if player i = A, B, ..., N of list L = QB, .., TE is selected, =0 otherwise.
Entries in column R, (next to the top table) are the sums of each row. These must be constrained with the numbers in column T. Cell R7 is the total sum of players, which should be 9: 2 flexible and 7 as per the individual list requirements.
Middle table shows the salaries (randomly generated between 50,000 and 150,000). The Sum of Salaries formula is =SUMPRODUCT(C11:P14,C3:P6). The idea here is that only the salaries of players that are selected are taken into account. This SUMPRODUCT should be constrained with the budget, which is in cell T14. For my experiment, I put it equal to 80% of the total sum of all salaries.
Objective: Bottom table shows the projected points for each player. The formula in cell R22 is =SUMPRODUCT(C19:P22,C3:P6) (same logic as with salaries above). This is the value to be maximized.
Solver Model shown below:
I suggest selecting Simplex LP and going to Options and setting the Integer Optimality to zero (0).
Result:
Solver manages to find an optimal solution. The problem is really small and it is very quick. Solver works with up to 200 variables and 100 constraints, for large problems you will need the (commercial) extended version:
Of course, you can just order the real player names so that they fit this setting. For example, if you sort the players of each list alphabetically, then (Player A, QB) = first player of team QB, etc.
I hope this helps! Let me know if you would like me to upload the file for you.
Best,
Ioannis
Excel's solver is built on numerical methods. Applying to a domain that consists of discrete values, like strings or football players is probably going to fail. You should consider writing a brute force solver in a "real" programming language, like c#, java, python, ruby, or javascript. If there are performance problems, then optimize from there.
Solver won't work here because it's not a numeric solution you're after.
Make a spreadsheet that has every possible combination of position players (that meet your criteria) on each row. Then make an Excel formula that calculates projected point total based on the players in that row. Sort the spreadsheet by your projected point column.

Resources