How to get the second largest value in a column - excel

Recently I discovered the LARGE and SMALL worksheet functions, one can use for determining the first, second, third, ... larges of smalles value in an array.
At least, that's what I thought:
When having a look at the array [1, 3, 5, 7, 9] (in one column or row), the LARGE(...;2) gives 7 as expected, but:
When having a look at the array [1, 1, 5, 9, 9], I expect LARGE(...;2) to give 5 but instead I get 9.
Now this makes sense : it seems that the function LARGE(...;2) takes the largest entry in the array (value 9 on the last but one place), deletes this and gives the larges entry of the reduced array (which still contains another 9), but this is not what one might expect intuitively.
In order to get 5 from [1, 1, 5, 9, 9], I would need something like:
=LARGE_OF_UNIQUE_VALUES_OF(...;2))
I didn't find this in LARGE documentation.
Does anybody know an easy way to achieve this?

If you have the new Dynamic Array formulas:
=LARGE(UNIQUE(...),2)
If not use AGGREGATE:
=AGGREGATE(14,7,A1:A5/(MATCH(A1:A5,A1:A5)=ROW(A1:A5)),2)

This is a bit of a hack.
=LARGE(IF(YOUR_DATA=LARGE(YOUR_DATA,1),SMALL(YOUR_DATA,1)-1,YOUR_DATA),1)
The idea is to (a) take any value in your data that is equal to the largest element and set it to less than the smallest element, then (b) find the (new) largest element. It's OK if you want the 2nd largest, but extending to 3rd largest etc. gets progressively uglier.
Hope that helps

Related

Comparing the word-counts of two files, accounting for the number of occurrences

I'm currently working on a program which is supposed to find exploits for vulnerabilities in web-applications by looking at the "Document Object Model" (DOM) of the application.
One approach for narrowing down the number of possible DB-entries follows the strategy of further filtering the entries by comparing the word-count of the DOM and the database entry.
I already have two dicts (actually Dataframes, but showing dict here for better presentation), each containing the top 10 words in descending order of their numbers of ocurrences in the text.
word_count_dom = {"Peter": 10, "is": 6, "eating": 2, ...}
word_count_db = {"eating": 6, "is": 6, "Peter": 1, "breakfast": 1, ...}
Now i would like to calculate some kind of value, which represents how similar the two dicts are while accounting for the number of occurences.
Currently im using:
len([value in word_count_db for value in word_count_dom])
>>> 3
but this does not account for the number of occurrences at all.
Looking at the example i would like the program to give more value to the "is"-match, because of the generally better "Ranking-Position to Number of Occurences"-value.
Just an idea:
Compute for each dict the relative probability of each entry to occur (e.g. among all the top counts "Peter" occurs 20% of the time). Do this for each word occuring in either dict. And then use something like:
https://en.wikipedia.org/wiki/Bhattacharyya_distance

Longest “increasing” subsequence with two consecutive numbers whose average is less than the third number

Problem Statement
Given an array of integers, find the length of the longest subsequence with two consecutive numbers whose average is less than the third number in O(n^3) time.
Example:
[20, 10, 5, 0, 6, 4, 15, 6, 9, 8], the longest subsequence that satisfies the requirement is 5, 0, 6, 4, 6, 9, 8, and the length of that sequence is 7. (5 + 0) / 2 = 2.5 < 6, (0 + 6) / 2 = 3.0 < 4, (6 + 4) / 2 = 5.0 < 6, etc.
What I tried
1st approach: O(n^2)
A generic dynamic programming approach, I define the DP array to be the length of the longest subsequence that satisfies the condition.
If (i-2)th and (i-1)th integers’ average is less than ith integer, we add one to the dp array. The solution is the last element of the DP array.
This didn’t work as I realized it is only considering the numbers in the original array, not the subsequence I am trying to achieve. So, this approach only gave me 5 as the answer for the example input above, and the answer would be 5, 0, 6, 4, 15. The approach did not account for disjoint parts of the original sequence to create the new subsequence.
1.5th approach
While writing out the problem on my notes, I realized the corresponding average subsequence for the example input is the longest. Following the idea of a LIS problem, I created an array of all the average numbers to find the longest increasing subsequence in that array. This solved the example input but failed more complicated inputs.
2nd approach: O(n^3)
Using the hint of the problem statement that the algorithm can be O(n^3), so I tried coming up with a definition for a 2D DP array and a loop to make it O(n^3). I defined the DP[i][j] to be the length of the longest subsequence from the start element to the ith element, while considering the jth element.
Considering the example input, for instance, DP[2][6] = 3 because the subsequence would be 10, 5, 15. From the first element to the 2nd index element, we consider the subsequence 10, 5, and the 6th index element is 15, so the subsequence here is 10, 5, 15, and the length is 3. Repeat until every half above the main diagonal of the table is filled, and the solution is the last element (last row, last column) in that half.
I thought this was it, but there were problems I ran into such as not knowing which part of the DP table should i be reusing and not knowing what exactly are my last two numbers of the subsequence I am trying to achieve. Ultimately, I didn’t know where to go next.
Other thoughts
I think a 3D DP array could also work, but I haven’t really thought about how I would define the array…
Any help would be greatly appreciated!

Interview question about "largest range" makes no sense

Here's the question. I'm actually dumbfounded. I don't even get the question. What are they on about?
What even is a largest range? What do they mean by largest? What's a range? They say a range is a collection of numbers that come right after each other in the set of real integers. Okay, so 1, 2, 3, 4, stuff like that, right? But then they say the numbers need not be ordered or even adjacent.... but then they're not coming right after each other!! They are contradicting their own previous statement. Now I have no idea what a range is.
Their example doesn't help either. Why is [0, 15, 5, 2, 4, 10, 7] the largest range in that vector?
What is going on?
It's not very clear in the question, but I'm pretty sure the interviewer means a "range" is a set of consecutive numbers (n, n+1).
The range [0,7] is actually [0,1,2,3,4,5,6,7] since all of those appear in the full set.
The actual order doesn't matter.
In the example you were given in the interview, which you list in your question as well, the input array is: [1, 11, 3, 0, 15, 5, 2, 4, 10, 7, 12, 6]. The reason that the "largest range" is identified as [0, 7] is because all the numbers between 0 and 7 are included in that array.
There isn't another range in the input array that has a longer range than 0 to 7. For instance, there is a [10, 12] range in the input array, but that array has a length of 3 that is smaller than the length of [0, 7] range, which is 8.
In this case, the range is understood as a continuous list of integers, the largest range is the list with the most number of integers.
It means
Find the largest continuous range of numbers
For eg. in array [0,1,2,5,6,7,8,9,10]
There are 2 continuous list
[0,1,2] and [5,6,7,8,9,10] but as the larger range is the second one. so the output must be [5,10].
i.e. The largest and smallest of the largest range.

Using result of an Excel array function in a calculation

I am attempting to count instances of a particular value in Excel, from the last instance of a prior value.
Assume a vertical list starting in cell A1: 1, 2, 3, 4, 5, 4, 5, 3, 4, 5, 2, 3, 4, 3, 4, 2, 3, 4, 5
I can use an array function in, say B14 (A14 value: 3), of {=MAX(ROW($1:14)*(A$1:A14=A14-1)) to give me the row number of the last instance of a "2" (row 10).
I can then have, in C15, a function =COUNTIF(OFFSET(A14,0,0,B14-ROW(A14),1):A14,A14), which will count the instances of 3's since the last 2.
The question is: how do I integrate that array function directly into the final formula, so as not to have to waste a column with the interim calculation?
Edit
The list of numbers represents a level of indentation, so the end result will be a compound of these calculations with different offset checking to provide section numbering: 1; 1.1; 1.1.1, 1.2, 1.2.1, 1.2.2, etc
I want a single function that can calculate this entire depth level, without having to waste several columns identifying how many rows above the previous indent layer was defined.
Try in cell B14 this formula array:
{=COUNTIF(OFFSET($A14,0,0,
MAX(ROW($1:14)*($A$1:$A14=$A14-1))
-ROW($A14),1):$A14,$A14)}

Power Pivot previous value

I have the following table & data that can be seen in Excel PowerPivot
item, timeframe, total
1, 1, 15
1, 2, 20
1, 3, 15
2, 1, 10
2, 2, 11
2, 3, 10
While I can easily get the last timeframe, I need to get the previous timeframe's total like:
item, timeframe, total, last_timeframe, last_timeframe_total
1, 1, 15, 0, 0
1, 2, 20, 1, 15
1, 3, 15, 2, 20
2, 1, 10, 0, 0
2, 2, 11, 1, 10
2, 3, 10, 2, 11
I've tried a calculate formula, but that didn't seem to work and only returns blanks. =CALCULATE(SUM(MyTable[total]), MyTable[timeframe] = EARLIER(MyTable[timeframe]) - 1)
EARLIER() doesn't understand any sort of ordering of rows.
EARLIER() refers to nested row contexts.
What you actually want is LOOKUPVALUE() here, which matches the values in specified fields with search criteria you provide, and returns the value that exists for the row which matches those criteria.
Based on your sample it looks like [Timeframe] is a one-incremented index for each item. If this assumption is not true, LOOKUPVALUE() is probably not the function you want.
last_timeframe_total =
LOOKUPVALUE(
MyTable[total]
,MyTable[item] // This is the field we're searching
,MyTable[item] // This is the value to search for in the field
// identified in argument 2 - this evaluates
// in the current row context of the table
// where we are defining last_timeframe_total
// as a calculated column.
,MyTable[timeframe] // The second field we are searching.
,MyTable[timeframe] - 1 // Similar to argument 3.
)
This will give you the value for the prior timeframe for the current item.
Ninja edit: Forgot to mention that Power Pivot isn't really the layer to be doing this sort of work in. Lookups and this sort of data shaping are better done in your ETL from transactional sources. If this is not possible, then it's better done in the query to populate Power Pivot than in Power Pivot. Power Query is a good tool to use for this sort of transformation that easily fits into the Microsoft ecosystem, being another Microsoft add-in for Excel.
Power Pivot is an analytical database optimized for aggregations. In general, if you ever find yourself thinking "for every row," it's a sign that what you're trying to accomplish is probably better suited for a different layer of the BI solution.

Resources