Comparing the word-counts of two files, accounting for the number of occurrences - python-3.x

I'm currently working on a program which is supposed to find exploits for vulnerabilities in web-applications by looking at the "Document Object Model" (DOM) of the application.
One approach for narrowing down the number of possible DB-entries follows the strategy of further filtering the entries by comparing the word-count of the DOM and the database entry.
I already have two dicts (actually Dataframes, but showing dict here for better presentation), each containing the top 10 words in descending order of their numbers of ocurrences in the text.
word_count_dom = {"Peter": 10, "is": 6, "eating": 2, ...}
word_count_db = {"eating": 6, "is": 6, "Peter": 1, "breakfast": 1, ...}
Now i would like to calculate some kind of value, which represents how similar the two dicts are while accounting for the number of occurences.
Currently im using:
len([value in word_count_db for value in word_count_dom])
>>> 3
but this does not account for the number of occurrences at all.
Looking at the example i would like the program to give more value to the "is"-match, because of the generally better "Ranking-Position to Number of Occurences"-value.

Just an idea:
Compute for each dict the relative probability of each entry to occur (e.g. among all the top counts "Peter" occurs 20% of the time). Do this for each word occuring in either dict. And then use something like:
https://en.wikipedia.org/wiki/Bhattacharyya_distance

Related

Excel : Find numbers in a list which when summed total a target number

Ive got a list of several thousand numbers.
I have a target number which is a sum of some of the list of numbers.
I want to be able to find what numbers in the list, when summed total to the target number.
Eg.
List :
1, 2, 3, 4, 5, 6, 7, 8
Target :
5
Result :
The target could be made from the sum of 1+4, 2+3, 5
If there are is no way the target can be achieved from the numbers in the list eg. list : 10,20 and the Target : 5 then the formula should output "no available matches"
That is a very simple example but in practice there are thousands of numbers in the list and some of the numbers are up to seven digits long.
Is there a formula could be used in excel (or google sheets) that would work this out automatically ? Preferably as a native function rather than VBA / Script.
You request is so complex that there is not a native function, which can do that. In top of it, the result of your function is a whole collection of possibilities (like [5, 1+4, 2+3] in your particular case), while Excel functions only generate a single result, which can be placed in a single cell.
In top of this, it's quite a difficult task to program, let me explain you why with some simple examples:
List : 1,2,3,4,5
Desired result: 36
=> no way: even when you sum all elements of the list, the result is not large enough.
List : 6,7,8,9,10
Desired result: 5
=> no way: even the smallest element in the list is larger than the result.
List : 1,2,20,21,23
Desired result: 10
=> no way (but how to make a computer see this?)
List : 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Desired result: 119
=> easy as hell: the sum of all numbers of the list is 120, just subtract 1.
List: 2,4,6,8,10,12,14,16,18,20
Desired result: 11
=> no way: all numbers in the list are odd, you can't obtain an odd number.
As you see, for a human this is fairly simple, but how to get a computer to do this?

PowerQuery remove items from a list matching a pattern

Using PowerQuery in excel how can I remove items from a list that match a pattern.
I have a column with cells that contain names and numeric id's. I want to be left with just a list of names.
LastName, FirstName;#123;#LastName, FirstName;#321;
The numbers are all unique. So if I had regex the pattern would be similar to
/^\#ddd+$/
I can split the cell into a list using ';' as a separator.
= Text.Split([Consultant],";")
If there was a way to remove every 2nd item until the end that could work too. Unfortunately it seems there is no way to specify patterns to match.
List.RemoveItems({1, 2, 3, 4, 2, 5, 5}, {2, 4, 6})
This would be awesome however I have to define all the number patterns that exist. So this fails.
List.RemoveMatchingItems(Text.Split([Consultant], ";#"), {1,2,3,4,5,6,7,8,9})
Method2
I split the text into a list as above. This gave me a column of lists. So I expanded the lists in columns to new rows. My plan was to remove alternate rows. However, remove alternate rows requires an end number. I would need an argument to go until there are no more arguments to process.
There are many ways.
One way is to select every other item with List.Select
In your example, these would be the items with an even number position.
let
x = Text.Split([Column1],";#"),
y = List.Select(x, each Number.IsEven(List.PositionOf(x, _)))
in
y
Edit Nov 2022
Another method or removing every other would be:
=List.Alternate(Text.Split([Column1],";#"),1,1,1)

How to get the second largest value in a column

Recently I discovered the LARGE and SMALL worksheet functions, one can use for determining the first, second, third, ... larges of smalles value in an array.
At least, that's what I thought:
When having a look at the array [1, 3, 5, 7, 9] (in one column or row), the LARGE(...;2) gives 7 as expected, but:
When having a look at the array [1, 1, 5, 9, 9], I expect LARGE(...;2) to give 5 but instead I get 9.
Now this makes sense : it seems that the function LARGE(...;2) takes the largest entry in the array (value 9 on the last but one place), deletes this and gives the larges entry of the reduced array (which still contains another 9), but this is not what one might expect intuitively.
In order to get 5 from [1, 1, 5, 9, 9], I would need something like:
=LARGE_OF_UNIQUE_VALUES_OF(...;2))
I didn't find this in LARGE documentation.
Does anybody know an easy way to achieve this?
If you have the new Dynamic Array formulas:
=LARGE(UNIQUE(...),2)
If not use AGGREGATE:
=AGGREGATE(14,7,A1:A5/(MATCH(A1:A5,A1:A5)=ROW(A1:A5)),2)
This is a bit of a hack.
=LARGE(IF(YOUR_DATA=LARGE(YOUR_DATA,1),SMALL(YOUR_DATA,1)-1,YOUR_DATA),1)
The idea is to (a) take any value in your data that is equal to the largest element and set it to less than the smallest element, then (b) find the (new) largest element. It's OK if you want the 2nd largest, but extending to 3rd largest etc. gets progressively uglier.
Hope that helps

Advance sorting in Excel

In our warehouse we have even/odd system of locations.
here is the example:
1-101-1
1-103-1
1-105-1
....
1-285-1
and
2-102-1
2-104-1
2-116-1
2-240-1
....
2-286-1
and have levels too
1-101-2
1-101-3
1-101-4
there have a lot of data, and I need sort like this:
example numbers:
1-101-1
2-130-1
1-131-1
1-150-2
2-132-3
3-229-5
4-262-1
4-286-5
7-267-1
5-239-1
6-270-1
7-267-3
I need sort like this:
1-101-1
2-130-1
1-131-1
2-132-3
4-286-5
4-262-1
3-229-5
5-239-1
6-270-1
7-267-1
7-267-1
point is first two numbers(1-101-1;2-102-1) goes from smallest to biggest, next two(3-285-1;4-286) goes from biggest to smallest and
5 - 6 goes again from smallest to biggest and with that system to the end
second thing for sort is middle number, that number will goes as first from smallest to biggest, then from biggest to smallest, and last number is level, that is same as level 1 but must be sorted as level one, or be near level 1 if there is 7-267-1 and 7-267-3
is there any solution? thanks
edit:
here is image for easier understanding because it is hard to explain
Thanks all for answers, especially Daniel who are an expert in Excel and understand what I need.
I mean there is not solution for sort like that without VBA, but Daniel show me that i was wrong. Thanks again.
That is what i need, but there are some errors, if you can help me with that
this is other example with other locations:
this is unsorted locations with formulas you give me
and this is sorted, but with bad order:
bad sort
and here is with errors:
errors
we have 120 rows, and numbers bigger then 99 display error, and number 22-250-1 goes in -25 in second row
I try formula with numbers you enter in this example, and i got same good sort as you, but after entering other places, there is some bad sort.
Welcome to StackOverflow!
I think I understand what is being requested. It's a bit difficult to explain but I'll give it a try.
The primary sorting is to be as follows:
If first digit is either 3 or 4, then it should be in descending order else ascending.
If the middle 3-digits are from a 3 or 4 numbered sequence (see #1 above), then the middle pair should be in descending order.
All sequences should be in ascending based on their final digit.
My solution breaks the sequence into distinct columns:
For example, create three columns: First, Second, Third.
Formula for First:
=INT(LEFT(A2, 1))
Formula for Second:
=INT(RIGHT(LEFT(A2,5), 3))
Formula for Third:
=INT(RIGHT(A2,1))
Next, we assign values for sorting these three fields:
Create a column labeled First_Sort_Pair:
=IF(OR(B2=1,B2=2),1,
IF(B2=3,3,
IF(B2=4,2,
IF(OR(B2=5,B2=6),4,
IF(OR(B2=7,B2=8),5,6)))))
Create a column labeled First_Sort:
=IF(OR(B2=3, B2=4), 2, 1)
Create a column labeled Second_Sort:
=IF(E2=4, 2, IF(E2=3, 3, 1))
Create a column labeled Sort_3_4:
=IF(OR(B2=3,B2=4),RANK(C2,C:C,0),)
You can now begin sorting:
[
Result:
You will now have your data sorted as intended:

Obtain every nth row of filtered records

I'm looking for information on how to copy nth rows of records from one excel sheet to the next, and now I am wondering if there is a way to do this for filtered data (i.e. I have 400 students enrolled at school, and I want every 15th male whose parents have not graduated from college (flags have been created for both gender and parent education, which I am using to filter on). Are there any ideas on how to do this? If not, I could just use the offset function for each combination of variables I am filtering on, but that's over 30-40 combinations if I did my math right. Thanks for any help you can provide.
There are a few standard formulas used for retrieving the first, second, third, etc set of values that match criteria. I prefer a standard formula model using the INDEX function and SMALL function. By throwing a little maths at the increment to change it from 1, 2, 3 ... to 1, 16, 31, 46, ... you should be able to achieve your offset results. In the following example image, I've used a stagger of 4 rather than 15 in order to accommodate sample data vertically while still producing more than a single result.
        
The formula in F2 is,
=IFERROR(INDEX(A$2:A$999, SMALL(INDEX(ROW($1:$998)+((C$2:C$999<>"M")+(D$2:D$999<>"N"))*1E+99, , ), 1+(ROW(1:1)-1)*4)), "")
For your purposes the 4 in 1+(ROW(1:1)-1)*4 will need to be changed to 15.
=IFERROR(INDEX(A$2:A$999, SMALL(INDEX(ROW($1:$998)+((C$2:C$999<>"M")+(D$2:D$999<>"N"))*1E+99, , ), 1+(ROW(1:1)-1)*15)), "")
Fill down as necessary.
Once you have retrieved a unique identifier, the remainder can be retrieved with a simple VLOOKUP function.

Resources