Find common subsets between "big" sets - string

So, I have a file that contains about 13000+ rows. Each row has a list of destinations separated by the char ";". I need to find between all those lists of destinations the 10 most common subsets (ignoring empty set or sets containing only 1 destination) between all the destinations, and the amount of times this subsets appear on the data:
An example may make this easier to understand:
This would be the file (each letter represents a destination)
A;B;C;D
A;B
A;B;C;D;E
A;B;C;D;E;F;G
A;B;C;D;E;F;G;H;L
C;G;B
K;H
So, the most common subsets of destinations together would be:
1. A;B : 5
2. A;C : 4
3. A;D : 4
4. A;B;C : 4
5. A;B;C;D : 4
6. A;E : 3
7. A;B;C;D;E : 3
8. B;C;D;E : 3
9. C;D;E : 3
10. A;B;C;D;E;F : 2
This problem seems very complex to me, I think it would be easier to solve it by limiting the size of the subsets to n (or a fixed number like 3).
Any ideas on how to solve it? I think I need something like FPGRowth but without the Association Rule generated.
Thanks!

you can solve this with one loop:
You have to generate a hashmap for saving the results...
you can give every destination a unique prime number and multiplicate the prime numbers of one line. the result is the key of the hashmap. if the key does not exist, you have to add it with a value of 1. If it exists, you can increase the value. This is called "Integer factorization". At the end you have to find the highest value number of your hashmap.
(hint: save the destination name also in the value of the hashmap,
then you do not have to recalculate the number to the destinations)
(2nd hint: remember the highest number and hashkey, so you don't have
to search at the end for the highest number and key...)
EDIT: for the combinations like A;B;C =>A;B and also B;C you can use 2 for loops to go through the line

Related

Counting if part of string is within interval

I am currently trying to check if a number in a comma-separated string is within a number interval. What I am trying to do is to check if an area code (from the comma-separated string) is within the interval of an area.
The data:
AREAS
Area interval
Name
Number of locations
1000-1499
Area 1
?
1500-1799
Area 2
?
1800-1999
Area 3
?
GEOLOCATIONS
Name
Areas List
Location A
1200, 1400
Location B
1020, 1720
Location C
1700, 1920
Location D
1940, 1950, 1730
The result I want here is the number of unique locations in the "Areas list" within the area interval. So Location D should only count ONCE in the 1800-1999 "area", and the Location A the same in the 1000-1499 location. But location B should count as one in both 1000-1499 and one in 1500-1799 (because a number from each interval is in the comma-separated string in "Areas list"):
Area interval
Name
Number of locations
1000-1499
Area 1
2
1500-1799
Area 2
3
1800-1999
Area 3
2
How is this possible?
I have tried with a COUNTIFS, but it doesnt seem to do the job.
Here is one option using FILTERXML():
Formula in C2:
=SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]"))
Where:
"<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>" - Is the part where we construct a valid piece of XML. The theory here is that we use three axes here. Each t-node will be named a literal 1 to make sure that once we return them with xpath we can sum the result. The outer x-nodes are there to make sure Excel will handle the inner axes correctly. If you are curious to know how this xml-syntax looks at the end, it's best to step through using the 'Evaluate Formula' function on the Data-tab;
//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]")) - Basically means that we collect all t-nodes where the count of child s-nodes that are >= to the leftmost number and <= to the rightmost number is larger than zero. For A2 the xpath would look like //t[count(.//*[.>=1000][.<=1499])>0]")) after substitution. In short: //t - Select t-nodes, where count(.//* select all child-nodes where count of nodes that fullfill both requirements [.>=1000][.<=1499] is larger than zero;
Since all t-nodes equal the number 1, the SUM() of these t-nodes equals the amount of unique locations that have at least one area in its Areas List;
Important to note that FILTERXML() will result into an error if no t-nodes could be found. That would mean we need to wrap the FILTERXML() in an IFERROR(...., 0) to counter that and make the SUM() still work correctly.
Or, wrap the above in BYROW():
Formula in C2:
=BYROW(A2:A4,LAMBDA(a,SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(a,"-","][.<=")&"])>0]"))))
Using MMULT and TEXTSPLIT:
=LET(rng,TEXTSPLIT(D2,"-"),
tarr,IFERROR(--TRIM(TEXTSPLIT(TEXTJOIN(";",,$B$2:$B$5),",",";")),0),
SUM(--(MMULT((tarr>=--TAKE(rng,,1))*(tarr<=--TAKE(rng,,-1)),SEQUENCE(COLUMNS(tarr),,1,0))>0)))
I am in very distinguished company but will add my version anyway as byrow probably is a slightly different approach
=LET(range,B$2:B$5,
lowerLimit,--#TEXTSPLIT(E2,"-"),
upperLimit,--INDEX(TEXTSPLIT(E2,"-"),2),
counts,BYROW(range,LAMBDA(r,SUM((--TEXTSPLIT(r,",")>=lowerLimit)*(--TEXTSPLIT(r,",")<=upperLimit)))),
SUM(--(counts>0))
)
Here the ugly way to do it, with A LOT of helper columns. But not so complicated 🙂
F4= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(B4;",";"</r><r>")&"</r></m>";"//r"))
F11= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(A11;"-";"</r><r>")&"</r></m>";"//r"))
F16= =SUM(F18:F21)
F18= =IF(SUM(($F4:$O4>=$F$11)*($F4:$O4<=$G$11))>0;1;"")
G18= =IF(SUM(($F4:$O4>=$F$12)*($F4:$O4<=$G$12))>0;1;"")
H18= =IF(SUM(($F4:$O4>=$F$13)*($F4:$O4<=$G$13))>0;1;"")

Excel : Find numbers in a list which when summed total a target number

Ive got a list of several thousand numbers.
I have a target number which is a sum of some of the list of numbers.
I want to be able to find what numbers in the list, when summed total to the target number.
Eg.
List :
1, 2, 3, 4, 5, 6, 7, 8
Target :
5
Result :
The target could be made from the sum of 1+4, 2+3, 5
If there are is no way the target can be achieved from the numbers in the list eg. list : 10,20 and the Target : 5 then the formula should output "no available matches"
That is a very simple example but in practice there are thousands of numbers in the list and some of the numbers are up to seven digits long.
Is there a formula could be used in excel (or google sheets) that would work this out automatically ? Preferably as a native function rather than VBA / Script.
You request is so complex that there is not a native function, which can do that. In top of it, the result of your function is a whole collection of possibilities (like [5, 1+4, 2+3] in your particular case), while Excel functions only generate a single result, which can be placed in a single cell.
In top of this, it's quite a difficult task to program, let me explain you why with some simple examples:
List : 1,2,3,4,5
Desired result: 36
=> no way: even when you sum all elements of the list, the result is not large enough.
List : 6,7,8,9,10
Desired result: 5
=> no way: even the smallest element in the list is larger than the result.
List : 1,2,20,21,23
Desired result: 10
=> no way (but how to make a computer see this?)
List : 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Desired result: 119
=> easy as hell: the sum of all numbers of the list is 120, just subtract 1.
List: 2,4,6,8,10,12,14,16,18,20
Desired result: 11
=> no way: all numbers in the list are odd, you can't obtain an odd number.
As you see, for a human this is fairly simple, but how to get a computer to do this?

2nd largest no. in python list

I'm a beginner in python and I've been solving this problem to find the second largest element of a python list. There are a no. of ways to solve this problem but the way I tried to solve it was removing the largest value (no matter how many times it would be present in the list) and then printing the maximum value of the modified list.
n = int(input("Enter the no. of list entries"))
list_students = []
for i in range(0, n):
the_input = int(input("Enter the list element"))
list_students.append(the_input)
highest = max(list_students)
for i in list_students:
print("Considering",i)
if i==highest:
print("to be deleted ",i)
list_students.remove(i)
print("the max value is",max(list_students))
Output-
Enter the list element 4
Enter the list element 4
Enter the list element 4
Enter the list element 3
Enter the list element 1
Considering 4
to be deleted 4
Considering 4
to be deleted 4
Considering 1
the max value is 4
While it was expected to be 3. It can be clearly seen that the loop doesn't even consider the third 4 and its neighboring element which is 3. And it happens every time no matter how many times the highest element is repeatedly entered. Can anyone please explain the reason behind this behavior?
The function remove remove the first matching value, not your current value you are testing in the loop. So you are modifying the list in the same time that you go through it.
What you could try is to call remove(highest) until its value is changed.
Like this:
while max(list_students) == highest:
list_students.remove(highest)
The reason for the strange behavior you're observing is that you're removing items from a list you're iterating through, so by moving items from the iterated list you make the iteration skip certain items, since what was considered the next item disappeared.
To fix your code without rewriting it, you can simply make a copy of the list before iterating through it:
for i in list_students.copy():

Time complexity of my backtracking to find the optimal solution of the maximum sum non adjacent

I'm trying to do dynamic programming backtracking of maximum sum of non adjacent elements to construct the optimal solution to get the max sum.
Background:
Say if input list is [1,2,3,4,5]
The memoization should be [1,2,4,6,9]
And my maximum sum is 9, right?
My solution:
I find the first occurence of the max sum in memo (as we may not choose the last item) [this is O(N)]
Then I find the previous item chosen by using this formula:
max_sum -= a_list[index]
As in this example, 9 - 5 = 4, which 4 is on index 2, we can say that the previous item chosen is "3" which is also on the index 2 in the input list.
I find the first occurence of 4 which is on index 2 (I find the first occurrence because of the same concept in step 1 as we may have not chosen that item in some cases where there are multiple same amounts together) [Also O(N) but...]
The issue:
The third step of my solution is done in a while loop, let's say the non adjacent constraint is 1, the max amount we have to backtrack when the length of list is 5 is 3 times, approx N//2 times.
But the 3rd step, uses Python's index function to find the first occurence of the previous_sum [which is O(N)] memo.index(that_previous_sum)
So the total time complexity is about O(N//2 * N)
Which is O(N^2) !!!
Am I correct on the time complexity? Or am I wrong? Is there a more efficient way to backtrack the memoization list?
P.S. Sorry for the formatting if I done it wrong, thanks!
Solved:
I looped from behind checking if the item in front is same or not
If it's same, means it's not first occurrence. If not, it's first occurrence.
Tada! No Python's index function to find from the first index! We find it now from the back
So the total time complexity is about O(N//2 * N)
Now O(N//2 + 1), which is O(N).

in excel, I want to count the number of cells that do not contain a specific character

in excel, I want to count the number of cells that do not contain a specific character (in this case, a "." /period).
I tried something like countif(A1:A10,"<>.*") but this is wrong and I can't seem to figure it out.
Say I have these data in column A:
D
N
P
.
.
A
N
.
P
.
And the count would be 6
For your example:
=COUNTIF(A1:A10,"<>.")
returns 6. But it would be a different story if say you wanted to exclude P. from the count also.
Your data may not be quite what you think it is however, because including the * should make no difference for your example.
Or you could subtract periods from the total and be left with the non periods
=COUNTIF(A1:A10,"*")-COUNTIF(A1:A10,"=.")
gives 6.
If your data includes periods along with other characters in the same cell and want a similar count:
then this:
=COUNTA(A1:A10)-COUNTIF(A1:A10,"*.*")
will return 5

Resources