Analyse runtime of my algorithm - python-3.x

I am working on creating some algorithms for a course, both of which are for the vertex cover problem.
For the first part I created an algorithm that does the work via brute force, it creates every possible combination of vertices, removes sets that are not covers, then analyses them. This size I already have.
The second part is the same brute force with an added heuristic, where I eliminate the lower portion of combos that are unlikely to make a cover based on the number of edges.
Since both of these do work on the sum of all base elements in the combos I need to understand the size of said list.
The graphs are randomly generated with integers for vertices and edges created randomly from pairs of vertices.
combos = []
vertices = [1, 2, 3,...]
edges = [(1, 2), (2, 3),...]
E = len(edges)
V = len(vertices)
Brute force
for x in range(1, V+1):
for subset in itertools.combinations(vertices, X):
combos.append(subset)
sum = 0
for i in combos:
for j in i:
sum += 1
The sum of brute force is:
Heuristc:
for x in range(ceil((V**2)/E), V+1):
for subset in itertools.combinations(vertices, X):
combos.append(subset)
sum = 0
for i in combos:
for j in i:
sum += 1
The sum as I thought it would end up being:
However, my test runs are not matching up for heuristic, brute force is matching up.
Sample runs:
V E Brute Heuristic
5 10 80 25
6 11 192 36
7 17 448 294
8 23 1024 792
9 25 2304 1467
10 36 5120 4660

Ok, so I was doing my math formula wrong, the heuristic is not the sum of all V minus the sum of the lower end. It is the sum from the lower end to all V:
I do not know exactly why this works and my original does not, since logically the other one is just an expanded version of this one.

Related

TRUE/FALSE ← VLOOKUP ← Identify the ROW! of the first negative value within a column

Firstly, we have an array of predetermined factors, ie. V-Z;
their attributes are 3, the first two (•xM) multiplied giving the 3rd.
f ... factors
• ... cap, the values in the data set may increase max
m ... fixed multiplier
p ... let's call it power
This is a separate, standalone array .. we'd access with eg. VLOOKUP
f • m pwr
V 1 9 9
W 2 8 16
X 3 7 21
Y 4 6 24
Z 5 5 25
—————————————————————————————————————————————
Then we have 6 columns, in which the actual data to be processed is in, & thereof derive the next-level result, based on the interaction of both samples introduced.
In addition, there are added two columns, for balance & profit.
Here's a short, 6-row data sample:
f • m bal profit
V 2 3 377 1
Y 2 3 156 7
Y 1 1 122 0
X 1 2 -27 2
Z 3 3 223 3
—————————————————————————————————————————————
Ultimately, starting at the end, we are comparing IF -27 inverted → so 27 is within the X's power range ie. 21 (as per the first sample) .. which is then fed into a bigger formula, beyond the scope of this post.
This can be done with VLOOKUP, all fine by now.
—————————————————————————————————————————————
To get to that .. for the working example, we are focusing coincidentally on row5, since that's the one with the first negative value in the 'balance' column, so ..
on factorX = which factor exactly is to us unknown &
balance -27 = which we have to locate amongst potentially dozens to hundreds of rows.
Why!?
Once we know that the factor is X, based on the * & multiplier pertaining to it, then we also know which 'power' (top array) to compare -27, as the identified first negative value in the balance column, to.
Is that clear?
I'd like to know the formula on how to achieve that, & (get to) move on with the broader-scope work.
—————————————————————————————————————————————
The main issue for me is not knowing how to identify the first negative or row -27 pertains to, then having that piece of information how to leverage it to get the X or identify the factor type, especially since its positioned left of the latter & to the best of my knowledge I cannot use negative column index number (so, latter even if possible is out of the question anyway).
To recap;
IF(21>27) = IF(-21<-27)
27 → LOCATE ROW with the first negative number (-27)
21 → IDENTIFY the FACTOR TYPE, same row as (-27)
→ VLOOKUP pwr, based on factor type identified (top array, 4th column right)
→ invert either 21 to a negative number or (-27) to the positive number
= TRUE/FALSE
Guessing your columns I'll say your first chart is in columns A to D, and the second in columns G to K
You could find the letter of that factor with something like this:
=INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0)))
INDEX(J:J<0) converts that column to TRUE and FALSE depending on being negative or not and with XMATCH you find the first TRUE. You could then use that in VLOOKUP:
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0)
That would return the 21. You can use the first concept too to find the the -27 and with ABS have its "positive value"
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0) > INDEX(J:J,XMATCH(TRUE,INDEX(J:J<0)))
That should return true or false in the comparison

Sum of arrays with repeated indices

How can I add an array of numbers to another array by indices? Especially with repeated indices. Like that
x
1 2 3 4
idx
0 1 0
y
5 6 7
] x add idx;y NB. (1 + 5 + 7) , (2 + 6) , 3 , 4
13 8 3 4
All nouns (x, idx, y) can be millions of items and I need to fast 'add' verb.
UPDATE
Solution (thanks to Dan Bron):
cumIdx =: 1 : 0
:
'i z' =. y
n =. ~. i
x n}~ (n{x) + i u//. z
)
(1 2 3 4) + cumIdx (0 1 0);(5 6 7)
13 8 3 4
For now, a short answer in the "get it done" mode:
data =. 1 2 3 4
idx =. 0 1 0
updat =. 5 6 7
cumIdx =: adverb define
:
n =. ~. m
y n}~ (n{y) + m +//. x
)
updat idx cumIdx data NB. 13 8 3 4
In brief:
Start by grouping the update array (in your post, y¹) where your index array has the same value, and taking the sum of each group
Accomplish this using the adverb key (/.) with sum (+/) as its verbal argument, deriving a dyadic verb whose arguments are idx on the left and the update array (your y, my updat) on the right.
Get the nub (~.) of your index array
Select these (unique) indices from your value array (your x, my data)
This will, by definition, have the same length as the cumulative sums we calculated in (1.)
Add these to the cumulative sum
Now you have your final updates to the data; updat and idx have the same length, so you just merge them into your value array using }, as you did in your code
Since we kept the update array small (never greater than its original length), this should have decent performance on larger inputs, though I haven't run any tests. The only performance drawback is the double computation of the nub of idx (once explicitly with ~. and once implicitly with /.), though since your values are integers, this should be relatively cheap; it's one of J's stronger areas, performance-wise.
¹ I realize renaming your arrays makes this answer more verbose than it needs to be. However, since you named your primary data x rather than y (which is the convention), if I had just kept your naming convention, then when I invoked cumIdx, the names of the nouns inside the definition would have the opposite meanings to the ones outside the definition, which I thought would cause greater confusion. For this reason, it's best to keep "primary data" on the right (y), and "control data" on the left (x).You might also consider constraining your use of the special names x,y,u,v,m and n to where they're already implicitly defined by invoking an explicit definition; definitely never change their nameclasses.
This approach also uses key (/.) but is a bit more simplistic in its approach.
It is likely to use more space especially for big updates than Dan Bron's.
addByIdx=: {{ (m , i.## y) +//. x,y }}
updat idx addByIdx data
13 8 3 4

I want to remove rows where a specific value doesn't increase. Is there a faster/more elegant way?

I have a dataframe with 30 columns, 1.000.000 rows and about 150 MB size. One column is categorical with 7 different elements and another column (Depth) contains mostly increasing numbers. The graph for each of the elements looks more or less like this.
I tried to save the column Depth as series and iterate through it while dropping rows that won't match the criteria. This was reeeeeaaaally slow.
Afterwards I added a boolean column to the dataframe which indicates if it will be dropped or not, so I could drop the rows in the end in a single step. Still slow. My last try (the code to it is in this post) was to create a boolean list to save the fact if it passes the criteria there. Still really slow (about 5 hours).
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Depth'].index.min()
maxIdx = df.loc[df['Element']==element]['Depth'].index.max()
for x in range(minIdx,maxIdx):
if df.loc[df['Element']==element]['Depth'][x] < currentMax:
dropList[x]=False
else:
currentMax = df.loc[df['Element']==element]['Depth'][x]
df: The main dataframe
elements: a list with the 7 different elements (same as in the categorical column in df)
All rows in an element, where the value Depth isn't bigger than all previous ones should be dropped. With the next element it should start with 0 again.
Example:
Input: 'Depth' = [0 1 2 3 4 2 3 5 6]
'AnyOtherColumn' = [a b c d e f g h i]
Output: 'Depth' [0 1 2 3 4 5 6]
'AnyOtherColumn' = [a b c d e h i]
This should apply to whole rows in the dataframe of course.
Is there a way to get this faster?
EDIT:
The whole rows of the input dataframe should stay as they are. Just the ones where the 'Depth' does not increase should be dropped.
EDIT2:
The remaining rows should stay in their initial order.
How about you take a 2-step approach. First you use a fast sorting algorithm (for example Quicksort) and next you get rid of all the duplicates?
Okay, I found a way thats faster. Here is the code:
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Tiefe'].index.min()
# maxIdx = df.loc[df['Element']==element]['Tiefe'].index.max()
elementList = df.loc[df['Element']==element]['Tiefe'].to_list()
for x in tqdm(range(len(elementList))):
if elementList[x] < currentMax:
dropList[x+minIdx]=False
else:
currentMax = elementList[x]
I took the column and saved it as a list. To preserve, the index of the dataframe I saved the lowest one and within the loop it gets added again.
Overall it seems the problem was the loc function. From initially 5 hours runtime, its now about 10 seconds.

Sort numbers into groups so that the difference of their sums is minimal

I found a few threads that were similar however I believe mine is a bit unique. This will be difficult to write so please bear with me.
I have a strain of 10 accounts, each account has a static number that can not be split up. I have 3 employees that need these accounts split as even as possible. They cannot share an account.
For example:
(A)lpha = 15
(B)eta = 30
(C)harlie = 22
(D)elta = 19
(E)cho = 28
(F)ranklin = 3
(G)roto = 7
(H)enry = 28
(I)ndia = 38
(J)uliet = 48
The total sum is = 238. In the perfect world, 2 people would get 79 and one person would have 80. However, remember we cannot break apart an account so we would need to add accounts together to get as close to evenly spread as possible.
I need a formula for this since situations like this occur regularly and it takes some time to figure this out. I believe this would be best executed with a helper column.
The closest I have come to is:
FHJ = 79
ABCG = 74
DEI = 85
But since this is reoccurring and can happen over even more accounts, I need something I can reuse over and over.
Another less complex but approximated solution would be to
sort your accounts from highest to lowest number.
Start sorting the numbers into 3 groups (A, B, C)
starting with the 3 highest numbers (48|J, 38|I, 30|B) sorting to group A, B and C
next highest number (28|E) goes to the group with the lowest sum (C)
next highest number (28|H) goes to the group with the lowest sum (B)
and so on …
You should end up with this:
Which is different from your manual solution but closer. If you see the differences:
Solution from above: 81 - 77 = 4
Your manual solution: 85 - 74 = 11
This algorithm is an approximation, it will not always find the best solution but if the difference between the lowest and highest number is not too large then the result is very close to the best solution.
This is known as a partition problem. You could try implementing the pseudo-polynomial time algorithm from the Wikipedia page. You'll have to modify it for 3 partitions instead of 2.
INPUT: A list of integers S
OUTPUT: True if S can be partitioned into two subsets that have equal sum
1 function find_partition(S):
2 n ← |S|
3 K ← sum(S)
4 P ← empty boolean table of size (floor(K/2)+ 1) by (n + 1)
5 initialize top row (P(0,x)) of P to True
6 initialize leftmost column (P(x, 0)) of P, except for P(0, 0) to False
7 for i from 1 to floor(K/2)
8 for j from 1 to n
9 if (i-S[j-1]) >= 0
10 P(i, j) ← P(i, j-1) or P(i-S[j-1], j-1)
11 else
12 P(i, j) ← P(i, j-1)
13 return P(floor(K/2), n)

Need Hint for ProjectEuler Problem

What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?
I could easily brute force the solution in an imperative programming language with loops. But I want to do this in Haskell and not having loops makes it much harder. I was thinking of doing something like this:
[n | n <- [1..], d <- [1..20], n `mod` d == 0] !! 0
But I know that won't work because "d" will make the condition equal True at d = 1. I need a hint on how to make it so that n mod d is calculated for [1..20] and can be verified for all 20 numbers.
Again, please don't give me a solution. Thanks.
As with many of the Project Euler problems, this is at least as much about math as it is about programming.
What you're looking for is the least common multiple of a set of numbers, which happen to be in a sequence starting at 1.
A likely tactic in a functional language is trying to make it recursive based on figuring out the relation between the smallest number divisible by all of [1..n] and the smallest number divisible by all of [1..n+1]. Play with this with some smaller numbers than 20 and try to understand the mathematical relation or perhaps discern a pattern.
Instead of a search until you find such a number, consider instead a constructive algorithm, where, given a set of numbers, you construct the smallest (or least) positive number that is evenly divisible by (aka "is a common multiple of") all those numbers. Look at the algorithms there, and consider how Euclid's algorithm (which they mention) might apply.
Can you think of any relationship between two numbers in terms of their greatest common divisor and their least common multiple? How about among a set of numbers?
If you look at it, it seems to be a list filtering operation. List of infinite numbers, to be filtered based on case the whether number is divisible by all numbers from 1 to 20.
So what we got is we need a function which takes a integer and a list of integer and tells whether it is divisible by all those numbers in the list
isDivisible :: [Int] -> Int -> Bool
and then use this in List filter as
filter (isDivisible [1..20]) [1..]
Now as Haskell is a lazy language, you just need to take the required number of items (in your case you need just one hence List.head method sounds good) from the above filter result.
I hope this helps you. This is a simple solution and there will be many other single line solutions for this too :)
Alternative answer: You can just take advantage of the lcm function provided in the Prelude.
For efficiently solving this, go with Don Roby's answer. If you just want a little hint on the brute force approach, translate what you wrote back into english and see how it differs from the problem description.
You wrote something like "filter the product of the positive naturals and the positive naturals from 1 to 20"
what you want is more like "filter the positive naturals by some function of the positive naturals from 1 to 20"
You have to get Mathy in this case. You are gonna do a foldl through [1..20], starting with an accumulator n = 1. For each number p of that list, you only proceed if p is a prime. Now for the previous prime p, you want to find the largest integer q such that p^q <= 20. Multiply n *= (p^q). Once the foldl finishes, n is the number you want.
A possible brute force implementation would be
head [n|n <- [1..], all ((==0).(n `mod`)) [1..20]]
but in this case it would take way too long. The all function tests if a predicate holds for all elements of a list. The lambda is short for (\d -> mod n d == 0).
So how could you speed up the calculation? Let's factorize our divisors in prime factors, and search for the highest power of every prime factor:
2 = 2
3 = 3
4 = 2^2
5 = 5
6 = 2 * 3
7 = 7
8 = 2^3
9 = 3^2
10 = 2 * 5
11 = 11
12 = 2^2*3
13 = 13
14 = 2 *7
15 = 3 * 5
16 = 2^4
17 = 17
18 = 2 * 3^2
19 = 19
20 = 2^2 * 5
--------------------------------
max= 2^4*3^2*5*7*11*13*17*19
Using this number we have:
all ((==0).(2^4*3^2*5*7*11*13*17*19 `mod`)) [1..20]
--True
Hey, it is divisible by all numbers from 1 to 20. Not very surprising. E.g. it is divisible by 15 because it "contains" the factors 3 and 5, and it is divisible by 16, because it "contains" the factor 2^4. But is it the smallest possible number? Think about it...

Resources