how to model complex if-then statements in integer programming - modeling

I am trying to model a if-then condition for a MIP. The MIP looks like
Maximize SUM_i H(i) - C
s.t.,
SUM_j x(i, j) <= D(i) and
SUM_i x(i, j) <= S(j)
where H(i) = 1 if SUM_j x(i, j) = D(i), 0 otherwise
and C = SUM_i,j (if x(i, j) > 1 then 1, 0 otherwise)
I know how to model a simple if-then condition in an MIP. But not able to model this one.

We can handle this by introducing a slack variable v to the demand equation, and then saying
h(i)=1 => v(i)=0
This implication is easily implemented as an inequality:
v(i) <= (1-h(i))*D(i)
The complete model can look like:

Related

Improving the performance of multiple subsets on a large Dataframe

I have a dataframe containing 6.3 million records and 111 columns. For this example I've limited the dataframe to 27 columns (A-Z). On this dataframe I am trying to run an analysis in which I use different combinations of the columns (with pairs of 5 columns per combination) and subset each of those on the dataframe and do a count of the number of occurrences for each combination and finally evaluate if this count extends a certain threshold and then store the combination. The code is already optimized with an efficient way of running the individual subsets, using numba. But still the overal script I have takes quite some time (7-8 hours). This is because if you use for example 90 columns (which is my actual number used) to make combinations of 5, you get 43.949.268 different combinations. In my case I also use a shifted versions of some columns (value of day before). So for this example I've limited it to 20 columns (A-J 2 times, including the shifted versions).
The columns used are stored in a list, which is converted to numbers because it otherwise gets to big using long strings. The names in the list corresponds with a dictionary containing the subset variables.
Here is the full code example:
import pandas as pd
import numpy as np
import numba as nb
import time
from itertools import combinations
# Numba preparation
#nb.njit('int64(bool_[::1],bool_[::1],bool_[::1],bool_[::1],bool_[::1])', parallel=True)
def computeSubsetLength5(cond1, cond2, cond3, cond4, cond5):
n = len(cond1)
assert len(cond2) == n and len(cond3) == n and len(cond4) == n and len(cond5) == n
subsetLength = 0
for i in nb.prange(n):
subsetLength += cond1[i] & cond2[i] & cond3[i] & cond4[i] & cond5[i]
return subsetLength
# Example Dataframe
np.random.seed(101)
bigDF = pd.DataFrame(np.random.randint(0,11,size=(6300000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
# Example query list
queryList = ['A_shift0','B_shift0','C_shift0','D_shift0','E_shift0','F_shift0','G_shift0','H_shift0','I_shift0','J_shift0','A_shift1','B_shift1','C_shift1','D_shift1','E_shift1','F_shift1','G_shift1','H_shift1','I_shift1','J_shift1']
# Convert list to numbers for creation combinations
listToNum = list(range(len(queryList)))
# Generate 15504 combinations of the 20 queries without repitition
queryCombinations = combinations(listToNum,5)
# Example query dict
queryDict = {
'query_A_shift0': ((bigDF.A >= 1) & (bigDF.A < 3)),
'query_B_shift0': ((bigDF.B >= 3) & (bigDF.B < 5)),
'query_C_shift0': ((bigDF.C >= 5) & (bigDF.C < 7)),
'query_D_shift0': ((bigDF.D >= 7) & (bigDF.D < 9)),
'query_E_shift0': ((bigDF.E >= 9) & (bigDF.E < 11)),
'query_F_shift0': ((bigDF.F >= 1) & (bigDF.F < 3)),
'query_G_shift0': ((bigDF.G >= 3) & (bigDF.G < 5)),
'query_H_shift0': ((bigDF.H >= 5) & (bigDF.H < 7)),
'query_I_shift0': ((bigDF.I >= 7) & (bigDF.I < 9)),
'query_J_shift0': ((bigDF.J >= 7) & (bigDF.J < 11)),
'query_A_shift1': ((bigDF.A.shift(1) >= 1) & (bigDF.A.shift(1) < 3)),
'query_B_shift1': ((bigDF.B.shift(1) >= 3) & (bigDF.B.shift(1) < 5)),
'query_C_shift1': ((bigDF.C.shift(1) >= 5) & (bigDF.C.shift(1) < 7)),
'query_D_shift1': ((bigDF.D.shift(1) >= 7) & (bigDF.D.shift(1) < 9)),
'query_E_shift1': ((bigDF.E.shift(1) >= 9) & (bigDF.E.shift(1) < 11)),
'query_F_shift1': ((bigDF.F.shift(1) >= 1) & (bigDF.F.shift(1) < 3)),
'query_G_shift1': ((bigDF.G.shift(1) >= 3) & (bigDF.G.shift(1) < 5)),
'query_H_shift1': ((bigDF.H.shift(1) >= 5) & (bigDF.H.shift(1) < 7)),
'query_I_shift1': ((bigDF.I.shift(1) >= 7) & (bigDF.I.shift(1) < 9)),
'query_J_shift1': ((bigDF.J.shift(1) >= 7) & (bigDF.J.shift(1) < 11))
}
totalCountDict = {'queryStrings': [],'totalCounts': []}
# Loop through all query combinations and count subset lengths
start = time.time()
for combi in list(queryCombinations):
tempList = list(combi)
queryOne = str(queryList[tempList[0]])
queryTwo = str(queryList[tempList[1]])
queryThree = str(queryList[tempList[2]])
queryFour = str(queryList[tempList[3]])
queryFive = str(queryList[tempList[4]])
queryString = '-'.join(map(str,tempList))
count = computeSubsetLength5(queryDict["query_" + queryOne].to_numpy(), queryDict["query_" + queryTwo].to_numpy(), queryDict["query_" + queryThree].to_numpy(), queryDict["query_" + queryFour].to_numpy(), queryDict["query_" + queryFive].to_numpy())
if count > 1300:
totalCountDict['queryStrings'].append(queryString)
totalCountDict['totalCounts'].append(count)
print(len(totalCountDict['totalCounts']))
stop = time.time()
print("Loop time:", stop - start)
This currently takes about 20 seconds on my Macbook Pro 2020 Intel version, for the 15504 combinations. Any thoughts on how this could be improved? I have tried using multiprocessing, but since I am using numba already for the individual subsets this did not work well together. Am I using an inefficient way to do multiple subsets using a list, dictionary and for loop to subset all the combinations, or is 7-8 hours realistic for doing 44 million subsets on a dataframe of 6.3 million records?
One solution to speed up by a large factor this code is to pack the bits in the boolean arrays stored in queryDict. Indeed, the code computeSubsetLength5 is likely memory bound (I thought the speed-up provided in my previous answer would be sufficient for the needs).
Here is the function to pack the bits of a boolean array:
#nb.njit('uint64[::1](bool_[::1])')
def toPackedArray(cond):
n = len(cond)
res = np.empty((n+63)//64, dtype=np.uint64)
for i in range(n//64):
tmp = np.uint64(0)
for j in range(64):
tmp |= nb.types.uint64(cond[i*64+j]) << j
res[i] = tmp
# Remainder
if n % 64 > 0:
tmp = 0
for j in range(n - (n % 64), n):
tmp |= cond[j] << j
res[len(res)-1] = tmp
return res
Note that the end of the arrays are padded with 0 which does not affect the specific following computation (it may not be the case if you plan to use the boolean arrays in another context).
This function is called once for each array like this:
'query_A_shift0': toPackedArray((((bigDF.A >= 1) & (bigDF.A < 3))).to_numpy()),
Once packed, the array can be computed much more efficiently by working directly on 64-bits integers (64-bits per integers are computed at once). Here is the resulting code:
# See: https://en.wikipedia.org/wiki/Hamming_weight
#nb.njit('uint64(uint64)', inline='always')
def popcount64c(x):
m1 = 0x5555555555555555
m2 = 0x3333333333333333
m4 = 0x0f0f0f0f0f0f0f0f
h01 = 0x0101010101010101
x -= (x >> 1) & m1
x = (x & m2) + ((x >> 2) & m2)
x = (x + (x >> 4)) & m4
return (x * h01) >> 56
# Numba preparation
#nb.njit('uint64(uint64[::1],uint64[::1],uint64[::1],uint64[::1],uint64[::1])', parallel=True)
def computeSubsetLength5(cond1, cond2, cond3, cond4, cond5):
n = len(cond1)
assert len(cond2) == n and len(cond3) == n and len(cond4) == n and len(cond5) == n
subsetLength = 0
for i in nb.prange(n):
subsetLength += popcount64c(cond1[i] & cond2[i] & cond3[i] & cond4[i] & cond5[i])
return subsetLength
popcount64c counts the number of bits set to 1 in each 64-bits chunks.
Here are results on my 6-core i5-9600KF machine:
Reference implementation: 13.41 s
Proposed implementation: 0.38 s
The proposed implementation is 35 times faster than the (already optimized) Numba implementation. The reason why it is much faster than just 8 times is that data should now fit in the last-level cache of your processor with is often much faster than the RAM (about 5 times on my machine).
If you when to optimize further this code you can work on smaller chunks that fits in the L2 cache and use threads in the combinatoric loop rather than in the still memory bound computeSubsetLength5 function. This should give you a significant speed-up (I expect at least a x2).
The biggest optimisation to apply then probably comes from the overall algorithm. The same logical AND operations are computed several times over and over. Pre-computing most of them on the fly while keeping only the most useful ones should significantly speed the algorithm up (I expect a speed-up of x2).
I am pretty sure there are many other optimisations that can be applied on on the overall algorithm. Performing a brute-force is often sufficient to solve a problem but hardly a requirement.

Add Boundary Condition for Goal Seek

I am trying to create an automated Goal Seek script for a interlinked cells and workbook. However, perhaps due to the complexity and number of interlinks, somehow under a certain condition the Goal Seek function converges at a very high or low x-value.
Is there a way to improve its accuracy by setting some kind of boundary (a < x < b) similar to that in Solver. The reason I don't want to add solver in VBA is that because some of the other users may not be activating their Solver add-ins.
This is what the Goal Seek value gives me for an initial guess of x =
0.5h = 500
This is what the X-value should be, with a random guess of x = 100
Another alternative that I could think about is to create some sort of manual iteration (e.g. Bisection method) Sub Routine, but again, the equations are pretty complex so this may not be ideal.
What I am doing at the moment is that to preset an initial value for the x if y (another parameter) is negative or positive. I reckon this has eliminated most of the invalid result, but it still gives an error on one or two occasion. Appreciate your input. Thanks.
Sub Guess()
' ------------- For Guessing Initial X-Value -----------
Dim i As Integer, j As Integer
For i = 4 To 11
For j = 18 To 25
If Worksheets("Crack Width").Range("I" & j) < 0 Then
'------------------------Pre-guess X_value to be 0.5X_bal if N<0-------------
Worksheets("Calcs").Range("B" & i) = Worksheets("Calcs").Range("C" & i).Value * 0.5
If Worksheets("Calcs").Range("B" & i) = 0 Then Worksheets("Calcs").Range("B" & i).ClearContent
i = i + 1
ElseIf Worksheets("Crack Width").Range("I" & j) >= 0 Then
'------------------------Pre-guess X_value to be 0.5h if N>0-------------
Worksheets("Calcs").Range("B" & i) = Worksheets("Calcs").Range("E" & i).Value * 0.5
If Worksheets("Calcs").Range("B" & i) = 0 Then Worksheets("Calcs").Range("B" & i).ClearContents
i = i + 1
End If
Next j
Next i
End Sub

Multiple Coin Toss in Excel VBA

In Excel VBA, I am tossing four coins and counting the number of heads. The code I am using is:
CoinHeads = Int(Round(Rnd(), 0)) + Int(Round(Rnd(), 0)) + Int(Round(Rnd(), 0)) + Int(Round(Rnd(), 0))
This works, but I am wondering if there is a simpler way to do this in Excel VBA code that would still give me the same distribution of head counts from 0 to 4. Thanks for any advice!
If you wanted just to simplify your statements a little bit you could use Int(2 * Rnd()) instead:
CoinHeads = Int(2 * Rnd()) + Int(2 * Rnd()) + Int(2 * Rnd()) + Int(2 * Rnd())
Other than that you can segment the number of heads like #Comintern says in their comment.
You should write a little function and pass the number of heads as parameter to generalize your code (here tossing head if the random number is larger than or equal to 0.5):
Public Function getNumberOfHeads(ByVal nb As Integer) As Integer
Dim nbHeads As Integer: nbHeads = 0
Randomize
For j = 0 To nb
If Rnd() >= 0.5 Then nbHeads = nbHeads + 1
Next j
getNumberOfHeads = nbHeads
End Function
And then you use it like this in your code:
numberOfHeads = getNumberOfHeads(4)

Is Excel VBA's Rnd() really this bad?

I need a pseudo random number generator for 2D Monte Carlo simulation that doesn't have the characteristic hyperplanes that you get with simple LCGs. I tested the random number generator Rnd() in Excel 2013 using the following code (takes about 5 secs to run):
Sub ZoomRNG()
Randomize
For i = 1 To 1000
Found = False
Do
x = Rnd() ' 2 random numbers between 0.0 and 1.0
y = Rnd()
If ((x > 0.5) And (x < 0.51)) Then
If ((y > 0.5) And (y < 0.51)) Then
' Write if both x & y in a narrow range
Cells(i, 1) = i
Cells(i, 2) = x
Cells(i, 3) = y
Found = True
End If
End If
Loop While (Not Found)
Next i
End Sub
Here is a simple plot of x vs y from running the above code
Not only is it not very random-looking, it has more obvious hyperplanes than the infamous RANDU algorithm does in 2D. Basically, am I using the function incorrectly or is the Rnd() function in VBA actually not the least bit usable?
For comparison, here's what I get for the Mersenne Twister MT19937 in C++.
To yield a better random generator and to make its performance faster, I modified your code like this:
Const N = 1000 'Put this on top of your code module
Sub ZoomRNG()
Dim RandXY(1 To N, 1 To 3) As Single, i As Single, x As Single, y As Single
For i = 1 To N
Randomize 'Put this in the loop to generate a better random numbers
Do
x = Rnd
y = Rnd
If x > 0.5 And x < 0.51 Then
If y > 0.5 And y < 0.51 Then
RandXY(i, 1) = i
RandXY(i, 2) = x
RandXY(i, 3) = y
Exit Do
End If
End If
Loop
Next
Cells(1, 9).Resize(N, 3) = RandXY
End Sub
I obtain this after plotting the result
The result looks better than your code's output. Modifying the above code a little bit to something like this
Const N = 1000
Sub ZoomRNG()
Dim RandXY(1 To N, 1 To 3) As Single, i As Single, x As Single, y As Single
For i = 1 To N
Randomize
Do
x = Rnd
If x > 0.5 And x < 0.51 Then
y = Rnd
If y > 0.5 And y < 0.51 Then
RandXY(i, 1) = i
RandXY(i, 2) = x
RandXY(i, 3) = y
Exit Do
End If
End If
Loop
Next
Cells(1, 9).Resize(N, 3) = RandXY
End Sub
yields a better result than the previous one
Sure the Mersenne Twister MT19937 in C++ is still better, but the last result is quite good for conducting Monte-Carlo simulations. FWIW, you might be interested in reading this paper: On the accuracy of statistical procedures in Microsoft Excel 2010.
That seems like it would take on average 1000 * 100 * 100 iterations to complete and VBA is usually a bit slower than native Excel formulas. Consider this example
Sub ZoomRNG()
t = Timer
[a1:a1000] = "=ROW()"
[b1:c1000] = "=RAND()/100+0.5"
[a1:c1000] = [A1:C1000].Value
Debug.Print CDbl(Timer - t) ' 0.0546875 seconds
End Sub
Update
It's not that bad at all! This will work too even without Randomize
Sub ZoomRNGs() ' VBA.Rnd returns Single
t = Timer
For i = 1 To 1000
Cells(i, 1) = i
Cells(i, 2) = Rnd / 100 + 0.5
Cells(i, 3) = Rnd / 100 + 0.5
Next i
Debug.Print Timer - t ' 0.25 seconds
End Sub
Sub ZoomRNGd() ' the Excel Function RAND() returns Double
t = Timer
For i = 1 To 1000
Cells(i, 1) = i
Cells(i, 2) = [RAND()] / 100 + 0.5
Cells(i, 3) = [RAND()] / 100 + 0.5
Next i
Debug.Print Timer - t ' 0.625 seconds
End Sub
and Single has about half of the precision of Double :
s = Rnd: d = [RAND()]
Debug.Print s; d; Len(Str(s)); Len(Str(d)) ' " 0.2895625 0.580839555868045 9 17 "
Update 2
I found C alternative that is as fast as VBA Rnd.
C:\Windows\System32\msvcrt.dll is the Microsoft C Runtime Library:
Declare Function rand Lib "msvcrt" () As Long ' this in a VBA module
and then you can use it like this x = rand / 32767 in your code:
Sub ZoomRNG()
t = Timer
Dim i%, x#, y#, Found As Boolean
For i = 1 To 1000
Found = False
Do
x = rand / 32767 ' RAND_MAX = 32,767
y = rand / 32767
If ((x > 0.5) And (x < 0.51)) Then
If ((y > 0.5) And (y < 0.51)) Then
' Write if both x & y in a narrow range
Cells(i, 1) = i
Cells(i, 2) = x
Cells(i, 3) = y
Found = True
End If
End If
Loop While (Not Found)
Next i
Debug.Print Timer - t ' 2.875 seconds
End Sub
After reading this question I got curious and found the paper
"Assessing Excel VBA Suitability for Monte Carlo Simulation" by Alexei Botchkarev that is available here. Both RAND and RND functions are not recommended, but as pointed out in the paper the Mersenne Twister has been implemented in VBA by Jerry Wang.
A quick search led me to this nicely commented Version that has been updated the last 2015/2/28: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/BASIC/MTwister.xlsb
Source: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/BASIC/basic.html
All LCGs will generate hyperplanes. The quality of the LCG increases with decreasing distance between these hyperplanes. So, having more hyperplanes than RANDU is a good thing.
The MT plot looks much better because it is NOT an LCG. Indeed, any non-LCG pRNG could have a random looking plot and still be a bad.
To avoid the problem of 2D correlations, you could use the same LCG for x and y but have different seeds for x and y. Of course, this will not work with RND because you cannot have two separate streams. You will need an LCG pRNG that takes the seed as an argument by reference.
As a balance between speed and goodness, I was thinking of combining them like
for...
z = [rand()] ' good but slow.
for .. ' just a few
t = z + rnd()
t = t - int(t)
...
Remember that good entropy + bad entropy = better entropy.
That said, only 0.05ms per [rand()].

solving string alignment in O(mn)

I need to solve the follow problem withing O(nm).
n = |T|
m = |P|
where T,P two strings
f is a scoring function.
the algorithm should return a substring T' of T such that score(P,T') value is the maximum.
score(A,B) is the max val for alignment A and B according f.
I know I can get it from DIST matrix which is a Monge matrix if f is discrete (meaning the diagonals of the matrix has weights not larger than C which is a constant, and the horizontal and vertical edges is 0 or some other constant), but in this case the f is a general function from (sigma * {-})x(sigma * {-}) to R (where '-' is a gap).
any ideas?
You've noticed that there are several algorithms that compute a shortest path in a graph whose arcs are (i, j) → (i + 1, j), (i + 1, j + 1), (i, j + 1). The most general form of this algorithm would allow every arc length to be specified separately, with the following meanings.
(i, j) → (i + 1, j): cost of aligning the (i+1)th letter of P with a gap in T
(i, j) → (i + 1, j + 1): cost of aligning the (i+1)th letter of P with the (j+1)th letter of T
(i, j) → (i, j + 1): cost of aligning a gap in P with the (j+1)th letter of T
Costs can be negative. To solve your substring problem, make the costs of all of the (i, j) → (i, j + 1) arcs zero so that we can delete from T without penalty.

Resources