fsolve cannot find a solution even when given one

fsolve cannot find a solution even when given one - python-3.x

I reproduced the issue that I am experiencing with this simple code.
to validate that fsolve is working with the functions below I pre-calculated the functions values with the t values below, so that I am sure that t1 to t4 are a solution.
but even when giving the solution points fsolve is returning always the same solution
[ 1.50000000e+02 7.00000000e-01 2.00000000e+02 1.00000000e-01]
What I am doing wrong?
Is there a way to set constraints on the solution, for example all the t variables are between 0 and 1000?
t1 = 150.0
t2 = 0.7
t3 = 200.00
t4 = 0.1
def FS(z):
x1=z[0]
x2=z[1]
x3=z[2]
x4=z[3]
f = np.zeros(4)
f[0] = x1*x2 + x3*x4 - 125.0
f[1] = (x1**2/500)*x2 + (x3**2/500)*x4 - 39.5
f[2] = (x1**3/500**2)*x2 + (x3**3/500**2)*x4 - 12.649999999999999
f[3] = (x1**4/500**3)*x2 + (x3**4/500**3)*x4 - 4.115
return f
res = fsolve(FS, [t1, t2, t3, t3])
print(res)
[ 1.50000000e+02 7.00000000e-01 2.00000000e+02 1.00000000e-01]

First of all, I don't see your problem. The algorithm converges as expected on (one of the) solutions. This even happens, when you deviate from the solution a lot in your starting values, eg
t1 = 10
t2 = 10
t3 = 190
t4 = 10
Which gives rise to the solution [150, 0.7, 200, 0.1]. But part of the problem is probably that you have more than one solution. Try for instance
t1 = 190
t2 = 10
t3 = 10
t4 = 10
If [t1, t2, t3, t4] is a solution, then [t3, t4, t1, t2] is a solution as well. Which probably makes the algorithm non-convergent for certain starting values like [1, 1, 1, 1]. See some discussion about the underlying algorithm for instance here

Related

perf experiments to understand what the hardware can do

Multiplying larger and larger arrays of fp64 values takes the same time up until some point where the time increases. However, it isn't what I expect. First the results and then the code. The first number is the size of the array of numbers and the last number is the time in seconds. The time is for 1,000,000 executions. It is on a 4090. While looking consider a few questions. If there are only 16,384 CUDA cores then why does the time stay the same from before 16,384 fp64 values till well after that? Only at 262,144 multiples does it take significantly longer. Then after that the time doesn't quite double(1.8x) for reasons I don't understand. Once you've saturated the device then doubling the work should be at least 2X slower. Finally when going from 2097152 to 4194304 multiplies it takes 4.5 times as long. ???
8192: t1 * t2 took 2.081
16384: t1 * t2 took 2.095
32768: t1 * t2 took 2.066
65536: t1 * t2 took 2.057
131072: t1 * t2 took 2.209 Q1: Why still about 2 second with way over num of cuda cores?
262144: t1 * t2 took 2.991
524288: t1 * t2 took 5.989 2X slower which makes sense
1048576: t1 * t2 took 10.388 Only 1.7X slower which is a suprise given it is twice the work
2097152: t1 * t2 took 18.95 Q2: 1.8X slower but why ONLY 1.8X
4194304: t1 * t2 took 86.161 Q3: 4.5X slower for twice the work What is going on here?
import torch
import time
from datetime import datetime
from datetime import timedelta
with torch.cuda.device(0):
dim1 = 256
dim2 = 16
while dim2 <= 16384:
t1 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
t2 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
i = 0
tm0 = datetime.now()
while i < 1000000:
t1 = t1 * t2
#torch.cuda.synchronize() # MULT is dependent on previous result
i += 1
torch.cuda.synchronize()
print(f"{dim1*dim2}: t1 * t2 took {round(timedelta.total_seconds(datetime.now()-tm0)+.0001, 3)}")
dim2 *= 2
I tried the code above and I was expecting 32,768 multiplies would take twice the time as 16,384 given the actual number of cuda cores.

Splitting an int64 into two int32, performing math, then re-joining

I am working within constraints of hardware that has 64bit integer limit. Does not support floating point. I am dealing with very large integers that I need to multiply and divide. When multiplying I encounter an overflow of the 64bits. I am prototyping a solution in python. This is what I have in my function:
upper = x >> 32 #x is cast as int64 before being passed to this function
lower = x & 0x00000000FFFFFFFF
temp_upper = upper * y // z #Dividing first is not an option, as this is not the actual equation I am working with. This is just to make sure in my testing I overflow unless I do the splitting.
temp_lower = lower * y // z
return temp_upper << 32 | lower
This works, somewhat, but I end up losing a lot of precision (my result is off by sometimes a few million). From looking at it, it appears that this is happening because of the division. If sufficient enough it shifts the upper to the right. Then when I shift it back into place I have a gap of zeroes.
Unfortunately this topic is very hard to google, since anything with upper/lower brings up results about rounding up/down. And anything about splitting ints returns results about splitting them into a char array. Anything about int arithmetic bring up basic algebra with integer math. Maybe I am just not good at googling. But can you guys give me some pointers on how to do this?
Splitting like this is just a thing I am trying, it doesnt have to be the solution. All I need to be able to do is to temporarily go over 64bit integer limit. The final result will be under 64bit (After the division part). I remember learning in college about splitting it up like this and then doing the math and re-combining. But unfortunately as I said I am having trouble finding anything online on how to do the actual math on it.
Lastly, my numbers are sometimes small. So I cant chop off the right bits. I need the results to basically be equivalent to if I used something like int128 or something.
I suppose a different way to look at this problem is this. Since I have no problem with splitting the int64, we can forget about that part. So then we can pretend that two int64's are being fed to me, one is upper and one is lower. I cant combine them, because they wont fit into a single int64. So I need to divide them first by Z. Combining step is easy. How do I do the division?
Thanks.

As I understand it, you want to perform (x*y)//z.
Your numbers x,y,z all fit on 64bits, except that you need 128 bits for intermediate x*y.
The problem you have is indeed related to division: you have
h * y = qh * z + rh
l * y = ql * z + rl
h * y << 32 + l*y = (qh<<32 + ql) * z + (rh<<32 + rl)
but nothing says that (rh<<32 + rl) < z, and in your case high bits of l*y overlap low bits of h * y, so you get the wrong quotient, off by potentially many units.
What you should do as second operation is rather:
rh<<32 + l * y = ql' * z + rl'
Then get the total quotient qh<<32 + ql'
But of course, you must care to avoid overflow when evaluating left operand...
Since you are splitting only one of the operands of x*y, I'll assume that the intermediate result always fits on 96 bits.
If that is correct, then your problem is to divide a 3 32bits limbs x*y by a 2 32bits limbs z.
It is thus like Burnigel - Ziegler divide and conquer algorithm for division.
The algorithm can be decomposed like this:
obtain the 3 limbs a2,a1,a0 of multiplication x*y by using karatsuba for example
split z into 2 limbs z1,z0
perform the div32( (a2,a1,a0) , (z1,z0) )
here is some pseudo code, only dealing with positive operands, and with no guaranty to be correct, but you get an idea of implementation:
p = 1<<32;
function (a1,a0) = split(a)
a1 = a >> 32;
a0 = a - (a1 * p);
function (a2,a1,a0) = mul22(x,y)
(x1,x0) = split(x) ;
(y1,y0) = split(y) ;
(h1,h0) = split(x1 * y1);
assert(h1 == 0); -- assume that results fits on 96 bits
(l1,l0) = split(x0 * y0);
(m1,m0) = split((x1 - x0) * (y0 - y1)); -- karatsuba trick
a0 = l0;
(carry,a1) = split( l1 + l0 + h0 + m0 );
a2 = l1 + m1 + h0 + carry;
function (q,r) = quorem(a,b)
q = a // b;
r = a - (b * q);
function (q1,q0,r0) = div21(a1,a0,b0)
(q1,r1) = quorem(a1,b0);
(q0,r0) = quorem( r1 * p + a0 , b0 );
(q1,q0) = split( q1 * p + q0 );
function q = div32(a2,a1,a0,b1,b0)
(q,r) = quorem(a2*p+a1,b1*p+b0);
q = q * p;
(a2,a1)=split(r);
if a2<b1
(q1,q0,r)=div21(a2,a1,b1);
assert(q1==0); -- since a2<b1...
else
q0=p-1;
r=(a2-b1)*p+a1+b1;
(d1,d0) = split(q0*b0);
r = (r-d1)*p + a0 - d0;
while(r < 0)
q = q - 1;
r = r + b1*p + b0;
function t=muldiv(x,y,z)
(a2,a1,a0) = mul22(x,y);
(z1,z0) = split(z);
if z1 == 0
(q2,q1,r1)=div21(a2,a1,z0);
assert(q2==0); -- otherwise result will not fit on 64 bits
t = q1*p + ( ( r1*p + a0 )//z0);
else
t = div32(a2,a1,a0,z1,z0);

How to use np.where in another np.where (conext: ray tracing)

The question is: how to use two np.where in the same statement, like this (oversimplified):
np.where((ndarr1==ndarr2),np.where((ndarr1+ndarr2==ndarr3),True,False),False)
To avoid computing second conditional statement if the first is not reached.
My first objective is to find the intersection of a ray in a triangle, if there is one. This problem can be solved by this algorithm (found on stackoverflow):
def intersect_line_triangle(q1,q2,p1,p2,p3):
def signed_tetra_volume(a,b,c,d):
return np.sign(np.dot(np.cross(b-a,c-a),d-a)/6.0)
s1 = signed_tetra_volume(q1,p1,p2,p3)
s2 = signed_tetra_volume(q2,p1,p2,p3)
if s1 != s2:
s3 = signed_tetra_volume(q1,q2,p1,p2)
s4 = signed_tetra_volume(q1,q2,p2,p3)
s5 = signed_tetra_volume(q1,q2,p3,p1)
if s3 == s4 and s4 == s5:
n = np.cross(p2-p1,p3-p1)
t = np.dot(p1-q1,n) / np.dot(q2-q1,n)
return q1 + t * (q2-q1)
return None
Here are two conditional statements:
s1!=s2
s3==s4 & s4==s5
Now since I have >20k triangles to check, I want to apply this function on all triangles at the same time.
First solution is:
s1 = vol(r0,tri[:,0,:],tri[:,1,:],tri[:,2,:])
s2 = vol(r1,tri[:,0,:],tri[:,1,:],tri[:,2,:])
s3 = vol(r1,r2,tri[:,0,:],tri[:,1,:])
s4 = vol(r1,r2,tri[:,1,:],tri[:,2,:])
s5 = vol(r1,r2,tri[:,2,:],tri[:,0,:])
np.where((s1!=s2) & (s3+s4==s4+s5),intersect(),False)
where s1,s2,s3,s4,s5 are arrays containing the value S for each triangle. Problem is, it means I have to compute s3,s4,and s5 for all triangles.
Now the ideal would be to compute statement 2 (and s3,s4,s5) only when statement 1 is True, with something like this:
check= np.where((s1!=s2),np.where((compute(s3)==compute(s4)) & (compute(s4)==compute(s5), compute(intersection),False),False)
(to simplify explanation, I just stated 'compute' instead of the whole computing process. Here, 'compute' is does only on the appropriate triangles).
Now of course this option doesn't work (and computes s4 two times), but I'd gladly have some recommendations on a similar process

Here's how I used masked arrays to answer this problem:
loTrue= np.where((s1!=s2),False,True)
s3=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t0), r0t1)),mask=loTrue)
s4=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t1), r0t2)),mask=loTrue)
s5=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t2), r0t0)),mask=loTrue)
loTrue= ma.masked_array(np.where((abs(s3-s4)<1e-4) & ( abs(s5-s4)<1e-4),True,False),mask=loTrue)
#also works when computing s3,s4 and s5 inside loTrue, like this:
loTrue= np.where((s1!=s2),False,True)
loTrue= ma.masked_array(np.where(
(abs(np.sign(dot(np.cross(r0r1, r0t0), r0t1))-np.sign(dot(np.cross(r0r1, r0t1), r0t2)))<1e-4) &
(abs(np.sign(dot(np.cross(r0r1, r0t2), r0t0))-np.sign(dot(np.cross(r0r1, r0t1), r0t2)))<1e-4),True,False)
,mask=loTrue)
Note that the same process, when not using such approach, is done like this:
s3= np.sign(dot(np.cross(r0r1, r0t0), r0t1) /6.0)
s4= np.sign(dot(np.cross(r0r1, r0t1), r0t2) /6.0)
s5= np.sign(dot(np.cross(r0r1, r0t2), r0t0) /6.0)
loTrue= np.where((s1!=s2) & (abs(s3-s4)<1e-4) & ( abs(s5-s4)<1e-4) ,True,False)
Both give the same results, however, when looping on this process only for 10k iterations, NOT using masked arrays is faster! (26 secs without masked arrays, 31 secs with masked arrays, 33 when using masked arrays in one line only (not computing s3,s4 and s5 separately, or computing s4 before).
Conclusion: using nested arrays is solved here (note that the mask indicates where it won't be computed, hence first loTri must bet set to False (0) when condition is verified). However, in that scenario, it's not faster.

I can get a small speedup from short circuiting but I'm not convinced it is worth the additional admin.
full computation 4.463818839867599 ms per iteration (one ray, 20,000 triangles)
short ciruciting 3.0060838296776637 ms per iteration (one ray, 20,000 triangles)
Code:
import numpy as np
def ilt_cut(q1,q2,p1,p2,p3):
qm = (q1+q2)/2
qd = qm-q2
p12 = p1-p2
aux = np.cross(qd,q2-p2)
s3 = np.einsum("ij,ij->i",aux,p12)
s4 = np.einsum("ij,ij->i",aux,p2-p3)
ge = (s3>=0)&(s4>=0)
le = (s3<=0)&(s4<=0)
keep = np.flatnonzero(ge|le)
aux = p1[keep]
qpm1 = qm-aux
p31 = p3[keep]-aux
s5 = np.einsum("ij,ij->i",np.cross(qpm1,p31),qd)
ge = ge[keep]&(s5>=0)
le = le[keep]&(s5<=0)
flt = np.flatnonzero(ge|le)
keep = keep[flt]
n = np.cross(p31[flt], p12[keep])
s12 = np.einsum("ij,ij->i",n,qpm1[flt])
flt = np.abs(s12) <= np.abs(s3[keep]+s4[keep]+s5[flt])
return keep[flt],qm-(s12[flt]/np.einsum("ij,ij->i",qd,n[flt]))[:,None]*qd
def ilt_full(q1,q2,p1,p2,p3):
qm = (q1+q2)/2
qd = qm-q2
p12 = p1-p2
qpm1 = qm-p1
p31 = p3-p1
aux = np.cross(qd,q2-p2)
s3 = np.einsum("ij,ij->i",aux,p12)
s4 = np.einsum("ij,ij->i",aux,p2-p3)
s5 = np.einsum("ij,ij->i",np.cross(qpm1,p31),qd)
n = np.cross(p31, p12)
s12 = np.einsum("ij,ij->i",n,qpm1)
ge = (s3>=0)&(s4>=0)&(s5>=0)
le = (s3<=0)&(s4<=0)&(s5<=0)
keep = np.flatnonzero((np.abs(s12) <= np.abs(s3+s4+s5)) & (ge|le))
return keep,qm-(s12[keep]/np.einsum("ij,ij->i",qd,n[keep]))[:,None]*qd
tri = np.random.uniform(1, 10, (20_000, 3, 3))
p0, p1 = np.random.uniform(1, 10, (2, 3))
from timeit import timeit
A,B,C = tri.transpose(1,0,2)
print('full computation', timeit(lambda: ilt_full(p0[None], p1[None], A, B, C), number=100)*10, 'ms per iteration (one ray, 20,000 triangles)')
print('short ciruciting', timeit(lambda: ilt_cut(p0[None], p1[None], A, B, C), number=100)*10, 'ms per iteration (one ray, 20,000 triangles)')
Note that I played a bit with the algorithm, so this may not in every edge case give the same result aas yours.
What I changed:
I inlined the tetra volume, which allows to save a few repeated subcomputations
I replace one of the ray ends with the midpoint M of the ray. This saves computing one tetra volume (s1 or s2) because one can check whether the ray crosses the triangle ABC plane by comparing the volume of tetra ABCM to the sum of s3, s4, s5 (if they have the same signs).

Haskell beginner - optimal way to solve this

I was assigned the task to create a function which is able to calculate the grade of your exam by adding the points you achieved to the extra points you collected through other means and then converting them into the grade system. I also had to add an error message if either the extra points or the exam points exceeded their maximum (20 and 100 respectively).
The function I created works, but it probably isn't close to being optimal.
calcGrade :: Double -> Double -> Double
calcGrade x y
| x > 20 = error "Can't add more than 20 extra points"
| y > 100 = error "Can't achieve more than 100 points"
| x + y < 50 = 5.0
| x + y >= 50 && x + y < 54 = 4.0
| x + y >= 54 && x + y < 58 = 3.7
| x + y >= 58 && x + y < 62 = 3.3
| x + y >= 62 && x + y < 66 = 3.0
| x + y >= 66 && x + y < 70 = 2.7
| x + y >= 70 && x + y < 74 = 2.3
| x + y >= 74 && x + y < 78 = 2.0
| x + y >= 78 && x + y < 82 = 1.7
| x + y >= 82 && x + y < 86 = 1.3
| x + y >= 86 = 1.0
Is there another way of doing this or is there anything I could do more efficiently? I'm pretty new to Haskell and programming in general so I'm thankful for any advice!

If I wanted to implement that exact same function (as opposed to changing the specification so that I could make the code cleaner -- which is sometimes possible), I think I would use Map's lookupGT to encode the lookup table that is currently done with guards. So:
import Data.Map (fromAscList, lookupGT)
calcGrade :: Double -> Double -> Double
calcGrade x y
| x > 20 = error "Can't add more than 20 extra points"
| y > 100 = error "Can't achieve more than 100 points"
| otherwise = case lookupGT (x+y) cutoffs of
Just (_, v) -> v
Nothing -> 1.0
where
cutoffs = fromAscList [(50, 5.0), (54, 4.0), (58, 3.7), (62, 3.3), (66, 3.0), (70, 2.7), (74, 2.3), (78, 2.0), (82, 1.7), (86, 1.3)]
This has a few advantages:
Much less repetition. This is a virtue in and of itself.
With the guards, the reader must carefully check that the condition is on x+y in every case, and not, say, the very visually similar x+v for some case for some reason. With this encoding, that's clear without careful attention.
Each guard would be checked in turn, giving a linear runtime in the number of cutoffs. With Map's lookupGT, only a log number of comparisons is done. Since you probably don't intend to vary the cutoffs dynamically, this probably doesn't matter; but the trick used here can occasionally be useful elsewhere, so it's nice to remember it for those cases where asymptotics do matter.
Because cutoffs appear in only one place, if this changes later (you'd be surprised...) you don't have to be careful to change, e.g., 58 to 59 in two places as one would need to do with your code.
The only wart, in my opinion, is that the default score case (Nothing -> 1.0) doesn't live next to the cutoffs; though it's not clear to me how one might go about doing that sanely.

If you only accepted Int values (and still returned Double) then you could write it as
calcGrade x y =
let score = (min 46 (x + y) - 46) `div` 4
grades = [5.0, 4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3] ++ repeat 1.0
in grades !! score
But this leaves out the first 2 checks. You could put them in rather easily, but it might be better to put that in a different function (also, the use of error is frowned upon in Haskell, better to use a type that indicates the function might fail, such as Maybe or Either).
What this function does is calculates the sum of x and y first, then says "which is smaller, x + y or 46?". This handles the case of x + y < 50. Next, it subtracts 46, so a score of 50 becomes 4, a score of 54 becomes 8, and so on. The div function will do integer division by 4, so a score of 50 becomes 4 becomes 1, a score less than 50 becomes 0, and a score of 73 becomes 27 becomes 6.
The grades themselves are stored in a list, any score less than 50 will index out the first element of 5.0, and then each range indexes out its corresponding element, so 73 indexes out the 7th element (index 6) of 2.3. The ++ repeat 1.0 handles a score >= 86.
Another way you could solve this might make a bit more sense. Just build a list of the ranges:
let score = x + y
mins = [0, 50, 54, 58, 62, 66, 70, 74, 78, 82, 86, 120]
ranges = zip mins (tail mins) -- [(0, 50), (50, 54), ..., (86, 120)]
grades = [5.0, 4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0]
inRange = map (\(lower, upper) -> lower <= score && score < upper) ranges
in snd $ head $ filter fst $ zip inRange grades
I think most of this logic is pretty clear, but the last line might be confusing. It zips the inRange list of Bools with the grades, filters by the first element (whether that range included the score), takes the first element from the list, then grabs the second element of that (Bool, Double) tuple.

Some things you can try to clean it up a bit:
Use a where binding to extract the common subexpression x + y.
Better yet, just accept the already added number as an argument. Then this function will be your "lookup grade" function, and you can call it like lookupGrade (exam + extra).
Don't use error. Instead, return a Maybe Double (which is either Just a Double, or Nothing if you can't compute the score.
List your guards in reverse order. This way, you only have to check one bound on each, not both. It is ok to overlap because they are checked in order.
Try to extract the actual meaning of the grading algorithm, rather than trying to list out cases. Try to find a math formula that converts it like you want, then write that formula in Haskell.
Applying these transformations is probably the best way to write this function, unless you want to use a Map (from Data.Map) to list larger numbers of cases. This code will express your intent better than using a bunch of unneeded lists.

Not sure about the performance, but this should work.
import Data.Map (fromAscList, filterWithKey)
calculate :: Double -> Double -> String
calculate exam bonus
| exam > 20 || bonus > 100 = "Be real!"
| otherwise = (foldr (++) "" . filterWithKey isInRange) letterGrade
where
isInRange k _ = percent `elem` k
percent = truncate $ (exam + bonus) * 10 / 12
letterGrade = fromAscList [ ([90..100], "A"), ([80..89], "B"), ([70..79], "C"), ([60..69], "D"), ([0 ..59], "F")]

Interpolate between three colors

I need to linearly interpolate between colors A and B so that at t=.1 the color is 0.5A + 0.5B. How can I achieve this?
I need to use this in a shader, so a conditional statement isn't an option, because branching can get quite slow. I need the solution in an expression-form.
Here's my solution, which is quite slow:
fixed3 color1= lerp(colorA, colorB, (1.0 - t) * .5 );
fixed3 borderColor= lerp(colorA, colorB, (1.0 - .1) * .5);
fixed3 color2= lerp(colorA, borderColor, t * 10.0 );
finalColor = lerp(color2, color1, saturate( (t - .1) * 100000.0) );
The last lerp is to insure that the interpolation value is either 0 or 1.
Thanks.

I will concentrate on the first line of your question: linear interpolation between two color values or color vectors.
You define an arbitrary point in time that marks the middle point of the interpolation:
tm = 0.1
For the calculation of the color vector C(t) you also need to define a starting time t1 or end time t2, or both:
t1 = t2 - 2*(t2-tm) = 2*tm - t2
t2 = t1 + 2*(tm-t1) = 2*tm - t1
To make the transition between color A to color B, you calculate the color components like this:
C = A + (B - A) * (t - t1) / (t2 - t1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string