Semaphores the output of the following - semaphore

I wanted to know the output of this program, step by step, please.
initially, the values are: S1=0 , S2=1 , S3=0, a=1.enter code here
p1 p2 p3
while(1) { while(1){ while(1){
P(S1) P(S2); P(S3);
a=2*a; a=a+1; P(S3);
V(S3) } V(S1); printf("%d\n",a);
V(S3) } V(S2);

The solution you have linked is correct.
In informal terms: P blocks while the semaphore == 0 or decrement the semaphore and continues; V increments the semaphore.
At first S1 and S3 == 0, so p1 and p3 are blocked.
p2 is the only one that can run. It increments a (== 2) and increments S1 and S3. It cannot continue because S2 is now 0.
p3 can only do one step, because after the first P call, S3 == 0 again.
p1 is the only process that can do work. It doubles a (== 4) then unblocks S3. It cannot continue because S2 is 0 again.
Now p3 is the only one that can run. It prints 4. Then unblocks p2.
Notice the situation is exactly the same as at the beginning of the problem (except a == 4 now). So each cycle runs exactly the same way: p2 increments a, p1 doubles a, p3 prints a. Repeat.
Hence printed values are 4, 10, 22, 46...

Related

How to use np.where in another np.where (conext: ray tracing)

The question is: how to use two np.where in the same statement, like this (oversimplified):
np.where((ndarr1==ndarr2),np.where((ndarr1+ndarr2==ndarr3),True,False),False)
To avoid computing second conditional statement if the first is not reached.
My first objective is to find the intersection of a ray in a triangle, if there is one. This problem can be solved by this algorithm (found on stackoverflow):
def intersect_line_triangle(q1,q2,p1,p2,p3):
def signed_tetra_volume(a,b,c,d):
return np.sign(np.dot(np.cross(b-a,c-a),d-a)/6.0)
s1 = signed_tetra_volume(q1,p1,p2,p3)
s2 = signed_tetra_volume(q2,p1,p2,p3)
if s1 != s2:
s3 = signed_tetra_volume(q1,q2,p1,p2)
s4 = signed_tetra_volume(q1,q2,p2,p3)
s5 = signed_tetra_volume(q1,q2,p3,p1)
if s3 == s4 and s4 == s5:
n = np.cross(p2-p1,p3-p1)
t = np.dot(p1-q1,n) / np.dot(q2-q1,n)
return q1 + t * (q2-q1)
return None
Here are two conditional statements:
s1!=s2
s3==s4 & s4==s5
Now since I have >20k triangles to check, I want to apply this function on all triangles at the same time.
First solution is:
s1 = vol(r0,tri[:,0,:],tri[:,1,:],tri[:,2,:])
s2 = vol(r1,tri[:,0,:],tri[:,1,:],tri[:,2,:])
s3 = vol(r1,r2,tri[:,0,:],tri[:,1,:])
s4 = vol(r1,r2,tri[:,1,:],tri[:,2,:])
s5 = vol(r1,r2,tri[:,2,:],tri[:,0,:])
np.where((s1!=s2) & (s3+s4==s4+s5),intersect(),False)
where s1,s2,s3,s4,s5 are arrays containing the value S for each triangle. Problem is, it means I have to compute s3,s4,and s5 for all triangles.
Now the ideal would be to compute statement 2 (and s3,s4,s5) only when statement 1 is True, with something like this:
check= np.where((s1!=s2),np.where((compute(s3)==compute(s4)) & (compute(s4)==compute(s5), compute(intersection),False),False)
(to simplify explanation, I just stated 'compute' instead of the whole computing process. Here, 'compute' is does only on the appropriate triangles).
Now of course this option doesn't work (and computes s4 two times), but I'd gladly have some recommendations on a similar process
Here's how I used masked arrays to answer this problem:
loTrue= np.where((s1!=s2),False,True)
s3=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t0), r0t1)),mask=loTrue)
s4=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t1), r0t2)),mask=loTrue)
s5=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t2), r0t0)),mask=loTrue)
loTrue= ma.masked_array(np.where((abs(s3-s4)<1e-4) & ( abs(s5-s4)<1e-4),True,False),mask=loTrue)
#also works when computing s3,s4 and s5 inside loTrue, like this:
loTrue= np.where((s1!=s2),False,True)
loTrue= ma.masked_array(np.where(
(abs(np.sign(dot(np.cross(r0r1, r0t0), r0t1))-np.sign(dot(np.cross(r0r1, r0t1), r0t2)))<1e-4) &
(abs(np.sign(dot(np.cross(r0r1, r0t2), r0t0))-np.sign(dot(np.cross(r0r1, r0t1), r0t2)))<1e-4),True,False)
,mask=loTrue)
Note that the same process, when not using such approach, is done like this:
s3= np.sign(dot(np.cross(r0r1, r0t0), r0t1) /6.0)
s4= np.sign(dot(np.cross(r0r1, r0t1), r0t2) /6.0)
s5= np.sign(dot(np.cross(r0r1, r0t2), r0t0) /6.0)
loTrue= np.where((s1!=s2) & (abs(s3-s4)<1e-4) & ( abs(s5-s4)<1e-4) ,True,False)
Both give the same results, however, when looping on this process only for 10k iterations, NOT using masked arrays is faster! (26 secs without masked arrays, 31 secs with masked arrays, 33 when using masked arrays in one line only (not computing s3,s4 and s5 separately, or computing s4 before).
Conclusion: using nested arrays is solved here (note that the mask indicates where it won't be computed, hence first loTri must bet set to False (0) when condition is verified). However, in that scenario, it's not faster.
I can get a small speedup from short circuiting but I'm not convinced it is worth the additional admin.
full computation 4.463818839867599 ms per iteration (one ray, 20,000 triangles)
short ciruciting 3.0060838296776637 ms per iteration (one ray, 20,000 triangles)
Code:
import numpy as np
def ilt_cut(q1,q2,p1,p2,p3):
qm = (q1+q2)/2
qd = qm-q2
p12 = p1-p2
aux = np.cross(qd,q2-p2)
s3 = np.einsum("ij,ij->i",aux,p12)
s4 = np.einsum("ij,ij->i",aux,p2-p3)
ge = (s3>=0)&(s4>=0)
le = (s3<=0)&(s4<=0)
keep = np.flatnonzero(ge|le)
aux = p1[keep]
qpm1 = qm-aux
p31 = p3[keep]-aux
s5 = np.einsum("ij,ij->i",np.cross(qpm1,p31),qd)
ge = ge[keep]&(s5>=0)
le = le[keep]&(s5<=0)
flt = np.flatnonzero(ge|le)
keep = keep[flt]
n = np.cross(p31[flt], p12[keep])
s12 = np.einsum("ij,ij->i",n,qpm1[flt])
flt = np.abs(s12) <= np.abs(s3[keep]+s4[keep]+s5[flt])
return keep[flt],qm-(s12[flt]/np.einsum("ij,ij->i",qd,n[flt]))[:,None]*qd
def ilt_full(q1,q2,p1,p2,p3):
qm = (q1+q2)/2
qd = qm-q2
p12 = p1-p2
qpm1 = qm-p1
p31 = p3-p1
aux = np.cross(qd,q2-p2)
s3 = np.einsum("ij,ij->i",aux,p12)
s4 = np.einsum("ij,ij->i",aux,p2-p3)
s5 = np.einsum("ij,ij->i",np.cross(qpm1,p31),qd)
n = np.cross(p31, p12)
s12 = np.einsum("ij,ij->i",n,qpm1)
ge = (s3>=0)&(s4>=0)&(s5>=0)
le = (s3<=0)&(s4<=0)&(s5<=0)
keep = np.flatnonzero((np.abs(s12) <= np.abs(s3+s4+s5)) & (ge|le))
return keep,qm-(s12[keep]/np.einsum("ij,ij->i",qd,n[keep]))[:,None]*qd
tri = np.random.uniform(1, 10, (20_000, 3, 3))
p0, p1 = np.random.uniform(1, 10, (2, 3))
from timeit import timeit
A,B,C = tri.transpose(1,0,2)
print('full computation', timeit(lambda: ilt_full(p0[None], p1[None], A, B, C), number=100)*10, 'ms per iteration (one ray, 20,000 triangles)')
print('short ciruciting', timeit(lambda: ilt_cut(p0[None], p1[None], A, B, C), number=100)*10, 'ms per iteration (one ray, 20,000 triangles)')
Note that I played a bit with the algorithm, so this may not in every edge case give the same result aas yours.
What I changed:
I inlined the tetra volume, which allows to save a few repeated subcomputations
I replace one of the ray ends with the midpoint M of the ray. This saves computing one tetra volume (s1 or s2) because one can check whether the ray crosses the triangle ABC plane by comparing the volume of tetra ABCM to the sum of s3, s4, s5 (if they have the same signs).

Stuck on a Concurrent programming example, in pseudocode(atomic actions/fine-grained atomicity)

My book presents a simple example which I'm a bit confused about:
It says, "consider the following program, and assume that the fine-grained atomic actions are reading and writing the variables:"
int y = 0, z = 0;
co x = y+z; // y=1; z=2; oc;
"If x = y + z is implemented by loading a register with y and then adding z to it, the final value of x can be 0,1,2, or 3. "
2? How does 2 work?
Note: co starts a concurrent process and // denote parallel-running statements
In your program there are two parallel sequences:
Sequence 1: x = y+z;
Sequence 2: y=1; z=2;
The operations of sequence 1 are:
y Copy the value of y into a register.
+ z Add the value of z to the value in the register.
x = Copy the value of the register into x.
The operations of sequence 2 are:
y=1; Set the value of y to 1.
z=2; Set the value of z to 2.
These two sequences are running at the same time, though the steps within a sequence must occur in order. Therefore, you can get an x value of '2' in the following sequence:
y=0
z=0
y Copy the value of y into a register. (register value is now '0')
y=1; Set the value of y to 1. (has no effect on the result, we've already copied y to the register)
z=2; Set the value of z to 2.
+ z Add the value of z to the value in the register. (register value is now '2')
x = Copy the value of the register into x. (the value of x is now '2')
Since they are assumed to run in parallel, I think an even simpler case could be y=0, z=2 when the assignment x = y + z occurs.

basic multithreading

I have the following Interview question:
class someClass
{
int sum=0;
public void foo()
{
for(int i=0; i<100; i++)
{
sum++
}
}
}
There are two parallel threads running through the foo method.
the value of sum at the end will vary from 100 to 200.
the question is why. As I understand only one thread gets a cpu and threads get preempted while running. At what point can the disturbance cause the sum not reaching 200?
The increment isn't atomic. It reads the value, then writes out the incremented value. In between these two operations, the other thread can interact with the sum in complex ways.
The range of values is not in fact 100 to 200. This range is based on the incorrect assumption that threads either take turns or they perform each read simultaneously. There are many more possible interleavings, some of which yield markedly different values. The worst case is as follows (x represents the implicit temporary used in the expression sum++):
Thread A Thread B
---------------- ----------------
x ← sum (0)
x ← sum (0)
x + 1 → sum (1)
x ← sum (1)
x + 1 → sum (2)
⋮
x ← sum (98)
x + 1 → sum (99)
x + 1 → sum (1)
x ← sum (1)
x ← sum (1)
x + 1 → sum (2)
⋮
x ← sum (99)
x + 1 → sum (100)
x + 1 → sum (2)
Thus the lowest possible value is 2. In simple terms, the two threads undo each the other's hard work. You can't go below 2 because thread B can't feed a zero to thread A — it can only output an incremented value — and thread A must in turn increment the 1 that thread B gave it.
The line sum++ is a race condition.
Both threads can read the value of sum as say 0, then each increments its value to 1 and stores that value back. This means that the value of sum will be 1 instead of 2.
Continue like that and you will get a result between 100 and 200.
Most CPU's have multiple cores now. So if we lock a mutex at the beginning of the function foo() and unlock the mutex after the for loop finishes then the 2 threads running on different cores will still yield the answer 200.

Problem detecting cyclic numbers in Haskell

I am doing problem 61 at project Euler and came up with the following code (to test the case they give):
p3 n = n*(n+1) `div` 2
p4 n = n*n
p5 n = n*(3*n -1) `div` 2
p6 n = n*(2*n -1)
p7 n = n*(5*n -3) `div` 2
p8 n = n*(3*n -2)
x n = take 2 $ show n
x2 n = reverse $ take 2 $ reverse $ show n
pX p = dropWhile (< 999) $ takeWhile (< 10000) [p n|n<-[1..]]
isCyclic2 (a,b,c) = x2 b == x c && x2 c == x a && x2 a == x b
ns2 = [(a,b,c)|a <- pX p3 , b <- pX p4 , c <- pX p5 , isCyclic2 (a,b,c)]
And all ns2 does is return an empty list, yet cyclic2 with the arguments given as the example in the question, yet the series doesn't come up in the solution. The problem must lie in the list comprehension ns2 but I can't see where, what have I done wrong?
Also, how can I make it so that the pX only gets the pX (n) up to the pX used in the previous pX?
PS: in case you thought I completely missed the problem, I will get my final solution with this:
isCyclic (a,b,c,d,e,f) = x2 a == x b && x2 b == x c && x2 c == x d && x2 d == x e && x2 e == x f && x2 f == x a
ns = [[a,b,c,d,e,f]|a <- pX p3 , b <- pX p4 , c <- pX p5 , d <- pX p6 , e <- pX p7 , f <- pX p8 ,isCyclic (a,b,c,d,e,f)]
answer = sum $ head ns
The order is important. The cyclic numbers in the question are 8128, 2882, 8281, and these are not P3/127, P4/91, P5/44 but P3/127, P5/44, P4/91.
Your code is only checking in the order 8128, 8281, 2882, which is not cyclic.
You would get the result if you check for
isCyclic2 (a,c,b)
in your list comprehension.
EDIT: Wrong Problem!
I assumed you were talking about the circular number problem, Sorry!
There is a more efficient way to do this with something like this:
take (2 * l x -1) . cycle $ show x
where l = length . show
Try that and see where it gets you.
If I understand you right here, you're no longer asking why your code doesn't work but how to make it faster. That's actually the whole fun of Project Euler to find an efficient way to solve the problems, so proceed with care and first try to think of reducing your search space yourself. I suggest you let Haskell print out the three lists pX p3, pX p4, pX p5 and see how you'd go about looking for a cycle.
If you would proceed like your list comprehension, you'd start with the first element of each list, 1035, 1024, 1080. I'm pretty sure you would stop right after picking 1035 and 1024 and not test for cycles with any value from P5, let alone try all the permutations of the combinations involving these two numbers.
(I haven't actually worked on this problem yet, so this is how I would go about speeding it up. There may be some math wizardry out there that's even faster)
First, start looking at the numbers you get from pX. You can drop more than those. For example, P3 contains 6105 - there's no way you're going to find a number in the other sets starting with '05'. So you can also drop those numbers where the number modulo 100 is less than 10.
Then (for the case of 3 sets), we can sometimes see after drawing two numbers that there can't be any number in the last set that will give you a cycle, no matter how you permutate (e.g. 1035 from P3 and 3136 from P4 - there can't be a cycle here).
I'd probably try to build a chain by starting with the elements from one list, one by one, and for each element, find the elements from the remaining lists that are valid successors. For those that you've found, continue trying to find the next chain element from the remaining lists. When you've built a chain with one number from every list, you just have to check if the last two digits of the last number match the first two digits of the first number.
Note when looking for successors, you again don't have to traverse the entire lists. If you're looking for a successor to 3015 from P5, for example, you can stop when you hit a number that's 1600 or larger.
If that's too slow still, you could transform the lists other than the first one to maps where the map key is the first two digits and the associated values are lists of numbers that start with those digits. Saves you from going through the lists from the start again and again.
I hope this helps a bit.
btw, I sense some repetition in your code.
you can unite your [p3, p4, p5, p6, p7, p8] functions into one function that will take the 3 from the p3 as a parameter etc.
to find what the pattern is, you can make all the functions in the form of
pX n = ... `div` 2

Do atomic operations work the same across processes as they do across threads?

Obviously, atomic operations make sure that different threads don't clobber a value. But is this still true across processes, when using shared memory? Even if the processes happen to be scheduled by the OS to run on different cores? Or across different distinct CPUs?
Edit: Also, if it's not safe, is it not safe even on an operating system like Linux, where processes and threads are the same from the scheduler's point of view?
tl;dr: Read the fine print in the documentation of the atomic operations. Some will be atomic by design but may trip over certain variable types. In general, though, an atomic operation will maintain its contract between different processes just as it does between threads.
An atomic operation really only ensures that you won't have an inconsistent state if called by two entities simultaneously. For example, an atomic increment that is called by two different threads or processes on the same integer will always behave like so:
x = initial value (zero for the sake of this discussion)
Entity A increments x and returns the result to itself: result = x = 1.
Entity B increments x and returns the result to itself: result = x = 2.
where A and B indicate the first and second thread or process that makes the call.
A non-atomic operation can result in inconsistent or generally crazy results due to race conditions, incomplete writes to the address space, etc. For example, you can easily see this:
x = initial value = zero again.
Entity A calls x = x + 1. To evaluate x + 1, A checks the value of x (zero) and adds 1.
Entity B calls x = x + 1. To evaluate x + 1, B checks the value of x (still zero) and adds 1.
Entity B (by luck) finishes first and assigns the result of x + 1 = 1 (step 3) to x. x is now 1.
Entity A finishes second and assigns the result of x + 1 = 1 (step 2) to x. x is now 1.
Note the race condition as entity B races past A and completes the expression first.
Now imagine if x were a 64-bit double that is not ensured to have atomic assignments. In that case you could easily see something like this:
A 64 bit double x = 0.
Entity A tries to assign 0x1122334455667788 to x. The first 32 bits are assigned first, leaving x with 0x1122334400000000.
Entity B races in and assigns 0xffeeddccbbaa9988 to x. By chance, both 32 bit halves are updated and x is now = 0xffeeddccbbaa9988.
Entity A completes its assignment with the second half and x is now = 0xffeeddcc55667788.
These non-atomic assignments are some of the most hideous concurrent bugs you'll ever have to diagnose.

Resources