when to use tail call optimization in a cairo smartcontract - tail-recursion

I can often make a terminal recursive version of my functions with a little less elegant code. Should I do it because it could reduce the fees or should I keep the unoptimised version?
For example, here is an "unoptimised" function which sums elements of an array:
#view
func get_convoy_strength{syscall_ptr : felt*, pedersen_ptr : HashBuiltin*, range_check_ptr}(
convoy_id : felt
) -> (strength : felt):
alloc_locals
let (convoyables_len : felt, convoyables : felt*) = get_convoyables(convoy_id)
return _get_convoyables_strength(convoyables_len, convoyables)
end
and here is the tail call optimization:
func _get_convoyables_strength_tc{
syscall_ptr : felt*, pedersen_ptr : HashBuiltin*, range_check_ptr
}(convoyables_len : felt, convoyables : felt*, sum : felt) -> (strength : felt):
if convoyables_len == 0:
return (sum)
else:
let convoyable_id = [convoyables]
alloc_locals
let (convoyable_strength) = _get_strength(convoyable_id)
return _get_convoyables_strength_tc(
convoyables_len - 1, convoyables + 1, sum + convoyable_strength
)
end
end
As you can see it's a little less friendly because it requires an additional argument (which will always be 0). On a normal computer this could be optimized to not fill the call stack but as FeedTheFed pointed out, the memory is immutable here so it doesn't seem to be useful. He did say, however, that it could be interesting for "not wasting memory cells for the intermediate return values". It would be very helpful to me to have a more detailed explanation as I'm not sure to understand.
Here is the cairo doc related to this: https://www.cairo-lang.org/docs/how_cairo_works/functions.html?highlight=tail#tail-recursion

The short answer: the cost of a few additional Cairo steps is likely to be negligible relative to accessing storage and using other system calls, so I'd start from the "non-optimized" version and try to optimize only if the function uses a lot of Cairo steps and it seems to affect the overall cost significantly.
The longer answer:
As you mentioned, the usual tail-call optimization of reusing the stack is not relevant in Cairo because of the immutable memory.
The advantage of a tail-call recursion in Cairo is that when the called function returns you don't need to do anything with the return value. This means that in a tail-call recursion, returning from the inner calls is just a sequence of ret instructions.
On the other hand, a non-tail-call recursion (like the one in your example) will have instructions that process the return values of the inner call, such as copying the implicit arguments and computing the sum of the current result with the next element.
In some (very simple) cases, it won't be worse than the tail-call version. Consider for example the following function that computes 1 + 2 + ... + i:
func sum(i : felt) -> (res : felt):
if i == 0:
return (res=0)
end
let (partial_sum) = sum(i=i-1)
return (res=partial_sum + i)
end
This function costs 5 steps per iteration: 3 before the recursive call (if, push i-1, call sum) and 2 after (compute the sum, ret). The "optimized" tail-call version will also cost 5 steps per iteration: 4 steps before and 1 step after.
But this is a very simple case with no implicit arguments and only one return argument. If we add an implicit argument (even if it's unused in the function), the tail-call version will perform better: only 1 additional step per iteration compared to 2 additional steps in the non-tail-call version.
In your example, you have 3 implicit arguments, so the tail-call version will probably be better (although, my guess is that it won't be significant, but this depends on the rest of the code).

Related

Tail-recursive string splitting in Haskell

I'm considering the problem of splitting a string s at a character c.
This is expressed as
break (c ==) s
where the Haskell library definition of break (c ==) close enough to
br [] = ([],[])
br s#(h:t) = if (c == h)
then ([],s)
else let (h',t') = br t in (h:h',t')
(And let's suppose that I immediately wanted access to the second item of the return value, so that any lazy evaluation has been forced through.) The recursive call to br t appears to store h on the call stack, but a general sense of the algorithm indicates that this shouldn't be necessary. Here is one way of doing it in constant stack space, in a pseudocode language with mutability, where & denotes passage by reference, and lists are implemented as LISPy pairs:
br(c,s) =
allocate res_head,res_rest
iter(c,s,&res_head,&res_rest)
return (res_head,res_rest)
iter(c,s,&res_head,&res_rest) =
case s of
[] -> set res_head = res_rest = [] -- and terminate
c:ss -> set res_head = [], res_rest = s -- and terminate
x:ss -> allocate new_pair
set res_head = new_pair, new_pair.head = x
iter(c,ss,&new_pair.tail,&res_rest) -- tail call / jump
Whether or not GHC is smart enough to find this optimization, I'd like to formulate the computation in Haskell in a manner that is patently tail-recursive. How might one do this?
Tail recursive breakAt
The standard accumulator introduction trick would produce something like this:
breakAt :: Char -> String -> (String, String)
breakAt needle = breakAtAcc []
where breakAtAcc :: String -> String -> (String, String)
breakAtAcc seen [] = (reverse seen, [])
breakAtAcc seen cs#(c:cs')
| c == needle
= (reverse seen, cs)
| otherwise
= breakAtAcc (c : seen) cs'
The recursive part of this is tail recursive, although we process the characters that make up the pre-split part of the return value in the wrong order for building up a list, so they need to be reversed at the end. However even ignoring that (using a version without the reverse), this is probably worse.
In Haskell you're worrying about the wrong thing if you're concerned about the stack overflow errors you would see from deep recursion in many other languages (often prevented by tail call optimisation, hence tail recursion being desirable). Haskell does not have this kind of stack overflow. Haskell does have a stack, which can overflow, but it's not the normal call stack from imperative languages.
For example, if I start GHCi with ghci +RTS -K65k to explicitly set the maximum stack size to 65 KB (about the smallest value I could get it to start up with), then tripping the standard foldr (+) stack overflow doesn't take much:
λ foldr (+) 0 [1..3000]
*** Exception: stack overflow
A mere 3,000 recursive steps kills it. But I can run your br on much larger lists without problem:
λ let (pre, post) = br 'b' (replicate 100000000 'a' ++ "b") in (length pre, length post)
(100000000,1)
it :: (Int, Int)
That's 100 million non-tail recursive steps. If each of those took a stack frame and they were fitting in out 65 KB stack, we'd be getting about 1500 stack frames for every byte. Clearly this kind of recursion does not actually cause the stack consumption problems it does in other languages! That's because it's not the recursion depth itself that's causing the stack overflow in foldr (+) 0 [1..3000]. (See the last section at the end if you want to know what does cause it)
The advantage br has over a tail-recursive version like breakAt is that it's productive. If you're only interested in the first n characters of the prefix, then at most n characters of the input string will be examined (if you're interested in the post-split string, then obviously it will need to examine enough of the string to find the split). You can observe this by running br and breakAt on a long input string and taking small bit of prefix, something like this:
λ let (pre, post) = br 'b' (replicate 100000000 'a' ++ "b") in take 5 pre
"aaaaa"
it :: [Char]
If you try the same thing with breakAt (even if you take out the call to reverse), it'll at first only print " and then spend a long time thinking before eventually coming up with the rest of "aaaaa". That's because it has to find the split point before it returns anything except another recursive call; the first character of the prefix is not available until the split point has been reached. And that's the essence of tail recursion; there's no way to fix it.
You can see it even more definitively by using undefined:
λ let (pre, post) = br 'b' ("12345" ++ undefined) in take 5 pre
"12345"
it :: [Char]
λ let (pre, post) = breakAtRev 'b' ("12345" ++ undefined) in take 5 pre
"*** Exception: Prelude.undefined
CallStack (from HasCallStack):
error, called at libraries/base/GHC/Err.hs:74:14 in base:GHC.Err
undefined, called at <interactive>:18:46 in interactive:Ghci8
br can return the first 5 characters without examining whether or not there is a 6th. breakAt (with or without reverse) forces more of the input, and so hits the undefined.
This is a common pattern in Haskell. Changing an algorithm to make it tail recursive frequently makes performance worse. You do want tail recursion if the final return value is a small type like an Int, Double, etc that can't be consumed in a "gradual" way; but you need to make sure any accumulator parameter you're using is strictly evaluated in that case! That's why for summing a list foldl' is better than foldr; there's no way to consume the sum "gradually" so we want tail recursion like foldl, but it has to be the strict variant foldl' or we still get stack overflows even though it's tail recursive! But when you're returning something like a list or a tree, it's much better if you can arrange for consuming the result gradually to cause the input to be read gradually. Tail recursion fundamentally does not allow this.
What causes stack consumption in Haskell?
Haskell is lazy. So when you call a recursive function it doesn't necessarily run all the way to the "bottom" of the recursion immediately as it would in a strict language (requiring the stack frames from every level of the recursion to be "live" at once, if they can't be optimised away by something like tail call elimination). It doesn't necessarily run at all of course, only when the result is demanded, but even then "demand" causes the function to run only as far as "weak head normal form". That has a fancy technical definition, but it more-or-less means the function will run until it has produced a data constructor.
So if the function's code itself returns a data constructor, as br does (all of its cases return the pair constructor (,)), then entering the function will be complete at the end of that one single step. The data constructor's fields may contain thunks for further recursive calls (as they do in br), but those recursive calls will only be actually run when something pattern matches on this constructor to extract those fields, and then pattern matches on them. Often that is just about to happen, because the pattern match on the returned constructor is what caused the demand to run this function in the first place, but it is still resolved after this function returns. And so any recursive calls in the constructor's fields don't have to be made while the first call is "still running", and thus we don't have to keep a call stack frame around for it when we enter the recursive calls. (I'm sure the actual GHC implementation does lots of fancy tricks I'm not covering, so this picture probably isn't correct in detail, but it's an accurate enough mental model for how the language "works")
But what if the code for the function doesn't return a data constructor directly? What if instead it returns another function call? Function calls aren't run until their return value is demanded, but the function we're considering was only run because its return value was demanded. That means the function call it returns is also demanded, so we have to enter it.
We can use tail call elimination to avoid needing a call stack frame for this too. But what if the code for this function makes a pattern match (or uses seq, or strictness analysis decided demand its arguments early, etc etc)? If the thing it's matching on is already evaluated to a data constructor then that's fine, it can run the pattern match now. But if the thing that's matching is itself a thunk, that means we have to enter some random other function and run it far enough to produce its outermost data constructor. Now we need a stack frame to remember where to come back to when that other function is done.
So stack consumption happens in Haskell not directly from call depth, but from "pattern match depth" or "demand depth"; the depth of thunks we have to enter without finding the outermost data constructor.
So br is totally fine for this sort of stack consumption; all of its branches immediately return a pair constructor (,). The recursive case has thunks for another call to br in its fields, but as we've seen that does not cause stack growth.
In the case of breakAt (or rather breakAtAcc), the return value in the recursive case is another function call we have to enter. We only get to a point where we can stop (a data constructor) after running all the way to the split point. So we lose laziness and productivity, but it still won't cause a stack overflow because of tail call elimination.
The problem with foldr (+) 0 [1..3000] is it returns 0 + <thunk>. That's not a data constructor, it's a function call to +, so it has to be entered. But + is strict in both arguments so before it returns it's going to pattern match on the thunk, requiring us to run it (and thus add a stack frame). That thunk will evaluate foldr (+) 1 [2..3000] to 1 + <thunk>, and entering + again will force that thunk to 2 + thunk, and so on, eventually exhausting the stack. But the call depth of foldr technically does no harm, rather it's the nested + thunks that foldr generates that consume the stack. If you could write a similar giant chain of additions literally (and GHC evaluated that naively without rewriting anything), the same stack overflow would happen with no call depth at all. And if you use foldr with a different function you can process infinite lists to an unbounded depth with no stack consumption at all.
TLDR
You can have a tail recursive break, but it's worse than the version in base. Deep recursion is a problem for strict languages that use a call stack, but not for Haskell. Deep pattern matching is the analogous problem, but it takes more than counting recursion depth to spot that. So trying to make all your recursion be tail recursion in Haskell will frequently make your code worse.

Python - PyDash example of reduce with accumulator

Looking for a code example using Python’s PyDash module of reduce with the accumulator arg, and how to output that with previous and current.
pydash.reduce_ is not very different from the built-in functools.reduce.
A good example for using the accumulator (or the initial parameter in functools' case) is to use it as a "neutral element":
def factorial(n):
return pydash.reduce_(range(1, n), lambda total, x: total*x, accumulator=1)
In this case, 1 is used as the initial value and doesn't affect the result (since 1*x=x), but more importantly: It will be used as the return value if there are no elements in range(1, n).
And indeed, factorial(0) == factorial(1) == 1 is the required result.

TypeError when applying sum to a list of strings [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Thrust custom binary predicate parallelization

I am implementing a custom binary predicate used by the thrust::max_element search algorithm. It also keeps track of a value of interest which the algorithm alone cannot yield. It looks something like this:
struct cmp_class{
cmp_class(int *val1){
this->val1 = val1;
}
bool operator()(int i, int j){
if (j < *val1) { *val1 = j; }
return i < j;
}
int *val1;
};
int val1 = ARRAY[0];
std::max_element(ARRAY, ARRAY+length, cmp(&val1));
.... use val1
My actual binary predicate is quite a bit more complex, but the example above captures the gist of it: I am passing a pointer to an integer to the binary predicate cmp, which then writes to that integer to keep track of some value (here the running minimum). Since max_element first calls cmp()(ARRAY[0],ARRAY[1]) and then cmp()(running maximum,ARRAY[i]), I only look at value j inside cmp and therefore initialize val1 = ARRAY[0] to ensure ARRAY[0] is taken into account.
If I do the above on the host, using std::max_element for example, this works fine. The values of val1 is what I expect given known data. However, using Thrust to execute this on the GPU, its value is off. I suspect this is due to the parallelization of thrust::max_element, which is recursively applied on sub arrays, the results of which form another array which thrust::max_element is run on, etc. Does this hold water?
In general, the binary predicates used for thrust reductions are expected to be commutative. I'm using "commutative" here to mean that the predicate result should be valid regardless of the order in which the arguments are presented.
At the initial stage of a thrust parallel reduction, the arguments are likely to be presented in an order you might expect (i.e. in the order of the vectors passed to the reduce function, or in the order that the values occur in a single vector.) However, later on in the parallel reduction process, the origin of the arguments may get mixed up, during the parallel-sweeping. Any binary comparison functor that assumes an ordering to the arguments is probably broken, for thrust parallel reductions.
In your case, the boolean result of your comparison functor should be valid regardless of the order of arguments presented, and in that respect it appears to be properly commutative.
However, regarding the custom storage functionality you have built-in around val1, it seems pretty clear that the results in val1 could be different depending on the order in which arguments are passed to the functor. Consider this simple max-finding sequence amongst a set values passed to the functor as (i,j) (assume val1 starts out at a large value):
values: 10 5 3 7 12
comparison pairs: (10,5) (10,3) (10,7) (10,12)
comparison result: 10 10 10 12
val1 storage: 5 3 3 3
Now suppose that we simply reverse the order that arguments are presented to the functor:
values: 10 5 3 7 12
comparison pairs: (5,10) (3,10) (7,10) (12,10)
comparison result: 10 10 10 12
val1 storage: 10 10 10 10
Another issue is that you have no atomic protection on val1 that I can see:
if (j < *val1) { *val1 = j; }
The above line of code may be OK in a serial realization. In a parallel multi-threaded algorithm, you have the possibility for multiple threads to be accessing (reading and writing) *val1 simultaneously, which will have undefined results.

What is call-by-need?

I want to know what is call-by-need.
Though I searched in wikipedia and found it here: http://en.wikipedia.org/wiki/Evaluation_strategy,
but could not understand properly.
If anyone can explain with an example and point out the difference with call-by-value, it would be a great help.
Suppose we have the function
square(x) = x * x
and we want to evaluate square(1+2).
In call-by-value, we do
square(1+2)
square(3)
3*3
9
In call-by-name, we do
square(1+2)
(1+2)*(1+2)
3*(1+2)
3*3
9
Notice that since we use the argument twice, we evaluate it twice. That would be wasteful if the argument evaluation took a long time. That's the issue that call-by-need fixes.
In call-by-need, we do something like the following:
square(1+2)
let x = 1+2 in x*x
let x = 3 in x*x
3*3
9
In step 2, instead of copying the argument (like in call-by-name), we give it a name. Then in step 3, when we notice that we need the value of x, we evaluate the expression for x. Only then do we substitute.
BTW, if the argument expression produced something more complicated, like a closure, there might be more shuffling of lets around to eliminate the possibility of copying. The formal rules are somewhat complicated to write down.
Notice that we "need" values for the arguments to primitive operations like + and *, but for other functions we take the "name, wait, and see" approach. We would say that the primitive arithmetic operations are "strict". It depends on the language, but usually most primitive operations are strict.
Notice also that "evaluation" still means to reduce to a value. A function call always returns a value, not an expression. (One of the other answers got this wrong.) OTOH, lazy languages usually have lazy data constructors, which can have components that are evaluated on-need, ie, when extracted. That's how you can have an "infinite" list---the value you return is a lazy data structure. But call-by-need vs call-by-value is a separate issue from lazy vs strict data structures. Scheme has lazy data constructors (streams), although since Scheme is call-by-value, the constructors are syntactic forms, not ordinary functions. And Haskell is call-by-name, but it has ways of defining strict data types.
If it helps to think about implementations, then one implementation of call-by-name is to wrap every argument in a thunk; when the argument is needed, you call the thunk and use the value. One implementation of call-by-need is similar, but the thunk is memoizing; it only runs the computation once, then it saves it and just returns the saved answer after that.
Imagine a function:
fun add(a, b) {
return a + b
}
And then we call it:
add(3 * 2, 4 / 2)
In a call-by-name language this will be evaluated so:
a = 3 * 2 = 6
b = 4 / 2 = 2
return a + b = 6 + 2 = 8
The function will return the value 8.
In a call-by-need (also called a lazy language) this is evaluated like so:
a = 3 * 2
b = 4 / 2
return a + b = 3 * 2 + 4 / 2
The function will return the expression 3 * 2 + 4 / 2. So far almost no computational resources have been spent. The whole expression will be computed only if its value is needed - say we wanted to print the result.
Why is this useful? Two reasons. First if you accidentally include dead code it doesn't weigh your program down and thus can be a lot more efficient. Second it allows to do very cool things like efficiently calculating with infinite lists:
fun takeFirstThree(list) {
return [list[0], list[1], list[2]]
}
takeFirstThree([0 ... infinity])
A call-by-name language would hang there trying to create a list from 0 to infinity. A lazy language will simply return [0,1,2].
A simple, yet illustrative example:
function choose(cond, arg1, arg2) {
if (cond)
do_something(arg1);
else
do_something(arg2);
}
choose(true, 7*0, 7/0);
Now lets say we're using the eager evaluation strategy, then it would calculate both 7*0 and 7/0 eagerly. If it is a lazy evaluated strategy (call-by-need), then it would just send the expressions 7*0 and 7/0 through to the function without evaluating them.
The difference? you would expect to execute do_something(0) because the first argument gets used, although it actually depends on the evaluation strategy:
If the language evaluates eagerly, then it will, as stated, evaluate 7*0 and 7/0 first, and what's 7/0? Divide-by-zero error.
But if the evaluation strategy is lazy, it will see that it doesn't need to calculate the division, it will call do_something(0) as we were expecting, with no errors.
In this example, the lazy evaluation strategy can save the execution from producing errors. In a similar manner, it can save the execution from performing unnecessary evaluation that it won't use (the same way it didn't use 7/0 here).
Here's a concrete example for a bunch of different evaluation strategies written in C. I'll specifically go over the difference between call-by-name, call-by-value, and call-by-need, which is kind of a combination of the previous two, as suggested by Ryan's answer.
#include<stdio.h>
int x = 1;
int y[3]= {1, 2, 3};
int i = 0;
int k = 0;
int j = 0;
int foo(int a, int b, int c) {
i = i + 1;
// 2 for call-by-name
// 1 for call-by-value, call-by-value-result, and call-by-reference
// unsure what call-by-need will do here; will likely be 2, but could have evaluated earlier than needed
printf("a is %i\n", a);
b = 2;
// 1 for call-by-value and call-by-value-result
// 2 for call-by-reference, call-by-need, and call-by-name
printf("x is %i\n", x);
// this triggers multiple increments of k for call-by-name
j = c + c;
// we don't actually care what j is, we just don't want it to be optimized out by the compiler
printf("j is %i\n", j);
// 2 for call-by-name
// 1 for call-by-need, call-by-value, call-by-value-result, and call-by-reference
printf("k is %i\n", k);
}
int main() {
int ans = foo(y[i], x, k++);
// 2 for call-by-value-result, call-by-name, call-by-reference, and call-by-need
// 1 for call-by-value
printf("x is %i\n", x);
return 0;
}
The part we're most interested in is the fact that foo is called with k++ as the actual parameter for the formal parameter c.
Note that how the ++ postfix operator works is that k++ returns k at first, and then increments k by 1. That is, the result of k++ is just k. (But, then after that result is returned, k will be incremented by 1.)
We can ignore all of the code inside foo up until the line j = c + c (the second section).
Here's what happens for this line under call-by-value:
When the function is first called, before it encounters the line j = c + c, because we're doing call-by-value, c will have the value of evaluating k++. Since evaluating k++ returns k, and k is 0 (from the top of the program), c will be 0. However, we did evaluate k++ once, which will set k to 1.
The line becomes j = 0 + 0, which behaves exactly like how you'd expect, by setting j to 0 and leaving c at 0.
Then, when we run printf("k is %i\n", k); we get that k is 1, because we evaluated k++ once.
Here's what happens for the line under call-by-name:
Since the line contains c and we're using call-by-name, we replace the text c with the text of the actual argument, k++. Thus, the line becomes j = (k++) + (k++).
We then run j = (k++) + (k++). One of the (k++)s will be evaluated first, returning 0 and setting k to 1. Then, the second (k++) will be evaluated, returning 1 (because k was set to 1 by the first evaluation of k++), and setting k to 2. Thus, we end up with j = 0 + 1 and k set to 2.
Then, when we run printf("k is %i\n", k);, we get that k is 2 because we evaluated k++ twice.
Finally, here's what happens for the line under call-by-need:
When we encounter j = c + c; we recognize that this is the first time the parameter c is evaluated. Thus we need to evaluate its actual argument (once) and store that value to be the evaluation of c. Thus, we evaluate the actual argument k++, which will return k, which is 0, and therefore the evaluation of c will be 0. Then, since we evaluated k++, k will be set to 1. We then use this stored evaluation as the evaluation for the second c. That is, unlike call-by-name, we do not re-evaluate k++. Instead, we reuse the previously evaluated initial value for c, which is 0. Thus, we get j = 0 + 0; just as if c was pass-by-value. And, since we only evaluated k++ once, k is 1.
As explained in the previous step, j = c + c is j = 0 + 0 under call-by-need, and it runs exactly as you'd expect.
When we run printf("k is %i\n", k);, we get that k is 1 because we only evaluated k++ once.
Hopefully this helps to differentiate how call-by-value, call-by-name, and call-by-need work. If it would be helpful to differentiate call-by-value and call-by-need more clearly, let me know in a comment and I'll explain the code earlier on in foo and why it works the way it does.
I think this line from Wikipedia sums things up nicely:
Call by need is a memoized variant of call by name, where, if the function argument is evaluated, that value is stored for subsequent use. If the argument is pure (i.e., free of side effects), this produces the same results as call by name, saving the cost of recomputing the argument.

Resources