IF statements in R - Always nested? - excel

I'm just now starting to dive into IF statements in R. From what I see from the CRAN documentation on IF statements, it looks that all IF statements must be nested.
Is this true? If it is, this IF/THEN structure is more like EXCEL and, I think, not as straight forward as RUBY or Python IF/THEN logic. Am I not interrupting this correct?
In EXCEL (the gui, not VBA), you must run a formula like this:
#IF Statement 1
=IF(A1<20, A1*1,
#IF Statement 2
IF(A1<50, A1*2,
#IF Statement 3
IF(A1<100, A1*3, A1*4)
#Closes IF Statement 2
#Closes IF Statement 1
Nested IF/THEN are complicated because you have ensure you close the functions properly.
This next part - I'm not 100% sure on, as I am a beginner in both languages, but... In Ruby or Python, you can explicitly write an IF function in a more structured manner:
This is much simpler and explicit.
Am I missing a proper way to run this in R, or is it that complicated? Is there a good resource that I have not found yet on IF/THEN/Loop for R?

There are actually two forms of if-else flow-control logic available in R.
The if statement is, to a first approximation, much like C, C++, or Java's if. Just like in those languages, you can chain ifs in sequence.
if(test) {
else if(test2) {
else {
R also has the ifelse function, which is indeed much like Excel's =IF. The rough equivalent of the if-elseif-else above would be
ifelse(test, result1, ifelse(test2, result2, result3))
A key difference is that in the second example, test, result1, result2 and result3 are all vectors.
You should use the first if you want to do the same set of operations on your entire dataset, but which set depends on a test. The second is meant for vectorised calculations, where you want to carry out different operations on each element of a vector.

Many new users of R are confused about if. It evaluates only a single value and then executes either the expression that follows or the else clause. In R the ifelse function is generally what former SAS, Excel, and SPSS users are going to want and it will support nesting. There is the switch function that might be helpful in some instances, although I do not see how your set of non-exclusive logical conditions would immediately fit into its logic.
In your case, I would think instead about using the findInterval function. This would accomplish the combined operations of logical and mathematical operation in your example (and would return a vector if "A" were a vector) :
A*( 1+ findInterval( A, c(20,50,100) ) ) # OR
A*( 1+ findInterval( A, c(-Inf, 20, 50, 100) ) ) # the equivalent using -Inf
And thinking about it a bit further The findInterval function could also be used as the first argument to switch if you wanted a function to be applied to "A".
(Further comment: I was assuming that your "A1" expression would get copied down a column or row of cellls in an Excel spreadsheet and would in the process have the row or column references incremented in the particular automagical manner that Excel supports becoming A2, A3, etc. That is a different programing perspective than any of the more general languages you are comparing to. Operations on R vectors are analogous but would not generally need the "1", "2", "3" ... entries and so I omitted them from the code.)

I am not sure I understand the question, but a natural R equivalent of your Excel code would be
if (a1 < 20)
a1 * 1
else if (a1 < 50)
a1 * 2
else if (a1 < 100)
a1 * 3
a1 * 4
And you could put curly braces around the a1 * n expressions if you wanted. However, if a1 is a vector rather than a scalar, you probably want to evaluate the comparisons in parallel for all vector elements, which is done with ifelse, which does nest like your Excel construct:
ifelse(a1 < 20, a1 * 1,
ifelse(a1 < 50, a1 * 2,
ifelse(a1 < 100, a1 * 3,
a1 * 4)))
A third way to write it, for vector a1, takes advantage of logical indexing:
a2 <- a1 # take a copy
a2[a1 >= 20 & a1 < 50] <- a1[a1 >= 20 & a1 < 50] * 2
a2[a1 >= 50 & a1 < 100] <- a1[a1 >= 50 & a1 < 100] * 3
a2[a1 >= 100 ] <- a1[a1 >= 100 ] * 4


How to use conditions while operating on dataframes in julia

I am trying to find the mean value of the dataframe's elements in corresponding to particular column when either of the condition is true. For example:
Using Statistics
df = DataFrame(value, xi, xj)
resulted_mean = []
for i in range(ncol(df))
push!(resulted_mean, mean(df[:value], (:xi == i | :xj == i)))
Here, I am checking when either xi or xj is equal to i then find the mean of the all the corresponding values stored in [:value] column. This mean will later be pushed to the array -> resulted_mean
However, this code is not producing the desired output.
Please suggest the optimal approach to fix this code snippet.
Thanks in advance.
I agree with Bogumił's comment, you should really consult the Julia documentation to get a basic understanding of the language, and then run through the DataFrames tutorials. I will however annotate your code to point out some of the issues so you might be able to target your learning a bit better:
Using Statistics
Julia (like most other languages) is case sensitive, so writing Usingis not the same as the reserved keyword using which is used to bring package definitions into your namespace. The relevant docs entry is here
Note also that you are using the DataFrames package, so to make your code reproducible you would have had to do using DataFrames, Statistics.
df = DataFrame(value, xi, xj)
It's unclear what this line is supposed to do as the arguments passed to the constructor are undefined, but assuming value, xi and xj are vectors of numbers, this isn't a correct way to construct a DataFrame:
julia> value = rand(10); xi = repeat(1:2, 5); xj = rand(1:2, 10);
julia> df = DataFrame(value, xi, xj)
ERROR: MethodError: no method matching DataFrame(::Vector{Float64}, ::Vector{Int64}, ::Vector{Int64})
You can read about constructors in the docs here, the most common approach for a DataFrame with only few columns like here would probably be:
julia> df = DataFrame(value = value, xi = xi, xj = xj)
10×3 DataFrame
Row │ value xi xj
│ Float64 Int64 Int64
1 │ 0.539533 1 2
2 │ 0.652752 2 1
3 │ 0.481461 1 2
Then you have
resulted_mean = []
I would say in this case the overall approach of preallocating a vector and pushing to it in a loop isn't ideal as it adds a lot of verbosity for no reason (see below), but as a general remark you should avoid untyped arrays in Julia:
julia> resulted_mean = []
Here the Any means that the array can hold values of any type (floating point numbers, integers, strings, probability distributions...), which means the compiler cannot anticipate what the actual content will be from looking at the code, leading to suboptimal machine code being generated. In doing so, you negate the main advantage that Julia has over e.g. base Python: the rich type system combined with a lot of compiler optimizations allow generation of highly efficient machine code while keeping the language dynamic. In this case, you know that you want to push the results of the mean function to the results vector, which will be a floating point number, so you should use:
julia> resulted_mean = Float64[]
That said, I wouldn't recommend pushing in a loop here at all (see below).
Your loop is:
for i in range(ncol(df))
A few issues with this:
Loops in Julia require an end, unlike in Python where their end is determined based on code indentation
range is a different function in Julia than in Python:
julia> range(5)
ERROR: ArgumentError: At least one of `length` or `stop` must be specified
You can learn about functions using the REPL help mode (type ? at the REPL prompt to access it):
help?> range
search: range LinRange UnitRange StepRange StepRangeLen trailing_zeros AbstractRange trailing_ones OrdinalRange AbstractUnitRange AbstractString
range(start[, stop]; length, stop, step=1)
Given a starting value, construct a range either by length or from start to stop, optionally with a given step (defaults to 1, a UnitRange). One of length or stop is required. If length, stop, and step are all specified, they must
So you'd need to do something like
julia> range(1, 5, step = 1)
That said, for simple ranges like this you can use the colon operator: 1:5 is the same as `range(1, 5, step = 1).
You then iterate over integers from 1 to ncol(df) - you might want to check whether this is what you're actually after, as it seems unusual to me that the values in the xi and xj columns (on which you filter in the loop) would be related to the number of columns in your DataFrame (which is 3).
In the loop, you do
push!(resulted_mean, mean(df[:value], (:xi == i | :xj == i)))
which again has a few problems: first of all you are passing the subsetting condition for your DataFrame to the mean function, which doesn't work:
julia> mean(rand(10), rand(Bool, 10))
ERROR: MethodError: objects of type Vector{Float64} are not callable
The subsetting condition itself has two issues as well: when you write :xi, there is no way for Julia to know that you are referring to the DataFrame column xi, so all you're doing is comparing the Symbol :xi to the value of i, which will always return false:
julia> :xi == 2
Furthermore, note that | has a higher precedence than ==, so if you want to combine two equality checks with or you need brackets:
julia> 1 == 1 | 2 == 2
julia> (1 == 1) | (2 == 2)
More things could be said about your code snippet, but I hope this gives you an idea of where your gaps in understanding are and how you might go about closing them.
For completeness, here's how I would approach your problem - I'm interpreting your code to mean "calculate the mean of the value column, grouped by each value of xi and xj, but only where xi equals xj":
julia> combine(groupby(df[df.xi .== df.xj, :], [:xi, :xj], sort = true), :value => mean => :resulted_mean)
2×3 DataFrame
Row │ xi xj resulted_mean
│ Int64 Int64 Float64
1 │ 1 1 0.356811
2 │ 2 2 0.977041
This is probably the most common analysis pattern for DataFrames, and is explained in the tutorial that Bogumił mentioned as well as in the DataFrames docs here.
As I said up front, if you want to use Julia productively, I recommend that you spend some time reading the documentation both for the language itself as well as for any of the key packages you're using. While Julia has some similarities to Python, and some bits in the DataFrames package have an API that resemble things you might have seen in R, it is a language in its own right that is fundamentally different from both Python and R (or any other language for that matter), and there's no way around familiarizing yourself with how it actually works.

On a dataset made up of dictionaries, how do I multiply the elements of each dictionary with Python'

I started coding in Python 4 days ago, so I'm a complete newbie. I have a dataset that comprises an undefined number of dictionaries. Each dictionary is the x and y of a point in the coordinates.
I'm trying to compute the summatory of xy by nesting the loop that multiplies xy within the loop that sums the products.
However I haven't been able to figure out how to multiply the values for the two keys in each dictionary (so far I only got to multiply all the x*y)
So far I've got this:
If my data set were to be d= [{'x':0, 'y':0}, {'x':1, 'y':1}, {'x':2, 'y':3}]
I've got the code for the function that calculates the product of each pair of x and y:
def product_xy (product_x_per_y):
prod_xy =[]
n = 0
for i in range (len(d)):
result = d[n]['x']*d[n]['y']
return prod_xy
I also have the function to add up the elements of a list (like prod_xy):
def total_xy_prod (sum_prod):
all = 0
for s in sum_prod:
all+= s
return all
I've been trying to find a way to nest this two functions so that I can iterate through the multiplication of each x*y and then add up all the products.
Make sure your code works as expected
First, your functions have a few mistakes. For example, in product_xy, you assign n=0, and later do n + 1; you probably meant to do n += 1 instead of n + 1. But n is also completely unnecessary; you can simply use the i from the range iteration to replace n like so: result = d[i]['x']*d[i]['y']
Nesting these two functions: part 1
To answer your question, it's fairly straightforward to get the sum of the products of the elements from your current code:
coord_sum = total_xy_prod(product_xy(d))
Nesting these two functions: part 2
However, there is a much shorter and more efficient way to tackle this problem. For one, Python provides the built-in function sum() to sum the elements of a list (and other iterables), so there's no need create total_xy_prod. Our code could at this point read as follows:
coord_sum = sum(product_xy(d))
But product_xy is also unnecessarily long and inefficient, and we could also replace it entirely with a shorter expression. In this case, the shortening comes from generator expressions, which are basically compact for-loops. The Python docs give some of the basic details of how the syntax works at list comprehensions, which are distinct, but closely related to generator expressions. For the purposes of answering this question, I will simply present the final, most simplified form of your desired result:
coord_sum = sum(e['x'] * e['y'] for e in d)
Here, the generator expression iterates through every element in d (using for e in d), multiplies the numbers stored in the dictionary keys 'x' and 'y' of each element (using e['x'] * e['y']), and then sums each of those products from the entire sequence.
There is also some documentation on generator expressions, but it's a bit technical, so it's probably not approachable for the Python beginner.

Calculate (If A=B, then "" else A) in Excel without evaluating A twice

I'm intending to conduct a formula of the type:
where VOL is a function I'm using through an Add-In. The limitations of this Add-In is, among others, that it is prohibited to call two Add-In function inside a single formula. I.e. the code I've written above is invalid and will result in an error.
Is there a way of achieving the following:
=IF(LHS=RHS;"Value if True";LHS) (2)
where LHS is Left hand side, RHS right hand side and the expression therefore checks if LHS is equal to RHS, and if so prints a corresponding value, else LHS, without having Excel evaluate LHS twice?
I haven't found any solution to this except importing the formula in one cell, and refer to that cell as the value to print if the logical expression in the IF statement is false, but this will become a quite extensive "double work". A solution like (2) would also become more readable, especially when LHS is of the type "'C:\pathtofile[filename]SheetName!'Cell".
Hope anyone has some clever solution to this
Here is one (rather ugly) way, just using formulas:
This makes use of the IFERROR function, which kind of does what you want but only tests for errors. Division by zero results in an error, so the inner IFERROR returns zero if VOL is zero, and 1/VOL otherwise. Now we need to take the reciprocal again to return the original value, so we repeat the trick, this time returning "" if there is an error.
If you want to test for another value (e.g. 3), just use something like:
A much neater way would be to create a function in VBA which wraps the VOL function and does what you want:
Public Function MyVol(varSite As Variant, varDate As Variant) As Variant
MyVol = vol(varSite, varDate)
If MyVol = 0 Then MyVol = ""
End Function
Assuming you can call VOL from VBA.

Excel VBA failure of repeated Evaluate method

I have written a little tool in VBA that charts a function you pass it as a string (e.g. "1/(1+x)" or "exp(-x^2)"). I use the built-in Evaluate method to parse the formula. The nub of it is this function, which evaluates a function of some variable at a given value:
Function eval(func As String, variable As String, value As Double) As Double
eval = Evaluate(Replace(func, variable, value))
End Function
This works fine, e.g. eval("x^2, "x", 2) = 4. I apply it element-wise down an array of x values to generate the graph of the function.
Now I want to enable my tool to chart the definite integral of a function. I have created an integrate function which takes an input formula string and uses Evaluate to evaluate it at various points and approximate the integral. My actual integrate function uses the trapezoidal rule, but for simplicity's sake let's suppose it is this:
Function integrate(func As String, variable As String, value As Double) As Double
integrate = value * (eval(func, variable, 0) + eval(func, variable, value)) / 2
End Function
This also works as expected, e.g. integrate("t", "t", 2) = 2 for the area of the triangle under the identity function.
The problem arises when I try to run integrate through the charting routine. When VBA encounters a line like this
eval("integrate(""t"",""t"",x)", "x", 2)
then it will stop with no error warning when Evaluate is called inside the eval function. (The internal quotes have to be doubled up to read the formula properly.) I expect to get the value 2 since Evaluate appears to try and evaluate integrate("t", "t", 2)
I suspect the problem is with second call on Evaluate inside integrate, but I've been going round in circles trying to figure it out. I know Evaluate is finicky and poorly documented http://fastexcel.wordpress.com/2011/11/02/evaluate-functions-and-formulas-fun-how-to-make-excels-evaluate-method-twice-as-fast but can anyone think of a way round this?
Excel 2010 V14, VBA 7.0
Thanks Chris, your Debug.Print suggestion got me thinking and I narrowed the problem down a bit more. It does seem like Evaluate gets called twice, as this example shows:
Function g() As Variant
Debug.Print "g"
g = 1
End Function
Run from the Immediate Window:
I found this http://www.decisionmodels.com/calcsecretsh.htm which shows a way round this by using Worksheet.Evaluate (Evaluate is actually the default for Application.Evaluate):
However this still doesn't solve the problem with Evaluate calling itself. Define
Function f() As Variant
Debug.Print "f"
f = ActiveSheet.Evaluate("g()+0")
End Function
Then in the Immediate Window:
Error 2015
The solution I found was define a different function for the second formula evaluation:
Function eval2(formula As String) As Variant
[A1] = "=" & formula
eval2 = [A1]
End Function
This still uses Excel's internal evaluation mechanism, but via a worksheet cell calculation. Then I get what I want:
It's slower due to the repeated worksheet hits, but that's the best I can do. So in my original example, I use eval to calculate the integral and eval2 to chart it. Still interested if anyone has any other suggestions.

What is call-by-need?

I want to know what is call-by-need.
Though I searched in wikipedia and found it here: http://en.wikipedia.org/wiki/Evaluation_strategy,
but could not understand properly.
If anyone can explain with an example and point out the difference with call-by-value, it would be a great help.
Suppose we have the function
square(x) = x * x
and we want to evaluate square(1+2).
In call-by-value, we do
In call-by-name, we do
Notice that since we use the argument twice, we evaluate it twice. That would be wasteful if the argument evaluation took a long time. That's the issue that call-by-need fixes.
In call-by-need, we do something like the following:
let x = 1+2 in x*x
let x = 3 in x*x
In step 2, instead of copying the argument (like in call-by-name), we give it a name. Then in step 3, when we notice that we need the value of x, we evaluate the expression for x. Only then do we substitute.
BTW, if the argument expression produced something more complicated, like a closure, there might be more shuffling of lets around to eliminate the possibility of copying. The formal rules are somewhat complicated to write down.
Notice that we "need" values for the arguments to primitive operations like + and *, but for other functions we take the "name, wait, and see" approach. We would say that the primitive arithmetic operations are "strict". It depends on the language, but usually most primitive operations are strict.
Notice also that "evaluation" still means to reduce to a value. A function call always returns a value, not an expression. (One of the other answers got this wrong.) OTOH, lazy languages usually have lazy data constructors, which can have components that are evaluated on-need, ie, when extracted. That's how you can have an "infinite" list---the value you return is a lazy data structure. But call-by-need vs call-by-value is a separate issue from lazy vs strict data structures. Scheme has lazy data constructors (streams), although since Scheme is call-by-value, the constructors are syntactic forms, not ordinary functions. And Haskell is call-by-name, but it has ways of defining strict data types.
If it helps to think about implementations, then one implementation of call-by-name is to wrap every argument in a thunk; when the argument is needed, you call the thunk and use the value. One implementation of call-by-need is similar, but the thunk is memoizing; it only runs the computation once, then it saves it and just returns the saved answer after that.
Imagine a function:
fun add(a, b) {
return a + b
And then we call it:
add(3 * 2, 4 / 2)
In a call-by-name language this will be evaluated so:
a = 3 * 2 = 6
b = 4 / 2 = 2
return a + b = 6 + 2 = 8
The function will return the value 8.
In a call-by-need (also called a lazy language) this is evaluated like so:
a = 3 * 2
b = 4 / 2
return a + b = 3 * 2 + 4 / 2
The function will return the expression 3 * 2 + 4 / 2. So far almost no computational resources have been spent. The whole expression will be computed only if its value is needed - say we wanted to print the result.
Why is this useful? Two reasons. First if you accidentally include dead code it doesn't weigh your program down and thus can be a lot more efficient. Second it allows to do very cool things like efficiently calculating with infinite lists:
fun takeFirstThree(list) {
return [list[0], list[1], list[2]]
takeFirstThree([0 ... infinity])
A call-by-name language would hang there trying to create a list from 0 to infinity. A lazy language will simply return [0,1,2].
A simple, yet illustrative example:
function choose(cond, arg1, arg2) {
if (cond)
choose(true, 7*0, 7/0);
Now lets say we're using the eager evaluation strategy, then it would calculate both 7*0 and 7/0 eagerly. If it is a lazy evaluated strategy (call-by-need), then it would just send the expressions 7*0 and 7/0 through to the function without evaluating them.
The difference? you would expect to execute do_something(0) because the first argument gets used, although it actually depends on the evaluation strategy:
If the language evaluates eagerly, then it will, as stated, evaluate 7*0 and 7/0 first, and what's 7/0? Divide-by-zero error.
But if the evaluation strategy is lazy, it will see that it doesn't need to calculate the division, it will call do_something(0) as we were expecting, with no errors.
In this example, the lazy evaluation strategy can save the execution from producing errors. In a similar manner, it can save the execution from performing unnecessary evaluation that it won't use (the same way it didn't use 7/0 here).
Here's a concrete example for a bunch of different evaluation strategies written in C. I'll specifically go over the difference between call-by-name, call-by-value, and call-by-need, which is kind of a combination of the previous two, as suggested by Ryan's answer.
int x = 1;
int y[3]= {1, 2, 3};
int i = 0;
int k = 0;
int j = 0;
int foo(int a, int b, int c) {
i = i + 1;
// 2 for call-by-name
// 1 for call-by-value, call-by-value-result, and call-by-reference
// unsure what call-by-need will do here; will likely be 2, but could have evaluated earlier than needed
printf("a is %i\n", a);
b = 2;
// 1 for call-by-value and call-by-value-result
// 2 for call-by-reference, call-by-need, and call-by-name
printf("x is %i\n", x);
// this triggers multiple increments of k for call-by-name
j = c + c;
// we don't actually care what j is, we just don't want it to be optimized out by the compiler
printf("j is %i\n", j);
// 2 for call-by-name
// 1 for call-by-need, call-by-value, call-by-value-result, and call-by-reference
printf("k is %i\n", k);
int main() {
int ans = foo(y[i], x, k++);
// 2 for call-by-value-result, call-by-name, call-by-reference, and call-by-need
// 1 for call-by-value
printf("x is %i\n", x);
return 0;
The part we're most interested in is the fact that foo is called with k++ as the actual parameter for the formal parameter c.
Note that how the ++ postfix operator works is that k++ returns k at first, and then increments k by 1. That is, the result of k++ is just k. (But, then after that result is returned, k will be incremented by 1.)
We can ignore all of the code inside foo up until the line j = c + c (the second section).
Here's what happens for this line under call-by-value:
When the function is first called, before it encounters the line j = c + c, because we're doing call-by-value, c will have the value of evaluating k++. Since evaluating k++ returns k, and k is 0 (from the top of the program), c will be 0. However, we did evaluate k++ once, which will set k to 1.
The line becomes j = 0 + 0, which behaves exactly like how you'd expect, by setting j to 0 and leaving c at 0.
Then, when we run printf("k is %i\n", k); we get that k is 1, because we evaluated k++ once.
Here's what happens for the line under call-by-name:
Since the line contains c and we're using call-by-name, we replace the text c with the text of the actual argument, k++. Thus, the line becomes j = (k++) + (k++).
We then run j = (k++) + (k++). One of the (k++)s will be evaluated first, returning 0 and setting k to 1. Then, the second (k++) will be evaluated, returning 1 (because k was set to 1 by the first evaluation of k++), and setting k to 2. Thus, we end up with j = 0 + 1 and k set to 2.
Then, when we run printf("k is %i\n", k);, we get that k is 2 because we evaluated k++ twice.
Finally, here's what happens for the line under call-by-need:
When we encounter j = c + c; we recognize that this is the first time the parameter c is evaluated. Thus we need to evaluate its actual argument (once) and store that value to be the evaluation of c. Thus, we evaluate the actual argument k++, which will return k, which is 0, and therefore the evaluation of c will be 0. Then, since we evaluated k++, k will be set to 1. We then use this stored evaluation as the evaluation for the second c. That is, unlike call-by-name, we do not re-evaluate k++. Instead, we reuse the previously evaluated initial value for c, which is 0. Thus, we get j = 0 + 0; just as if c was pass-by-value. And, since we only evaluated k++ once, k is 1.
As explained in the previous step, j = c + c is j = 0 + 0 under call-by-need, and it runs exactly as you'd expect.
When we run printf("k is %i\n", k);, we get that k is 1 because we only evaluated k++ once.
Hopefully this helps to differentiate how call-by-value, call-by-name, and call-by-need work. If it would be helpful to differentiate call-by-value and call-by-need more clearly, let me know in a comment and I'll explain the code earlier on in foo and why it works the way it does.
I think this line from Wikipedia sums things up nicely:
Call by need is a memoized variant of call by name, where, if the function argument is evaluated, that value is stored for subsequent use. If the argument is pure (i.e., free of side effects), this produces the same results as call by name, saving the cost of recomputing the argument.
