R - Difference in subsetting with `subset()` vs. brackets [ ]? - subset

Why might I come up with very different answers for these two lines of code:
nrow(aql[(aql$`Land Use`=="RESIDENTIAL" & aql$`Location Setting`=="SUBURBAN"),])
[1] 4514
...and...
nrow(subset(aql, (`Location Setting`=="SUBURBAN" & `Land Use`=="RESIDENTIAL")))
[1] 3527

Its hard to say without a more reproducible example, but it is likely the influence of NA values in the Location Setting or Land Use variables.
The subset function explicitly strips them out, while [ does not.
R does some unintuitive things when comparing NA and TRUE/FALSE values, so be careful for something like this:
NA & FALSE
[1] FALSE
NA & TRUE
[1] NA

Related

How to use conditions while operating on dataframes in julia

I am trying to find the mean value of the dataframe's elements in corresponding to particular column when either of the condition is true. For example:
Using Statistics
df = DataFrame(value, xi, xj)
resulted_mean = []
for i in range(ncol(df))
push!(resulted_mean, mean(df[:value], (:xi == i | :xj == i)))
Here, I am checking when either xi or xj is equal to i then find the mean of the all the corresponding values stored in [:value] column. This mean will later be pushed to the array -> resulted_mean
However, this code is not producing the desired output.
Please suggest the optimal approach to fix this code snippet.
Thanks in advance.
I agree with Bogumił's comment, you should really consult the Julia documentation to get a basic understanding of the language, and then run through the DataFrames tutorials. I will however annotate your code to point out some of the issues so you might be able to target your learning a bit better:
Using Statistics
Julia (like most other languages) is case sensitive, so writing Usingis not the same as the reserved keyword using which is used to bring package definitions into your namespace. The relevant docs entry is here
Note also that you are using the DataFrames package, so to make your code reproducible you would have had to do using DataFrames, Statistics.
df = DataFrame(value, xi, xj)
It's unclear what this line is supposed to do as the arguments passed to the constructor are undefined, but assuming value, xi and xj are vectors of numbers, this isn't a correct way to construct a DataFrame:
julia> value = rand(10); xi = repeat(1:2, 5); xj = rand(1:2, 10);
julia> df = DataFrame(value, xi, xj)
ERROR: MethodError: no method matching DataFrame(::Vector{Float64}, ::Vector{Int64}, ::Vector{Int64})
You can read about constructors in the docs here, the most common approach for a DataFrame with only few columns like here would probably be:
julia> df = DataFrame(value = value, xi = xi, xj = xj)
10×3 DataFrame
Row │ value xi xj
│ Float64 Int64 Int64
─────┼────────────────────────
1 │ 0.539533 1 2
2 │ 0.652752 2 1
3 │ 0.481461 1 2
...
Then you have
resulted_mean = []
I would say in this case the overall approach of preallocating a vector and pushing to it in a loop isn't ideal as it adds a lot of verbosity for no reason (see below), but as a general remark you should avoid untyped arrays in Julia:
julia> resulted_mean = []
Any[]
Here the Any means that the array can hold values of any type (floating point numbers, integers, strings, probability distributions...), which means the compiler cannot anticipate what the actual content will be from looking at the code, leading to suboptimal machine code being generated. In doing so, you negate the main advantage that Julia has over e.g. base Python: the rich type system combined with a lot of compiler optimizations allow generation of highly efficient machine code while keeping the language dynamic. In this case, you know that you want to push the results of the mean function to the results vector, which will be a floating point number, so you should use:
julia> resulted_mean = Float64[]
Float64[]
That said, I wouldn't recommend pushing in a loop here at all (see below).
Your loop is:
for i in range(ncol(df))
...
A few issues with this:
Loops in Julia require an end, unlike in Python where their end is determined based on code indentation
range is a different function in Julia than in Python:
julia> range(5)
ERROR: ArgumentError: At least one of `length` or `stop` must be specified
You can learn about functions using the REPL help mode (type ? at the REPL prompt to access it):
help?> range
search: range LinRange UnitRange StepRange StepRangeLen trailing_zeros AbstractRange trailing_ones OrdinalRange AbstractUnitRange AbstractString
range(start[, stop]; length, stop, step=1)
Given a starting value, construct a range either by length or from start to stop, optionally with a given step (defaults to 1, a UnitRange). One of length or stop is required. If length, stop, and step are all specified, they must
agree.
...
So you'd need to do something like
julia> range(1, 5, step = 1)
1:1:5
That said, for simple ranges like this you can use the colon operator: 1:5 is the same as `range(1, 5, step = 1).
You then iterate over integers from 1 to ncol(df) - you might want to check whether this is what you're actually after, as it seems unusual to me that the values in the xi and xj columns (on which you filter in the loop) would be related to the number of columns in your DataFrame (which is 3).
In the loop, you do
push!(resulted_mean, mean(df[:value], (:xi == i | :xj == i)))
which again has a few problems: first of all you are passing the subsetting condition for your DataFrame to the mean function, which doesn't work:
julia> mean(rand(10), rand(Bool, 10))
ERROR: MethodError: objects of type Vector{Float64} are not callable
The subsetting condition itself has two issues as well: when you write :xi, there is no way for Julia to know that you are referring to the DataFrame column xi, so all you're doing is comparing the Symbol :xi to the value of i, which will always return false:
julia> :xi == 2
false
Furthermore, note that | has a higher precedence than ==, so if you want to combine two equality checks with or you need brackets:
julia> 1 == 1 | 2 == 2
false
julia> (1 == 1) | (2 == 2)
true
More things could be said about your code snippet, but I hope this gives you an idea of where your gaps in understanding are and how you might go about closing them.
For completeness, here's how I would approach your problem - I'm interpreting your code to mean "calculate the mean of the value column, grouped by each value of xi and xj, but only where xi equals xj":
julia> combine(groupby(df[df.xi .== df.xj, :], [:xi, :xj], sort = true), :value => mean => :resulted_mean)
2×3 DataFrame
Row │ xi xj resulted_mean
│ Int64 Int64 Float64
─────┼─────────────────────────────
1 │ 1 1 0.356811
2 │ 2 2 0.977041
This is probably the most common analysis pattern for DataFrames, and is explained in the tutorial that Bogumił mentioned as well as in the DataFrames docs here.
As I said up front, if you want to use Julia productively, I recommend that you spend some time reading the documentation both for the language itself as well as for any of the key packages you're using. While Julia has some similarities to Python, and some bits in the DataFrames package have an API that resemble things you might have seen in R, it is a language in its own right that is fundamentally different from both Python and R (or any other language for that matter), and there's no way around familiarizing yourself with how it actually works.

Expressing rules in B-Method

I'm writing some specifications of a system in B-method. I have the following variables which are subsets of a general set:
First Notation:
a :={x,y,z,v}
b :={x,y,z}
I want to state a rule that whenever something exists in set "b", it also exists in set "a" which helps writing the above specifications as the following:
second Notation:
a :={v}
b :={x,y,z}
Explanation of second notation: I want the machine to infer that a :={x,y,z,v} from a :={v}, b :={x,y,z}, and the rule.
How can I express the rule so I avoid the first notation and instead write the second notation?
I tried the following but it didn't work
INITIALISATION
a :={v} &
b :={x,y,z}
ASSERTIONS
!x.(x:b => x:a)
First of all, the predicate !x.(x:b => x:a) can be more easily expressed just by b<:a.
It's not clear to me what exactly you want to express: Should b always be a subset of a or just in the initialisation?
If always, the INVARIANT would be the correct location for that. ASSERTIONS are similar but should be an implication by the other invariants.
But then you must explicitly ensure that in your initialisation.
Another point which is unclear to me is what you mean by "infer". Do you just not want to specify the details?
An initialisation where you assign a set with one element more than b could look like the following (assuming that a and b contain elements of S):
INITIALISATION
ANY v,s
WHERE
v:S
& s<:S
THEN
a := s\/{v}
|| b := s
END
(Disclaimer: I haven't actually tested it.)
If a should always be larger than b, you could add something like v/:s.
Your description does not make it clear what exactly you want to achieve.
Another approach would use the "becomes such substitution" (but in my opinion it is less readable):
INITIALISATION
a,b :( a:S & b:S &
b={x,y,z} &
#x.( x:S & a=b\/{x} ))
First, and foremost, the B machine does not infer anything by itself. It provides a language where the user can express properties and a mechanism that generates proof obligations that must be successfully processed by the prover (automatically or with human assistance) to guarantee that the properties hold.
In your example, if you want to express that every element of set bb is always an element of set aa, then as observed by danielp, just write
bb <: aa
Next, if you want to write that aa apossesses an element that is not in bb, then you can express it as
aa /= bb & not(aa = {})
or as
#(elem).(elem : S & elem : bb & not(elem : aa))
If you rather want to express that the specific value vv is in aa but not in bb, then the following applies
vv : bb & not(vv : aa)
These expressions may be used at several locations of the B machine, depending whether you want to assert properties on parameters, constants or variables.
For instance, say you have a machine with two variables va and vb, where both are sets of elements of a given set s1, and that you want they are initialized in such a way that every element of vb is also an element of va, and that there exists an element of va that is not in vb. Observe first that this means that vb is a strict subset of va.
INITIALISATION
ANY inia, inib WHERE
inia <: s1 & inib <: s2 & inib <<: inia
THEN
va := inia || vb := inib
END

Python failure to find all duplicates

This is related to random sampling. I am using random.sample(number,5) to return a list of random numbers from within a range of numbers contained in numbers. I am using while i < 100 to return one hundred sets of five numbers. To check for duplicates, I am using :
if len(numbers) != len(set(numbers)):
to identify sets with duplicates and following this with random.sample(number,5) to try to do another randomisation to replace the set with duplicates. I seem to get about 8% getting re-randomised ( using a print statement to say which number was duplicated), but about 5% seem to be missed. What am I doing incorrectly? The actual code is as follows:
while i < 100:
set1 = random.sample(numbers1,5)
if len(set1) != len(set(set1))
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
In another routine I am trying to do the same as above, but searching for duplicates in two sets by adding the same, substituting set2 for set1. This gives the same sorts of failures. The set2 routine is indented and placed immediately below the above routine. While i < 100: is not repeated for set2.
I hope that I have explained my problem clearly!!
There is nothing in your code to stop the second sample from having duplicates. What if you did something like a second while loop?
while i<100:
i+=1
set1 = random.sample(numbers1,5)
while len(set1) != len(set(set1)):
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
Of course you're still missing the part of the code that does something... beyond the above it's difficult to tell what you might need to change without a full code sample.
EDIT: here is a working version of the code sample from the comments:
def choose_random(list1,n):
import random
i = 0
set_list=[]
major_numbers=range(1,50) + list1
print(major_numbers)
while i <n:
set1 =random.sample(major_numbers,5)
set2 =random.sample(major_numbers,2)
while len(set(set1)) != len(set1):
print("Duplicate found at %i"%i)
print set1
print("Changing to:")
set1 =random.sample(major_numbers,5)
print set1
set_list.append([set1,set2])
i +=1
return set_list
The code you give obviously has some gaps in it and cannot work as it is there, so I cannot pinpoint where exactly your error is, but running set1 = random.sample(numbers1,5) after the end of the while loop (which is infinite if written as in your question) undoes everything you did before, because it overwrites whatever you managed to set set1 to.
Anyway, random.sample should give you a sample without replacement. If you have any repetitions in random.sample(numbers1, 5) that means that you already have repetitions in numbers1. If that is not supposed to be the case, you should check the content of numbers1 and maybe force it to contain everything uniquely, for example by using set(numbers1) instead.
If the reason is that you want some elements from numbers1 with higher probability, you might want to put this as
set1 = random.sample(numbers1, 5)
while len(set1) != len(set(set1)):
set1 = random.sample(numbers1, 5)
This is a possibly infinite loop, but if numbers1 contains at least 5 different elements, it will exit the loop at some point. If you don't like the theoretical possibility of this loop never exiting, you should probably use a weighted sample instead of random.sample, (there are a few examples of how to do that here on stackoverflow) and remove the numbers you have already chosen from the weights table.

whats another way to write python3 zip [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Ive been working on a code that reads lines in a file document and then the code organizes them. However, i got stuck at one point and my friend told me what i could use. the code works but it seems that i dont know what he is doing at line 7 and 8 FROM THE BOTTOM. I used #### so you guys know which lines it is.
So, essentially how can you re-write those 2 lines of codes and why do they work? I seem to not understand dictionaries
from sys import argv
filename = input("Please enter the name of a file: ")
file_in=(open(filename, "r"))
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
animaldictionary = dict()
for line in file_in:
if '\n' == line[-1]:
line = line[:-1]
(a, b, c) = line.split(':')
ac = (a,c)
if ac not in animaldictionary:
animaldictionary[ac] = 0
animaldictionary[ac] += 1
alla = []
for key, value in animaldictionary:
if key not in alla:
alla.append(key)
print ("alla:",alla)
allc = []
for key, value in animaldictionary:
if value not in allc:
allc.append(value)
print("allc", allc)
for a in sorted(alla):
print('%9s'%a,end=' '*13)
for c in sorted(allc):
ac = (a,c)
valc = 0
if ac in animaldictionary:
valc = animaldictionary[ac]
print('%4d'%valc,end=' '*19)
print()
print("="*60)
print("Animals that visited both stations at least 3 times: ")
for a in sorted(alla):
x = 'false'
for c in sorted(allc):
ac = (a,c)
count = 0
if ac in animaldictionary:
count = animaldictionary[ac]
if count >= 3:
x = 'true'
if x is 'true':
print('%6s'%a, end=' ')
print("")
print("="*60)
print("Average of the number visits in each month for each station:")
#(alla, allc) =
#for s in zip(*animaldictionary.keys()):
# (alla,allc).append(s)
#print(alla, allc)
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys())) ##### how else can you write this
##### how else can you rewrite the next code
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0) for a in alla for ac in ((a,c,),))//12)))for c in sorted(allc)]))
print("="*60)
print("Month with the maximum number of visits for each station:")
print("Station Month Number")
print("1")
print("2")
The two lines you indicated are indeed rather confusing. I'll try to explain them as best I can, and suggest alternative implementations.
The first one computes values for alla and allc:
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys()))
This is nearly equivalent to the loops you've already done above to build your alla and allc lists. You can skip it completely if you want. However, lets unpack what it's doing, so you can actually understand it.
The innermost part is animaldictionary.keys(). This returns an iterable object that contains all the keys of your dictionary. Since the keys in animaldictionary are two-valued tuples, that's what you'll get from the iterable. It's actually not necessary to call keys when dealing with a dictionary in most cases, since operations on the keys view are usually identical to doing the same operation on the dictionary directly.
Moving on, the keys gets wrapped up by a call to the zip function using zip(*keys). There's two things happening here. First, the * syntax unpacks the iterable from above into separate arguments. So if animaldictionary's keys were ("a1", "c1), ("a2", "c2"), ("a3", "c3") this would call zip with those three tuples as separate arguments. Now, what zip does is turn several iterable arguments into a single iterable, yielding a tuple with the first value from each, then a tuple with the second value from each, and so on. So zip(("a1", "c1"), ("a2", "c2"), ("a3", "c3")) would return a generator yielding ("a1", "a2", "a3") followed by ("c1", "c2", "c3").
The next part is a generator expression that passes each value from the zip expression into the set constructor. This serves to eliminate any duplicates. set instances can also be useful in other ways (e.g. finding intersections) but that's not needed here.
Finally, the two sets of a and c values get assigned to variables alla and allc. They replace the lists you already had with those names (and the same contents!).
You've already got an alternative to this, where you calculate alla and allc as lists. Using sets may be slightly more efficient, but it probably doesn't matter too much for small amounts of data. Another, more clear, way to do it would be:
alla = set()
allc = set()
for key in animaldict: # note, iterating over a dict yields the keys!
a, c = key # unpack the tuple key
alla.add(a)
allc.add(c)
The second line you were asking about does some averaging and combines the results into a giant string which it prints out. It is really bad programming style to cram so much into one line. And in fact, it does some needless stuff which makes it even more confusing. Here it is, with a couple of line breaks added to make it all fit on the screen at once.
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0)
for a in alla for ac in ((a,c,),))//12)
)) for c in sorted(allc)]))
The innermost piece of this is for ac in ((a,c,),). This is silly, since it's a loop over a 1-element tuple. It's a way of renaming the tuple (a,c) to ac, but it is very confusing and unnecessary.
If we replace the one use of ac with the tuple explicitly written out, the new innermost piece is animaldictionary.get((a,c),0). This is a special way of writing animaldictionary[(a, c)] but without running the risk of causing a KeyError to be raised if (a, c) is not in the dictionary. Instead, the default value of 0 (passed in to get) will be returned for non-existant keys.
That get call is wrapped up in this: (getcall for a in alla). This is a generator expression that gets all the values from the dictionary with a given c value in the key
(with a default of zero if the value is not present).
The next step is taking the average of the values in the previous generator expression: sum(genexp)//12. This is pretty straightforward, though you should note that using // for division always rounds down to the next integer. If you want a more precise floating point value, use just /.
The next part is a call to '\t'.join, with an argument that is a single (c, avg) tuple. This is an awkward construction that could be more clearly written as c+"\t"+str(avg) or "{}\t{}".format(c, avg). All of these result in a string containing the c value, a tab character and the string form of the average calcualted above.
The next step is a list comprehension, [joinedstr for c in sorted(allc)] (where joinedstr is the join call in the previous step). Using a list comprehension here is a bit odd, since there's no need for a list (a generator expression would do just as well).
Finally, the list comprehension is joined with newlines and printed: print("\n".join(listcomp)). This is straightforward.
Anyway, this whole mess can be rewritten in a much clearer way, by using a few variables and printing each line separately in a loop:
for c in sorted(allc):
total_values = sum(animaldictionary.get((a,c),0) for a in alla)
average = total_values // 12
print("{}\t{}".format(c, average))
To finish, I have some general suggestions.
First, your data structure may not be optimal for the uses you are making of you data. Rather than having animaldict be a dictionary with (a,c) keys, it might make more sense to have a nested structure, where you index each level separately. That is, animaldict[a][c]. It might even make sense to have a second dictionaries containing the same values indexed in the reverse order (e.g. one is indexed [a][c] while another is indexed [c][a]). With this approach you might not need the alla and allc lists for iterating (you'd just loop over the contents of the main dictionary directly).
My second suggestion is about code style. Many of your variables are named poorly, either because their names don't have any meaning (e.g. c) or where the names imply a meaning that is incorrect. The most glaring issue is your key and value variables, which in fact unpack two pieces of the key (AKA a and c). In other situations you can get keys and values together, but only when you are iterating over a dictionary's items() view rather than on the dictionary directly.

Repeat each element in a string a certain number of times

I'm using the rep() function to repeat each element in a string a number of times. Each character I have contains information for a state, and I need the first three elements of the character vector repeated three times, and the fourth element repeated five times.
So lets say I have the following character vectors.
al <- c("AlabamaCity", "AlabamaCityST", "AlabamaCityState", "AlabamaZipCode")
ak <- c("AlaskaCity", "AlaskaCityST", "AlaskaCityState", "AlaskaZipCode")
az <- c("ArizonaCity", "ArizonaCityST", "ArizonaCityState", "ArizonaZipCode")
ar <- c("ArkansasCity", "ArkansasCityST", "ArkansasCityState", "ArkansasZipCode")
I want to end up having the following output.
AlabamaCity
AlabamaCity
AlabamaCity
AlabamaCityST
AlabamaCityST
AlabamaCityST
AlabamaCityState
AlabamaCityState
AlabamaCityState
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
...
I was able to get the desired output with the following command, but it's a little inconvenient when I'm running through all fifty states. Plus, I might have another column with 237 cities in Alabama, and I'll inevitably run into problems matching up the names in the first column with the values in the second column.
dat = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)))
dat
dat2 = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)),
city=c(rep("x",each=15), rep("y",each=15)))
dat2
Of course, in real life, the 'x' and 'y' won't be single values.
So my question concerns if there is a more efficient way of performing this task. And closely related to the question, when does it become important to ditch procedural programming in favor of OOP in R. (not a programmer, so the second part may be a really stupid question) More importantly, is this a task where I should look for a oop related solution.
According to ?rep, times= can be a vector. So, how about this:
dat <- data.frame(name=rep(al, times=c(3,3,3,6)))
It would also be more convenient if your "state" data were in a list.
stateData <- list(al,ak,az,ar)
Data <- lapply(stateData, function(x) data.frame(name=rep(x, times=c(3,3,3,6))))
Data <- do.call(rbind, Data)
I think you can combine the times() argument of rep to work through a list with sapply(). So first, we need to make our list object:
vars <- list(al, ak, az, ar)
# Iterate through each object in vars. By default, this returns a column for each list item.
# Convert to vector and then to data.frame...This is probably not that efficient.
as.data.frame(as.vector(sapply(vars, function(x) rep(x, times = c(3,3,3,6)))))
1 AlabamaCity
2 AlabamaCity
3 AlabamaCity
4 AlabamaCityST
....snip....
....snip....
57 ArkansasZipCode
58 ArkansasZipCode
59 ArkansasZipCode
60 ArkansasZipCode
You might consider using expand.grid, then paste on the results from that.

Resources