I am working with a very long list of commodity names (var1). I would like to extract information from this list by creating a second variable (var2) that is equal to 1 if var1 contains certain keywords.
I was using the following code:
g soy = strpos(productsproduced, "Soybeans, ") | strpos(productsproduced, "Soybean, ") | strpos(productsproduced, "soybeans, ")| strpos(productsproduced, "soybean, ") | productsproduced == "Soybeans"
The list is much longer, given that the data was not properly coded, and each name appears in many different ways (as the excerpt in the code sample shows).
I believe that it would be much easier to work with a list (easier to look through the list certainly, and see if I am missing anything, etc.)
Unfortunately, it has been a while since I have worked with loops, but I was thinking something of the sort:
local mylist Soybean soybean Soybeans soybeans Soybeans, soybeans,
forval i = mylist {
g soy = strpos(var1, "`i'")
}
This doesn't quite work, but I am not sure how to code it. One definite issue is that Stata would not know in this case whether I would like it to use the or operator (yes, I would) or the and operator.
The spirit is evident; the details need various fixes.
local mywords Soybean soybean Soybeans soybeans Soybeans, soybeans,
gen soy = 0
foreach w of local mywords {
replace soy = soy | strpos(var1, "`w'")
}
What's crucial is that you need replace inside the loop; otherwise the loop will fail second time round on a generate as the variable already exists.
In fact this example reduces to
gen soy = strpos(var1, "oybean") > 0
on the assumption that oybean wouldn't match anything not wanted.
Standardising to lower case is often helpful
local mywords soybean soybeans soybeans,
gen soy = 0
foreach w of local mywords {
replace soy = soy | strpos(lower(var1), "`w'")
}
Related
I have enter code here an assignment in Haskell and in one of the tasks I have to translate an input word according to a dictionary that is already in the file.
This is an example of the a line from the dictionary:
dictionary = [
("doubleplusgood",["excellent", "fabulous", "fantastic", "best"]),
]
If the input is "excellent" the task says that my translator function should return output "doubleplusgood":
translate "excellent" = "doubleplusgood"
I have been trying to solve this task for hours now and I think my thought are just going in a circle, so I was wondering if anyone has any advice on where I should begin in order to solve the task? I am, by the way, not allowed to import any other packages other than Prelude.
Well you can start with what you already know to be true, defining
translate "excellent" = "doubleplusgood"
(suspend your disbelief for a moment, will you).
Yes this is a valid definition. But it doesn't refer to dictionary, so we can address that by defining
translate "excellent" = matchup "excellent" dictionary
matchup "excellent" dict = "doubleplusgood"
Except that it is too specific of course. So we generalize a little bit as
translate excellent = matchup excellent dictionary
matchup excellent dict = "doubleplusgood"
Now the matchup is just cheating. We can try make it do some actual work as
matchup "excellent"
[("doubleplusgood",[ "excellent", "fabulous", "fantastic", "best"])]
=
"doubleplusgood"
All these variations so far are just writing down what you already had.
But we had it generalized before, so
matchup excellent
[("doubleplusgood",[ excellent, "fabulous", "fantastic", "best"])]
=
"doubleplusgood"
and that's an error now. We can't have two occurrences of the same variable in our arguments. It's forbidden in Haskell. So it must be
matchup excellent1
[("doubleplusgood",[ excellent2, "fabulous", "fantastic", "best"])]
=
"doubleplusgood"
We are definitely going somewhere with this. But wait, we didn't use the two variables at all. They are supposed to have the same value, aren't they. So let's write that down:
matchup excellent1
[("doubleplusgood",[ excellent2, "fabulous", "fantastic", "best"])]
| excellent1 == excellent2
=
"doubleplusgood"
Well but what's with all these other entries in the synonyms list? And their "translation"?
matchup excellent1
[(doubleplusgood,[ excellent2, fabulous, fantastic, best])]
| excellent1 == excellent2
=
doubleplusgood
Now this is a bona fide Haskell definition. Almost. Why should the list of synonyms have this fixed length? Why would there be just one entry in the dictionary?
We proceed by wishful thinking (Thank You) and write
matchup excellent1 (
(doubleplusgood, synonyms)
:
more )
| present excellent1 synonyms
=
doubleplusgood
And now we must also define this present. But first, what should we do with more? Under what condition?
matchup excellent1 (
(doubleplusgood, synonyms)
:
more )
| present excellent1 synonyms
=
doubleplusgood
| otherwise
=
somethingelse excellent1 more
But what should somethingelse do? Isn't it exactly what matchup is doing?
etc. etc. etc.
I think You can continue from here.
The title may sound a bit weird, but the situation is not so much.
I have some lists.
Here they are initialized (global variables):
sensor0, sensor1, sensor2, sensor3 = ([] for i in range(4))
Now we have a variable that indicates the list that i can insert data to. Let's say it's selector = 3.
This means i should append() my value to the sensor3 list.
What is the most practical way to do this in python?
If it was a C style language, i would use a switch-case.
But there is no switch-case syntax in python. Of course i could do multiple ifs, but this seems not the best way to do it.
I wonder, since there is only one letter to the lists that change everytime, perhaps there is a better way to select the proper list to append() to, based on the selector variable.
For your specific case, use eval
sensor0, sensor1, sensor2, sensor3 = ([] for i in range(4))
sensor3 = 20
selector =3
result = eval("sensor"+str(selector))
print(result)
But using a list may be a better option.
Using List
sensor = [[] for i in range(4)]
selector = 3
sensor[selector] = 20
print(sensor[selector])
I am populating a new variable of a dataframe, based on string conditions from another variable. I receive the following error msg:
Error in Source == "httpWWW.BGDAILYNEWS.COM" | Source == :
operations are possible only for numeric, logical or complex types
My code is as follows:
County <- ifelse(Source == 'httpWWW.BGDAILYNEWS.COM' | 'WWW.BGDAILYNEWS.COM', 'Warren',
ifelse(Source == 'httpWWW.HCLOCAL.COM' | 'WWW.HCLOCAL.COM', 'Henry',
ifelse(Source == 'httpWWW.KENTUCKY.COM' | 'WWW.KENTUCKY.COM', 'Fayette',
ifelse(Source == 'httpWWW.KENTUCKYNEWERA.COM' | 'WWW.KENTUCKYNEWERA.COM', 'Christian')
)))
I suggest you break down that deeply nested ifelse statement into more manageable chunks.
But the error is telling you that you cannot use | like that. 'a' | 'b' doesn't make sense since its a logical comparison. Instead use %in%:
Source %in% c('htpWWW.BGDAILYNEWS.com', 'WWW.BGDAILYNEWS.COM')
I think... If I understand what you're doing, you will be much better off using multiple assignments:
County = vector(mode='character', length=length(Source))
County[County %in% c('htpWWW.BGDAILYNEWS.com', 'WWW.BGDAILYNEWS.COM')] <- 'Warren'
etc.
You can also use a switch statement for this type of thing:
myfun <- function(x) {
switch(x,
'httpWWW.BGDAILYNEWS.COM'='Warren',
'httpWWW.HCLOCAL.COM'='Henry',
etc...)
}
Then you want to do a simple apply (sapply) passing each element in Source to myfun:
County = sapply(Source, myfun)
Or finally, you can use factors and levels, but I'll leave that as an exercise to the reader...
A different approach:
county <- c("Warren","Henry","Fayette","Christian")
sites <- c("WWW.BGDAILYNEWS.COM","WWW.HCLOCAL.COM","WWW.KENTUCKY.COM","WWW.KENTUCKYNEWERA.COM")
County <- county[match(gsub("^http","",Source), sites)]
This will return NA for strings that do no match any of the given inputs.
Using Hadley's suggestion (lookup-tables-character-subsetting):
lookup <- c(WWW.BGDAILYNEWS.COM="Warren", WWW.HCLOCAL.COM="Henry", WWW.KENTUCKY.COM="Fayette", WWW.KENTUCKYNEWERA.COM="Christian")
County <- unname(lookup[gsub("^http","",Source)])
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Ive been working on a code that reads lines in a file document and then the code organizes them. However, i got stuck at one point and my friend told me what i could use. the code works but it seems that i dont know what he is doing at line 7 and 8 FROM THE BOTTOM. I used #### so you guys know which lines it is.
So, essentially how can you re-write those 2 lines of codes and why do they work? I seem to not understand dictionaries
from sys import argv
filename = input("Please enter the name of a file: ")
file_in=(open(filename, "r"))
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
animaldictionary = dict()
for line in file_in:
if '\n' == line[-1]:
line = line[:-1]
(a, b, c) = line.split(':')
ac = (a,c)
if ac not in animaldictionary:
animaldictionary[ac] = 0
animaldictionary[ac] += 1
alla = []
for key, value in animaldictionary:
if key not in alla:
alla.append(key)
print ("alla:",alla)
allc = []
for key, value in animaldictionary:
if value not in allc:
allc.append(value)
print("allc", allc)
for a in sorted(alla):
print('%9s'%a,end=' '*13)
for c in sorted(allc):
ac = (a,c)
valc = 0
if ac in animaldictionary:
valc = animaldictionary[ac]
print('%4d'%valc,end=' '*19)
print()
print("="*60)
print("Animals that visited both stations at least 3 times: ")
for a in sorted(alla):
x = 'false'
for c in sorted(allc):
ac = (a,c)
count = 0
if ac in animaldictionary:
count = animaldictionary[ac]
if count >= 3:
x = 'true'
if x is 'true':
print('%6s'%a, end=' ')
print("")
print("="*60)
print("Average of the number visits in each month for each station:")
#(alla, allc) =
#for s in zip(*animaldictionary.keys()):
# (alla,allc).append(s)
#print(alla, allc)
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys())) ##### how else can you write this
##### how else can you rewrite the next code
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0) for a in alla for ac in ((a,c,),))//12)))for c in sorted(allc)]))
print("="*60)
print("Month with the maximum number of visits for each station:")
print("Station Month Number")
print("1")
print("2")
The two lines you indicated are indeed rather confusing. I'll try to explain them as best I can, and suggest alternative implementations.
The first one computes values for alla and allc:
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys()))
This is nearly equivalent to the loops you've already done above to build your alla and allc lists. You can skip it completely if you want. However, lets unpack what it's doing, so you can actually understand it.
The innermost part is animaldictionary.keys(). This returns an iterable object that contains all the keys of your dictionary. Since the keys in animaldictionary are two-valued tuples, that's what you'll get from the iterable. It's actually not necessary to call keys when dealing with a dictionary in most cases, since operations on the keys view are usually identical to doing the same operation on the dictionary directly.
Moving on, the keys gets wrapped up by a call to the zip function using zip(*keys). There's two things happening here. First, the * syntax unpacks the iterable from above into separate arguments. So if animaldictionary's keys were ("a1", "c1), ("a2", "c2"), ("a3", "c3") this would call zip with those three tuples as separate arguments. Now, what zip does is turn several iterable arguments into a single iterable, yielding a tuple with the first value from each, then a tuple with the second value from each, and so on. So zip(("a1", "c1"), ("a2", "c2"), ("a3", "c3")) would return a generator yielding ("a1", "a2", "a3") followed by ("c1", "c2", "c3").
The next part is a generator expression that passes each value from the zip expression into the set constructor. This serves to eliminate any duplicates. set instances can also be useful in other ways (e.g. finding intersections) but that's not needed here.
Finally, the two sets of a and c values get assigned to variables alla and allc. They replace the lists you already had with those names (and the same contents!).
You've already got an alternative to this, where you calculate alla and allc as lists. Using sets may be slightly more efficient, but it probably doesn't matter too much for small amounts of data. Another, more clear, way to do it would be:
alla = set()
allc = set()
for key in animaldict: # note, iterating over a dict yields the keys!
a, c = key # unpack the tuple key
alla.add(a)
allc.add(c)
The second line you were asking about does some averaging and combines the results into a giant string which it prints out. It is really bad programming style to cram so much into one line. And in fact, it does some needless stuff which makes it even more confusing. Here it is, with a couple of line breaks added to make it all fit on the screen at once.
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0)
for a in alla for ac in ((a,c,),))//12)
)) for c in sorted(allc)]))
The innermost piece of this is for ac in ((a,c,),). This is silly, since it's a loop over a 1-element tuple. It's a way of renaming the tuple (a,c) to ac, but it is very confusing and unnecessary.
If we replace the one use of ac with the tuple explicitly written out, the new innermost piece is animaldictionary.get((a,c),0). This is a special way of writing animaldictionary[(a, c)] but without running the risk of causing a KeyError to be raised if (a, c) is not in the dictionary. Instead, the default value of 0 (passed in to get) will be returned for non-existant keys.
That get call is wrapped up in this: (getcall for a in alla). This is a generator expression that gets all the values from the dictionary with a given c value in the key
(with a default of zero if the value is not present).
The next step is taking the average of the values in the previous generator expression: sum(genexp)//12. This is pretty straightforward, though you should note that using // for division always rounds down to the next integer. If you want a more precise floating point value, use just /.
The next part is a call to '\t'.join, with an argument that is a single (c, avg) tuple. This is an awkward construction that could be more clearly written as c+"\t"+str(avg) or "{}\t{}".format(c, avg). All of these result in a string containing the c value, a tab character and the string form of the average calcualted above.
The next step is a list comprehension, [joinedstr for c in sorted(allc)] (where joinedstr is the join call in the previous step). Using a list comprehension here is a bit odd, since there's no need for a list (a generator expression would do just as well).
Finally, the list comprehension is joined with newlines and printed: print("\n".join(listcomp)). This is straightforward.
Anyway, this whole mess can be rewritten in a much clearer way, by using a few variables and printing each line separately in a loop:
for c in sorted(allc):
total_values = sum(animaldictionary.get((a,c),0) for a in alla)
average = total_values // 12
print("{}\t{}".format(c, average))
To finish, I have some general suggestions.
First, your data structure may not be optimal for the uses you are making of you data. Rather than having animaldict be a dictionary with (a,c) keys, it might make more sense to have a nested structure, where you index each level separately. That is, animaldict[a][c]. It might even make sense to have a second dictionaries containing the same values indexed in the reverse order (e.g. one is indexed [a][c] while another is indexed [c][a]). With this approach you might not need the alla and allc lists for iterating (you'd just loop over the contents of the main dictionary directly).
My second suggestion is about code style. Many of your variables are named poorly, either because their names don't have any meaning (e.g. c) or where the names imply a meaning that is incorrect. The most glaring issue is your key and value variables, which in fact unpack two pieces of the key (AKA a and c). In other situations you can get keys and values together, but only when you are iterating over a dictionary's items() view rather than on the dictionary directly.
I'm using the rep() function to repeat each element in a string a number of times. Each character I have contains information for a state, and I need the first three elements of the character vector repeated three times, and the fourth element repeated five times.
So lets say I have the following character vectors.
al <- c("AlabamaCity", "AlabamaCityST", "AlabamaCityState", "AlabamaZipCode")
ak <- c("AlaskaCity", "AlaskaCityST", "AlaskaCityState", "AlaskaZipCode")
az <- c("ArizonaCity", "ArizonaCityST", "ArizonaCityState", "ArizonaZipCode")
ar <- c("ArkansasCity", "ArkansasCityST", "ArkansasCityState", "ArkansasZipCode")
I want to end up having the following output.
AlabamaCity
AlabamaCity
AlabamaCity
AlabamaCityST
AlabamaCityST
AlabamaCityST
AlabamaCityState
AlabamaCityState
AlabamaCityState
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
...
I was able to get the desired output with the following command, but it's a little inconvenient when I'm running through all fifty states. Plus, I might have another column with 237 cities in Alabama, and I'll inevitably run into problems matching up the names in the first column with the values in the second column.
dat = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)))
dat
dat2 = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)),
city=c(rep("x",each=15), rep("y",each=15)))
dat2
Of course, in real life, the 'x' and 'y' won't be single values.
So my question concerns if there is a more efficient way of performing this task. And closely related to the question, when does it become important to ditch procedural programming in favor of OOP in R. (not a programmer, so the second part may be a really stupid question) More importantly, is this a task where I should look for a oop related solution.
According to ?rep, times= can be a vector. So, how about this:
dat <- data.frame(name=rep(al, times=c(3,3,3,6)))
It would also be more convenient if your "state" data were in a list.
stateData <- list(al,ak,az,ar)
Data <- lapply(stateData, function(x) data.frame(name=rep(x, times=c(3,3,3,6))))
Data <- do.call(rbind, Data)
I think you can combine the times() argument of rep to work through a list with sapply(). So first, we need to make our list object:
vars <- list(al, ak, az, ar)
# Iterate through each object in vars. By default, this returns a column for each list item.
# Convert to vector and then to data.frame...This is probably not that efficient.
as.data.frame(as.vector(sapply(vars, function(x) rep(x, times = c(3,3,3,6)))))
1 AlabamaCity
2 AlabamaCity
3 AlabamaCity
4 AlabamaCityST
....snip....
....snip....
57 ArkansasZipCode
58 ArkansasZipCode
59 ArkansasZipCode
60 ArkansasZipCode
You might consider using expand.grid, then paste on the results from that.