Repeat each element in a string a certain number of times - string

I'm using the rep() function to repeat each element in a string a number of times. Each character I have contains information for a state, and I need the first three elements of the character vector repeated three times, and the fourth element repeated five times.
So lets say I have the following character vectors.
al <- c("AlabamaCity", "AlabamaCityST", "AlabamaCityState", "AlabamaZipCode")
ak <- c("AlaskaCity", "AlaskaCityST", "AlaskaCityState", "AlaskaZipCode")
az <- c("ArizonaCity", "ArizonaCityST", "ArizonaCityState", "ArizonaZipCode")
ar <- c("ArkansasCity", "ArkansasCityST", "ArkansasCityState", "ArkansasZipCode")
I want to end up having the following output.
AlabamaCity
AlabamaCity
AlabamaCity
AlabamaCityST
AlabamaCityST
AlabamaCityST
AlabamaCityState
AlabamaCityState
AlabamaCityState
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
...
I was able to get the desired output with the following command, but it's a little inconvenient when I'm running through all fifty states. Plus, I might have another column with 237 cities in Alabama, and I'll inevitably run into problems matching up the names in the first column with the values in the second column.
dat = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)))
dat
dat2 = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)),
city=c(rep("x",each=15), rep("y",each=15)))
dat2
Of course, in real life, the 'x' and 'y' won't be single values.
So my question concerns if there is a more efficient way of performing this task. And closely related to the question, when does it become important to ditch procedural programming in favor of OOP in R. (not a programmer, so the second part may be a really stupid question) More importantly, is this a task where I should look for a oop related solution.

According to ?rep, times= can be a vector. So, how about this:
dat <- data.frame(name=rep(al, times=c(3,3,3,6)))
It would also be more convenient if your "state" data were in a list.
stateData <- list(al,ak,az,ar)
Data <- lapply(stateData, function(x) data.frame(name=rep(x, times=c(3,3,3,6))))
Data <- do.call(rbind, Data)

I think you can combine the times() argument of rep to work through a list with sapply(). So first, we need to make our list object:
vars <- list(al, ak, az, ar)
# Iterate through each object in vars. By default, this returns a column for each list item.
# Convert to vector and then to data.frame...This is probably not that efficient.
as.data.frame(as.vector(sapply(vars, function(x) rep(x, times = c(3,3,3,6)))))
1 AlabamaCity
2 AlabamaCity
3 AlabamaCity
4 AlabamaCityST
....snip....
....snip....
57 ArkansasZipCode
58 ArkansasZipCode
59 ArkansasZipCode
60 ArkansasZipCode

You might consider using expand.grid, then paste on the results from that.

Related

Most practical way to add a value to a particular list, based on a variable

The title may sound a bit weird, but the situation is not so much.
I have some lists.
Here they are initialized (global variables):
sensor0, sensor1, sensor2, sensor3 = ([] for i in range(4))
Now we have a variable that indicates the list that i can insert data to. Let's say it's selector = 3.
This means i should append() my value to the sensor3 list.
What is the most practical way to do this in python?
If it was a C style language, i would use a switch-case.
But there is no switch-case syntax in python. Of course i could do multiple ifs, but this seems not the best way to do it.
I wonder, since there is only one letter to the lists that change everytime, perhaps there is a better way to select the proper list to append() to, based on the selector variable.
For your specific case, use eval
sensor0, sensor1, sensor2, sensor3 = ([] for i in range(4))
sensor3 = 20
selector =3
result = eval("sensor"+str(selector))
print(result)
But using a list may be a better option.
Using List
sensor = [[] for i in range(4)]
selector = 3
sensor[selector] = 20
print(sensor[selector])

Scala string manipulation

I have the following Scala code :
val res = for {
i <- 0 to 3
j <- 0 to 3
if (similarity(z(i),z(j)) < threshold) && (i<=j)
} yield z(j)
z here represents Array[String] and similarity(z(i),z(j)) calculates similarity between two strings.
This problems works like that similarity is calculated between 1st string and all the other strings and then similarity is calculated between 2nd string and all other strings except for first and then similarity for 3rd string and so on.
My requirement is that if 1st string matches with 3rd, 4th and 8th string, then
all these 3 strings shouldn't participate in loops further and loop should jump to 2nd string, then 5th string, 6th string and so on.
I am stuck at this step and don't know how to proceed further.
I am presuming that your intent is to keep the first String of two similar Strings (eg. if 1st String is too similar to 3rd, 4th, and 8th Strings, keep only the 1st String [out of these similar strings]).
I have a couple of ways to do this. They both work, in a sense, in reverse: for each String, if it is too similar to any later Strings, then that current String is filtered out (not the later Strings). If you first reverse the input data before applying this process, you will find that the desired outcome is produced (although in the first solution below the resulting list is itself reversed - so you can just reverse it again, if order is important):
1st way (likely easier to understand):
def filterStrings(z: Array[String]) = {
val revz = z.reverse
val filtered = for {
i <- 0 to revz.length if !revz.drop(i+1).exists(zz => similarity(zz, revz(i)) < threshold)
} yield revz(i)
filtered.reverse // re-reverses output if order is important
}
The 'drop' call is to ensure that each String is only checked against later Strings.
2nd option (fully functional, but harder to follow):
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) { case ((acc, zt), zz) =>
(if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc else zz :: acc, zt.tail)
}._1
I'll try to explain what is going on here (in case you - or any readers - aren't use to following folds):
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
If I understand correctly, you want to loop through the elements of the array, comparing each element to later elements, and removing ones that are too similar as you go.
You can't (easily) do this within a simple loop. You'd need to keep track of which items had been filtered out, which would require another array of booleans, which you update and test against as you go. It's not a bad approach and is efficient, but it's not pretty or functional.
So you need to use a recursive function, and this kind of thing is best done using an immutable data structure, so let's stick to List.
def removeSimilar(xs: List[String]): List[String] = xs match {
case Nil => Nil
case y :: ys => y :: removeSimilar(ys filter {x => similarity(y, x) < threshold})
}
It's a simple-recursive function. Not much to explain: if xs is empty, it returns the empty list, else it adds the head of the list to the function applied to the filtered tail.

Can this be done without Quasi Quoter?

I have a tiny DSL that actually works quite well. When I say
import language.CWMWL
main = runCWMWL $ do
out (matrixMult, A, 1, row, 1 3 44 6 7)
then runCWMWL is a function that is exported by language.CWMWL. This parses the experession and takes some action.
What I want to achieve is that there is some way to repeat this e.g. 1000 times and have the third element of the tuple consisting the numbers 1 to 1000. My own DSL is not complete enough to do this. Eventually I want to change the string in the last element as well.
Is there any possibility to do this without Quasi Quotes? Are Quasi Quotes the best tool for this?
What binops / primitives would my DSL need to contain or need to wrap in order to allow this in an elegant way?
Unless I'm misunderstanding, I don't think quasiquotation will get you something much nicer than
main = runCWMWL $
sequence [ out (matrixMult, A, n, row, 1 3 44 6 7) | n <- [1..1000] ]
You might also look into MonadComprehensions as well as RebindableSyntax for other ideas.

Python failure to find all duplicates

This is related to random sampling. I am using random.sample(number,5) to return a list of random numbers from within a range of numbers contained in numbers. I am using while i < 100 to return one hundred sets of five numbers. To check for duplicates, I am using :
if len(numbers) != len(set(numbers)):
to identify sets with duplicates and following this with random.sample(number,5) to try to do another randomisation to replace the set with duplicates. I seem to get about 8% getting re-randomised ( using a print statement to say which number was duplicated), but about 5% seem to be missed. What am I doing incorrectly? The actual code is as follows:
while i < 100:
set1 = random.sample(numbers1,5)
if len(set1) != len(set(set1))
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
In another routine I am trying to do the same as above, but searching for duplicates in two sets by adding the same, substituting set2 for set1. This gives the same sorts of failures. The set2 routine is indented and placed immediately below the above routine. While i < 100: is not repeated for set2.
I hope that I have explained my problem clearly!!
There is nothing in your code to stop the second sample from having duplicates. What if you did something like a second while loop?
while i<100:
i+=1
set1 = random.sample(numbers1,5)
while len(set1) != len(set(set1)):
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
Of course you're still missing the part of the code that does something... beyond the above it's difficult to tell what you might need to change without a full code sample.
EDIT: here is a working version of the code sample from the comments:
def choose_random(list1,n):
import random
i = 0
set_list=[]
major_numbers=range(1,50) + list1
print(major_numbers)
while i <n:
set1 =random.sample(major_numbers,5)
set2 =random.sample(major_numbers,2)
while len(set(set1)) != len(set1):
print("Duplicate found at %i"%i)
print set1
print("Changing to:")
set1 =random.sample(major_numbers,5)
print set1
set_list.append([set1,set2])
i +=1
return set_list
The code you give obviously has some gaps in it and cannot work as it is there, so I cannot pinpoint where exactly your error is, but running set1 = random.sample(numbers1,5) after the end of the while loop (which is infinite if written as in your question) undoes everything you did before, because it overwrites whatever you managed to set set1 to.
Anyway, random.sample should give you a sample without replacement. If you have any repetitions in random.sample(numbers1, 5) that means that you already have repetitions in numbers1. If that is not supposed to be the case, you should check the content of numbers1 and maybe force it to contain everything uniquely, for example by using set(numbers1) instead.
If the reason is that you want some elements from numbers1 with higher probability, you might want to put this as
set1 = random.sample(numbers1, 5)
while len(set1) != len(set(set1)):
set1 = random.sample(numbers1, 5)
This is a possibly infinite loop, but if numbers1 contains at least 5 different elements, it will exit the loop at some point. If you don't like the theoretical possibility of this loop never exiting, you should probably use a weighted sample instead of random.sample, (there are a few examples of how to do that here on stackoverflow) and remove the numbers you have already chosen from the weights table.

whats another way to write python3 zip [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Ive been working on a code that reads lines in a file document and then the code organizes them. However, i got stuck at one point and my friend told me what i could use. the code works but it seems that i dont know what he is doing at line 7 and 8 FROM THE BOTTOM. I used #### so you guys know which lines it is.
So, essentially how can you re-write those 2 lines of codes and why do they work? I seem to not understand dictionaries
from sys import argv
filename = input("Please enter the name of a file: ")
file_in=(open(filename, "r"))
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
animaldictionary = dict()
for line in file_in:
if '\n' == line[-1]:
line = line[:-1]
(a, b, c) = line.split(':')
ac = (a,c)
if ac not in animaldictionary:
animaldictionary[ac] = 0
animaldictionary[ac] += 1
alla = []
for key, value in animaldictionary:
if key not in alla:
alla.append(key)
print ("alla:",alla)
allc = []
for key, value in animaldictionary:
if value not in allc:
allc.append(value)
print("allc", allc)
for a in sorted(alla):
print('%9s'%a,end=' '*13)
for c in sorted(allc):
ac = (a,c)
valc = 0
if ac in animaldictionary:
valc = animaldictionary[ac]
print('%4d'%valc,end=' '*19)
print()
print("="*60)
print("Animals that visited both stations at least 3 times: ")
for a in sorted(alla):
x = 'false'
for c in sorted(allc):
ac = (a,c)
count = 0
if ac in animaldictionary:
count = animaldictionary[ac]
if count >= 3:
x = 'true'
if x is 'true':
print('%6s'%a, end=' ')
print("")
print("="*60)
print("Average of the number visits in each month for each station:")
#(alla, allc) =
#for s in zip(*animaldictionary.keys()):
# (alla,allc).append(s)
#print(alla, allc)
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys())) ##### how else can you write this
##### how else can you rewrite the next code
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0) for a in alla for ac in ((a,c,),))//12)))for c in sorted(allc)]))
print("="*60)
print("Month with the maximum number of visits for each station:")
print("Station Month Number")
print("1")
print("2")
The two lines you indicated are indeed rather confusing. I'll try to explain them as best I can, and suggest alternative implementations.
The first one computes values for alla and allc:
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys()))
This is nearly equivalent to the loops you've already done above to build your alla and allc lists. You can skip it completely if you want. However, lets unpack what it's doing, so you can actually understand it.
The innermost part is animaldictionary.keys(). This returns an iterable object that contains all the keys of your dictionary. Since the keys in animaldictionary are two-valued tuples, that's what you'll get from the iterable. It's actually not necessary to call keys when dealing with a dictionary in most cases, since operations on the keys view are usually identical to doing the same operation on the dictionary directly.
Moving on, the keys gets wrapped up by a call to the zip function using zip(*keys). There's two things happening here. First, the * syntax unpacks the iterable from above into separate arguments. So if animaldictionary's keys were ("a1", "c1), ("a2", "c2"), ("a3", "c3") this would call zip with those three tuples as separate arguments. Now, what zip does is turn several iterable arguments into a single iterable, yielding a tuple with the first value from each, then a tuple with the second value from each, and so on. So zip(("a1", "c1"), ("a2", "c2"), ("a3", "c3")) would return a generator yielding ("a1", "a2", "a3") followed by ("c1", "c2", "c3").
The next part is a generator expression that passes each value from the zip expression into the set constructor. This serves to eliminate any duplicates. set instances can also be useful in other ways (e.g. finding intersections) but that's not needed here.
Finally, the two sets of a and c values get assigned to variables alla and allc. They replace the lists you already had with those names (and the same contents!).
You've already got an alternative to this, where you calculate alla and allc as lists. Using sets may be slightly more efficient, but it probably doesn't matter too much for small amounts of data. Another, more clear, way to do it would be:
alla = set()
allc = set()
for key in animaldict: # note, iterating over a dict yields the keys!
a, c = key # unpack the tuple key
alla.add(a)
allc.add(c)
The second line you were asking about does some averaging and combines the results into a giant string which it prints out. It is really bad programming style to cram so much into one line. And in fact, it does some needless stuff which makes it even more confusing. Here it is, with a couple of line breaks added to make it all fit on the screen at once.
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0)
for a in alla for ac in ((a,c,),))//12)
)) for c in sorted(allc)]))
The innermost piece of this is for ac in ((a,c,),). This is silly, since it's a loop over a 1-element tuple. It's a way of renaming the tuple (a,c) to ac, but it is very confusing and unnecessary.
If we replace the one use of ac with the tuple explicitly written out, the new innermost piece is animaldictionary.get((a,c),0). This is a special way of writing animaldictionary[(a, c)] but without running the risk of causing a KeyError to be raised if (a, c) is not in the dictionary. Instead, the default value of 0 (passed in to get) will be returned for non-existant keys.
That get call is wrapped up in this: (getcall for a in alla). This is a generator expression that gets all the values from the dictionary with a given c value in the key
(with a default of zero if the value is not present).
The next step is taking the average of the values in the previous generator expression: sum(genexp)//12. This is pretty straightforward, though you should note that using // for division always rounds down to the next integer. If you want a more precise floating point value, use just /.
The next part is a call to '\t'.join, with an argument that is a single (c, avg) tuple. This is an awkward construction that could be more clearly written as c+"\t"+str(avg) or "{}\t{}".format(c, avg). All of these result in a string containing the c value, a tab character and the string form of the average calcualted above.
The next step is a list comprehension, [joinedstr for c in sorted(allc)] (where joinedstr is the join call in the previous step). Using a list comprehension here is a bit odd, since there's no need for a list (a generator expression would do just as well).
Finally, the list comprehension is joined with newlines and printed: print("\n".join(listcomp)). This is straightforward.
Anyway, this whole mess can be rewritten in a much clearer way, by using a few variables and printing each line separately in a loop:
for c in sorted(allc):
total_values = sum(animaldictionary.get((a,c),0) for a in alla)
average = total_values // 12
print("{}\t{}".format(c, average))
To finish, I have some general suggestions.
First, your data structure may not be optimal for the uses you are making of you data. Rather than having animaldict be a dictionary with (a,c) keys, it might make more sense to have a nested structure, where you index each level separately. That is, animaldict[a][c]. It might even make sense to have a second dictionaries containing the same values indexed in the reverse order (e.g. one is indexed [a][c] while another is indexed [c][a]). With this approach you might not need the alla and allc lists for iterating (you'd just loop over the contents of the main dictionary directly).
My second suggestion is about code style. Many of your variables are named poorly, either because their names don't have any meaning (e.g. c) or where the names imply a meaning that is incorrect. The most glaring issue is your key and value variables, which in fact unpack two pieces of the key (AKA a and c). In other situations you can get keys and values together, but only when you are iterating over a dictionary's items() view rather than on the dictionary directly.

Resources