Python failure to find all duplicates

Python failure to find all duplicates - python-3.x

This is related to random sampling. I am using random.sample(number,5) to return a list of random numbers from within a range of numbers contained in numbers. I am using while i < 100 to return one hundred sets of five numbers. To check for duplicates, I am using :
if len(numbers) != len(set(numbers)):
to identify sets with duplicates and following this with random.sample(number,5) to try to do another randomisation to replace the set with duplicates. I seem to get about 8% getting re-randomised ( using a print statement to say which number was duplicated), but about 5% seem to be missed. What am I doing incorrectly? The actual code is as follows:
while i < 100:
set1 = random.sample(numbers1,5)
if len(set1) != len(set(set1))
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
In another routine I am trying to do the same as above, but searching for duplicates in two sets by adding the same, substituting set2 for set1. This gives the same sorts of failures. The set2 routine is indented and placed immediately below the above routine. While i < 100: is not repeated for set2.
I hope that I have explained my problem clearly!!

There is nothing in your code to stop the second sample from having duplicates. What if you did something like a second while loop?
while i<100:
i+=1
set1 = random.sample(numbers1,5)
while len(set1) != len(set(set1)):
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
Of course you're still missing the part of the code that does something... beyond the above it's difficult to tell what you might need to change without a full code sample.
EDIT: here is a working version of the code sample from the comments:
def choose_random(list1,n):
import random
i = 0
set_list=[]
major_numbers=range(1,50) + list1
print(major_numbers)
while i <n:
set1 =random.sample(major_numbers,5)
set2 =random.sample(major_numbers,2)
while len(set(set1)) != len(set1):
print("Duplicate found at %i"%i)
print set1
print("Changing to:")
set1 =random.sample(major_numbers,5)
print set1
set_list.append([set1,set2])
i +=1
return set_list

The code you give obviously has some gaps in it and cannot work as it is there, so I cannot pinpoint where exactly your error is, but running set1 = random.sample(numbers1,5) after the end of the while loop (which is infinite if written as in your question) undoes everything you did before, because it overwrites whatever you managed to set set1 to.
Anyway, random.sample should give you a sample without replacement. If you have any repetitions in random.sample(numbers1, 5) that means that you already have repetitions in numbers1. If that is not supposed to be the case, you should check the content of numbers1 and maybe force it to contain everything uniquely, for example by using set(numbers1) instead.
If the reason is that you want some elements from numbers1 with higher probability, you might want to put this as
set1 = random.sample(numbers1, 5)
while len(set1) != len(set(set1)):
set1 = random.sample(numbers1, 5)
This is a possibly infinite loop, but if numbers1 contains at least 5 different elements, it will exit the loop at some point. If you don't like the theoretical possibility of this loop never exiting, you should probably use a weighted sample instead of random.sample, (there are a few examples of how to do that here on stackoverflow) and remove the numbers you have already chosen from the weights table.

Related

DC3/Skew Suffix Array Algorithm doesn't work for specific cases

When applying the DC3/Skew algorithm to the string yabadabado, I can't quite get it to sort correctly. This issue happens in other cases, but this is a short example to show it.
This first table is for reference:
These are the triples of R12
We have a tie between i = 1, and i = 5 since both their triples are aba.
We now need to get the suffix array of the ranks R' through recursing, but we can quickly break this tie since i_1 = [1,3] > i_5 = [1,2] which implies that the suffix starting at i = 5 should come before i = 1. Recursing returns the same result with R'5 < R'1.
So applying these results puts the relative order of those two suffixes as:
S[5,] == abado // Comes first (smaller)
S[1,] == abadabado
But abadabado < abado. I've been looking at this for a while, and can't seem to figure out where I stray from the algorithm.
I'm hoping someone with more experience using the algorithm can point me in the right direction.

how can i remove incorrect results from memory?

import itertools
printable = 'abcdefghijklmnopqrstuvwxz'
all_possibilites = ([''.join(i) for i in itertools.product(printable, repeat = 3)])
comparison = ['zd']
if comparison in all_possibilities:
print("match")
This is a snippet of my code. my intention is to generate every single combination of the alphabet. The snippet here has a limit of three characters. With the limit too large python returns memory error. My question is:
Is there a way to remove from memory the combinations that did not match in order for the only limitation to be time, instead of memory? say if the character limit was 5? Any further reading on this would be helpful too.

The main fault is that you're first creating the full list and then trying to filter it out. That's an issue because you bring the full list in memory.
You'd be better adding your condition inside the list-comprehension, keeping any elements you'll actually need:
all_posibilities = ["".join(i) for i in itertools.product(printable, repeat = 5) if 'a' or 'b' in i]
## ^^ place your condition here
print(len(all_posibilities))
Alternatively, if you're looking to iterate through all_posibilities in the end, it makes sense to create a generator to further limit the memory footprint:
all_posibilities = ("".join(i) for i in itertools.product(printable, repeat = 5) if 'a' or 'b' in i)
for i in all_posibilities:
# do things

i made the following binary search algorithm, and I am trying to improve it. suggestions?

i am doing a homework assignment where I have to check large lists of numbers for a given number. the list length is <= 20000, and I can be searching for just as many numbers. if the number we are searching for is in the list, return the index of that number, otherwise, return -1. here is what I did.
i wrote the following code, that outputsthe correct answer, but does not do it fast enough. it has to be done in less than 1 second.
here is my binary search code:`I am looking for suggestions to make it faster.
def binary_search(list1, target):
p = list1
upper = len(list1)
lower = 0
found = False
check = int((upper+lower)//2)
while found == False:
upper = len(list1)
lower = 0
check = int(len(list1)//2)
if list1[check] > target:
list1 = list1[lower:check]
check= int((len(list1))//2)
if list1[check] < target:
list1 = list1[check:upper]
check = int((len(list1))//2)
if list1[check] == target:
found = True
return p.index(target)
if len(list1)==1:
if target not in list1:
return -1`
grateful for any help/

The core problem is that this is not a correctly written Binary Search (see the algorithm there).
There is no need of index(target) to find the solution; a binary search is O(lg n) and so the very presence of index-of, being O(n), violates this! As written, the entire "binary search" function could be replaced as list1.index(value) as it is "finding" the value index simply to "find" the value index.
The code is also slicing the lists which is not needed1; simply move the upper/lower indices to "hone in" on the value. The index of the found value is where the upper and lower bounds eventually meet.
Also, make sure that list1 is really a list such that the item access is O(1).
(And int is not needed with //.)
1 I believe that the complexity is still O(lg n) with the slice, but it is a non-idiomatic binary search and adds additional overhead of creating new lists and copying the relevant items. It also doesn't allow the correct generation of the found item's index - at least without similar variable maintenance as found in a traditional implementation.

Try using else if's, for example if the value thats being checked is greater then you don't also need to check if its smaller.

Doesn't accept the list index?

I have this peice of code:
n = int (input ('Enter the Number of Players: '))
m = [[j] for j in range (0, n)]
all_names= []
i = 0
while n > 1:
m[i] = input('Player {0}: '.format (i+1))
all_names.extend ([m[i]])
if m[i][0] != m[i-1][-1]:
b= m.pop (i)
n = n-1
if all_names.count (m[i]) == 2:
n = n-1
b= m.pop (i)
i = i+1
It says the index is out of range (second if clause)
but I dont get it, why?

I hate to not answer your question directly, but what you're trying to do seems... really confusing. Python has a sort of rule that there's supposed to be a really clear, clean way of doing things, so if a piece of code looks really funky (especially for such a simple function), it's probably not using the right approach.
If you just want to create a container of names, there are numerous simpler ways of doing it:
players=int(input("How many players?\n"))
player_names=set()
while len(player_names)<players:
player_names.add(input("What is player {}'s name?\n".format(len(player_names)+1)))
... will give you a set of unique player names, although this won't be ordered. That might matter (your implementation kept order, so maybe it is), and in this case you could still use a list and add a small check to make sure you were adding a new name and not repeatedly adding names:
players=int(input("How many players?\n"))
player_names=list()
while len(player_names)<players:
playname=input("What is player {}'s name?\n".format(len(player_names)+1))
if playname not in player_names:
player_names.append(playname)
I'm open to someone haranguing me about dodging the question, particularly if there's a purpose/reason for the approach the questioner took.

Length of m decreases every time the code enters the first if clause. However, you increment the value of i in each iteration. So, at the midpoint of length of m (if the 1st clause is entered always) or a little later, the value of i will be bigger than the value of m and you will get an index out of range.

whats another way to write python3 zip [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Ive been working on a code that reads lines in a file document and then the code organizes them. However, i got stuck at one point and my friend told me what i could use. the code works but it seems that i dont know what he is doing at line 7 and 8 FROM THE BOTTOM. I used #### so you guys know which lines it is.
So, essentially how can you re-write those 2 lines of codes and why do they work? I seem to not understand dictionaries
from sys import argv
filename = input("Please enter the name of a file: ")
file_in=(open(filename, "r"))
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
animaldictionary = dict()
for line in file_in:
if '\n' == line[-1]:
line = line[:-1]
(a, b, c) = line.split(':')
ac = (a,c)
if ac not in animaldictionary:
animaldictionary[ac] = 0
animaldictionary[ac] += 1
alla = []
for key, value in animaldictionary:
if key not in alla:
alla.append(key)
print ("alla:",alla)
allc = []
for key, value in animaldictionary:
if value not in allc:
allc.append(value)
print("allc", allc)
for a in sorted(alla):
print('%9s'%a,end=' '*13)
for c in sorted(allc):
ac = (a,c)
valc = 0
if ac in animaldictionary:
valc = animaldictionary[ac]
print('%4d'%valc,end=' '*19)
print()
print("="*60)
print("Animals that visited both stations at least 3 times: ")
for a in sorted(alla):
x = 'false'
for c in sorted(allc):
ac = (a,c)
count = 0
if ac in animaldictionary:
count = animaldictionary[ac]
if count >= 3:
x = 'true'
if x is 'true':
print('%6s'%a, end=' ')
print("")
print("="*60)
print("Average of the number visits in each month for each station:")
#(alla, allc) =
#for s in zip(*animaldictionary.keys()):
# (alla,allc).append(s)
#print(alla, allc)
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys())) ##### how else can you write this
##### how else can you rewrite the next code
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0) for a in alla for ac in ((a,c,),))//12)))for c in sorted(allc)]))
print("="*60)
print("Month with the maximum number of visits for each station:")
print("Station Month Number")
print("1")
print("2")

The two lines you indicated are indeed rather confusing. I'll try to explain them as best I can, and suggest alternative implementations.
The first one computes values for alla and allc:
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys()))
This is nearly equivalent to the loops you've already done above to build your alla and allc lists. You can skip it completely if you want. However, lets unpack what it's doing, so you can actually understand it.
The innermost part is animaldictionary.keys(). This returns an iterable object that contains all the keys of your dictionary. Since the keys in animaldictionary are two-valued tuples, that's what you'll get from the iterable. It's actually not necessary to call keys when dealing with a dictionary in most cases, since operations on the keys view are usually identical to doing the same operation on the dictionary directly.
Moving on, the keys gets wrapped up by a call to the zip function using zip(*keys). There's two things happening here. First, the * syntax unpacks the iterable from above into separate arguments. So if animaldictionary's keys were ("a1", "c1), ("a2", "c2"), ("a3", "c3") this would call zip with those three tuples as separate arguments. Now, what zip does is turn several iterable arguments into a single iterable, yielding a tuple with the first value from each, then a tuple with the second value from each, and so on. So zip(("a1", "c1"), ("a2", "c2"), ("a3", "c3")) would return a generator yielding ("a1", "a2", "a3") followed by ("c1", "c2", "c3").
The next part is a generator expression that passes each value from the zip expression into the set constructor. This serves to eliminate any duplicates. set instances can also be useful in other ways (e.g. finding intersections) but that's not needed here.
Finally, the two sets of a and c values get assigned to variables alla and allc. They replace the lists you already had with those names (and the same contents!).
You've already got an alternative to this, where you calculate alla and allc as lists. Using sets may be slightly more efficient, but it probably doesn't matter too much for small amounts of data. Another, more clear, way to do it would be:
alla = set()
allc = set()
for key in animaldict: # note, iterating over a dict yields the keys!
a, c = key # unpack the tuple key
alla.add(a)
allc.add(c)
The second line you were asking about does some averaging and combines the results into a giant string which it prints out. It is really bad programming style to cram so much into one line. And in fact, it does some needless stuff which makes it even more confusing. Here it is, with a couple of line breaks added to make it all fit on the screen at once.
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0)
for a in alla for ac in ((a,c,),))//12)
)) for c in sorted(allc)]))
The innermost piece of this is for ac in ((a,c,),). This is silly, since it's a loop over a 1-element tuple. It's a way of renaming the tuple (a,c) to ac, but it is very confusing and unnecessary.
If we replace the one use of ac with the tuple explicitly written out, the new innermost piece is animaldictionary.get((a,c),0). This is a special way of writing animaldictionary[(a, c)] but without running the risk of causing a KeyError to be raised if (a, c) is not in the dictionary. Instead, the default value of 0 (passed in to get) will be returned for non-existant keys.
That get call is wrapped up in this: (getcall for a in alla). This is a generator expression that gets all the values from the dictionary with a given c value in the key
(with a default of zero if the value is not present).
The next step is taking the average of the values in the previous generator expression: sum(genexp)//12. This is pretty straightforward, though you should note that using // for division always rounds down to the next integer. If you want a more precise floating point value, use just /.
The next part is a call to '\t'.join, with an argument that is a single (c, avg) tuple. This is an awkward construction that could be more clearly written as c+"\t"+str(avg) or "{}\t{}".format(c, avg). All of these result in a string containing the c value, a tab character and the string form of the average calcualted above.
The next step is a list comprehension, [joinedstr for c in sorted(allc)] (where joinedstr is the join call in the previous step). Using a list comprehension here is a bit odd, since there's no need for a list (a generator expression would do just as well).
Finally, the list comprehension is joined with newlines and printed: print("\n".join(listcomp)). This is straightforward.
Anyway, this whole mess can be rewritten in a much clearer way, by using a few variables and printing each line separately in a loop:
for c in sorted(allc):
total_values = sum(animaldictionary.get((a,c),0) for a in alla)
average = total_values // 12
print("{}\t{}".format(c, average))
To finish, I have some general suggestions.
First, your data structure may not be optimal for the uses you are making of you data. Rather than having animaldict be a dictionary with (a,c) keys, it might make more sense to have a nested structure, where you index each level separately. That is, animaldict[a][c]. It might even make sense to have a second dictionaries containing the same values indexed in the reverse order (e.g. one is indexed [a][c] while another is indexed [c][a]). With this approach you might not need the alla and allc lists for iterating (you'd just loop over the contents of the main dictionary directly).
My second suggestion is about code style. Many of your variables are named poorly, either because their names don't have any meaning (e.g. c) or where the names imply a meaning that is incorrect. The most glaring issue is your key and value variables, which in fact unpack two pieces of the key (AKA a and c). In other situations you can get keys and values together, but only when you are iterating over a dictionary's items() view rather than on the dictionary directly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python failure to find all duplicates - python-3.x

Related

DC3/Skew Suffix Array Algorithm doesn't work for specific cases

how can i remove incorrect results from memory?

i made the following binary search algorithm, and I am trying to improve it. suggestions?

Doesn't accept the list index?

whats another way to write python3 zip [closed]

Categories

Resources