how can i remove incorrect results from memory? - python-3.x

import itertools
printable = 'abcdefghijklmnopqrstuvwxz'
all_possibilites = ([''.join(i) for i in itertools.product(printable, repeat = 3)])
comparison = ['zd']
if comparison in all_possibilities:
print("match")
This is a snippet of my code. my intention is to generate every single combination of the alphabet. The snippet here has a limit of three characters. With the limit too large python returns memory error. My question is:
Is there a way to remove from memory the combinations that did not match in order for the only limitation to be time, instead of memory? say if the character limit was 5? Any further reading on this would be helpful too.

The main fault is that you're first creating the full list and then trying to filter it out. That's an issue because you bring the full list in memory.
You'd be better adding your condition inside the list-comprehension, keeping any elements you'll actually need:
all_posibilities = ["".join(i) for i in itertools.product(printable, repeat = 5) if 'a' or 'b' in i]
## ^^ place your condition here
print(len(all_posibilities))
Alternatively, if you're looking to iterate through all_posibilities in the end, it makes sense to create a generator to further limit the memory footprint:
all_posibilities = ("".join(i) for i in itertools.product(printable, repeat = 5) if 'a' or 'b' in i)
for i in all_posibilities:
# do things

Related

TypeError when applying sum to a list of strings [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Finding item at index i of an iterable (syntactic sugar)?

I know that maps, range, filters etc. in python3 return iterables, and only calculate value when required. Suppose that there is a map M. I want to print the i^th element of M.
One way would be to iterate till i^th value, and print it:
for _ in range(i):
next(M)
print(next(M))
The above takes O(i) time, where I have to find the i^th value.
Another way is to convert to a list, and print the i^th value:
print(list(M)[i])
This however, takes O(n) time and O(n) space (where n is the size of the list from which the map M is created). However, this suits the so-called "Pythonic way of writing one-liners."
I was wondering if there is a syntactic sugar to minimise writing in the first way? (i.e., if there is a way which takes O(i) time, no extra space, and is more suited to the "Pythonic way of writing".)
You can use islice:
from itertools import islice
i = 3
print(next(islice(iterable), i, i + 1))
This outputs '3'.
It actually doesn't matter what you use as the stop argument, as long as you call next once.
Thanks to #DeepSpace for the reference to the official docs, I found the following:
from more_itertools import nth
print(nth(M, i))
It prints the element at i^th index of the iterable.

Simple adding two arrays using numpy in python?

This might be a simple question. However, I wanted to get some clarifications of how the following code works.
a = np.arange(8)
a
array([1,2,3,4,5,6,7])
Example Function = a[0:-1]+a[1:]/2.0
In the Example Function, I want to draw your attention to the plus sign between the array a[0:-1]+a[1:]. How does that work? What does that look like?
For instance, is the plus sign (addition) adding the first index of each array? (e.g 1+2) or add everything together? (e.g 1+2+2+3+3+4+4+5+5+6+6+7)
Then, I assume /2.0 is just dividing it by 2...
A numpy array uses vector algebra in that you can only add two arrays if they have the same dimensions as you are adding element by element
a = [1,2,3,4,5]
b = [1,1,1]
a+b # will throw an error
whilst
a = [1,2,3,4,5]
b = [1,1,1,1,1]
a+b # is ok
The division is also element by element.
Now to your question about the indexing
a = [1,2,3,4,5]
a[0:-1]= [1,2,3,4]
a[1:] = [2,3,4,5]
or more generally a[index_start: index_end] is inclusive at the start_index but exclusive at the end_index - unless you are given a a[start_index:]where it includes everything up to and including the last element.
My final tip is just to try and play around with the structures - there is no harm in trying different things, the computer will not explode with a wrong value here or there. Unless you trying to do so of course.
If arrays have identical shapes, they can be added:
new_array = first_array.__add__(second_array)
This simple operation adds each value from first_array to each value in second_array and puts result into new_array.

i made the following binary search algorithm, and I am trying to improve it. suggestions?

i am doing a homework assignment where I have to check large lists of numbers for a given number. the list length is <= 20000, and I can be searching for just as many numbers. if the number we are searching for is in the list, return the index of that number, otherwise, return -1. here is what I did.
i wrote the following code, that outputsthe correct answer, but does not do it fast enough. it has to be done in less than 1 second.
here is my binary search code:`I am looking for suggestions to make it faster.
def binary_search(list1, target):
p = list1
upper = len(list1)
lower = 0
found = False
check = int((upper+lower)//2)
while found == False:
upper = len(list1)
lower = 0
check = int(len(list1)//2)
if list1[check] > target:
list1 = list1[lower:check]
check= int((len(list1))//2)
if list1[check] < target:
list1 = list1[check:upper]
check = int((len(list1))//2)
if list1[check] == target:
found = True
return p.index(target)
if len(list1)==1:
if target not in list1:
return -1`
grateful for any help/
The core problem is that this is not a correctly written Binary Search (see the algorithm there).
There is no need of index(target) to find the solution; a binary search is O(lg n) and so the very presence of index-of, being O(n), violates this! As written, the entire "binary search" function could be replaced as list1.index(value) as it is "finding" the value index simply to "find" the value index.
The code is also slicing the lists which is not needed1; simply move the upper/lower indices to "hone in" on the value. The index of the found value is where the upper and lower bounds eventually meet.
Also, make sure that list1 is really a list such that the item access is O(1).
(And int is not needed with //.)
1 I believe that the complexity is still O(lg n) with the slice, but it is a non-idiomatic binary search and adds additional overhead of creating new lists and copying the relevant items. It also doesn't allow the correct generation of the found item's index - at least without similar variable maintenance as found in a traditional implementation.
Try using else if's, for example if the value thats being checked is greater then you don't also need to check if its smaller.

Python failure to find all duplicates

This is related to random sampling. I am using random.sample(number,5) to return a list of random numbers from within a range of numbers contained in numbers. I am using while i < 100 to return one hundred sets of five numbers. To check for duplicates, I am using :
if len(numbers) != len(set(numbers)):
to identify sets with duplicates and following this with random.sample(number,5) to try to do another randomisation to replace the set with duplicates. I seem to get about 8% getting re-randomised ( using a print statement to say which number was duplicated), but about 5% seem to be missed. What am I doing incorrectly? The actual code is as follows:
while i < 100:
set1 = random.sample(numbers1,5)
if len(set1) != len(set(set1))
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
In another routine I am trying to do the same as above, but searching for duplicates in two sets by adding the same, substituting set2 for set1. This gives the same sorts of failures. The set2 routine is indented and placed immediately below the above routine. While i < 100: is not repeated for set2.
I hope that I have explained my problem clearly!!
There is nothing in your code to stop the second sample from having duplicates. What if you did something like a second while loop?
while i<100:
i+=1
set1 = random.sample(numbers1,5)
while len(set1) != len(set(set1)):
print('duplicate(s) found, random selection repeated')
set1 = random.sample(numbers1,5)
Of course you're still missing the part of the code that does something... beyond the above it's difficult to tell what you might need to change without a full code sample.
EDIT: here is a working version of the code sample from the comments:
def choose_random(list1,n):
import random
i = 0
set_list=[]
major_numbers=range(1,50) + list1
print(major_numbers)
while i <n:
set1 =random.sample(major_numbers,5)
set2 =random.sample(major_numbers,2)
while len(set(set1)) != len(set1):
print("Duplicate found at %i"%i)
print set1
print("Changing to:")
set1 =random.sample(major_numbers,5)
print set1
set_list.append([set1,set2])
i +=1
return set_list
The code you give obviously has some gaps in it and cannot work as it is there, so I cannot pinpoint where exactly your error is, but running set1 = random.sample(numbers1,5) after the end of the while loop (which is infinite if written as in your question) undoes everything you did before, because it overwrites whatever you managed to set set1 to.
Anyway, random.sample should give you a sample without replacement. If you have any repetitions in random.sample(numbers1, 5) that means that you already have repetitions in numbers1. If that is not supposed to be the case, you should check the content of numbers1 and maybe force it to contain everything uniquely, for example by using set(numbers1) instead.
If the reason is that you want some elements from numbers1 with higher probability, you might want to put this as
set1 = random.sample(numbers1, 5)
while len(set1) != len(set(set1)):
set1 = random.sample(numbers1, 5)
This is a possibly infinite loop, but if numbers1 contains at least 5 different elements, it will exit the loop at some point. If you don't like the theoretical possibility of this loop never exiting, you should probably use a weighted sample instead of random.sample, (there are a few examples of how to do that here on stackoverflow) and remove the numbers you have already chosen from the weights table.

Resources