Taking first N items from two combined iterators? - python-3.x

I've got two iterables (i1 and i2). Each one is producing items sorted in order of a key, with both iterables using the same key. I want to get the first N items, still sorted by the key, from the combined iterations. If I was willing to completely consume both iterables, I could do:
l = list(i1) + list(i2)
l.sort()
l[:n]
but I know I'm only going to need a small fraction of that. Is there some neat way to do this using just itertools?

See Tim Peters's suggestion to use heapq.merge().

Related

Python: Sort 2-tuple sometimes according to first and sometimes according to second element

I have a list of tuples of integers [(2,10), [...] (4,11),(3,9)].
Tuples are added to the list as well as deleted from the list regularly. It will contain up to ~5000 Elements.
In my code I need to use this list sometimes sorted according to the first and sometimes to the second tuple-element. Hence ordering of the list will change drastically. Resorting might take place at any time.
Pythons tinsort is only fast when list are already sorted heavily. So this general approach of frequent resorting might be inefficient. A better approach would be to use two naturally sorted data-structures like the SortedList. But here I would need two lists (one for the first tuple element, and one for the second) as well as a dictionary to create the mapping of the above tuples.
What is the pythonic way to solve this?
In Java I would do it like this:
TreeSet<Integer> leftTupleEntry = new Treeset<Integer>();
TreeSet<Integer> rightTupleEntry = new Treeset<Integer>();
Hashmap<Integer, Integer> tupleMap = new HashMap<Integer,Integer>()
And have both sorting strategies in the best runtime complexity class as well as the necessary connection between both numbers.
When I need to sort it according to first tuple I need to access the whole list (as i need to calculate a cumulative sum, and other operations)
When I need to sort according to second element, I'm only interested in the smallest elements, which then is usually followed by the deletion of these respective tuples.
Typically after any insertation a new sort according to the first element is requested.
first_element_list = sorted([i[0] for i in list_tuple])
second_element_list = sorted([i[1] for i in list_tuple])
What I did:
I use the SortedKeyList and sorted according to the first tuple element. Inserting into this list is O(log(n)). Reading from it is O(log(n)) too.
from operator import itemgetter
from sortedcontainers import SortedKeyList
self.list = SortedKeyList(key=itemgetter(0))
self.list.add((1,4))
self.list.add((2,6))
When I need the argmin according to the second tuple element I used
np.argmin(self.list, axis=0)[0]
Which is O(n). Not optimal.

Generating all permutations excluding cyclic rotations in Python

I need to create a (practice) program for currency arbitrage that detects profitable "loops" given a series of exchange rates. So there might be different values for USD->JPY, JPY->USD, USD->EUR, and so on. In order to detect profitability, however, I first need to enumerate all possible loops -- USD->JPY->EUR->USD is one example, but USD->EUR->JPY->USD is a distinct example using the same currencies since it may hit different exchange rates.
If I ignore the last part of the loop, which will always be the same as the origin, it seems to be the case that every currency can only exist at most once in the "best" loop, as if a currency exists more than once it would actually be two different loops (at least one of which would still be profitable).
Similarly, I can ignore loops that are just translations of already tested loops: USD->JPY->ASD is the same as JPY->ASD->USD.
So, given input like [USD,JPY,EUR,ASD] I need something that would return:
(USD,JPY,EUR,ASD)
(USD,JPY,ASD,EUR)
(USD,EUR,ASD,JPY)
(USD,EUR,JPY,ASD)
(USD,ASD,EUR,JPY)
(USD,ASD,JPY,EUR)
This solution uses the yield from syntax introduced in Python 3.3. Like the built-in itertools.permutations(), this:
Is a generator and does not require storing anything
Will yield an empty tuple if passed length 0
Assumes every item in the permuted object is itself unique
from itertools import permutations
def unique_cyclic_permutations(thing, length):
if length == 0:
yield (); return
for x in permutations(thing[1:], length - 1):
yield (thing[0],) + x
if length < len(thing):
yield from unique_cyclic_permutations(thing[1:], length)
The algorithm works by choosing a pivot, fixing it at the beginning, and then permuting the rest of the objects. In the case of a non-full length permutation, there will also be some permutations that don't include the pivot object at all. In this case, the generator recursively calls itself while excluding the original pivot.

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something
READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

constraint to avoid generating duplicates in this search task

I have to solve the following optimization problem:
Given a set of elements (E1,E2,E3,E4,E5,E6) create an arbitrary set of sequences e.g.
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
and given a function f that gives a value for every pair of elements e.g.
f(E1,E4) = 5
f(E4,E3) = 2
f(E6,E5) = 3
...
in addition it also gives a value for the pair of an element combined with some special element T, e.g.
f(T,E2) = 10
f(E2,T) = 3
f(E5,T) = 1
f(T,E6) = 2
f(T,E1) = 4
f(E3,T) = 2
...
The utility function that must be optimized is the following:
The utility of a sequence set is the sum of the utility of all sequences.
The utility of a sequence A1,A2,A3,...,AN is equal to
f(T,A1)+f(A1,A2)+f(A2,A3)+...+f(AN,T)
for our example set of sequences above this leads to
seq1: f(T,E1)+f(E1,E4)+f(E4,E3)+f(E3,T) = 4+5+2+2=13
seq2: f(T,E2)+f(E2,T) =10+3=13
seq3: f(T,E6)+f(E6,E5)+f(E5,T) =2+3+1=6
Utility(set) = 13+13+6=32
I try to solve a larger version (more elements than 6, rather 1000) of this problem using A* and some heuristic. Starting from zero sequences and stepwise adding elements either to existing sequences or as a new sequence, until we obtain a set of sequences containing all elements.
The problem I run into is the fact that while generating possible solutions I end up with duplicates, for example in above example all the following combinations are generated:
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
+
seq1:E1,E4,E3
seq2:E6,E5
seq3:E2
+
seq1:E2
seq2:E1,E4,E3
seq3:E6,E5
+
seq1:E2
seq2:E6,E5
seq3:E1,E4,E3
+
seq1:E6,E5
seq2:E2
seq3:E1,E4,E3
+
seq1:E6,E5
seq2:E1,E4,E3
seq3:E2
which all have equal utility, since the order of the sequences does not matter.
These are all permutations of the 3 sequences, since the number of sequences is arbitrairy there can be as much sequences as elements and a faculty(!) amount of duplicates generated...
One way to solve such a problem is keeping already visited states and don't revisit them. However since storing all visited states requires a huge amount of memory and the fact that comparing two states can be a quite expensive operation, I was wondering whether there wasn't a way I could avoid generating these in the first place.
THE QUESTION:
Is there a way to stepwise construct all these sequence constraining the adding of elements in a way that only combinations of sequences are generated rather than all variations of sequences.(or limit the number of duplicates)
As an example, I only found a way to limit the amount of 'duplicates' generated by stating that an element Ei should always be in a seqj with j<=i, therefore if you had two elements E1,E2 only
seq1:E1
seq2:E2
would be generated, and not
seq1:E2
seq2:E1
I was wondering whether there was any such constraint that would prevent duplicates from being generated at all, without failing to generate all combinations of sets.
Well, it is simple. Allow generation of only such sequences that are sorted according to first member, that is, from the above example, only
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
would be correct. And this you can guard very easily: never allow additional sequence that has its first member less than the first member of its predecessor.

Best way to sort a long list of strings

I would like to know the best way to sort a long list of strings wrt the time and space efficiency. I prefer time efficiency over space efficiency.
The strings can be numeric, alpha, alphanumeric etc. I am not interested in the sort behavior like alphanumeric sort v/s alphabetic sort just the sort itself.
Some ways below that I can think of.
Using code ex: .Net framework's Arrays.Sort() function. I think the way this works is that the hashcodes for the strings are calculated and the string is inserted at the proper position using a binary search.
Using the database (ex: MS-sql). I have not done this. I do not know how efficient this would be though.
Using a prefix tree data structure like a trie. Sorting requires traversing all the trieNodes of the trie tree using DFS (depth first search) - O(|V| + |E|) time. (Searching takes O(l) time where l is the length of the string to compare).
Any other ways or data structures?
You say that you have a database, and presumably the strings are stored in the database. Then you should get the database to do the work for you. It may be able to take advantage of an index and therefore not need to actually sort the list, but just read it from the index in sorted order.
If there is no index the database might still be able to help you. If you only fetch the first k rows for some small constant number k, for example 100. When you use ORDER BY with a LIMIT clause it allows SQL Server to use a special optimization called TOP N SORT which runs in linear time instead of O(n log(n)) time.
If your strings are not in the database already then you should use the features provided by .NET instead. I think it is unlikely you will be able to write custom code that will be much faster than the default sort.
I found this paper that uses trie data structure to efficiently sort large sets of strings. I have not looked into it in detail though.
Radix sort could also be good option if strings are not very long e.g. list of names
Let us suppose you have a large list of strings and that the length of the List is N.
Using a comparison based sorting algorithm like MergeSort, HeapSort or Quicksort will give you an
where n is the size of the list and d is the maximum length for all strings in the list.
We can try to use Radix sort in this case. Let b be the base and let d be the length of the maximum string then we can show that the running time using radix sort is .
Furthermore, if the strings are say the lower case English Alphabets the running time is
Source: MIT Opencourse Algorithms lecture by prof. Eric Demaine.

Resources