Most efficient way to print an AVL tree of strings? - string

I'm thinking that an in order traversal will run in O(n) time. The only thing better than that would be to have something running in logn time. But I don't see how this could be, considering we have to run at least n times.
Is O(n) the lastest we could do here?

Converting and expanding #C.B.'s comment to an answer:
If you have an AVL tree with n strings in it and you want to print all of them, then you have to do at least Θ(n) total work simply because you have to print out each of the n strings. You can often lower-bound the amount of work required to produce a list or otherwise output a sequence of values simply by counting up how many items are going to be in the list.
We can be even more precise here. Suppose the combined length of all the strings in the tree is L. The time required to print out all the strings in the tree has to be at least Θ(L), since it costs some computational effort to output each individual character. Therefore, we can say that we have to do at least Θ(n + L) work to print out all the strings in the tree.
The bound given here just says that any correct algorithm has to do at least this much work, not that there actually is an algorithm that does this much work. But if you look closely at any of the major tree traversals - inorder, preorder, postorder, level-order - you'll find that they all match this time bound.
Now, one area where you can look for savings is in space complexity. A level-order traversal of the tree might require Ω(n) total space if the tree is perfectly balanced (since it holds a whole layer of the tree in memory and the bottommost layer can have Θ(n) nodes in it), while an inorder, preorder, or postorder traversal would only require O(log n) memory because you only need to store the current access path, which has logarithmic height in an AVL tree.

Related

fast, semi-accurate sort in linux

I'm going through a huge list of files in Linux, the output of a "find" (directory walk). I want to sort the list by filename, but I'd like to begin processing the files as soon as possible.
I don't need the sort to be 100% correct.
How can I do a "partial sort", that might be off some of the time but will output quickly?
This is StackOverflow, not SuperUser, so an algorithm answer should be enough for you.
Try implementing HeapSort. But instead of sorting the full list of names, do the following.
Pick a constant M. The smaller it is, the more "off" it will be and the "faster" the algorithm will start printing the results. In the limiting case where M is equal to the number of all names, it will be an exact sorting algorithm.
Load the first M elements, heapify() them.
Take the lowest element from the heap, print it. Put next unsorted name into its place, then do siftDown().
Repeat until you run out of unsorted names. Do a standard HeapSort on the elements left in the heap.
This algorithm will be linear in number of names and will start printing the names as soon as the first M of them will be read. Step 2 is O(M) == O(1). Step 3 is O(log M) == O(1), it is repeated O(N) times, hence total is O(N).
This algorithm will try to keep the large elements in the heap as long as possible while pushing the lowest elements from the heap as quickly as possible. Hence the output will look as if it was almost sorted.
IIRC, a variant of this algorithm is actually what GNU sort does before switching to on-disk MergeSort to keep sorted runs of data as long as possible and minimize number of on-disk merges.

How can natural numbers be represented to offer constant time addition?

Cirdec's answer to a largely unrelated question made me wonder how best to represent natural numbers with constant-time addition, subtraction by one, and testing for zero.
Why Peano arithmetic isn't good enough:
Suppose we use
data Nat = Z | S Nat
Then we can write
Z + n = n
S m + n = S(m+n)
We can calculate m+n in O(1) time by placing m-r debits (for some constant r), one on each S constructor added onto n. To get O(1) isZero, we need to be sure to have at most p debits per S constructor, for some constant p. This works great if we calculate a + (b + (c+...)), but it falls apart if we calculate ((...+b)+c)+d. The trouble is that the debits stack up on the front end.
One option
The easy way out is to just use catenable lists, such as the ones Okasaki describes, directly. There are two problems:
O(n) space is not really ideal.
It's not entirely clear (at least to me) that the complexity of bootstrapped queues is necessary when we don't care about order the way we would for lists.
As far as I know, Idris (a dependently-typed purely functional language which is very close to Haskell) deals with this in a quite straightforward way. Compiler is aware of Nats and Fins (upper-bounded Nats) and replaces them with machine integer types and operations whenever possible, so the resulting code is pretty effective. However, that's not true for custom types (even isomorphic ones) as well as for compilation stage (there were some code samples using Nats for type checking which resulted in exponential growth in compile-time, I can provide them if needed).
In case of Haskell, I think a similar compiler extension may be implemented. Another possibility is to make TH macros which would transform the code. Of course, both of options aren't easy.
My understanding is that in basic computer programming terminology the underlying problem is you want to concatenate lists in constant time. The lists don't have cheats like forward references, so you can't jump to the end in O(1) time, for example.
You can use rings instead, which you can merge in O(1) time, regardless if a+(b+(c+...)) or ((...+c)+b)+a logic is used. The nodes in the rings don't need to be doubly linked, just a link to the next node.
Subtraction is the removal of any node, O(1), and testing for zero (or one) is trivial. Testing for n > 1 is O(n), however.
If you want to reduce space, then at each operation you can merge the nodes at the insertion or deletion points and weight the remaining ones higher. The more operations you do, the more compact the representation becomes! I think the worst case will still be O(n), however.
We know that there are two "extremal" solutions for efficient addition of natural numbers:
Memory efficient, the standard binary representation of natural numbers that uses O(log n) memory and requires O(log n) time for addition. (See also Chapter "Binary Representations" in the Okasaki's book.)
CPU efficient which use just O(1) time. (See Chapter "Structural Abstraction" in the book.) However, the solution uses O(n) memory as we'd represent natural number n as a list of n copies of ().
I haven't done the actual calculations, but I believe for the O(1) numerical addition we won't need the full power of O(1) FIFO queues, it'd be enough to bootstrap standard list [] (LIFO) in the same way. If you're interested, I could try to elaborate on that.
The problem with the CPU efficient solution is that we need to add some redundancy to the memory representation so that we can spare enough CPU time. In some cases, adding such a redundancy can be accomplished without compromising the memory size (like for O(1) increment/decrement operation). And if we allow arbitrary tree shapes, like in the CPU efficient solution with bootstrapped lists, there are simply too many tree shapes to distinguish them in O(log n) memory.
So the question is: Can we find just the right amount of redundancy so that sub-linear amount of memory is enough and with which we could achieve O(1) addition? I believe the answer is no:
Let's have a representation+algorithm that has O(1) time addition. Let's then have a number of the magnitude of m-bits, which we compute as a sum of 2^k numbers, each of them of the magnitude of (m-k)-bit. To represent each of those summands we need (regardless of the representation) minimum of (m-k) bits of memory, so at the beginning, we start with (at least) (m-k) 2^k bits of memory. Now at each of those 2^k additions, we are allowed to preform a constant amount of operations, so we are able to process (and ideally remove) total of C 2^k bits. Therefore at the end, the lower bound for the number of bits we need to represent the outcome is (m-k-C) 2^k bits. Since k can be chosen arbitrarily, our adversary can set k=m-C-1, which means the total sum will be represented with at least 2^(m-C-1) = 2^m/2^(C+1) ∈ O(2^m) bits. So a natural number n will always need O(n) bits of memory!

Find Median of AVL tree

I've searched a bit and found a related post: Get median from AVL tree?
but I'm not too satisfied with the response.
My thoughts on solving this problem:
If the balance factor is 0, return root
else keep removing the root until the tree is completely balanced, and calculate the median of the roots you just removed
Assuming the AVL tree will keep the balance(by definition?)
I've seen some answers suggesting in-order traversal and find median, but I that will require more space and time in my opinion.
Can someone confirm or correct my ideas? thanks!
There are two problems in your suggested approach:
You destroy your tree in the process (or take up twice as much memory for a "backup" copy)
In the worst case, you need quite a lot of root removals to get a completely balanced tree (I think in the worst-case, it would be close to 2^(n-1)-1 removals)... and you'd still need to calculate the median from that.
The answer in your linked question is right and optimal. The usual way to solve this is to construct a Order statistic tree (by holding the number of elements of the left and right sub-tree for each node). Do note, that you have to compensate the numbers accordingly if a rotation of the AVL tree happens.
See IVlad's answer here. Since an AVL tree guarantees an O(log n) Search operation and IVlad's algorithm is essentially a Search operation, you can find the k-th smallest element in O(log n) time and O(1) space (not counting the space for the tree itself).
Assuming your tree is indexed from 0 and has n elements, find the median in the following way:
if n is odd: Find the (n-1)/2-th element and return it
if n is even: Find the n/2-th and (n/2)-1 elements and return their average
Also, if changing the tree (left/right element counts) is not an option, see the second part of the answer you linked to.

What data structure supports fast insertion, deletion and search

I feel like this has to exist, but I just can't think of it. Is there a data structure that can one hold a sorted list of values and be searched quickly (maybe log(N) time like an array), and also supports insertion and removal of elements in log(N) or constant time?
This is pretty much the description of a balanced binary search tree, which stores elements in sorted order, allows for O(log n) insertions, deletions, and lookups, and allows for O(n) traversal of all elements.
There are many ways to build a balanced BST - there are red/black trees, AVL trees, scapegoat trees, splay trees, AA trees, treaps, (a, b)-trees, etc. Any of these would solve your problem. Of them, splay trees are probably the easiest to code up, followed by AA-trees and AVL trees.
Hope this helps!

data structure for shift strings

We're interested in a data structure for binary strings. Let S=s1s2....sm be a binary string of size m. Shift(S,i) is a cyclic shift of string S i spaces to the left. That is, Shift(S,i)=sisi+1si+2...sms1...si-1. Suggest an efficient data structure that supports:
Init() of an empy DS in O(1)
Insert(s) inserts a binary string to the DS in O(|s|^2)
Search_cyclic(s) checks if there is a Shift(S,i) for ANY i in O(|s|).
Space Complexity: O(|S1|+|S2|+.....+|Sm|) where Si is one if the m strings we've inserted this far.
If i had to find Search_cyclic(s,i) for some given i, this is quite simple with using a suffix tree and just traversing it in O(|s|). But here in Search_cyclic(s) we don't have a given i, so I don't know what to do in the given complexity. OTOH, Insert(s) generally takes O(|s|) to insert to a suffix tree and here we are given O(|s|^2).
So here is a solution I can propose to you. The complexities are even lower then the ones they asked of you but it may seem a bit complicated.
The data structure in which you keep all the strings will be a Trie or even a Patricia tree. In this tree for each string you want to insert the minimum cyclic shift(i.e. the cyclic shift of all possible ones which is minimum lexicographically) out of all of its possible shifts. You can calculate the minimum cyclic shift of a string in linear time and I will give one possible solution to that a bit later. For the moment lets assume you can do it. Here is how the operations required will be implemented:
Init() - init of both trie and patricia tree are constant - no problem here
Insert(s) - you compute the minimum cyclic shift s' of s in O(|s|) and then you insert it in either of the data structures in O(|s'|) = O(|s|). This is even better then the required complexity
Search_cyclic(s) - again you compute the minimum cyclic shift of s in O(|s|) and then you check in the Patricia or Trie if the string is present, which again is done in O(|s|)
Also the memory complexity is as required and may be even lower if you construct a Patricia.
So all that is left is to exaplain how to find the minimum cyclic shift. Since you mention suffix tree I hope you know how to construct it in linear time. So the trick is - you append your string s to itself(i.e. double it) and then you construct a suffix tree for the doubled string. This is still linear with respect to |s| so no problem there. After that all you have to do is to find the minimum of the suffixes of length n in this tree. This is not hard at all I believe - start from the root and always follow the link from the current node that has minimal string written on it until you accumulate length longer then |s|. Because of the doubling of the string, you will always be able to follow minimal string links until you accumulate length at least |s|.
Hope this answer helps.

Resources