How to use a WholeRowIterator as the source of another iterator? - accumulo

I am trying to filter out columns after using a WholeRowIterator to filter rows. This is to remove columns that were useful in determining which row to keep, but not useful in the data returned by the scan.
The WholeRowIterator does not appear to play nice as the source of another iterator such as a RegExFilter. I know the keys/values are encoded by the WholeRowIterator.
Are there any possible solutions to get this iterator stack to work?
Thanks.

Usually, the WholeRowIterator is the last iterator in the "stack" as it involves serializing the row (many key-values) into a single key-value. You probably don't want to do it more than once. But, let's assume you want to do that:
You would want to write an Iterator which, deserializes each Key-Value into a SortedMap using the WholeRowIterator method, modify the SortedMap, reserialize it back into a single Key-Value, and then return it. This iterator would need to be assigned a priority higher than the priority given to the WholeRowIterator.
Alternatively, you could extend the WholeRowIterator and override the encodeRow(List<Key>,List<Value>) method to not serialize your unwanted columns in the first place. This would save the extra serialization and deserialization the first approach has.

Related

What is the most idomatic way to write iterators that map an uncertain number of input items to other objects in Rust?

I'm trying to implement a Lexer. Since lexers emit tokens, I suppose that we can perceive a Lexer as a special iterator that maps certain chunks of chars to Tokens. I therefore expect Lexer to store an Iterator<Item=char> and manipulate that iterator instead of a &str to enable maximum flexibility.
struct Lexer<T: Iterator<Item=char>> {
source: T
}
Yet I find it hard to manipulate the iterator, since almost all iterator adaptors take ownership, and with generics I cannot change the type of T at runtime, unless I switch to Box.
self.source.take_while(|x| x.is_whitespace())
A possible workaround is to require that the iterator implement Clone, use a clone every time I want to transform it, remember how many characters I have seen, and call next that many times. I believe that it is too clumsy.
I wonder if there is an idomatic way to write iterators that map an uncertain number of input items (in this case, chars) into another object (in this case, Tokens)?
The most elegant way I can come up with so far is to use while let etc. which are not so fluent-style-like. I inspected the implementation of GroupBy in itertools and found that they use the while let approach too.

How to get value from IMAP (hazelcast) given the list of keys?

Problem we are trying to solve:
Give a list of Keys, what is the best way to get the value from IMap given the number of entries is around 500K?
Also we need to filter the values based on fields.
Here is the example map we are trying to read from.
Given IMap[String, Object]
We are using protobuf to serialize the object
Object can be say
Message test{ Required mac_address eth_mac = 1, ….// size can be around 300 bytes }
You can use IMap.getAll(keySet) if you know the keys beforehand. It's much better than single gets since it'll be much less network trips in a bulk operation.
For filtering, you can use predicates on IMap.values(predicate), IMap.entryset(predicate) or IMap.keyset(predicate) based on what you want to filter.
See more: http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#distributed-query

Why LinkedList as a bucket implementation for HashMap?

I understand the workings of HashMap -collisions and everything - trying to understand the deeper mechanics and the choice for entry bucket - and not say an Array (making it a 2 dimensional matrix )? Since search through both are O(n) operations ? I assumed one factor that might make the choice for Linked List is the insertions o(1) factor! Is that a correct assumption?
If it is in Java then the linked list is the bucket. And each entry in the table is a linked list containing the array of entry elements. In that each node knows what is in the next to the list till the time you reach to the end when the next reference is null.
Also do check:-
How does Java's Hashmap work internally?
An array has a predetermined size. So if you use an array, then the hash table has a predetermined size for each of the array elements. Every possible bucket is allocated, and your hash table will become big. This will be make sense if you have very huge memory but if not then use link list and walk through the list to find the match.
Its because you do not add at the end of the linkedlist but at the head to make it O(1)
[Answering below question - No not looking for a solution :) - I just wanted to understand why was the choice was made for Buckets to be am internam implemetnation of a Linked List and not an array based List (i didn't mean a fixed size array - ). Linked lists offer node based implemetnation that certainly help when there are a large number of inserts into the data structure - but when you are adding to the end of list - any growable array based implementation for bucket should suffice?]

Data Structure to use instead of hash_map

I want to make an array containing three wide character arrays such that one of them is the key.
"LPWCH,LPWCH,LPWCH" was not able to use the greater than/lesser than symbols since it thinks it is a tag
Hash_map only lets me use a pair. wKey and the element associated with it. Is there another data structure that lets me use this?
This set will be updated by different threads almost simultaneously. And thats the reason why I don't want to use a class or another struct to define the remaining two wide character arrays.
You can use LPWCH as a key and std::pair<LPWCH, LPWCH> as an element.
Using any of LP-typedefs is not good. You would only be comparing the points, and not strings.
LPWCH is nothing but a WCHAR* which can be drilled down to void*. When you compare two pointers, you are comparing where they are pointing, and not what they are pointing.
You either need to have another comparer attached to your map/hash_map, or use actual string datatype (like std::string, CString)

Why do some programming languages restrict you from editing the array you're looping through?

Pseudo-code:
for each x in someArray {
// possibly add an element to someArray
}
I forget the name of the exception this throws in some languages.
I'm curious to know why some languages prohibit this use case, whereas other languages allow it. Are the allowing languages unsafe -- open to some pitfall? Or are the prohibiting languages simply being overly cautious, or perhaps lazy (they could have implemented the language to gracefully handle this case, but simply didn't bother).
Thanks!
What would you want the behavior to be?
list = [1,2,3,4]
foreach x in list:
print x
if x == 2: list.remove(1)
possible behaviors:
list is some linked-list type iterator, where deletions don't affect your current iterator:
[1,2,3,4]
list is some array, where your iterator iterates via pointer increment:
[1,2,4]
same as before, only the system tries to cache the iteration count
[1,2,4,<segfault>]
The problem is that different collections implementing this enumerable/sequence interface that allows for foreach-looping have different behaviors.
Depending on the language (or platform, as .Net), iteration may be implemented differently.
Typically a foreach creates an Iterator or Enumerator object on the array, which internally keeps its state about the iteration details. If you modify the array (by adding or deleting an element), the iterator state would be inconsistent in regard to the new state of the array.
Platforms such as .Net allow you to define your own enumerators which may not be susceptible to adding/removing elements of the underlying array.
A generic solution to the problem of adding/removing elements while iterating is to collect the elements in a new list/collection/array, and add/remove the collected elements after the enumeration has completed.
Suppose your array has 10 elements. You get to the 7th element, and decide there that you need to add a new element earlier in the array. Uh-oh! That element doesn't get iterated on! for each has the semantics, to me at least, of operating on each and every element of the array, once and only once.
Your pseudo example code would lead to an infinite loop. For each element you look at, you add one to the collection, hence if you have at least 1 element to start with, you will have i (iterative counter) + 1 elements.
Arrays are typically fixed in the number of elements. You get flexible sized widths through wrapped objects (such as List) that allow the flexibility to occur. I suspect that the language may have issues if the mechanism they used created a whole new array to allow for the edit.
Many compiled languages implement "for" loops with the assumption that the number of iterations will be calculated once at loop startup (or better yet, compile time). This means that if you change the value of the "to" variable inside the "for i = 1 to x" loop, it won't change the number of iterations. Doing this allows a legion of loop optimizations, which are very important in speeding up number-crunching applications.
If you don't like that semantics, the idea is that you should use the language's "while" construct instead.
Note that in this view of the world, C and C++ don't have proper "for" loops, just fancy "while" loops.
To implement the lists and enumerators to handle this, would mean a lot of overhead. This overhead would always be there, and it would only be useful in a vast miniority of the cases.
Also, any implementation that were chosen would not always make sense. Take for example the simple case of inserting an item in the list while enumerating it, would the new item always be included in the enumeration, always excluded, or should that depend on where in the list the item was added? If I insert the item at the current position, would that change the value of the Current property of the enumerator, and should it skip the currently current item which is then the next item?
This only happens within foreach blocks. Use a for loop with an index value and you'll be allowed to. Just make sure to iterate backwards so that you can delete items without causing issues.
From the top of my head there could be two scenarios of implementing iteration on a collection.
the iterator iterates over the collection for which it was created
the iterator iterates over a copy of the collection for which it was created
when changes are made to the collection on the fly, the first option should either update its iteration sequence (which could be very hard or even impossible to do reliably) or just deny the possibility (throw an exception). The last of which obviously is the safe option.
In the second option changes can be made upon the original collection without bothering the iteration sequence. But any adjustments will not be seen in the iteration, this might be confusing for users (leaky abstraction).
I could imagine languages/libraries implementing any of these possibilities with equal merit.

Resources