HashMap in OpenCL? - hashmap

Is it possible to create a simple HashMap in OpenCL? E.g. one where all keys have type long and all values type int, and that never has to be modified (i.e. is passed read-only to the kernel).
Construction of the HashMap can take time (is it done once on the CPU and never has to be modified again), but read-access will be frequent, so get(long key, *hashmap H) should be cheap.
Are there any known implementations for this in OpenCL? I failed to find them. In case I'd have to write one from scratch, which HashMap implementation would be most suitable for this use?

I think that a simple hash table implementation using open addressing could fulfill your requirements here:
By its nature it is stored on a single buffer, and thus trivial to transfer to the kernels.
It's then easy to write the getter logic in the kernel, especially when you don't need any synchronization (read-only).
So, pass a buffer of long2 or a buffer of struct { long key; int val; }, when the first item is the key and the second the value, and also pass the buffer size; now write a regular open-address getter.

Related

How to implement thread-safe map of maps in golang?

I am working on a multi-threaded module and need to implement map of map in golang - map[outer]map[inner]*some_struct. The outer key(map[outer]) will be accessed by multiple threads(goroutines) to add key to inner map. I have a doubt if multiple threads can concurrently add keys to inner map, for a common outer key - map[outer]. Is it thread safe and is sync.Map a better option ?
Also outer key- map[outer] and total number of outer keys are known at runtime so can't define locks beforehand.
To better understand the problem statement, we can take example of add information about different cities. We can group cities by states. Each thread represents a city. To add info about a city, first thread needs to check outer key - state,(map[state]) and then each thread will simply add info to map[state][city] = &some_struct{x:y,y:z}.
I have read few articles and found out sync.Map is suitable for concurrent map operations and these operations are performed atomically. But in documentation one of the use-case mentioned was - when multiple goroutines read, write, and overwrite entries for disjoint sets of keys.
It will be helpful if someone can suggest thread-safe approach for this problem statement.
You must thing in OO terms
What do you want to represent as map of map?
Map state, city make some sense. However what kind of operations do you want to do?
Write and Read, concurrent? Why?
Do you want to iterate over all cities? Do you need to delete cities/states?
Imagine the following interface
type DB interface {
Exists(state, city string) bool
Get(state, city string) *some_struct
Set(state, city string, data *some_struct)
Delete(state, city string)
DeleteState(state string)
ForeachCitiesInState(state string, func(city string, data *some_struct) bool)
Foreach(func(state, city…))
}
With this interface we can consider:
use a struct with a Mutex and map of maps to control the access on each read/write/delete
same as 1 but with Read Write Mutex if you have more reads than writes
if you don’t need loop over cities on a particular state, perhaps
you can create a map[ composite key ] struct like state:city to
simplify.
If you will load it from another place with a constant time interval, perhaps you should use atomic.Value to store the big map. Update is just a substitution for a more recent map.
Perhaps you can combine several rw locks. For instance one for state and another for city. You can split like
type states struct {
sync.Mutex
map[ stateName ]state
}
type state struct {
sync.Mutex
map[ cityFirstLetter ]cities
}
type cities struct {
sync.Mutex
map[ cityName ] *some_struct
}
Ideas:
Define the interface
Define (or measure) the real scenario of usage
Write benchmarks
Be careful by return a pointer to data. You can change the internal state. Consider return a copy or an interface

HashMap vs ConcurrentHashMap for a threadsafe value type

HashMap vs ConcurrentHashMap, when the value is AtomicInteger or LongAdder, is there any harm in using HashMap in a multithreaded environment ?
Yes, there is.
An object being of type AtomicInteger or LongAdder just means that the object itself is safe in a concurrent modification operation (i.e. if two threads try to modify it, they will do so one after the other). However, if the map containing the objects itself is of type 'HashMap', then concurrent modification operations of the map are not safe. For instance, if you want to add a key-value pair only if the key doesn't already exist in the map, you cannot safely use the putIfAbset() operation anymore because it's not synchronized/thread-safe in HashMap. And if you do use it, then it is possible that two threads will execute call this method at the same time, both of them reaching the conclusion that the map doesn't have they key, and then both of them adding a key-value pair, resulting in one of them overwriting the other other's value.
You cannot use a HashMap in a multithreaded environment. The reason is as follows:
If multiple threads operate on a simple HashMap they can damage the internal structure of the HashMap which is an array of linked lists. The links can go missing or go in circles. The result will be that the HashMap will be totally unusable and corrupt. This is the reason you should always use a concurrentHashMap in a multithreaded environment regardless of what value you want to store in the map itself.
Now, in a concurrentHashMap of a type say map< String val, 'any number'> 'any number' could be a LongAdder or an AtomicLong etc. Remember that not all operations on a concurrentHashMap are threadsafe by default. Therefore, if you use say a LongAdder then you could write the following atomic operation without any need to synchronize:
map.putIfAbsent(a string key, new LongAdder());
map.get("abc").increment();

Design of a high-performance sorted data structure read by many threads and written by few

I have an interesting data structure design problem that is beyond my current expertise. I'm seeking data structure or algorithm answers about tackling this problem.
The requirements:
Store a reasonable number of (pointer address, size) pairs (effectively two numbers; the first is useful as a sorting key) in one location
In a highly threaded application, many threads will look up values, to see if a specific pointer is within one of the (address, size) pairs - that is, if the pair defines a memory range, if the pointer is within any range in the list. Threads will much more rarely add or remove entries from this list.
Reading or searching for values must be as fast as possible, happening hundreds of thousands to millions of times a second
Adding or removing values, ie mutating the list, happens much more rarely; performance is not as important
It is acceptable but not ideal for the list contents to be out of date, ie for a thread's lookup code to not find an entry that should exist, so long as at some point the entry will exist.
I am keen to avoid a naive implementation such as having a critical section to serialize access to a sorted list or tree. What data structures or algorithms might be suitable for this task?
Tagged with Delphi since I am using that language for
this task. Language-agnostic answers are very welcome.
However, I probably cannot use any of the standard
libraries in any language without a lot of care. The reason is that memory access
(allocation, freeing, etc of objects and their internal memory, eg
tree nodes, etc) is strictly controlled and must go through my own
functions. My current code elsewhere in the same program uses
red/black trees and a bit trie, and I've written these myself. Object
and node allocation runs through custom memory allocation routines.
It's beyond the scope of the question, but is mentioned here to avoid
an answer like 'use STL structure foo.' I'm keen for an algorithmic or
structure answer that, so long as I have the right references or textbooks,
I can implement myself.
I would use a TDictionary<Pointer, Integer> (from Generics.Collections) combined with a TMREWSync (from SysUtils) for the multi-read exclusive-write access. TMREWSync allows multiple readers simulatenous access to the dictionary, as long as no writer is active. The dictionary itself provides O(1) lookup of pointers.
If you don't want to use the RTL classes the answer becomes: use a hash map combined with a multi-read exclusive-write synchronization object.
EDIT: Just realized that your pairs really represent memory ranges, so a hash map does not work. In this case you could use a sorted list (sorted by memory adress) and then use binary search to quickly find a matching range. That makes the lookup O(log n) instead of O(1) though.
Exploring a bit the replication idea ...
From the correctness point of view, reader/writer locks will do the work. However,
in practice, while readers may be able to proceed concurrently and in parallel
with accessing the structure, they will create a huge contention on the lock, for the
obvious reason that locking even for read access involves writing to the lock itself.
This will kill the performance in a multi-core system and even more in a multi-socket
system.
The reason for the low performance is the cache line invalidation/transfer traffic
between cores/sockets. (As a side note, here's a very recent and very interesting study
on the subject Everything You Always Wanted to Know About
Synchronization but Were Afraid to Ask ).
Naturally, we can avoid inter core cache transfers, triggered by readers, by making
a copy of the structure on each core and restricting the reader threads to accessing only
the copy local to the core they are currently executing. This requires some mechanism for a thread to obtain its current core id. It also relies to on the operating system scheduler to not move gratuitously threads across cores, i.e. to maintain core affinity to some extent.
AFACT, most current operating systems do it.
As for the writers, their job would be to update all the existing replicas, by obtaining each lock for writing. Updating one tree (apparently the structure should be some tree) at a time does mean a temporary inconsistency between replicas. From the problem
description this seams acceptable. When a writer works, it will block readers on a single
core, but not all readers. The drawback is that a writer has the perform the same work
many times - as many time as there are cores or sockets in the system.
PS.
Maybe, just maybe, another alternative is some RCU-like approach, but I don't know
this well, so I'll just stop after mentioning it :)
With replication you could have:
- one copy of your data structure (list w/ binary search, the interval tree mentioned, ..) (say, the "original" one) that is used only for the lookup (read-access).
- A second copy, the "update" one, is created when the data is to be altered (write access). So the write is made to the update copy.
Once writing completes, change some "current"-pointer from the "original" to the "update" version. Involving an access-counter to the "original" copy, this one can be destroyed when the counter decremented back to zero readers.
In pseudo-code:
// read:
data = get4Read();
... do the lookup
release4Read(data);
// write
data = get4Write();
... alter the data
release4Write(data);
// implementation:
// current is the datat structure + a 'readers' counter, initially set to '0'
get4Read() {
lock(current_lock) { // exclusive access to current
current.readers++; // one more reader
return current;
}
}
release4Read(copy) {
lock(current_lock) { // exclusive access to current
if(0 == --copy.readers) { // last reader
if(copy != current) { // it was the old, "original" one
delete(copy); // destroy it
}
}
}
}
get4Write() {
aquire_writelock(update_lock); // blocks concurrent writers!
var copy_from = get4Read();
var copy_to = deep_copy(copy_from);
copy_to.readers = 0;
return copy_to;
}
release4Write(data) {
lock(current_lock) { // exclusive access to current
var copy_from = current;
current = data;
}
release4Read(copy_from);
release_writelock(update_lock); // next write can come
}
To complete the answer regarding the actual data structure to use:
Given the fixed size of the data-entries (two integer tuple), also being quite small, i would use an array for storage and binary search for the lookup. (An alternative would be a balanced tree mentioned in the comment).
Talking about performance: As i understand, the 'address' and 'size' define ranges. Thus, the lookup for a given address being within such a range would involve an addition operation of 'address' + 'size' (for comparison of the queried address with the ranges upper bound) over and over again. It may be more performant to store start- and end-adress explicitely, instead of start-adress and size - to avoid this repeated addition.
Read the LMDB design papers at http://symas.com/mdb/ . An MVCC B+tree with lockless reads and copy-on-write writes. Reads are always zero-copy, writes may optionally be zero-copy as well. Can easily handle millions of reads per second in the C implementation. I believe you should be able to use this in your Delphi program without modification, since readers also do no memory allocation. (Writers may do a few allocations, but it's possible to avoid most of them.)
As a side note, here's a good read about memory barriers: Memory Barriers: a Hardware View for Software Hackers
This is just to answer a comment by #fast, the comment space is not big enough ...
#chill: Where do you see the need to place any 'memory barriers'?
Everywhere, where you access shared storage from two different cores.
For example, a writer comes, make a copy of the data and then calls
release4Write. Inside release4write, the writer does the assignment
current = data, to update the shared pointer with the location of the new
data, decrements the counter of the old copy to zero and proceeds with deleting it.
Now a reader intervenes and calls get4Read. And inside get4Read it does copy = current. Since there's no memory barrier, this happens to read the old value of current. For all we know, the write may be reordered after the delete call or the new value of current may still reside in the writer's store queue or the reader may not have yet
seen and processed a corresponding cache invalidation request and whatnot ...
Now the reader happily proceeds to search in that copy of the data
that the writer is deleting or has just deleted. Oops!
But, wait, there's more! :D
With propper use if the > get..() and release..() functions, where do you see the problems to access deleted data or multiple deletion?
See the following interleaving of reader and write operations.
Reader Shared data Writer
====== =========== ======
current = A:0
data = get4Read()
var copy = A:0
copy.readers++;
current = A:1
return A:1
data = A:1
... do the lookup
release4Read(copy == A:1):
--copy.readers current = A:0
0 == copy.readers -> true
data = get4Write():
aquire_writelock(update_lock)
var copy_from = get4Read():
var copy = A:0
copy.readers++;
current = A:1
return A:1
copy_from == A:1
var copy_to = deep_copy(A:1);
copy_to == B:1
return B:1
data == B:1
... alter the data
release4Write(data = B:1)
var copy_from = current;
copy_form == A:1
current = B:1
current = B:1
A:1 != B:1 -> true
delete A:1
!!! release4Read(A:1) !!!
And the writer accesses deleted data and then tries to delete it again. Double oops!

Assign string to zmq::message_t without copying

I need to do some high performance c++ stuff and that is why I need to avoid copying data whenever possible.
Therefore I want to directly assign a string buffer to a zmq::message_t object without copying it. But there seems to be some deallocation of the string which avoids successful sending.
Here is the piece of code:
for (pair<int, string> msg : l) {
comm_out.send_int(msg.first);
comm_out.send_int(t_id);
int size = msg.second.size();
zmq::message_t m((void *) std::move(msg.second).data(), size, NULL, NULL);
comm_out.send_frame_msg(m, false); // some zmq-wrapper class
}
How can I avoid that the string is deallocated before the message is send out? And when is the string deallocated exactly?
Regards
I think that zmq::message_t m((void *) std::move(msg.second).data()... is probably undefined behaviour, but is certainly the cause of your problem. In this instance, std::move isn't doing what I suspect you think it does.
The call to std::move is effectively creating an anonymous temporary of a string, moving the contents of msg.second into it, then passing a pointer to that temporary data into the message_t constructor. The 0MQ code assumes that pointer is valid, but the temporary object is destroyed after the constructor of message_t completes - i.e. before you call send_frame.
Zero-copy is a complicated matter in 0mq (see the 0MQ Guide) for more details, but you have to ensure that the data that hasn't been copied is valid until 0MQ tells you explicitly that it's finished with it.
Using C++ strings in this situation is hard, and requires a lot of thought. Your question about how to "avoid that the string is deallocated..." goes right to the heart of the issue. The only answer to that is "with great care".
In short, are you sure you need zero-copy at all?

Golang: Best way to read from a hashmap w/ mutex

This is a continuation from here: Golang: Shared communication in async http server
Assuming I have a hashmap w/ locking:
//create async hashmap for inter request communication
type state struct {
*sync.Mutex // inherits locking methods
AsyncResponses map[string]string // map ids to values
}
var State = &state{&sync.Mutex{}, map[string]string{}}
Functions that write to this will place a lock. My question is, what is the best / fastest way to have another function check for a value without blocking writes to the hashmap? I'd like to know the instant a value is present on it.
MyVal = State.AsyncResponses[MyId]
Reading a shared map without blocking writers is the very definition of a data race. Actually, semantically it is a data race even when the writers will be blocked during the read! Because as soon as you finish reading the value and unblock the writers - the value may not exists in the map anymore.
Anyway, it's not very likely that proper syncing would be a bottleneck in many programs. A non-blocking lock af a {RW,}Mutex is probably in the order of < 20 nsecs even on middle powered CPUS. I suggest to postpone optimization not only after making the program correct, but also after measuring where the major part of time is being spent.

Resources