Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Following this question and other questions I asked, I got some suggestions.
tl;dr:
I'm trying to run a "foreach" loop asynchronously. Each iteration updates a few hashes independently. The problem is that the memory is kept with each thread and I didn't know how to unite it all together.
I got a few suggestions, but I had problem with almost each:
When tried the thread/fork, there was a problem with shared memory that I needed to update everything to be shared and you're allowed to assign only shared values to those hashes and it made a big mess... (If there's a way to share everything, even variables that would later be defined. that might be a solution)
When trying to write all the hashes to files (by json), all the blessing is gone and I need to bless everything from the top which is a big mess too...
Any ideas how can I do it easier/faster?
In some problems a common data structure must indeed be shared between different threads.
When this isn't necessary things are greatly simplified by simply returning from each thread, when join-ing, a reference to a data structure built in the thread during the run. The main thread can then process those results (or merge them first if needed). Here is a simple demo
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd); # or use core Data::Dumper
use threads;
# Start threads. Like threads->create(...)
my #thr = map { async { proc_thr($_) } } 1..3;
# Wait for threads to complete. If they return, that happens here
my #res = map { $_->join } #thr;
# Process results (just print in this case)
dd $_ for #res;
sub proc_thr {
my ($num) = #_;
# A convoluted example, to return a complex data structure
my %ds = map { 'k'.$_ => [$_*10 .. $_*10 + 2] } 10*$num .. 10*$num+2;
return \%ds;
}
This prints
{ k10 => [100, 101, 102], k11 => [110, 111, 112], k12 => [120, 121, 122] }
{ k20 => [200, 201, 202], k21 => [210, 211, 212], k22 => [220, 221, 222] }
{ k30 => [300, 301, 302], k31 => [310, 311, 312], k32 => [320, 321, 322] }
Now manipulate these returned data structures as suitable; work with them as they stand or merge them. I can't discuss that because we aren't told what kind of data need be passed around. This roughly provides for what was asked for, as far as I can tell.
Important notes
Lots of threads? Large data structures to merge? Then this may not be a good way
The word "bless" was mentioned, tantalizingly. If what you'd pass around are objects then they need be serialized for that, in such a way that the main thread can reconstruct the object.
Or, pass the object's data, either as a reference (as above), or by serializing it and passing the string; then the main thread can populate its own object from that data.
Returning (join-ing) an object itself (so a reference like above, an object being a reference) doesn't seem to fully "protect your rights;" I find that at least some operator overloading is lost (even as all methods seem to work and data is accessible and workable).
This is a whole other question, of passing objects around.†
If the work to be done is I/O-bound (lots of work with the filesystem) then this whole approach (with threads) need be carefully reconsidered. It may even slow it down
Altogether -- we need more of a description, and way more detail.
† This has been addressed on Stackoverflow. A couple of directly related pages that readily come to mind for me are here and here.
In short, objects can be serialized and restored using for example Storable. Pure JSON cannot do objects, while extensions can. On the other hand, pure JSON is excellent for serializing data in an object, which can then be used on the other end to populate an identical object of that class.
Related
Example seen below. It seems like this might by definition be ub, but it remains unclear to me.
fn main(){
let mut threads = Vec::new();
for _ in 0..100 {
let thread = thread::spawn(move || {
fs::read_to_string("./config/init.json")
.unwrap()
.trim()
.to_string()
});
threads.push(thread);
}
for handler in threads {
handler.join().unwrap();
}
}
On most operating systems only individual read operations are guaranteed to be atomic. read_to_string may perform multiple distinct reads, which means that it's not guaranteed to be atomic between multiple threads/processes. If another process is modifying this file concurrently, read_to_string could return a mixture of data from before and after the modification. In other words, each read_to_string operation is not guaranteed to return an identical result, and some may even fail while others succeed if another process deletes the file while the program is running.
However, none of this behavior is classified as "undefined." Absent hardware problems, you are guaranteed to get back a std::io::Result<String> in a valid state, which is something you can reason about. Once UB is invoked, you can no longer reason about the state of the program.
By way of analogy, consider a choose your own adventure book. At the end of each segment you'll have some instructions like "If you choose to go into the cave, go to page 53. If you choose to take the path by the river, go to page 20." Then you turn to the appropriate page and keep reading. This is a bit like Result -- if you have an Ok you do one thing, but if you have an Err you do another thing.
Once undefined behavior is invoked, this kind of choice no longer makes sense because the program is in a state where the rules of the language no longer apply. The program could do anything, including deleting random files from your hard drive. In the book analogy, the book caught fire. Trying to follow the rules of the book no longer makes any sense, and you hope the book doesn't burn your house down with it.
In Rust you're not supposed to be able to invoke UB without using the unsafe keyword, so if you don't see that keyword anywhere then UB isn't on the table.
I have an interesting data structure design problem that is beyond my current expertise. I'm seeking data structure or algorithm answers about tackling this problem.
The requirements:
Store a reasonable number of (pointer address, size) pairs (effectively two numbers; the first is useful as a sorting key) in one location
In a highly threaded application, many threads will look up values, to see if a specific pointer is within one of the (address, size) pairs - that is, if the pair defines a memory range, if the pointer is within any range in the list. Threads will much more rarely add or remove entries from this list.
Reading or searching for values must be as fast as possible, happening hundreds of thousands to millions of times a second
Adding or removing values, ie mutating the list, happens much more rarely; performance is not as important
It is acceptable but not ideal for the list contents to be out of date, ie for a thread's lookup code to not find an entry that should exist, so long as at some point the entry will exist.
I am keen to avoid a naive implementation such as having a critical section to serialize access to a sorted list or tree. What data structures or algorithms might be suitable for this task?
Tagged with Delphi since I am using that language for
this task. Language-agnostic answers are very welcome.
However, I probably cannot use any of the standard
libraries in any language without a lot of care. The reason is that memory access
(allocation, freeing, etc of objects and their internal memory, eg
tree nodes, etc) is strictly controlled and must go through my own
functions. My current code elsewhere in the same program uses
red/black trees and a bit trie, and I've written these myself. Object
and node allocation runs through custom memory allocation routines.
It's beyond the scope of the question, but is mentioned here to avoid
an answer like 'use STL structure foo.' I'm keen for an algorithmic or
structure answer that, so long as I have the right references or textbooks,
I can implement myself.
I would use a TDictionary<Pointer, Integer> (from Generics.Collections) combined with a TMREWSync (from SysUtils) for the multi-read exclusive-write access. TMREWSync allows multiple readers simulatenous access to the dictionary, as long as no writer is active. The dictionary itself provides O(1) lookup of pointers.
If you don't want to use the RTL classes the answer becomes: use a hash map combined with a multi-read exclusive-write synchronization object.
EDIT: Just realized that your pairs really represent memory ranges, so a hash map does not work. In this case you could use a sorted list (sorted by memory adress) and then use binary search to quickly find a matching range. That makes the lookup O(log n) instead of O(1) though.
Exploring a bit the replication idea ...
From the correctness point of view, reader/writer locks will do the work. However,
in practice, while readers may be able to proceed concurrently and in parallel
with accessing the structure, they will create a huge contention on the lock, for the
obvious reason that locking even for read access involves writing to the lock itself.
This will kill the performance in a multi-core system and even more in a multi-socket
system.
The reason for the low performance is the cache line invalidation/transfer traffic
between cores/sockets. (As a side note, here's a very recent and very interesting study
on the subject Everything You Always Wanted to Know About
Synchronization but Were Afraid to Ask ).
Naturally, we can avoid inter core cache transfers, triggered by readers, by making
a copy of the structure on each core and restricting the reader threads to accessing only
the copy local to the core they are currently executing. This requires some mechanism for a thread to obtain its current core id. It also relies to on the operating system scheduler to not move gratuitously threads across cores, i.e. to maintain core affinity to some extent.
AFACT, most current operating systems do it.
As for the writers, their job would be to update all the existing replicas, by obtaining each lock for writing. Updating one tree (apparently the structure should be some tree) at a time does mean a temporary inconsistency between replicas. From the problem
description this seams acceptable. When a writer works, it will block readers on a single
core, but not all readers. The drawback is that a writer has the perform the same work
many times - as many time as there are cores or sockets in the system.
PS.
Maybe, just maybe, another alternative is some RCU-like approach, but I don't know
this well, so I'll just stop after mentioning it :)
With replication you could have:
- one copy of your data structure (list w/ binary search, the interval tree mentioned, ..) (say, the "original" one) that is used only for the lookup (read-access).
- A second copy, the "update" one, is created when the data is to be altered (write access). So the write is made to the update copy.
Once writing completes, change some "current"-pointer from the "original" to the "update" version. Involving an access-counter to the "original" copy, this one can be destroyed when the counter decremented back to zero readers.
In pseudo-code:
// read:
data = get4Read();
... do the lookup
release4Read(data);
// write
data = get4Write();
... alter the data
release4Write(data);
// implementation:
// current is the datat structure + a 'readers' counter, initially set to '0'
get4Read() {
lock(current_lock) { // exclusive access to current
current.readers++; // one more reader
return current;
}
}
release4Read(copy) {
lock(current_lock) { // exclusive access to current
if(0 == --copy.readers) { // last reader
if(copy != current) { // it was the old, "original" one
delete(copy); // destroy it
}
}
}
}
get4Write() {
aquire_writelock(update_lock); // blocks concurrent writers!
var copy_from = get4Read();
var copy_to = deep_copy(copy_from);
copy_to.readers = 0;
return copy_to;
}
release4Write(data) {
lock(current_lock) { // exclusive access to current
var copy_from = current;
current = data;
}
release4Read(copy_from);
release_writelock(update_lock); // next write can come
}
To complete the answer regarding the actual data structure to use:
Given the fixed size of the data-entries (two integer tuple), also being quite small, i would use an array for storage and binary search for the lookup. (An alternative would be a balanced tree mentioned in the comment).
Talking about performance: As i understand, the 'address' and 'size' define ranges. Thus, the lookup for a given address being within such a range would involve an addition operation of 'address' + 'size' (for comparison of the queried address with the ranges upper bound) over and over again. It may be more performant to store start- and end-adress explicitely, instead of start-adress and size - to avoid this repeated addition.
Read the LMDB design papers at http://symas.com/mdb/ . An MVCC B+tree with lockless reads and copy-on-write writes. Reads are always zero-copy, writes may optionally be zero-copy as well. Can easily handle millions of reads per second in the C implementation. I believe you should be able to use this in your Delphi program without modification, since readers also do no memory allocation. (Writers may do a few allocations, but it's possible to avoid most of them.)
As a side note, here's a good read about memory barriers: Memory Barriers: a Hardware View for Software Hackers
This is just to answer a comment by #fast, the comment space is not big enough ...
#chill: Where do you see the need to place any 'memory barriers'?
Everywhere, where you access shared storage from two different cores.
For example, a writer comes, make a copy of the data and then calls
release4Write. Inside release4write, the writer does the assignment
current = data, to update the shared pointer with the location of the new
data, decrements the counter of the old copy to zero and proceeds with deleting it.
Now a reader intervenes and calls get4Read. And inside get4Read it does copy = current. Since there's no memory barrier, this happens to read the old value of current. For all we know, the write may be reordered after the delete call or the new value of current may still reside in the writer's store queue or the reader may not have yet
seen and processed a corresponding cache invalidation request and whatnot ...
Now the reader happily proceeds to search in that copy of the data
that the writer is deleting or has just deleted. Oops!
But, wait, there's more! :D
With propper use if the > get..() and release..() functions, where do you see the problems to access deleted data or multiple deletion?
See the following interleaving of reader and write operations.
Reader Shared data Writer
====== =========== ======
current = A:0
data = get4Read()
var copy = A:0
copy.readers++;
current = A:1
return A:1
data = A:1
... do the lookup
release4Read(copy == A:1):
--copy.readers current = A:0
0 == copy.readers -> true
data = get4Write():
aquire_writelock(update_lock)
var copy_from = get4Read():
var copy = A:0
copy.readers++;
current = A:1
return A:1
copy_from == A:1
var copy_to = deep_copy(A:1);
copy_to == B:1
return B:1
data == B:1
... alter the data
release4Write(data = B:1)
var copy_from = current;
copy_form == A:1
current = B:1
current = B:1
A:1 != B:1 -> true
delete A:1
!!! release4Read(A:1) !!!
And the writer accesses deleted data and then tries to delete it again. Double oops!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have several very large tables in mysql (Millions of rows) that I need to load into my perl script.
We then do some custom processing of the data, and aggregate it into a hash. Unfortunately, that custom processing can't be implemented in MySQL.
Heres a quick pseudocode.
my #data;
for my $table_num(#table_numbers){
my $sth = $dbh->prepare(...);
$sth->execute();
$sth->bind_columns(\my($a,$b,$c,...));
while(($sth->fetch()){
$data[$table_num]{black_box($a)}{secret_func($b)}+=$c;
}
}
my $x = $#data + 1;
for my $num (#table_numbers){
for my $a (keys %{$data[$num]}){
for my $b (keys %{$data[$num]{$a}){
$data[$x]{$a}{$b} += $data[$num]{$a}{$b};
}
}
}
Now, the first loop can take several minutes per iteration to run, so I am thinking of ways I can run them in parallel. I have looked at using Perl Threads before, but they seem to just run several perl interpreters at once, and my script is already using a lot of memory, and merging the data would seem t be problematic. Also, at this stage, the script is not using a lot of CPU.
I have been looking at possibly using Coro threads, but it seem like there would be a learning curve, plus a fairly complex integration of my current code. What I would like to know if I am likely to see any gains by going this route. Are there better ways of multithreading code like this. I can not afford to use any more memory then my code already uses. Is there something else I can do here?
Unfortunately doing the aggregation in MySQL is not an option, and rewriting the code in a different language would be too time consuming. I am aware that using arrays instead of hashes is likely to make my code faster/use less memory, but again that would require a major rewrite of a large script.
Edit: The above is pseudo code, the actual logic is a lot more complex. The bucketing is based on several db tables, and many more inputs then just $a and $b. Precomputing them is not practical, as the there are Trillions+ possible combinations. The main goal is how do I make the perl script run faster, not how to fix the SQL Part of things. That requires changes to how the data is stored and indexed in the actual server. Which would affect a lot of other code. There are other people working on doing those optimizations. My current goal is to attempt to make the code faster without changing any sql.
You could do it in mysql simply by making black_box and secret_func tables (temporary tables, if necessary) prepopulated with the results for every existing value of the relevant columns.
Short of that, measure how much time is spent in the calls to black_box and secret_func vs. execute/fetch. If a lot is in the former, you could memoize the results:
my %black_box;
my %secret_func;
for my $table_num...
...
$data[$table_num]{ $black_box{$a} //= black_box($a) }{ $secret_func{$b} //= secret_func($b) } += $c;
If you have memory concerns, using forks instead of threads may help. They use much less memory than the standard perl threads. There is going to be somewhat of a memory penalty for multi-threading, and YMMV as far as performance goes, but you might want to try something like:
use forks;
use Thread::Queue;
my $inQueue = Thread::Queue->new;
my $outQueue = Thread::Queue->new;
$inQueue->enqueue(#table_numbers);
# create the worker threads
my $numThreads = 4;
for(1 .. $numThreads) {
threads->create(\&doMagic);
}
# wait for the threads to finish
$_->join for threads->list;
# collect the data
my #data;
while(my $result = $outQueue->dequeue_nb) {
# merge $result into #data
}
sub doMagic {
while(my $table_num = $inQueue->dequeue_nb) {
my #data;
# your first loop goes here
$outQueue->enqueue(\#data);
}
return;
}
I've got a computation (CTR encryption) that requires results in a precise order.
For this I created a multithreaded design that calculates said results, in this case the result is a ByteBuffer. The calculation itself of course runs asynchronous, so the results may become available at any time and in any order. The "user" is a single-threaded application that uses the results by calling a method, after which the ByteBuffers are returned to the pool of resources by said method - the management of resources is already handled (using a thread safe stack).
Now the question: I need something that aggregates the results and makes them available in the right order. If the next result is not available, the method that the user called should block until it is. Does anyone know a good strategy or class in java.util.concurrent that can return asynchronously calculated results in order?
The solution it must be thread safe. I would like to avoid third party libraries, Thread.sleep() / Thread.wait() and theading related keywords other than "synchronized". Futhermore, The tasks may be given to e.g. an Executor in the correct order if that is required. This is for research, so feel free to use Java 1.6 or even 1.7 constructs.
Note: I've tagged these quesions [jre] as I want to keep within the classes defined in the JRE and [encryption] as somebody may already have had to deal with it, but the question itself is purely about java & multi-threading.
Use the executors framework:
ExecutorService executorService = Executors.newFixedThreadPool(5);
List<Future> futures = executorService.invokeAll(listOfCallables);
for (Future future : futures) {
//do something with future.get();
}
executorService.shutdown();
The listOfCallables will be a List<Callable<ByteBuffer>> that you have constructed to operate on the data. For example:
list.add(new SubTaskCalculator(1, 20));
list.add(new SubTaskCalculator(21, 40));
list.add(new SubTaskCalculator(41, 60));
(arbitrary ranges of numbers, adjust that to your task at hand)
.get() blocks until the result is complete, but at the same time other tasks are also running, so when you reach them, their .get() will be ready.
Returning results in the right order is trivial. As each result arrives, store it in an arraylist, and once you have ALL the results, just sort the arraylist. You could use a PriorityQueue to keep the results sorted at all times as they arrive, but there is no point in doing this, since you will not be making any use of the results before all of them have arrived anyway.
So, what you could do is this:
Declare a "WorkItem" class which contains one of your bytearrays and its ordinal number, so that they can be sorted by ordinal number.
In your work threads, do something like this:
...do work and produce a work_item...
synchronized( LockObject )
{
ResultList.Add( work_item );
number_of_results++;
LockObject.notifyAll();
}
In your main thread, do something like this:
synchronized( LockObject )
while( number_of_results != number_of_items )
LockObject.wait();
ResultList.Sort();
...go ahead and use the results...
My new answer after gaining a better understanding of what you want to do:
Declare a "WorkItem" class which contains one of your bytearrays and its ordinal number, so that they can be sorted by ordinal number.
Make use of a java.util.PriorityQueue which is kept sorted by ordinal number. Essentially, all we care is that the first item in the priority queue at any given time will be the next item to process.
Each work thread stores its result in the PriorityQueue and issues a NotifyAll on some locking object.
The main thread waits on the locking object, and then if there are items in the queue, and if the ordinal of the (peeked, not dequeued) first item in the queue is equal to the number of items processed so far, then it dequeues the item and processes it. If not, it keeps waiting. If all of the items have been produced and processed, it is done.
Problem
I have such code
var ls = src.iter.toList
src.iter = ls.iterator
(this is part of copy constructor of my iterator-wrapper) which reads the source iterator, and in next line set it back. The problem is, those two lines have to be atomic (especially if you consider that I change the source of copy constructor -- I don't like it, but well...).
I've read about Actors but I don't see how they fit here -- they look more like a mechanism for asynchronous execution. I've read about Java solutions and using them in Scala, for example: http://naedyr.blogspot.com/2011/03/atomic-scala.html
My question is: what is the most Scala way to make some operations atomic? I don't want to use some heavy artillery for this, and also I would not like to use some external resources. In other words -- something that looks and feels "right".
I kind like the solution presented in the above link, because this is what I exactly do -- exchange references. And if I understand correctly, I would guard only those 2 lines, and other code does not have to be altered! But I will wait for definitive answer.
Background
Because every Nth question, instead of answer I read "but why do you use...", here:
How to copy iterator in Scala? :-)
I need to copy iterator (make a fork) and such solution is the most "right" I read about. The problem is, it destroys the original iterator.
Solutions
Locks
For example here:
http://www.ibm.com/developerworks/java/library/j-scala02049/index.html
The only problem I see here, that I have to put lock on those two lines, and every other usage on iter. It is minor thing now, but when I add some code, it is easy to forget to add additional lock.
I am not saying "no", but I have no experience, so I would like to get answer from someone who is familiar with Scala, to point a direction -- which solution is the best for such task, and in long-run.
Immutable iterator
While I appreciate the explanation by Paradigmatic, I don't see how such approach fits my problem. The thing is IteratorWrapper class has to wrap iterator -- i.e. raw iterator should be hidden within the class (usually it is done by making it private). Such methods as hasNext() and next() should be wrapped as well. Normally next() alters the state of the object (iterator) so in case of immutable IteratorWrapper it should return both new IteratorWrapper and status of next() (successful or not). Another solution would be returning NULL if raw next() fails, anyway, this makes using such IteratorWrapper not very handy.
Worse, there is still not easy way to copy such IteratorWrapper.
So either I miss something, or actually classic approach with making piece of code atomic is cleaner. Because all the burden is contained inside the class, and the user does not have to pay the price of they way IteratorWrapper handles the data (raw iterator in this case).
Scala approach is to favor immutability whenever it is possible (and it's very often possible). Then you do not need anymore copy constructors, locks, mutex, etc.
For example, you can convert the iterator to a List at object construction. Since lists are immutable, you can safely share them without having to lock:
class IteratorWrapper[A]( iter: Iterator[A] ) {
val list = iter.toList
def iteratorCopy = list.iterator
}
Here, the IteratorWrapper is also immutable. You can safely pass it around. But if you really need to change the wrapped iterator, you will need more demanding approaches. For instance you could:
Use locks
Transform the wrapper into an Actor
Use STM (akka or other implementations).
Clarifications: I lack information on your problem constraints. But here is how I understand it.
Several threads must traverse simultaneously an Iterator. A possible approach is to copy it before passing the reference to the threads. However, Scala practice aims at sharing immutable objects that do not need to be copied.
With the copy strategy, you would write something like:
//A single iterator producer
class Producer {
val iterator: Iterator[Foo] = produceIterator(...)
}
//Several consumers, living on different threads
class Consumer( p: Producer ) {
def consumeIterator = {
val iteratorCopy = copy( p.iterator ) //BROKEN !!!
while( iteratorCopy.hasNext ) {
doSomething( iteratorCopy.next )
}
}
}
However, it is difficult (or slow) to implement a copy method which is thread-safe. A possible solution using immutability will be:
class Producer {
val lst: List[Foo] = produceIterator(...).toList
def iteratorCopy = list.iterator
}
class Consumer( p: Producer ) {
def consumeIterator = {
val iteratorCopy = p.iteratorCopy
while( iteratorCopy.hasNext ) {
doSomething( iteratorCopy.next )
}
}
}
The producer will call produceIterator once at construction. It it immutable because its state is only a list which is also immutable. The iteratorCopy is also thread-safe, because the list is not modified when creating the copy (so several thread can traverse it simultaneously without having to lock).
Note that calling list.iterator does not traverse the list. So it will not decrease performances in any way (as opposed to really copying the iterator each time).