Acumatica perfomance with threads - multithreading

Let's say in my cloud there are 18 cores, and each core can execute 2 threads, that means that in total I can proceed 36 threads. Let's say I created 36 threads.
I want to create 36 ARInvoices via graph ARInvoiceEntry. What is better/faster: share ARInvoiceEntry graph among 36 threads, or create 36 ARInvoiceEntry graphs.

The instance of the graph is created each time the client posts data to the server and is destroyed after the request has been processed. The data views retrieve the top data record, based on Order by. So for processing, you will have to make sure the record that you need is retrieved in your View. My choice would be to create 36 graphs. Feel free to object...this is an interesting question.

Finally I've found the way that properly speed up insertion into Acumatica. The biggest part to keep in mind is that persistence to database should be done in one thread. That is easy to achieve with locking of C#. Here are some fragments of my working solution:
Lock object:
private Object thisLock = new Object();
For each logical core I've created thread and split data for each thread separately:
int numberOfLogicalCores = Environment.ProcessorCount;
List<Thread> threads = new List<Thread>(numberOfLogicalCores);
int sizeOfOneChunk = (customers.Count / numberOfLogicalCores) + 1;
for (int i = 0; i < numberOfLogicalCores; i++)
{
int a = i;
var thr = new Thread(
() =>
{
var portions = customers.Skip(a * sizeOfOneChunk).Take(sizeOfOneChunk).ToList();
InsertCustomersFromList(portionsCustomers);
}
);
thr.Name = $"thr{i}";
threads.Add(thr);
}
foreach (var thread in threads)
{
thread.Start();
}
foreach (var thread in threads)
{
thread.Join();
}
and then part that should be executed in single thread I've locked it like this:
lock (thisLock)
{
custMaint.Actions.PressSave(); // custMaint is the name of created graph
}
In my tests improvement boost difference was three times. 110 records were inserted in 1 minute in comparison with three minutes in single threaded mode. And also resources of server were used with bigger efficiency.

Related

How can I solve the problem of script blocking?

I want to give users the ability to customize the behavior of game objects, but I found that unity is actually a single threaded program. If the user writes a script with circular statements in the game object, the main thread of unity will block, just like the game is stuck. How to make the update function of object seem to be executed on a separate thread?
De facto execution order
The logical execution sequence I want to implement
You can implement threading, but the UnityAPI is NOT thread safe, so anything you do outside of the main thread cannot use the UnityAPI. This means that you can do a calculation in another thread and get a result returned to the main thread, but you cannot manipulate GameObjects from the thread.
You do have other options though, for tasks which can take several frames, you can use a coroutine. This will also allow the method to wait without halting the main thread. It sounds like your best option is the C# Jobs System. This system essentially lets you use multithreading and manages the threads for you.
Example from the Unity Manual:
public struct MyJob : IJob
{
public float a;
public float b;
public NativeArray<float> result;
public void Execute()
{
result[0] = a + b;
}
}
// Create a native array of a single float to store the result. This example waits for the job to complete for illustration purposes
NativeArray<float> result = new NativeArray<float>(1, Allocator.TempJob);
// Set up the job data
MyJob jobData = new MyJob();
jobData.a = 10;
jobData.b = 10;
jobData.result = result;
// Schedule the job
JobHandle handle = jobData.Schedule();
// Wait for the job to complete
handle.Complete();
float aPlusB = result[0];
// Free the memory allocated by the result array
result.Dispose();

How to join all threads before deleting the ThreadPool

I am using a MultiThreading class which creates the required number of threads in its own threadpool and deletes itself after use.
std::thread *m_pool; //number of threads according to available cores
std::mutex m_locker;
std::condition_variable m_condition;
std::atomic<bool> m_exit;
int m_processors
m_pool = new std::thread[m_processors + 1]
void func()
{
//code
}
for (int i = 0; i < m_processors; i++)
{
m_pool[i] = std::thread(func);
}
void reset(void)
{
{
std::lock_guard<std::mutex> lock(m_locker);
m_exit = true;
}
m_condition.notify_all();
for(int i = 0; i <= m_processors; i++)
m_pool[i].join();
delete[] m_pool;
}
After running through all tasks, the for-loop is supposed to join all running threads before delete[] is being executed.
But there seems to be one last thread still running, while the m_pool does not exist anymore.
This leads to the problem, that I can't close my program anymore.
Is there any way to check if all threads are joined or wait for all threads to be joined before deleting the threadpool?
Simple typo bug I think.
Your loop that has the condition i <= m_processors is a bug and will actually process one extra entry past the end of the array. This is an off-by-one bug. Suppose m_processors is 2. You'll have an array that contains 2 elements with indices [0] and [1]. Yet, you'll be reading past the end of the array, attempting to join with the item at index [2]. m_pool[2] is undefined memory and you're likely going to either crash or block forever there.
You likely intended i < m_processors.
The real source of the problem is addressed by Wick's answer. I will extend it with some tips that also solve your problem while improving other aspects of your code.
If you use C++11 for std::thread, then you shouldn't create your thread handles using operator new[]. There are better ways of doing that with other C++ constructs, which will make everything simpler and exception safe (you don't leak memory if an unexpected exception is thrown).
Store your thread objects in a std::vector. It will manage the memory allocation and deallocation for you (no more new and delete). You can use other more flexible containers such as std::list if you insert/delete threads dynamically.
Fill the vector in place with std::generate or similar
std::vector<std::thread> m_pool;
m_pool.reserve(n_processors);
// Fill the vector
std::generate_n( std::back_inserter(m_pool), m_processors,
[](){ return std::thread(func); } );
Join all the elements using range-for loop and delete handles using container's functions.
for( std::thread& t: m_pool ) {
t.join();
}
m_pool.clear();

ways to express concurrency without thread

I am wondering about how concurrency can be expressed without an explicit thread object, not the implementation, which probably would use threads or thread pools, but the language design related issues.
Q1: I wonder what would be lost if there was no thread object, what couldn't be done in such a language?
Q2: I also wonder about how this would be expressed, what ways were proposed or implemented as alternatives or complements to threads?
one possibility is the MPI-programm-model (GPU as well)
lets say you have the following code
for(int i=0; i < 100; i++) {
work(i);
}
the "normal" thread-based way would be the separation of the iteration-range into multiple subsets. So something like this
Thread-1:
for(int i=0; i < 50; i++) {
work(i);
}
Thread-2:
for(int i=50; i < 100; i++) {
work(i);
}
however in MPI/GPU you do something different.
the idea is, that every core execute the same(GPU) or at least
a similar (MPI) programm. the difference is, that each core uses
a different ID, which changes the behavior of the code.
mpi-style: (not exactly the MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int subset = 100 / size;
for (int i = rank * subset;i < (rand+1)*subset; i+) {
//each core will use a different range for i
work(i);
}
the next big thing is communication. Normally you need to use all of the synchronization-stuff manually. MPI is message-based, meaning that its not perfectly suited for classical shared-memory modells (every core has access to the same memory), but in a cluster system (many cores combined with a network) it works excellent. This is not only limited to supercomputers (they use basically only mpi-style stuff), but in the recent years a new type of core-architecture (manycores) was developed. They have a local so called Network-On-Chip, so each core can send/receive messages without having the problem with synchronization.
MPI contains not only simple messages, but higher constructs to automatically scatter and gather data to every core.
Example: (again not MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int data[100];
int result;
int results[size];
if (rank == 0) { //master-core only
fill_with_stuff(data);
}
scatter(0, data); //core-0 will send the data-content to all other cores
result = work(rank, data); // every core works on the same data
gather(0,result,results); //get all local results and store them in
//the results-array of core-0
an other solutions is the openMP-libary
here you declare parallel-blocks. the whole thread-part is done by the libary itself
example:
//this will split the for-loop automatically in 4 threads
#pragma omp parallel for num_threads(4)
for(int i=0; i < 100; i++) {
work(i);
}
the big advantage is, that its fast to write. thats it
you may get better performance with writing the threads on your own,
but it takes a lot more time and knowledge about synchronization

Design pattern for asynchronous while loop

I have a function that boils down to:
while(doWork)
{
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
}
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
while(doWork)
{
if(thread_count < max_threads)
{
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
thread_count++;
}
else
usleep(100); // don't consume too much CPU
}
void checkResult(Result value)
{
if(value == good) doWork = false;
thread_count--;
}
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).

Why am i use Parallel.For get the diffrent result?

using System.Threading.Tasks;
const int _Total = 1000000;
[ThreadStatic]
static long count = 0;
static void Main(string[] args)
{
Parallel.For(0, _Total, (i) =>
{
count++;
});
Console.WriteLine(count);
}
I get different result every time, can anybody help me and tell me why?
Most likely your "count" variable isn't atomic in any form, so you are getting concurrent modifications that aren't synchronized. Thus, the following sequence of events is possible:
Thread 1 reads "count"
Thread 2 reads "count"
Thread 1 stores value+1
Thread 2 stores value+1
Thus, the "for" loop has done 2 iterations, but the value has only increased by 1. As thread ordering is "random", so will be the result.
Things can get a lot worse, of course:
Thread 1 reads count
Thread 2 increases count 100 times
Thread 1 stores value+1
In that case, all those 100 increases done by thread 2 are undone. Although that can really only happen if the "++" is actually split into at least 2 machine instructions, so it can be interrupted in the middle of the operation. In the one-instruction case, you're only dealing with interleaved hardware threads.
It's a typical race condition scenario.
So, most likely, ThreadStatic is not working here. In this concrete sample use System.Threading.Interlocked:
void Main()
{
int total = 1000000;
int count = 0;
System.Threading.Tasks.Parallel.For(0, _Total, (i) =>
{
System.Threading.Interlocked.Increment(ref count);
});
Console.WriteLine(count);
}
Similar question
C# ThreadStatic + volatile members not working as expected

Resources