Determining no of processors and cores

Determining no of processors and cores - multithreading

I have a Dell Inspiron 620. The System control panel says 1 Intel i3-2100. Intel says (http://ark.intel.com/products/53422/) it has 2 cores and four threads. Three questions: In system environment variables the no_of_processors=4; so is that 4 threads?
Regarding the cores, when I run this code:
// get a list of all processor devices
deviceList = SetupDiGetClassDevs(ref processorGuid, "ACPI", IntPtr.Zero, (int)DIGCF.PRESENT);
// attempt to process each item in the list
for (int deviceNumber = 0; ; deviceNumber++)
{
SP_DEVINFO_DATA deviceInfo = new SP_DEVINFO_DATA();
deviceInfo.cbSize = Marshal.SizeOf(deviceInfo);
// attempt to read the device info from the list, if this fails, we're at the end of the list
if (!SetupDiEnumDeviceInfo(deviceList, deviceNumber, ref deviceInfo))
{
deviceCount = deviceNumber - 1;
break;
}
}
I get 3 as the number of cores, not 2.
Also, in terms of the number of threads this system will support adequately, is that cores x processors?
Thanks.

This is pure supposition but could it be:
2 Core with 4 threads implies hyper threading
System properties are referring 4 processors => its reading the number of hyper threads available as 'logical single thread processors'
Your loop starts at 0 so you're still seeing 4.

Related

AMD SMT or Intel HT performance

I don't really understand why processors with doubled logical processors are much more expensive then single logical processors. As far as I noticed there is no difference with running code on 6 or 12 threads for 6 cores/12 threads CPU.
As monkeys asked, here is C# example emulating heavy load on each thread:
static void Main(string[] args)
{
if (IntPtr.Size != 8)
throw new Exception("use only x64 code, 2020 is coming...");
//6 for physical cores, 12 for logical cores
const int limit_threads = 12;
const int limit_actions = 256;
const int limit_loop = 1000 * 1000 * 10;
const double power = 1.0 / 17.0;
long result = 0;
var action = new Action(() =>
{
long value = 0;
for (int i = 0; i < limit_loop; i++)
value += (long)Math.Pow(i, power);
Interlocked.Add(ref result, value);
});
var actions = Enumerable.Range(0, limit_actions).Select(x => action).ToArray();
var sw = Stopwatch.StartNew();
Parallel.Invoke(new ParallelOptions()
{
MaxDegreeOfParallelism = limit_threads
}, actions);
Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}s\nresult={result}\nlimit_threads={limit_threads}\nlimit_actions={limit_actions}\nlimit_loop={limit_loop}");
}
Results for 6 threads (AMD Ryzen 2600):
done in 13,7074543s
result=5086445312
limit_threads=6
limit_actions=256
limit_loop=10000000
Results for 12 threads (AMD Ryzen 2600):
done in 11,3992756s
result=5086445312
limit_threads=12
limit_actions=256
limit_loop=10000000
It's about 10% performance boost with using all logical cores instead of only physical, which is almost null. What you can say now?
Can someone provide sample code which will be valuable faster with using processor multi-threading (AMD SMT or Intel HT) comparing to using only physical cores?

TLDR: SMT/HT is a technology that exists to offset the cost of massive multithreading as opposed to speeding up your computation with more cores.
You have misunderstood what SMT/HT does.
"As far as I noticed there is no difference with running code on 6 or 12 threads for 6cores-12threads CPU".
If this is true, then SMT/HT is working.
To understand why, you need to understand modern OS kernels and Kernel Threads. Today's Operating Systems use what is called Preemptive Threading.
The OS Kernel divides up each core into time-slices called "Quantum", and using interrupts schedules the various processes in a complicated round robin fashion.
The part we want to look at is the interrupt. When a CPU core is scheduled to switch run another thread, we call this process a "Context Switch". Context Switches are expensive, slow processes, as the entire state and flow of the highly pipelined CPU must be stopped, saved and swapped out for another state (as well as other caches, registers, lookup tables etc). According to this answer, Context Switch times are measured in microseconds (thousands of clock-cycles); and will only get worse as CPUs become more complicated.
The point of SMT/HT is to cheat, by having each CPU core being able to store two states at the same time (imagine having two monitors instead of one, you still only use one at time, but you are more productive because you don't need to rearrange your windows each time you switch tasks). So SMT/HT processors can Context Switch must faster than non-SMT/HT processors.
So back to your example. If you turned off SMT on your Ryzen 2600, then ran the same workload with 12 threads, you will find that it performs significantly slower than with 6 threads.
Also, note, more threads does not make things faster.

I think that varying the price of the processors depending on the availability of the SMT/HT technology is just a matter of marketing strategy.
The hardware is probably the same in every case but the feature is disabled by the manufacturer on some of them to offer cheap models.
This technology relies on the fact that some micro-operations in a single
instruction have to wait for something to be executed; so instead of just waiting,
the same core uses its circuits to make some progress on the micro-operations
from another thread.
On a coarse point of view, we can perceive the execution of two (or more on
certain models) sequences of micro-operations from two different threads executed
on a single piece of hardware (except some redundant parts, like registers...)
The efficiency of this technology depends on the problem.
After various tests I noticed that if the problem is compute bound, ie the
limiting factor is the time needed to compute (add, multiply...), but not
memory bound (the data are already available, no need to wait for the memory),
then this technology does not provide any benefit.
This is due to the fact that there is no gap to fill in the two sequences of
micro-operations, thus the intertwined execution of two threads is not better
than two independent serial executions.
In the exact opposite case, when the problem is memory bound but not
compute bound, there is no more benefit because both threads have to wait
for the data coming from memory.
I only noticed an improvement in performances when the problem is mixed between
data access and computation; in this case when one thread is waiting for data, the
same core can make some progress in the computations of the other thread and
vice-versa.
Edit
Below is given an example to illustrate these situations, and I obtain the
following results (quite stable when run many times,
dual Xeon E5-2697 v2, Linux 5.3.13).
In this memory bound situation HT does not help.
$ ./prog_ht mem
24 threads running memory_task()
result: 1e+17
duration: 13.0383 seconds
$ ./prog_ht mem ht
48 threads (ht) running memory_task()
result: 1e+17
duration: 13.1096 seconds
In this compute bound situation HT helps (almost 30% gain)
(I don't know exactly the details of what is implied in the hardware
when computing cos, but there must be some latencies which are not due
to memory access)
$ ./prog_ht
24 threads running compute_task()
result: -260.782
duration: 9.76226 seconds
$ ./prog_ht ht
48 threads (ht) running compute_task()
result: -260.782
duration: 7.58181 seconds
In this mixed situation HT helps much more (around 70% gain)
$ ./prog_ht mix
24 threads running mixed_task()
result: -260.782
duration: 60.1602 seconds
$ ./prog_ht mix ht
48 threads (ht) running mixed_task()
result: -260.782
duration: 35.121 seconds
Here is the source code (in C++, I'm not confortable with C#)
/*
g++ -std=c++17 -o prog_ht prog_ht.cpp \
-pedantic -Wall -Wextra -Wconversion \
-Wno-missing-braces -Wno-sign-conversion \
-O3 -ffast-math -march=native -fomit-frame-pointer -DNDEBUG \
-pthread
*/
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <thread>
#include <chrono>
#include <cstdint>
#include <random>
#include <cmath>
#include <pthread.h>
bool // success
bind_current_thread_to_cpu(int cpu_id)
{
/* !!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!
I checked the numbering of the CPUs according to the packages and cores
on my computer/system (dual Xeon E5-2697 v2, Linux 5.3.13)
0 to 11 --> different cores of package 1
12 to 23 --> different cores of package 2
24 to 35 --> different cores of package 1
36 to 47 --> different cores of package 2
Thus using cpu_id from 0 to 23 does not bind more than one thread
to each single core (no HT).
Of course using cpu_id from 0 to 47 binds two threads to each single
core (HT is used).
This numbering is absolutely NOT guaranteed on any other computer/system,
thus the relation between thread numbers and cpu_id should be adapted
accordingly.
*/
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu_id, &cpu_set);
return !pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
}
inline
double // seconds since 1970/01/01 00:00:00 UTC
system_time()
{
const auto now=std::chrono::system_clock::now().time_since_epoch();
return 1e-6*double(std::chrono::duration_cast
<std::chrono::microseconds>(now).count());
}
constexpr auto count=std::int64_t{20'000'000};
constexpr auto repeat=500;
void
compute_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
(void)indices; // not used here
(void)inputs; // not used here
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
result+=std::cos(double(i));
}
}
results[thread_id]+=result;
}
void
mixed_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=std::cos(inputs[index]);
}
}
results[thread_id]+=result;
}
void
memory_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=inputs[index];
}
}
results[thread_id]+=result;
}
int
main(int argc,
char **argv)
{
//~~~~ analyse command line arguments ~~~~
const auto args=std::vector<std::string>{argv, argv+argc};
const auto has_arg=
[&](const auto &a)
{
return std::find(cbegin(args)+1, cend(args), a)!=cend(args);
};
const auto use_ht=has_arg("ht");
const auto thread_count=int(std::thread::hardware_concurrency())
/(use_ht ? 1 : 2);
const auto use_mix=has_arg("mix");
const auto use_mem=has_arg("mem");
const auto task=use_mem ? memory_task
: use_mix ? mixed_task
: compute_task;
const auto task_name=use_mem ? "memory_task"
: use_mix ? "mixed_task"
: "compute_task";
//~~~~ prepare input/output data ~~~~
auto results=std::vector<double>(thread_count);
auto indices=std::vector<int>(count);
auto inputs=std::vector<double>(count);
std::generate(begin(indices), end(indices),
[i=0]() mutable { return i++; });
std::copy(cbegin(indices), cend(indices), begin(inputs));
std::shuffle(begin(indices), end(indices), // fight the prefetcher!
std::default_random_engine{std::random_device{}()});
//~~~~ launch threads ~~~~
std::cout << thread_count << " threads"<< (use_ht ? " (ht)" : "")
<< " running " << task_name << "()\n";
auto threads=std::vector<std::thread>(thread_count);
const auto t0=system_time();
for(auto i=0; i<thread_count; ++i)
{
threads[i]=std::thread{task, i, thread_count,
data(indices), data(inputs), data(results)};
}
//~~~~ wait for threads ~~~~
auto result=0.0;
for(auto i=0; i<thread_count; ++i)
{
threads[i].join();
result+=results[i];
}
const auto duration=system_time()-t0;
std::cout << "result: " << result << '\n';
std::cout << "duration: " << duration << " seconds\n";
return 0;
}

Restrict the number of cores available to an Akka application

I am trying to run some experiments to study the scaling properties of an Akka application that I have written. As a baseline I would like to force the application to run using only a single thread on a single core.
I am currently running the simulation on my quad-core laptop with the following in my application.conf file...
akka {
actor {
default-dispatcher {
fork-join-executor {
parallelism-min = 1
parallelism-factor = 0.25
parallelism-max = 1
}
}
}
}
Is this the correct best way to force my application to run as a single threaded application on one core? The idea is that once I have this baseline, I will then increase the number of available cores (and threads).

Yes that should work. I would just add that the declaration of parallelism-max would be already enough in your case. The parallelism-factor is just used in the following formula: available processors * factor. Akka first uses the formula to determine the number of threads that should be used. Next it makes sure you are within the min & max. So a number below 1 for the factor makes no sense. I think the best thing for you should be the following:
akka {
actor {
default-dispatcher {
fork-join-executor {
parallelism-max = X // set it to the number of cores you want to allow
parallelism-factor = 1
}
}
}
}
You can read more about it here.

cudaDeviceSynchronize() waits to finish only in current CUDA context or in all contexts?

I use CUDA 6.5 and 4 x GPUs Kepler.
I use multithreading, CUDA runtime API and access to the CUDA contexts from different CPU threads (by using OpenMP - but it does not really matter).
When I call cudaDeviceSynchronize(); will it wait for kernel(s) to finish only in current CUDA context which selected by the latest call cudaSetDevice(), or in all CUDA contexts?
If it will wait for kernel(s) to finish in all CUDA contexts, then it will wait in all CUDA contexts which used in current CPU thread (in example CPU thread_0 will wait GPUs: 0 and 1) or generally all CUDA contexts (CPU thread_0 will wait GPUs: 0, 1, 2 and 3)?
Following code:
// For using OpenMP requires to set:
// MSVS option: -Xcompiler "/openmp"
// GCC option: –Xcompiler –fopenmp
#include <omp.h>
int main() {
// execute two threads with different: omp_get_thread_num() = 0 and 1
#pragma omp parallel num_threads(2)
{
int omp_threadId = omp_get_thread_num();
// CPU thread 0
if(omp_threadId == 0) {
cudaSetDevice(0);
kernel_0<<<...>>>(...);
cudaSetDevice(1);
kernel_1<<<...>>>(...);
cudaDeviceSynchronize(); // what kernel<>() will wait?
// CPU thread 1
} else if(omp_threadId == 1) {
cudaSetDevice(2);
kernel_2<<<...>>>(...);
cudaSetDevice(3);
kernel_3<<<...>>>(...);
cudaDeviceSynchronize(); // what kernel<>() will wait?
}
}
return 0;
}

When I call cudaDeviceSynchronize(); will it wait for kernel(s) to
finish only in current CUDA context which selected by the latest call
cudaSetDevice(), or in all CUDA contexts?
cudaDeviceSynchronize() syncs all streams in the current CUDA context only.
Note: cudaDeviceSynchronize() will only synchronize host with the currently set GPU, if multiple GPUs are in use and all need to be synchronized, cudaDeviceSynchronize() has to be called separately for each one.
Here is a minimal example:
cudaSetDevice(0); cudaDeviceSynchronize();
cudaSetDevice(1); cudaDeviceSynchronize();
...
Source: Pawel Pomorski, slides of "CUDA on multiple GPUs". Linked here.

Why FFTW on Windows is faster than on Linux?

I wrote two identical programs in Linux and Windows using the fftw libraries (fftw3.a, fftw3.lib), and compute the duration of the fftwf_execute(m_wfpFFTplan) statement (16-fft).
For 10000 runs:
On Linux: average time is 0.9
On Windows: average time is 0.12
I am confused as to why this is nine times faster on Windows than on Linux.
Processor: Intel(R) Core(TM) i7 CPU 870 # 2.93GHz
Each OS (Windows XP 32 bit and Linux OpenSUSE 11.4 32 bit) are installed on same machines.
I downloaded the fftw.lib (for Windows) from internet and don't know that configurations. Once I build FFTW with this config:
/configure --enable-float --enable-threads --with-combined-threads --disable-fortran --with-slow-timer --enable-sse --enable-sse2 --enable-avx
in Linux and it results in a lib that is four times faster than the default configs (0.4 ms).

16 FFT is very small. What you will find is FFTs smaller than say 64 will be hard coded assembler with no loops to get the highest possible performance. This means they can be highly susceptible to variations in instruction sets, compiler optimisations, even 64 or 32bit words.
What happens when you run a test of FFT sizes from 16 -> 1048576 in powers of 2? I say this as a particular hard-coded asm routine on Linux might not be the best optimized for your machine, whereas you might have been lucky on the Windows implementation for that particular size. A comparison of all sizes in this range will give you a better indication of the Linux vs. Windows performance.
Have you calibrated FFTW? When first run FFTW guesses the fastest implementation per machine, however if you have special instruction sets, or a particular sized cache or other processor features then these can have a dramatic effect on execution speed. As a result performing a calibration will test the speed of various FFT routines and choose the fastest per size for your specific hardware. Calibration involves repeatedly computing the plans and saving the FFTW "Wisdom" file generated. The saved calibration data (this is a lengthy process) can then be re-used. I suggest doing it once when your software starts up and re-using the file each time. I have noticed 4-10x performance improvements for certain sizes after calibrating!
Below is a snippet of code I have used to calibrate FFTW for certain sizes. Please note this code is pasted verbatim from a DSP library I worked on so some function calls are specific to my library. I hope the FFTW specific calls are helpful.
// Calibration FFTW
void DSP::forceCalibration(void)
{
// Try to import FFTw Wisdom for fast plan creation
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r");
// If wisdom does not exist, ask user to calibrate
if (fftw_wisdom == 0)
{
int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\
"Would you like to perform a one-time calibration?\n\n"\
"Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\
"\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\
"\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!",
MB_YESNO | MB_ICONSTOP, 0);
if (iStatus2 == IDYES)
{
// Perform calibration for all powers of 2 from 8 to 4194304
// (most heavily used FFTs - for signal processing)
AfxMessageBox("About to perform calibration.\n"\
"Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\
"Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n"
"\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n");
startTimer();
// Create a whole load of FFTw Plans (wisdom accumulates automatically)
for (int i = 8; i <= 4194304; i *= 2)
{
// Create new buffers and fill
DSP::cFFTin = new fftw_complex[i];
DSP::cFFTout = new fftw_complex[i];
DSP::fconv_FULL_Real_FFT_rdat = new double[i];
DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1];
for(int j = 0; j < i; j++)
{
DSP::fconv_FULL_Real_FFT_rdat[j] = j;
DSP::cFFTin[j][0] = j;
DSP::cFFTin[j][1] = j;
DSP::cFFTout[j][0] = 0.0;
DSP::cFFTout[j][1] = 0.0;
}
// Create a plan for complex FFT.
// Use the measure flag to get the best possible FFT for this size
// FFTw "remembers" which FFTs were the fastest during this test.
// at the end of the test, the results are saved to disk and re-used
// upon every initialisation of the DSP Library
DSP::pCF = fftw_plan_dft_1d
(i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real forward FFT
DSP::pCF = fftw_plan_dft_r2c_1d
(i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real inverse FFT
DSP::pCF = fftw_plan_dft_c2r_1d
(i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Destroy the buffers. Repeat for each size
delete [] DSP::cFFTin;
delete [] DSP::cFFTout;
delete [] DSP::fconv_FULL_Real_FFT_rdat;
delete [] DSP::fconv_FULL_Real_FFT_cdat;
}
double time = stopTimer();
char * strOutput;
strOutput = (char*) malloc (100);
sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\
"Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\
"to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60);
AfxMessageBox(strOutput);
isCalibrated = 1;
// Save accumulated wisdom
char * strWisdom = fftw_export_wisdom_to_string();
FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w");
fprintf(fftw_wisdomsave, "%s", strWisdom);
fclose(fftw_wisdomsave);
DSP::pCF = NULL;
DSP::cFFTin = NULL;
DSP::cFFTout = NULL;
fconv_FULL_Real_FFT_cdat = NULL;
fconv_FULL_Real_FFT_rdat = NULL;
free(strOutput);
}
}
else
{
// obtain file size.
fseek (fftw_wisdom , 0 , SEEK_END);
long lSize = ftell (fftw_wisdom);
rewind (fftw_wisdom);
// allocate memory to contain the whole file.
char * strWisdom = (char*) malloc (lSize);
// copy the file into the buffer.
fread (strWisdom,1,lSize,fftw_wisdom);
// import the buffer to fftw wisdom
fftw_import_wisdom_from_string(strWisdom);
fclose(fftw_wisdom);
free(strWisdom);
isCalibrated = 1;
return;
}
}
The secret sauce is to create the plan using the FFTW_MEASURE flag, which specifically measures hundreds of routines to find the fastest for your particular type of FFT (real, complex, 1D, 2D) and size:
DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout,
FFTW_FORWARD, FFTW_MEASURE);
Finally, all benchmark tests should also be performed with a single FFT Plan stage outside of execute, called from code that is compiled in release mode with optimizations on and detached from the debugger. Benchmarks should be performed in a loop with many thousands (or even millions) of iterations and then take the average run time to compute the result. As you probably know the planning stage takes a significant amount of time and the execute is designed to be performed multiple times with a single plan.

Seeking help with a MT design pattern

I have a queue of 1000 work items and a n-proc machine (assume n =
4).The main thread spawns n (=4) worker threads at a time ( 25 outer
iterations) and waits for all threads to complete before processing
the next n (=4) items until the entire queue is processed
for(i= 0 to queue.Length / numprocs)
for(j= 0 to numprocs)
{
CreateThread(WorkerThread,WorkItem)
}
WaitForMultipleObjects(threadHandle[])
The work done by each (worker) thread is not homogeneous.Therefore in
1 batch (of n) if thread 1 spends 1000 s doing work and rest of the 3
threads only 1 s , above design is inefficient,becaue after 1 sec
other 3 processors are idling. Besides there is no pooling - 1000
distinct threads are being created
How do I use the NT thread pool (I am not familiar enough- hence the
long winded question) and QueueUserWorkitem to achieve the above. The
following constraints should hold
The main thread requires that all worker items are processed before
it can proceed.So I would think that a waitall like construct above
is required
I want to create as many threads as processors (ie not 1000 threads
at a time)
Also I dont want to create 1000 distinct events, pass to the worker
thread, and wait on all events using the QueueUserWorkitem API or
otherwise
Exisitng code is in C++.Prefer C++ because I dont know c#
I suspect that the above is a very common pattern and was looking for
input from you folks.

I'm not a C++ programmer, so I'll give you some half-way pseudo code for it
tcount = 0
maxproc = 4
while queue_item = queue.get_next() # depends on implementation of queue
# may well be:
# for i=0; i<queue.length; i++
while tcount == maxproc
wait 0.1 seconds # or some other interval that isn't as cpu intensive
# as continously running the loop
tcount += 1 # must be atomic (reading the value and writing the new
# one must happen consecutively without interruption from
# other threads). I think ++tcount would handle that in cpp.
new thread(worker, queue_item)
function worker(item)
# ...do stuff with item here...
tcount -= 1 # must be atomic

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string