Setting a thread priority to high C

Setting a thread priority to high C - multithreading

I am writing a program that will create two threads, one of them has to have a high pripoity and the other is default. I am using pthread_create() to create the thread and would like to initiate the thread priority from the same command. The way I am doing that is as follow:
pthread_create(&threads[lastThreadIndex], NULL, &solution, (void *)(&threadParam));
where,
threads: is an array of type pthread_t that has all my threads in.
lastThreadIndex: is a counter
solution: is my function
threadParam: is a struct that has all variables needed by the solution function.
I have read many articles and most of them suggest to replace NULL with the priority level; However, I never found the level key word or the exact way of doing it.
Please help...
Thanks

In POSIX, that second parameter is the pthread attribute, and NULL just means to use the default.
However, you can create your own attribute and set its properties, including driving up the priority with something like:
#include <pthread.h>
#include <sched.h>
int rc;
pthread_attr_t attr;
struct sched_param param;
rc = pthread_attr_init (&attr);
rc = pthread_attr_getschedparam (&attr, &param);
(param.sched_priority)++;
rc = pthread_attr_setschedparam (&attr, &param);
rc = pthread_create (&threads[lastThreadIndex], &attr,
&solution, (void *)(&threadParam));
// Should really be checking rc for errors.
Details on POSIX threading, including scheduling, can be found starting at this page.

Related

Where can PTHRED_MUTEX_ADAPTIVE_NP be specified and how does it work?

I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP which is somehow given as a value to a mutex so that the mutex does an adaptive spinning, meaning that it spins in the magnitude of an immediate wakeup through the kernel would last. But how do I utilize this configuration-macro to a thread ?
And as I've developed an improved shared readers-writer lock (it needs only one atomic operation at best in contrast to the three operations given in the Wikipedia-solution) with relative writer-priority (further readers are stalled when there's a writer and the readers before are allowed to proceed) which could also make use of adaptive spinning: how is the number of spinning-cycles calculated ?

I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP
Some pthreads implementations provide a macro PTHREAD_MUTEX_ADAPTIVE_NP (note spelling) that is one of the possible values of the kind_np mutex attribute, but neither that attribute nor the macro are standard. It looks like at least BSD and AIX have them, or at least did at one time, but this is not something you should be using in new code.
But how do I utilize this configuration-macro to a thread ?
You don't. Even if you are using a pthreads implementation that supports it, this is the value of a mutex attribute, not a thread attribute. You obtain a mutex with that attribute value by explicitly requesting it when you initialize the mutex. It would look something like this:
pthread_mutexattr_t attr;
pthread_mutex_t mutex;
int rval;
// Return-value checks omitted for brevity and clarity
rval = pthread_mutexattr_init(&attr);
rval = pthread_mutexattr_setkind_np(&attr, PTHREAD_MUTEX_ADAPTIVE_NP);
rval = pthread_mutex_init(&mutex, &attr);
There are other mutex attributes that you can set in analogous ways, which is one of the reasons I wrote this answer. Although you should not be using the kind_np attribute, you can follow this general model for other mutex attributes. There are also thread attributes, which work similarly.

I found the code in the glibc:
That's the "adaptive" mutex locking code of pthread_mutex_lock
in the glibc 2.31:
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (! __is_smp)
goto simple;
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN (max_adaptive_count (),
mutex->__data.__spins * 2 + 10);
do
{
if (cnt++ >= max_cnt)
{
LLL_MUTEX_LOCK (mutex);
break;
}
atomic_spin_nop ();
}
while (LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
So the spin count is doubled up to a maximum plus 10 first (system configurable or 1000 if thre's no configuration) and after the locking the difference between the actual spins and the predefined spins divided by 8 is added to the next spin-count.

Identifying bug in linux kernel module

I am marking Michael's as he was the first. Thank you to osgx and employee of the month for additional information and assistance.
I am attempting to identify a bug in a consumer/produce kernel module. This is a problem being given to me for a course in university. My teaching assistant was not able to figure it out, and my professor said it was okay if I uploaded online (he doesn't think Stack can figure it out!).
I have included the module, the makefile, and the Kbuild.
Running the program does not guarantee the bug will present itself.
I thought the issue was on line 30 since it is possible for a thread to rush to line 36, and starve the other threads. My professor said that is not what he is looking for.
Unrelated question: What is the purpose of line 40? It seems out of place to me, but my professor said it serves a purporse.
My professor said the bug is very subtle. The bug is not deadlock.
My approach was to identify critical sections and shared variables, but I'm stumped. I am not familiar with tracing (as a method of debugging), and was told that while it may help it is not necessary to identify the issue.
File: final.c
#include <linux/completion.h>
#include <linux/init.h>
#include <linux/kthread.h>
#include <linux/module.h>
static int actor_kthread(void *);
static int writer_kthread(void *);
static DECLARE_COMPLETION(episode_cv);
static DEFINE_SPINLOCK(lock);
static int episodes_written;
static const int MAX_EPISODES = 21;
static bool show_over;
static struct task_info {
struct task_struct *task;
const char *name;
int (*threadfn) (void *);
} task_info[] = {
{.name = "Liz", .threadfn = writer_kthread},
{.name = "Tracy", .threadfn = actor_kthread},
{.name = "Jenna", .threadfn = actor_kthread},
{.name = "Josh", .threadfn = actor_kthread},
};
static int actor_kthread(void *data) {
struct task_info *actor_info = (struct task_info *)data;
spin_lock(&lock);
while (!show_over) {
spin_unlock(&lock);
wait_for_completion_interruptible(&episode_cv); //Line 30
spin_lock(&lock);
while (episodes_written) {
pr_info("%s is in a skit\n", actor_info->name);
episodes_written--;
}
reinit_completion(&episode_cv); // Line 36
}
pr_info("%s is done for the season\n", actor_info->name);
complete(&episode_cv); //Why do we need this line?
actor_info->task = NULL;
spin_unlock(&lock);
return 0;
}
static int writer_kthread(void *data) {
struct task_info *writer_info = (struct task_info *)data;
size_t ep_num;
spin_lock(&lock);
for (ep_num = 0; ep_num < MAX_EPISODES && !show_over; ep_num++) {
spin_unlock(&lock);
/* spend some time writing the next episode */
schedule_timeout_interruptible(2 * HZ);
spin_lock(&lock);
episodes_written++;
complete_all(&episode_cv);
}
pr_info("%s wrote the last episode for the season\n", writer_info->name);
show_over = true;
complete_all(&episode_cv);
writer_info->task = NULL;
spin_unlock(&lock);
return 0;
}
static int __init tgs_init(void) {
size_t i;
for (i = 0; i < ARRAY_SIZE(task_info); i++) {
struct task_info *info = &task_info[i];
info->task = kthread_run(info->threadfn, info, info->name);
}
return 0;
}
static void __exit tgs_exit(void) {
size_t i;
spin_lock(&lock);
show_over = true;
spin_unlock(&lock);
for (i = 0; i < ARRAY_SIZE(task_info); i++)
if (task_info[i].task)
kthread_stop(task_info[i].task);
}
module_init(tgs_init);
module_exit(tgs_exit);
MODULE_DESCRIPTION("CS421 Final");
MODULE_LICENSE("GPL");
File: kbuild
Kobj-m := final.o
File: Makefile
# Basic Makefile to pull in kernel's KBuild to build an out-of-tree
# kernel module
KDIR ?= /lib/modules/$(shell uname -r)/build
all: modules
clean modules:

When cleaning up in tgs_exit() the function executes the following without holding the spinlock:
if (task_info[i].task)
kthread_stop(task_info[i].task);
It's possible for a thread that's ending to set it's task_info[i].task to NULL between the check and call to kthread_stop().

I'm quite confused here.
You claim this is a question from an upcoming exam and it was released by the person delivering the course. Why would they do that? Then you say that TA failed to solve the problem. If TA can't do it, who can expect students to pass?
(professor) doesn't think Stack can figure it out
If the claim is that the level on this website is bad I definitely agree. But still, claiming it is below a level to be expected from a random university is a stretch. If there is no claim of the sort, I once more ask how are students expected to do it. What if the problem gets solved?
The code itself is imho unsuitable for teaching as it deviates too much from common idioms.
Another answer here noted one side effect of the actual problem. Namely, it was stated that the loop in tgs_exit can race with threads exiting on their own and test the ->task pointer to be non-NULL, while it becomes NULL just afterwards. The discussion whether this can result in a kthread_stop(NULL) call is not really relevant.
Either a kernel thread exiting on its own will clear everything up OR kthread_stop (and maybe something else) is necessary to do it.
If the former is true, the code suffers from a possible use-after-free. After tgs_exit tests that the pointer, the target thread could have exited. Maybe prior to kthread_stop call or maybe just as it was executed. Either way, it is possible that the passed pointer is stale as the area was already freed by the thread which was exiting.
If the latter is true, the code suffers from resource leaks due to insufficient cleanup - there are no kthread_stop calls if tgs_exit is executed after all threads exit.
The kthread_* api allows threads to just exit, hence effects are as described in the first variant.
For the sake of argument let's say the code is compiled in into the kernel (as opposed to being loaded as a module). Say the exit func is called on shutdown.
There is a design problem that there are 2 exit mechanisms and it transforms into a bug as they are not coordinated. A possible solution for this case would set a flag for writers to stop and would wait for a writer counter to drop to 0.
The fact that the code is in a module makes the problem more acute: unless you kthread_stop, you can't tell if the target thread is gone. In particular "actor" threads do:
actor_info->task = NULL;
So the thread is skipped in the exit handler, which can now finish and let the kernel unload the module itself...
spin_unlock(&lock);
return 0;
... but this code (located in the module!) possibly was not executed yet.
This would not have happened if the code followed an idiom if always using kthread_stop.
Other issue is that writers wake everyone up (so-called "thundering herd problem"), as opposed to at most one actor.
Perhaps the bug one is supposed to find is that each episode has at most one actor? Maybe that the module can exit when there are episodes written but not acted out yet?
The code is extremely weird and if you were shown a reasonable implementation of a thread-safe queue in userspace, you should see how what's presented here does not fit. For instance, why does it block instantly without checking for episodes?
Also a fun fact that locking around the write to show_over plays no role in correctness.
There are more issues and it is quite likely I missed some. As it is, I think the question is of poor quality. It does not look like anything real-world.

perf_event_open Overflow Signal

I want to count the (more or less) exact amount of instructions for some piece of code. Additionally, I want to receive a Signal after a specific amount of instructions passed.
For this purpose, I use the overflow signal behaviour provided by
perf_event_open.
I'm using the second way the manpage proposes to achieve overflow signals:
Signal overflow
Events can be set to deliver a signal when a threshold
is crossed. The signal handler is set up using the poll(2), select(2),
epoll(2) and fcntl(2), system calls.
[...]
The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This
ioctl adds to a counter that decrements each time the event overflows.
When nonzero, a POLL_IN signal is sent on overflow, but once the value
reaches 0, a signal is sent of type POLL_HUP and the underlying event
is disabled.
Further explanation of PERF_EVENT_IOC_REFRESH ioctl:
PERF_EVENT_IOC_REFRESH
Non-inherited overflow counters can use this to enable a
counter for a number of overflows specified by the argument,
after which it is disabled. Subsequent calls of this ioctl
add the argument value to the current count. A signal with
POLL_IN set will happen on each overflow until the count
reaches 0; when that happens a signal with POLL_HUP set is
sent and the event is disabled. Using an argument of 0 is
considered undefined behavior.
A very minimal example would look like this:
#define _GNU_SOURCE 1
#include <asm/unistd.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long perf_event_open(struct perf_event_attr* event_attr, pid_t pid, int cpu, int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, event_attr, pid, cpu, group_fd, flags);
}
static void perf_event_handler(int signum, siginfo_t* info, void* ucontext) {
if(info->si_code != POLL_HUP) {
// Only POLL_HUP should happen.
exit(EXIT_FAILURE);
}
ioctl(info->si_fd, PERF_EVENT_IOC_REFRESH, 1);
}
int main(int argc, char** argv)
{
// Configure signal handler
struct sigaction sa;
memset(&sa, 0, sizeof(struct sigaction));
sa.sa_sigaction = perf_event_handler;
sa.sa_flags = SA_SIGINFO;
// Setup signal handler
if (sigaction(SIGIO, &sa, NULL) < 0) {
fprintf(stderr,"Error setting up signal handler\n");
perror("sigaction");
exit(EXIT_FAILURE);
}
// Configure perf_event_attr struct
struct perf_event_attr pe;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS; // Count retired hardware instructions
pe.disabled = 1; // Event is initially disabled
pe.sample_type = PERF_SAMPLE_IP;
pe.sample_period = 1000;
pe.exclude_kernel = 1; // excluding events that happen in the kernel-space
pe.exclude_hv = 1; // excluding events that happen in the hypervisor
pid_t pid = 0; // measure the current process/thread
int cpu = -1; // measure on any cpu
int group_fd = -1;
unsigned long flags = 0;
int fd = perf_event_open(&pe, pid, cpu, group_fd, flags);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
perror("perf_event_open");
exit(EXIT_FAILURE);
}
// Setup event handler for overflow signals
fcntl(fd, F_SETFL, O_NONBLOCK|O_ASYNC);
fcntl(fd, F_SETSIG, SIGIO);
fcntl(fd, F_SETOWN, getpid());
ioctl(fd, PERF_EVENT_IOC_RESET, 0); // Reset event counter to 0
ioctl(fd, PERF_EVENT_IOC_REFRESH, 1); //
// Start monitoring
long loopCount = 1000000;
long c = 0;
long i = 0;
// Some sample payload.
for(i = 0; i < loopCount; i++) {
c += 1;
}
// End monitoring
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); // Disable event
long long counter;
read(fd, &counter, sizeof(long long)); // Read event counter value
printf("Used %lld instructions\n", counter);
close(fd);
}
So basically I'm doing the following:
Set up a signal handler for SIGIO signals
Create a new performance counter with perf_event_open (returns a file descriptor)
Use fcntl to add signal sending behavior to the file descriptor.
Run a payload loop to execute many instructions.
When executing the payload loop, at some point 1000 instructions (the sample_interval) will have been executed. According to the perf_event_open manpage this triggers an overflow which will then decrement an internal counter.
Once this counter reaches zero, "a signal is sent of type POLL_HUP and the underlying event is disabled."
When a signal is sent, the control flow of the current process/thread is stopped, and the signal handler is executed. Scenario:
1000 instructions have been executed.
Event is automatically disabled and a signal is sent.
Signal is immediately delivered, control flow of the process is stopped and the signal handler is executed.
This scenario would mean two things:
The final amount of counted instructions would always be equal to an example which does not use signals at all.
The instruction pointer which has been saved for the signal handler (and can be accessed through ucontext) would directly point to the instruction which caused the overflow.
Basically you could say, the signal behavior can be seen as synchronous.
This is the perfect semantic for what I want to achieve.
However, as far as I'm concerned, the signal I configured is generally rather asynchronous and some time may pass until it is eventually delivered and the signal handler is executed. This may pose a problem for me.
For example, consider the following scenario:
1000 instructions have been executed.
Event is automatically disabled and a signal is sent.
Some more instructions pass
Signal is delivered, control flow of the process is stopped and the signal handler is executed.
This scenario would mean two things:
The final amount of counted instructions would be less than an example which does not use signals at all.
The instruction pointer which has been saved for the signal handler would point to the instructions which caused the overflow or to any one after it.
So far, I've tested above example a lot and did not experience missed instructions which would support the first scenario.
However, I'd really like to know, whether I can rely on this assumption or not.
What happens in the kernel?

I want to count the (more or less) exact amount of instructions for some piece of code. Additionally, I want to receive a Signal after a specific amount of instructions passed.
You have two task which may conflict with each other. When you want to get counting (exact amounts of some hardware event), just use performance monitoring unit of your CPU in counting mode (don't set sample_period/sample_freq of perf_event_attr structure used) and place the measurement code in your target program (as it was done in your example). In this mode according to the man page of perf_event_open no overflows will be generated (CPU's PMU are usually 64-bit wide and don't overflow when not set to small negative value when sampling mode is used):
Overflows are generated only by sampling events (sample_period must a nonzero value).
To count part of program, use ioctls of perf_event_open returned fd as described in man page
perf_event ioctl calls - Various ioctls act on perf_event_open() file descriptors: PERF_EVENT_IOC_ENABLE ... PERF_EVENT_IOC_DISABLE ... PERF_EVENT_IOC_RESET
You can read current value with rdpmc (on x86) or by read syscall on the fd like in the short example from the man page:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
printf("Measuring instruction count for this printf\n");
/* Place target code here instead of printf */
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("Used %lld instructions\n", count);
close(fd);
}
Additionally, I want to receive a Signal after a specific amount of instructions passed.
Do you really want to get signal or you just need instruction pointers at every 1000 instructions executed? If you want to collect pointers, use perf_even_open with sampling mode, but do it from other program to disable measuring of the event collection code. Also, it will have less negative effect on your target program, if you will use not signals for every overflow (with huge amount of kernel-tracer interactions and switching from/to kernel), but instead use capabilities of perf_events to collect several overflow events into single mmap buffer and poll on this buffer. On overflow interrupt from PMU perf interrupt handler will be called to save the instruction pointer into buffer and then counting will be reset and program will return to execution. In your example, perf interrupt handler will woke your program, it will do several syscalls, return to kernel and then kernel will restart target code (so overhead per sample is greater than using mmap and parsing it). With precise_ip flag you may activate advanced sampling of your PMU (if it has such mode, like PEBS and PREC_DIST in intel x86/em64t for some counters like INST_RETIRED, UOPS_RETIRED, BR_INST_RETIRED, BR_MISP_RETIRED, MEM_UOPS_RETIRED, MEM_LOAD_UOPS_RETIRED, MEM_LOAD_UOPS_LLC_HIT_RETIRED and with simple hack to cycles too; or like IBS of AMD x86/amd64; paper about PEBS and IBS), when instruction address is saved directly by hardware with low skid. Some very advanced PMUs has ability to do sampling in hardware, storing overflow information of several events in row with automatic reset of counter without software interrupts (some descriptions on precise_ip are in the same paper).
I don't know if it is possible in perf_events subsystem and in your CPU to have two perf_event tasks active at same time: both count events in the target process and in the same time have sampling from other process. With advanced PMU this can be possible in the hardware and perf_events in modern kernel may allow it. But you give no details on your kernel version and your CPU vendor and family, so we can't answer this part.
You also may try other APIs to access PMU like PAPI or likwid (https://github.com/RRZE-HPC/likwid). Some of them may directly read PMU registers (sometimes MSR) and may allow sampling at the same time when counting is enabled.

ptrace one thread from another

Experimenting with the ptrace() system call, I am trying to trace another thread of the same process. According to the man page, both the tracer and the tracee are specific threads (not processes), so I don't see a reason why it should not work. So far, I have tried the following:
use PTRACE_TRACEME from the clone()d child: the call succeeds, but does not do what I want, probably because the parent of the to-be-traced thread is not the thread that called clone()
use PTRACE_ATTACH or PTRACE_SEIZE from the parent thread: this always fails with EPERM, even if the process runs as root and with prctl(PR_SET_DUMPABLE, 1)
In all cases, waitpid(-1, &status, __WALL) fails with ECHILD (same when passing the child pid explicitly).
What should I do to make it work?
If it is not possible at all, is it by desing or a bug in the kernel (I am using version 3.8.0). In the former case, could you point me to the right bit of the documentation?

As #mic_e pointed out, this is a known fact about the kernel - not quite a bug, but not quite correct either. See the kernel mailing list thread about it. To provide an excerpt from Linus Torvalds:
That "new" (last November) check isn't likely going away. It solved
so many problems (both security and stability), and considering that
(a) in a year, only two people have ever even noticed
(b) there's a work-around as per above that isn't horribly invasive
I have to say that in order to actually go back to the old behaviour,
we'd have to have somebody who cares deeply, go back and check every
single special case, deadlock, and race.
The solution is to actually start the process that is being traced in a subprocess - you'll need to make the ptracing process be the parent of the other.
Here's an outline of doing this based on another answer that I wrote:
// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)
int main_thread(void *ptr) {
// do work for main thread
}
int main(int argc, char *argv[]) {
void *vstack = malloc(STACK_SIZE);
pid_t v;
if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
perror("failed to spawn child task");
return 3;
}
long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
if (ptv == -1) {
perror("failed monitor sieze");
return 1;
}
// do actual ptrace work
}

Can I assign a per-thread index, using pthreads?

I'm optimizing some instrumentation for my project (Linux,ICC,pthreads), and would like some feedback on this technique to assign a unique index to a thread, so I can use it to index into an array of per-thread data.
The old technique uses a std::map based on pthread id, but I'd like to avoid locks and a map lookup if possible (it is creating a significant amount of overhead).
Here is my new technique:
static PerThreadInfo info[MAX_THREADS]; // shared, each index is per thread
// Allow each thread a unique sequential index, used for indexing into per
// thread data.
1:static size_t GetThreadIndex()
2:{
3: static size_t threadCount = 0;
4: __thread static size_t myThreadIndex = threadCount++;
5: return myThreadIndex;
6:}
later in the code:
// add some info per thread, so it can be aggregated globally
info[ GetThreadIndex() ] = MyNewInfo();
So:
1) It looks like line 4 could be a race condition if two threads where created at exactly the same time. If so - how can I avoid this (preferably without locks)? I can't see how an atomic increment would help here.
2) Is there a better way to create a per-thread index somehow? Maybe by pre-generating the TLS index on thread creation somehow?

1) An atomic increment would help here actually, as the possible race is two threads reading and assigning the same ID to themselves, so making sure the increment (read number, add 1, store number) happens atomically fixes that race condition. On Intel a "lock; inc" would do the trick, or whatever your platform offers (like InterlockedIncrement() for Windows for example).
2) Well, you could actually make the whole info thread-local ("__thread static PerThreadInfo info;"), provided your only aim is to be able to access the data per-thread easily and under a common name. If you actually want it to be a globally accessible array, then saving the index as you do using TLS is a very straightforward and efficient way to do this. You could also pre-compute the indexes and pass them along as arguments at thread creation, as Kromey noted in his post.

Why so averse to using locks? Solving race conditions is exactly what they're designed for...
In any rate, you can use the 4th argument in pthread_create() to pass an argument to your threads' start routine; in this way, you could use your master process to generate an incrementing counter as it launches the threads, and pass this counter into each thread as it is created, giving you your unique index for each thread.

I know you tagged this [pthreads], but you also mentioned the "old technique" of using std::map. This leads me to believe that you're programming in C++. In C++11 you have std::thread, and you can pass out unique indexes (id's) to your threads at thread creation time through an ordinary function parameter.
Below is an example HelloWorld that creates N threads, assigning each an index of 0 through N-1. Each thread does nothing but say "hi" and give it's index:
#include <iostream>
#include <thread>
#include <mutex>
#include <vector>
inline void sub_print() {}
template <class A0, class ...Args>
void
sub_print(const A0& a0, const Args& ...args)
{
std::cout << a0;
sub_print(args...);
}
std::mutex&
cout_mut()
{
static std::mutex m;
return m;
}
template <class ...Args>
void
print(const Args& ...args)
{
std::lock_guard<std::mutex> _(cout_mut());
sub_print(args...);
}
void f(int id)
{
print("This is thread ", id, "\n");
}
int main()
{
const int N = 10;
std::vector<std::thread> threads;
for (int i = 0; i < N; ++i)
threads.push_back(std::thread(f, i));
for (auto i = threads.begin(), e = threads.end(); i != e; ++i)
i->join();
}
My output:
This is thread 0
This is thread 1
This is thread 4
This is thread 3
This is thread 5
This is thread 7
This is thread 6
This is thread 2
This is thread 9
This is thread 8

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string