Is there a way to check whether the processor cache has been flushed recently?

Is there a way to check whether the processor cache has been flushed recently? - linux

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Related

DMA memcpy operation in Linux

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:
void foo ()
{
int index = 0;
dma_cookie_t cookie;
size_t len = 0x20000;
ktime_t start, end, end1, end2, end3;
s64 actual_time;
u16* dest;
u16* src;
dest = kmalloc(len, GFP_KERNEL);
src = kmalloc(len, GFP_KERNEL);
for (index = 0; index < len/2; index++)
{
dest[index] = 0xAA55;
src[index] = 0xDEAD;
}
start = ktime_get();
cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);
while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
{
dma_sync_wait(chan, cookie);
}
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution dma: %lld\n",(long long)actual_time);
memset(dest, 0 , len);
start = ktime_get();
memcpy(dest, src, len);
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}
There are some issues with DMA:
Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)

There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:
Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

In general, on ucLinux, is ioctl faster than writing to /sys filesystem?

I have an embedded system I'm working with, and it currently uses the sysfs to control certain features.
However, there is function that we would like to speed up, if possible.
I discovered that this subsystem also supports and ioctl interface, but before rewriting the code, I decided to search to see which is a faster interface (on ucLinux) in general: sysfs or ioctl.
Does anybody understand both implementations well enough to give me a rough idea of the difference in overhead for each? I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls". Or "they are roughly the same because sysfs has a very simple interface".
Update 10/24/2013:
The specific case I'm currently doing is as follows:
int fd = open("/sys/power/state",O_WRONLY);
write( fd, "standby", 7 );
close( fd );
In kernel/power/main.c, the code that handles this write looks like:
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t n)
{
#ifdef CONFIG_SUSPEND
suspend_state_t state = PM_SUSPEND_STANDBY;
const char * const *s;
#endif
char *p;
int len;
int error = -EINVAL;
p = memchr(buf, '\n', n);
len = p ? p - buf : n;
/* First, check if we are requested to hibernate */
if (len == 7 && !strncmp(buf, "standby", len)) {
error = enter_standby();
goto Exit;
((( snip )))
Can this be sped up by moving to a custom ioctl() where the code to handle the ioctl call looks something like:
case SNAPSHOT_STANDBY:
if (!data->frozen) {
error = -EPERM;
break;
}
error = enter_standby();
break;
(so the ioctl() calls the same low-level function that the sysfs function did).

If by sysfs you mean the sysfs() library call, notice this in man 2 sysfs:
NOTES
This System-V derived system call is obsolete; don't use it. On systems with /proc, the same information can be obtained via
/proc/filesystems; use that interface instead.
I can't recall noticing stuff that had an ioctl() and a sysfs interface, but probably they exist. I'd use the proc or sys handle anyway, since that tends to be less cryptic and more flexible.
If by sysfs you mean accessing files in /sys, that's the preferred method.
I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls".
Accessing procfs or sysfs files does not entail an I/O bottleneck because they are not real files -- they are kernel interfaces. So no, accessing this stuff through "the file layer" does not affect performance. This is a not uncommon misconception in linux systems programming, I think. Programmers can be squeamish about system calls that aren't well, system calls, and paranoid that opening a file will be somehow slower. Of course, file I/O in the ABI is just system calls anyway. What makes a normal (disk) file read slow is not the calls to open, read, write, whatever, it's the hardware bottleneck.
I always use low level descriptor based functions (open(), read()) instead of high level streams when doing this because at some point some experience led me to believe they were more reliable for this specifically (reading from /proc). I can't say whether that's definitively true.

So, the question was interesting, I built a couple of modules, one for ioctl and one for sysfs, the ioctl implementing only a 4 bytes copy_from_user and nothing more, and the sysfs having nothing in its write interface.
Then a couple of userspace test up to 1 million iterations, here the results:
time ./sysfs /sys/kernel/kobject_example/bar
real 0m0.427s
user 0m0.056s
sys 0m0.368s
time ./ioctl /run/temp
real 0m0.236s
user 0m0.060s
sys 0m0.172s
edit
I agree with #goldilocks answer, HW is the real bottleneck, in a Linux environment with a well written driver choosing ioctl or sysfs doesn't make a big difference, but if you are using uClinux probably in your HW even few cpu cycles can make a difference.
The test I've done is for Linux not uClinux and it never wanted to be an absolute reference profiling the two interfaces, my point is that you can write a book about how fast is one or another but only testing will let you know, took me few minutes to setup the thing.

independent searches on GPU -- how to synchronize its finish?

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.
I'm trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.
For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I'm looking to do this on GPUs. But I'm struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:
__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
while (lessThan100truesFound) {
totAttempts[gid]++;
if (generateRandomNumbersAndTestThem())
reportTrue();
}
}
How should I implement this without severe performance loss, given that
triggering of the "if" will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don't even know if a boolean is the right way)
the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.
I assume that something can be done similar to Java's "synchronized" modifier, but I fail to find the exact way to do it. What is the "right" way to do this, i.e. any way that works without severe performance loss?

I'd suggest not using global flag to stop kernel, but rather run kernel to do certain amount of attempts, check on host if you have accumulated enough 'successes', and repeat if necessary. Using cycle of undefined length in kernel is bad since GPU driver could be killed by watch-dog timer. Besides, checking some global variable at each iteration would certainly screw kernel performance.
This way, reportTrue could be implemented as atomic_inc to some counter residing in global memory.
__kernel void sampleKernel(all_the_input, __global unsigned long *successes) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
for (int i = 0; i < ATT_PER_THREAD; ++i) {
if (generateRandomNumbersAndTestThem())
atomic_inc(successes);
}
}
ATT_PER_THREAD is to be adjusted depending on how long it takes to execute generateRandomNumbersAndTestThem(). Kernel launch overhead is pretty small, so there usually is no need to make your kernel run more than 0.1--1 second

How to Verify Atomic Writes?

I have searched diligently (both within the S[O|F|U] network and elsewhere) and believe this to be an uncommon question. I am working with an Atmel AT91SAM9263-EK development board (ARM926EJ-S core, ARMv5 instruction set) running Debian Linux 2.6.28-4. I am writing using (I believe) the tty driver to talk to an RS-485 serial controller. I need to ensure that writes and reads are atomic. Several lines of source code (listed below the end of this post relative to the kernel source installation directory) either imply or implicitly state this.
Is there any way I can verify that writing/reading to/from this device is actually an atomic operation? Or, is the /dev/ttyXX device considered a FIFO and the argument ends there? It seems not enough to simply trust that the code is enforcing this claim it makes - as recently as February of this year freebsd was demonstrated to lack atomic writes for small lines. Yes I realize that freebsd is not exactly the same as Linux, but my point is that it doesn't hurt to be carefully sure. All I can think of is to keep sending data and look for a permutation - I was hoping for something a little more scientific and, ideally, deterministic. Unfortunately, I remember precisely nothing from my concurrent programming classes in the college days of yore. I would thoroughly appreciate a slap or a shove in the right direction. Thank you in advance should you choose to reply.
Kind regards,
Jayce
drivers/char/tty_io.c:1087
void tty_write_message(struct tty_struct *tty, char *msg)
{
lock_kernel();
if (tty) {
mutex_lock(&tty->atomic_write_lock);
if (tty->ops->write && !test_bit(TTY_CLOSING, &tty->flags))
tty->ops->write(tty, msg, strlen(msg));
tty_write_unlock(tty);
}
unlock_kernel();
return;
}
arch/arm/include/asm/bitops.h:37
static inline void ____atomic_set_bit(unsigned int bit, volatile unsigned long *p)
{
unsigned long flags;
unsigned long mask = 1UL << (bit & 31);
p += bit >> 5;
raw_local_irq_save(flags);
*p |= mask;
raw_local_irq_restore(flags);
}
drivers/serial/serial_core.c:2376
static int
uart_write(struct tty_struct *tty, const unsigned char *buf, int count)
{
struct uart_state *state = tty->driver_data;
struct uart_port *port;
struct circ_buf *circ;
unsigned long flags;
int c, ret = 0;
/*
* This means you called this function _after_ the port was
* closed. No cookie for you.
*/
if (!state || !state->info) {
WARN_ON(1);
return -EL3HLT;
}
port = state->port;
circ = &state->info->xmit;
if (!circ->buf)
return 0;
spin_lock_irqsave(&port->lock, flags);
while (1) {
c = CIRC_SPACE_TO_END(circ->head, circ->tail, UART_XMIT_SIZE);
if (count < c)
c = count;
if (c <= 0)
break;
memcpy(circ->buf + circ->head, buf, c);
circ->head = (circ->head + c) & (UART_XMIT_SIZE - 1);
buf += c;
count -= c;
ret += c;
}
spin_unlock_irqrestore(&port->lock, flags);
uart_start(tty);
return ret;
}
Also, from the man write(3) documentation:
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.

I think that, technically, devices are not FIFOs, so it's not at all clear that the guarantees you quote are supposed to apply.
Are you concerned about partial writes and reads within a process, or are you actually reading and/or writing the same device from different processes? Assuming the latter, you might be better off implementing a proxy process of some sort. The proxy owns the device exclusively and performs all reads and writes, thus avoiding the multi-process atomicity problem entirely.
In short, I advise not attempting to verify that "reading/writing from this device is actually an atomic operation". It will be difficult to do with confidence, and leave you with an application that is subject to subtle failures if a later version of linux (or different o/s altogether) fails to implement atomicity the way you need.

I think PIPE_BUF is the right thing. Now, writes of less than PIPE_BUF bytes may not be atomic, but if they aren't it's an OS bug. I suppose you could ask here if an OS has known bugs. But really, if it has a bug like that, it just ought to be immediately fixed.
If you want to write more than PIPE_BUF atomically, I think you're out of luck. I don't think there is any way outside of application coordination and cooperation to make sure that writes of larger sizes happen atomically.
One solution to this problem is to put your own process in front of the device and make sure everybody who wants to write to the device contacts the process and sends the data to it instead. Then you can do whatever makes sense for your application in terms of atomicity guarantees.

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string