Is it possible to track display driver memory accesses over PCIe? - linux

Our team wishes to write a performance measuring tool with focus on GPUGPU
In order to understand if a particular app is compute-bound or memory-bound we would like to track Graphic card's memory accesses without sticking to any compute API
Is it possible to track syscalls like read for this purpose?

Related

Is there a simple way to measure total memory consumption?

I want to report the total memory usage of an Actix Web application from within itself.
Because this is only shown at the bottom of a single page and used very rarely and it does not need to be that accurate (e.g. it is enough if it measures just the heap), I don't want to dedicate too much code to it or use an external crate.
Is there a simple one-liner to measure how much total memory my program is using without using an external dependency or is that not possible?
Is there a simple one-liner in Rust
No.
The best way to get the total memory consumption of a process is via the OS's facilities. Which, of course, depends on what OS you are running on.
There's crates like sysinfo that abstract it all away though and implement cross-platform process metrics.

How To implement swapfile cross operating system

There are use cases where I can't have a lot of ram, and sometimes due to docker based services doesn't always provide more than 512mb/1gb of ram, or if I run multiple rust based gui apps and if each take 100mb of ram normally, how can I implement a swapfile/ virtual ram to exceed allotted ram? Also os level swapfiles don't let users choose which app can use real ram and which swapfile, so it can become a problem too. I want to use swapfile as much as possible, and not even real ram, if possible. Users and hosting services provide with lot of storage usually (more than 10gb normally) so it would be a good way to use the available storage too!
If swapfile or anything like that aren't possible, I would like to know if there is any difference in speed and cpu consumption between "cache data in ram" apps and "cache data in file and read it when required" apps. If the latter is slow normally and not as efficient as swapfiles, I would like to know the possible ways how os manages to make swapfiles that efficient than apps.
An application does not control whether the memory they allocate is allocated on real RAM, on a swap partition, or else. You just ask for memory, and the OS is responsible for finding available memory to give to you.
Besides that, note that using swap (sometimes called swapping) is extremely bad performance-wise. How much depends a lot on your hardware, but it's about three orders of magnitude. This is even amplified if you are interacting with a user: a program that is fetching some resources will not be too bothered if it has to wait one minute to get them instead of a few milliseconds because the system is under heavy load, but a user will generally not be that patient.
Also note that, when swapping, the OS does not chose which application gets the faster RAM and which ones get the swap memory at random. It will try to determine which application should be prioritized, by how much, etc. based on how it was configured (at least for the Linux kernel), so in reality it's the user who, in the end, decides which applications get the most RAM (ahead of time, of course: they are not prompted each time the kernel has to make that decision with a little pop-up...).
Finally, modern OS allow several applications to allocate memory that overlap, as long as each application is not fully using the memory it asked for (which is kind of usual), allowing you to run applications that in theory require more RAM that you actually have.
This was on the OS part: now to the application part. Usually, when you write a program (whose purpose is not specifically RAM-related), you should not really care for memory consumption (up to a certain point), especially in Rust. Not only that is usually handled by the OS in case you used a little bit too much memory, but when it's possible, most people prefer to trade a little more memory usage (even a lot more) for better CPU performance, because RAM is a lot cheaper than CPU.
There are exceptions, of course, in which the memory consumption is so high that you can't really afford not paying attention. In these cases, either you let the user deal with this problem (ie. this application is known to consume a lot of memory because there are no other ways to do this, so if you want to use it, just have a lot of memory), as often video games do, or you rethink your application to reduce the memory usage trading it for some CPU efficiency, as for example is done when you are handling graphs so huge you couldn't even store them on all the hard disks of the world (in which case your application has to be smart enough to be able to work on small parts of the graph at the time), or finally you are working with a big resource but which can be stored on the hard disk, so you just write it on a file and access it chunks-by-chunks, as some database managers do.

How to make the OS schedule disk accesses optimally?

Suppose that a process needs to access the file system in many (1000+) places, and the order is not important to the program logic. However, the order obviously matters for performance if the file system is stored on a (spinning) hard disk.
How can the application programmer communicate to the OS that it should schedule the accesses optimally? Launching 1000+ threads does not seem practical. Does database management software accomplish this, and if so, then how?
Additional details: I had a large (1TB+) mmapped file where I needed to read 1000+ chunks of about 1KB, each time in new, unpredictable places.
In the early days when parameters like Wikipedia: Hard disk drive performance characteristics → Seek time were very expensive and thus very important, database vendors payed attention to the on-disk data representation and layout as can be seen e.g. in Oracle8i: Designing and Tuning for Performance → Tuning I/O.
The important optimization parameters changed with appearance of Solid-state drives (SSD) where the seek time is 0 (or at least constant) as there is nothing to rotate. Some of the new parameters are addressed by Wikipedia: Solid-state drive (SSD) → optimized file systems.
But even those optimization parameters go away with the use of Wikipedia: In-memory databases. The list of vendors is pretty long, all big players on it.
So how to schedule your access optimally depends a lot on the use case (1000 concurrent hits is not sufficient problem description) and buying some RAM is one of the options and "how can the programmer communicate with the OS" will be one of the last (not first) questions
Files and their transactions are cached in various devices in your computer; RAM and the HD cache are the most usual places. The file system driver may also implement IO transaction queues, defragmentation, and error-correction logic that makes things complicated for the developer who wants to control every aspect of file access. This level of complexity is ultimately designed to provide integrity, security, performance, and coordination of file access across all processes of your system.
Optimization efforts should not interfere with the system's own caching and prediction algorithms, not just for IO but for all caches. Trying to second-guess your system is a waste of your time and your processors' time.
Most probably your IO operations and data will stay on caches and later be committed to your storage devices when your OS sees fit.
That said, there's always options like database suites, mmap, readahead mechanisms, and direct IO to your drive. You will need to invest time benchmarking any of your efforts. I advise against multiple IO threads because cache contention will make things even slower than one thread.
The kernel will already reorder the read/write requests (e.g. to fit the spin of a mechanical disk), if they come from various processes or threads. BTW, most of the reads & writes would go to the kernel file system cache, not to the disk.
You might consider using posix_fadvise(2) & perhaps (in a separate thread) readahead(2). If -instead of read(2)-ing- you use mmap(2) to project some file portion to virtual memory, you might use also madvise(2)
Of course, the file system does not usually guarantee that a sequential portion of a file is physically sequentially located on the disk (and even the disk firmware might reorder sectors). See picture in Ext2 wikipage, also relevant for Ext4. Some file systems might be better in that respect, and you could tune their block size (at mkfs time).
I would not recommend having thousands of threads (only at most a few dozens).
At last, it might worth buying some SSD or some more RAM (for file cache). See http://linuxatemyram.com/
Actual performance would depend a lot on the particular system and hardware.
Perhaps using an indexed file library like GDBM or a database library Sqlite (or a real database like PostGreSQL) might be worthwhile! Perhaps have fewer files but bigger ones could help.
BTW, you are mmap-ing, and reading small chunk of 1K (smaller than page size of 4K). You could use madvise (if possible in advance), but you should try to read larger chunks, since every file access will bring at least a whole page.
You really should benchmark!

Tool to identify app's data/code most susceptible to memory performance

Context:
-- embedded platform running Linux with some static RAM which is declared about 3 times faster then the rest of RAM (dynamic). The amount of this fast memory is 512kB and the official name is eSRAM. (Details not important for this post: Galileo board, information on eSRAM and relevant kernel API: https://communities.intel.com/servlet/JiveServlet/previewBody/22488-102-1-26046/Quark_SWDevManLx_330235_001.pdf)
-- eSRAM can be used by an application with some support from the kernel---a simple driver that allocates kernel memory on its behalf, overlays the memory with eSRAM (this is done in physical space) and mmaps it to app's virtual memory space. This was tested and confirmed to work as expected.
Problem:
Identify which sections of app's data (and possibly code) to map into eSRAM to achieve optimum performance gain. A suitable analysis tool is required.
After some search I'm not sure if any existing tool is actually suited to this task. Currently my best bet is to develop a specialized Valgrind tool. But maybe there is already something in the ecosystem to start with. Any advice/information is welcome even if, for instance, a tool is kind of partially suited etc.
P.S.
Full analysis should probably take a lot of factors into account, like:
-- memory access patterns (cache performance)
-- changes over time (one could consider eSRAM paging)
...
I have taken a look at Valgrind Cachegrind. It can collect data about data cache reades and data cache writes. And cg_annotate can report Line-by-line Counts for you program. Can it be useful for you to find variables in your program that cause most operations with data cache and in this way to identify data that can benefit most from moving to quick memory? http://valgrind.org/docs/manual/cg-manual.html#cg-manual.line-by-line
Probably, you are interested in D cache reads (Dr) and D cache writes (Dw), or even (Dr+Dw). In that way you can find a place in your code which does most (Dr+Dw) and try to move this place in your quick memory.

Classifying a program as compute intensive based on performance counters

I'm trying to classify few parallel programs as compute / memory/ data intensive. Can I classify them from values obtained from performance counters like perf. This command gives couple of values like number of page faults that I think can be used to know if a program needs to access memory frequently, else otherwise.
Is this approach correct and possible way. If not can someone guide me in classifying programs into respective categories.
Cheers,
Kris
Yes you should in theroy be able to do that with perf. I don't think page faults events are the one to observe if you want to analyse memory activity. For this purpose, on Intel processors you should use uncore events that allow you to count memory traffic (read/write separately). On my Westmere-EP these counters are UNC_QMC_NORMAL_READS.ANY and UNC_QMC_WRITES_FULL.ANY
The following article deals exactly with your problem (on Intel processors):
http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/ispass-2013_177.pdf

Resources