I'd like to have Grinder's agents log CPU and Memory from OperatingSystemMXBean
Ideally this should create a usual _data.log so I could feed it into Grinder Analyzer
How could I instrument a test to create such a report? Or could I create a test script that would report CPU load values as Test time?
(I read Best tool to record CPU and memory usage with Grinder?)
Related
I have got a large transformer model from huggingface. The model is about 2gb in storage. When I try to run multiprocessing processes or pool the program just freezes. Even if I just try to have 2 workers/processes.
From what I understand it freezes because it's trying to pickle the transformer model and copy the environment for both workers.
I've tried to load the model in after the multiprocessing starts but it also results in the same challenge.
My question is do I need to increase my ram if so what's the general rule of thumb for how much ram I need per worker and how would I calculate it.
How can I get this right, I've tried making the model use a shared memory block but I've not managed to get it to work. has anyone done something like this?
You probably have to account 2 GB (or more) for each worker, since they likely have different copies of your model.
Using shared memory is the only option if you can't increase your memory amount.
I believe that an easy rule of the thumb to understand how much RAM you need is something n_workers * per_worker_mem * 1.1. You measure per_worker_mem with free or ps command, accounting for a 10% overhead that you may have for synchronization and data exchange between threads.
Your overhead may vary according to the amount of data shared and exchanged between the workers.
On a physical system you may also want to account for an additional 1/2 GB for the OS and (in general) a fair amount of free RAM to be used as cache to speedup your file system (e.g. if your model needs 6 GB of RAM, I won't go below 16 or 32 to keep a snappy system).
I am developing a function to monitor CPU and memory.
By the way, is there any new idea to show with just CPU memory information?
Currently, there is one CPU and one memory trend graph, and I am thinking about adding a Z-score graph.
I have been looking for a way to log all memory accesses of a process/execution in Linux. I know there have been questions asked on this topic previously here like this
Logging memory access footprint of whole system in Linux
But I wanted to know if there is any non-instrumentation tool that performs this activity. I am not looking for QEMU/ VALGRIND for this purpose since it would be a bit slow and I want as little overhead as possible.
I looked at perf mem and PEBS events like cpu/mem-loads/pp for this purpose but I see that they will collect only sampled data and I actually wanted the trace of all the memory accesses without any sampling.
I wanted to know is there any possibility to collect all memory accesses without wasting too much on overhead by using a tool like QEMU. Is there any possibility to use PERF only but without samples so that I get all the memory access data ?
Is there any other tool out there that I am missing ? Or any other strategy that gives me all memory access data ?
It is just impossible both to have fastest possible run of Spec and all memory accesses (or cache misses) traced in this run (using in-system tracers). Do one run for timing and other run (longer,slower), or even recompiled binary for memory access tracing.
You may start from short and simple program (not the ref inputs of recent SpecCPU, or billion mem accesses in your big programs) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. There is perf mem tool or you may try some PEBS-enabled events of memory subsystem. PEBS is enabled by adding :p and :pp suffix to the perf event specifier perf record -e event:pp, where event is one of PEBS events. Also try pmu-tools ocperf.py for easier intel event name encoding and to find PEBS enabled events.
Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory performance tests. Check worst case of memory recording overhead at left part on the Arithmetic Intensity scale of [Roofline model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Typical tests from this part are: STREAM (BLAS1), RandomAccess (GUPS) and memlat are almost SpMV; many real tasks are usually not so left on the scale:
STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes (the memory address, sometimes: instruction address) to be recorded to the same memory. So, having memory tracing enabled (more than 10% or memory access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:
http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:
Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity
Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...
User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.
I have to run some performance tests, to see how my programs work when the system runs out of RAM and the system starts thrashing. Ideally, I would be able to change the amount of RAM used by the system.
I haved tried to by boot my system (running Ubuntu 10.10) in single user mode with a limited amount of physical memory, but with the parameters I used (max_addr=300M, max_addr=314572800 or mem=300M) the system did not use my swap partition.
Is there a way to limit the amount of RAM used by the total system, while still using swap space?
The point is to measure the total running time of each program as a function of the input size. I am not trying to pinpoint performance problems, I am trying to compare algorithms, which means I need accuracy.
Write a simple c program which
Will allocate large amount of memory.
Keep on accessing allocated memory random to try to keep in main memory (in an infinite loop).
Now run this program (one or few processes) so that you allocate enough memory to cause the thrashing of process you are testing.
Is there a way we can record memory footprint? In a way that
after the process has finish we still can have access to it.
The typical way I check memory footprint is this:
$ cat /proc/PID/status
But in no way it exist after the process has finished.
you can do something like:
watch 'grep VmSize /proc/PID/status >> log'
when the program ends you'll have a list of memory footprints over time in log.
Valgrind has a memory profiler called Massif that provides detailed information about the memory usage of your program:
Massif is a heap profiler. It performs detailed heap profiling by taking regular snapshots of a program's heap. It produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. The graph is supplemented by a text or HTML file that includes more information for determining where the most memory is being allocated. Massif runs programs about 20x slower than normal.
You can record it using munin + a custom plugin.
This will allow you to monitor and save the needed process information, and graph it, easily.
Here's a related answer I gave at serverfault.com