I am developing a function to monitor CPU and memory.
By the way, is there any new idea to show with just CPU memory information?
Currently, there is one CPU and one memory trend graph, and I am thinking about adding a Z-score graph.
Related
I'm working on feature generation before I train a model in PyTorch. I wish to save my features as PyTorch tensors on disk for later use in training.
One of my features ("Feature A") is calculated on a CPU while another feature ("Feature B") must be calculated from that CPU on a GPU (some linear algebra stuff). I have an unusual limitation: on my university cluster, jobs which don't use GPUs have CPU memory limits of 1TB each while jobs which do use GPUs have CPU memory limits of 4GB with GPU memory limits of 48GB. Feature A and Feature B are each approximately 10GB.
Naturally, I want to first calculate Feature A using CPUs only then save Feature A to disk. In another job (this one with GPU access and thus the 4GB CPU memory limitation), I want to load Feature A directly to GPU, compute Feature B, then save Feature B to disk.
With Feature A computed and saved to disk, I've tried:
feaB = torch.load(feaAfile, map_location=torch.device('cuda'))
And yet I max-out my CPU memory. I've confirmed cuda is available.
In the PyTorch documentation I see that in loading tensors they "are first deserialized on the CPU..."
I wonder if there is any way to avoid a CPU memory implication when I want to load only onto the GPU? If the tensor must first be copied to the CPU, could I use some sort of 4GB buffer? Thanks so much in advance.
EDIT: per discussion in the comments, I no longer need to do this. But the question itself, of loading a tensor to the GPU without using CPU memory, remains unanswered so I'm leaving this question up.
I'm trying to set up an auto-scale on Azure to scale out when 5-min average memory exceeds 90%.
Here's a 24-hour chart of 1-min memory usage:
I have a scale-out rule for 5-min average memory from 1 to 2. I have a scale-in rule when average 5 min memory is under 80%. Admittedly this is quite tight. However, the scale in NEVER seems to fire, it's always prevented by 'flapping'. Surely given the chart above there would be several places where it could scale in? (I don't even see where a 5-min average would be triggered for the scale up, given the chart is a 1-min average).
I've also discovered that scaling out based on memory percentage gets tricky with smaller instance sizes; that's because Azure fails to recognize that the greater part of the memory used is actually dedicated to the OS and other infrastructure...
Here you can find a quite exhaustive explanation: https://medium.com/#jonfinerty/flapping-and-anti-flapping-dcba5ba92a05.
There seem to be no workarounds and I'm considering letting it go (and only scale out and in based on CPU).
Our team wishes to write a performance measuring tool with focus on GPUGPU
In order to understand if a particular app is compute-bound or memory-bound we would like to track Graphic card's memory accesses without sticking to any compute API
Is it possible to track syscalls like read for this purpose?
I'm trying to classify few parallel programs as compute / memory/ data intensive. Can I classify them from values obtained from performance counters like perf. This command gives couple of values like number of page faults that I think can be used to know if a program needs to access memory frequently, else otherwise.
Is this approach correct and possible way. If not can someone guide me in classifying programs into respective categories.
Cheers,
Kris
Yes you should in theroy be able to do that with perf. I don't think page faults events are the one to observe if you want to analyse memory activity. For this purpose, on Intel processors you should use uncore events that allow you to count memory traffic (read/write separately). On my Westmere-EP these counters are UNC_QMC_NORMAL_READS.ANY and UNC_QMC_WRITES_FULL.ANY
The following article deals exactly with your problem (on Intel processors):
http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/ispass-2013_177.pdf
How to measure cycles spent in accessing shared remote cache say L3. I need to get this cache access information both system-wide and for per-thread. Is there any specific tool/hardware requirements. Or can I use any formula to get an approximate value of cycles spent over a time interval
To get the average latencies (when a single thread is running) to various caches present on your machine, you can use memory profiler tools such as RMMA for windows (http://cpu.rightmark.org/products/rmma.shtml) and Lmbench for linux.
You can also write your own benchmarks based on the ideas used by these tools.
See the answers posted on this StackOverflow question:
measuring latencies of memory
Or Google for how the Lmbench benchmark works.
If you want to find exact latencies for particular memory access patterns, you will need to use a simulator. This way you can trace a memory access as it flows through the memory system. However simulators will not model all the effects that are present in a modern processor or memory system.
If you want to learn how multiple threads affect the average latency to L3, I think the best bet would be to write your own benchmark.