What do the fields in the output from `malloc_info` mean? - malloc

I'm calling malloc_info(3) and getting XML output.
Everything I've found online says something along the lines of "this is subject to change and we're not gonna bother documenting it".
But that's not terribly useful for someone investigating a potential memory fragmentation issue.
Here are some snippets:
<malloc version="1">
<heap nr="0">
<sizes>
<size from="257" to="257" total="257" count="1"/>
</sizes>
<total type="fast" count="0" size="0"/>
<total type="rest" count="1" size="257"/>
<system type="current" size="303104"/>
<system type="max" size="303104"/>
<aspace type="total" size="303104"/>
<aspace type="mprotect" size="303104"/>
</heap>
<heap nr="1">
<sizes>
<size from="17" to="32" total="96" count="3"/>
<!-- etc. -->
<size from="10609" to="10609" total="10609" count="1"/>
<unsorted from="145" to="209" total="740" count="4"/>
</sizes>
<total type="fast" count="95" size="7328"/>
<total type="rest" count="2633564" size="112589836"/>
<system type="current" size="2032623616"/>
<system type="max" size="2032947200"/>
<aspace type="total" size="19451904"/>
<aspace type="mprotect" size="19775488"/>
</heap>
etc., etc., until...
<heap nr="27">
<sizes>
</sizes>
<total type="fast" count="0" size="0"/>
<total type="rest" count="0" size="0"/>
<system type="current" size="32768"/>
<system type="max" size="32768"/>
<aspace type="total" size="32768"/>
<aspace type="mprotect" size="32768"/>
</heap>
<total type="fast" count="4232" size="293312"/>
<total type="rest" count="22498068" size="1597097332"/>
<system type="current" size="17265770496"/>
<system type="max" size="17271173120"/>
<aspace type="total" size="491339776"/>
<aspace type="mprotect" size="496742400"/>
</malloc>
What does this all actually mean?
I'll put some of my own notes/thoughts in an answer, but I'd appreciate it if someone who knew more than I do would chip in.
ldd --version reports Ubuntu EGLIBC 2.19-0ubuntu6.13, which is the best guess I've got for the glibc version.
I've tagged this with eglibc because it's taken from Ubuntu 14.04, but it's probably relevant to glibc as well.

The diagram at http://core-analyzer.sourceforge.net/index_files/Page335.html is fairly useful.
Things I'm pretty sure of:
Heap 0 is the main arena.
Free blocks are kept in "bins", arranged by size, for quick re-use.
Heap 0 has a single free block, size 257 bytes.
I don't know what the relationship between current, max, total and mprotect is.
Other heaps are numbered starting at 1.
Heap 1:
A bunch of free blocks, of various sizes.
The values in fast and rest track the total size of the free blocks: fast.count + rest.count == SUM(size.count) and fast.size + rest.size == SUM(size.total)
current size = 2032623616 (~2GiB)
total (currently used?) = 19451904 (~19MiB)
Heap 27:
No free blocks?
32KiB allocated, 32KiB used?
Totals?
~17.2GiB allocated?
~491MiB used?
That last number doesn't look reliable; if I total up the used memory (through other means), I get ~3GiB.

Related

Is RDMA/DMA performance impacted by 'Region 0' size

I am using the Mellanox ConnectX-6 NIC and the configuration is as shown below:
(lspci -vv)
a1:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Subsystem: Mellanox Technologies Device 0028
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 182
NUMA node: 1
Region 0: Memory at a0000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at 9d000000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00 ...
I am measuring RDMA throughput between two systems by varying the chunk size for total transfer of 10GB data. (Both the machines have same Mellanox NIC)
The results show that just after the chunk size of 32 MB (i.e. 33, 34, 35 MB …), the throughput would drop drastically by around 50+ Gbps. (Normal speeds for this NIC is 175-185 Gbps, so till 32 MB I get these speeds but in 33MB chunk size, I am getting somewhere in between 85-120 Gbps)
So would like to know is the prefetchable memory 32 MB which is listed in the above configuration has any impact on RDMA throughput.

Load testing - decreased number of concurrent users and low session times

I use tsung to do a load test. I recorded browser behavior with tsung recorder. I did not add anything to the xml file I saved with tsung-recorder.
XML:
<session name='rec20200313-1147' probability='100' type='ts_http'>
<request><http url='https://www.example.com/' version='1.1' method='GET'>
<http_header name='Accept' value='text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' />
<http_header name='Accept-Encoding' value='gzip, deflate' />
<http_header name='Accept-Language' value='en-US,en;q=0.5' /></http></request>
<thinktime random='true' value='10'/>
<request><http url='https://www.example.com/arama?aranan=example' version='1.1' method='GET'>
<http_header name='Accept' value='text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' />
<http_header name='Accept-Encoding' value='gzip, deflate' />
<http_header name='Accept-Language' value='en-US,en;q=0.5' /></http></request>
<thinktime random='true' value='17'/>
<request> <http url='/arama?search=loadtest&siralama=1' version='1.1' method='GET'>
<http_header name='Accept' value='text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' />
<http_header name='Accept-Encoding' value='gzip, deflate' />
<http_header name='Accept-Language' value='en-US,en;q=0.5' /></http></request>
</session>
When I tried to do this test with 300 bots, the number of users was not more than 200,000. Also this number suddenly began to decrease
Example:
I started testing.
20,000 (In 10 seconds)
80,000 (Withing 40 seconds)
170,000 ( 1 min)
50,000 (over a minute)
70,000
100,000
60,000
Test finished
I have tried many different scenarios.But I aimed to increase up to 1 million users. That's all I've observed on my side
Administrators who owners the application which I tested said:
There is no blocking.
Your session duration does not take more than 5 seconds.
The test is over. Afterwards, when logs were checked, they said that i got TCP RST response
I try to understand where the error is about this issue.
What did I do missing in tsung-recorder ? How can I emulate a real browser with tsung-recorder. Why did I get a TCP RST response? Is tsung a suitable tool for what i want to do. What's wrong.
Note: I was not successful when I wanted to make an http get attack.
TCP Reset occurs when unexpected TCP packet arrives to the host.
One of the possible reasons is that the machine you're using as the load generator becomes overloaded so make sure to setup monitoring in order to check whether Tsung has enough headroom to operate in terms of CPU, RAM, Network and Disk IO.
If it happens that the machine where Tsung is running is overloaded - you will need to add more hosts and run Tsung in clustered mode
With regards to real browser simulation you might need to setup/maintain an unique session per virtual user, in the majority of cases it is being implemented via Cookies. Check out How to make JMeter behave more like a real browser article for general recommendations, all of them can be translated into Tsung.

Using sar command results in wrong memory statistics on Fedora 22

I'm trying to monitor a few servers by gathering various information with sar. All the systems which should be monitored are currently running Fedora 22. Unfortunately, I'm not able to get correct memory readings.
> free:
total used free shared buff/cache available
Mem: 1017260 34788 150984 68 831488 816204
Swap: 524284 20960 503324
> sar -r 1:
kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
150996 866264 85.16 40 60784 169524 11.00 39572 31068 164
How does sar come up with those numbers? kbmemfree seems alright and kbmemused also makes sense if you add used and buff/cache from free together. But kbbuffers and kbcached look way off - my assumption is kbmemused - kbbuffers - kbcached = used (output of free), but that doesn't match up.
Am I doing something wrong? I'm struggling with that issue since two days now and wasn't able to find any further information.
free from the procps tools seems to add Slab: from /proc/meminfo to its cached output. See proc/sysinfo.c kb_main_cached
So to get the equivalent output from sar you have add kbcached and kbslab from sar -r ALL 1 together.

How to force kernel to re-read/re-initialize PCI device IDs?

My machine (running Linux kernel 3.2.38) on boot has wrong subsystem IDs (sub-device and sub-vendor IDs) of a PCI device. If I then physically unplug and re-plug the PCI device while the system is still up (i.e., hot-plug), it gets the correct IDs.
Note that the wrong sub-device and sub-vendor IDs it gets are same as the device's device and vendor IDs (see the first two lines in the lspci output below).
Following is the output of lspci -vvnn before and after hot-plugging the device:
Before hot-plugging:
0b:0f.0 Bridge [0680]: Device [1a88:4d45] (rev 05)
Subsystem: Device [1a88:4d45]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 32 (250ns min, 63750ns max)
Interrupt: pin A routed to IRQ 10
Region 0: I/O ports at 2100 [size=256]
Region 1: I/O ports at 2000 [size=256]
Region 2: Memory at 92920000 (32-bit, non-prefetchable) [size=64]
After hot-plugging:
0b:0f.0 Bridge [0680]: Device [1a88:4d45] (rev 05)
Subsystem: Device [007d:5a14]
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 10
Region 0: I/O ports at 2100 [disabled] [size=256]
Region 1: I/O ports at 2000 [disabled] [size=256]
Region 2: [virtual] Memory at 92920000 (32-bit, non-prefetchable) [size=64]
My question: Is there a way to get the IDs fixed without hot-plugging the device? e.g. forcing kernel to re-read PCI device IDs e.g. by performing a PCI bus rescan/re-enumeration/re-configuration?
Any help would be highly appreciated. Thanks.
PS. Note that the problem isn't really related to kernel/software as it exists even if boot into UEFI internal shell.
PPS. The PCI device in this case is MEN F206N and "My machine" is MEN F22P
You may forcefully rescan the PCI by :
# echo 1 > /sys/bus/pci/rescan
A closer look at your lscpi output before and after hot plugging the device shows more delta than just the sub device/vendor ID. I'd be surprised if the device functions as expected after hot plugging.
Besides, forcing PCI reenumeration is not possible primarily because there may be other devices that have been enumerated correctly and functioning already. How do you expect reenumeration to deal with that? (and there are other reasons too.)
Prafulla

FPGA resource estimation

I'm estimating my resource usage by counting the number of flip flop I need for a component. For example, when I estimate it for ins_controldata (simple counter and some I/O), I use 32 flipflops. When I look at the detailed map report of this component, Section 13 - Utilization by Hierarchy, I see that my estimations are close to the number of slice registers used for this component. Every slice has 4 LUTs and 8 flipflops.
Now when I do the same for my finite state machine, inst_xtm640, I estimate my flipflop usage around 43 (including the 3 flipflops needed for the 6 states). When I look at the map report, I see my estimation is more or less correct (+-10% error). But the number of slices needed are much higher than the slice register and LUTs needed. It's 40 when you look at the used LUTs, it should only be around 20.
Why are extra slices used for this component? Is it for speed optimization ?
+----------------------------------------------------------------------------------+
| Module | Partition | Slices* | Slice Reg | LUTs |
+----------------------------------------------------------------------------------+
| ++inst_controldata | | 6/6 | 35/35 | 20/20 |
| +++inst_xtm640 | | 40/40 | 57/57 | 88/88 |
+----------------------------------------------------------------------------------+
Edit:
I think I found it myself but other inputs are always welcome:
Not all slices are fully used. So it doesn't mean if I use 88 LUTS and 57 slices registers, I will be using 22 slices. Some slices aren't fully used, so this will make slice usage go higher. Also extra slices will be used to meet timing constraints.
You are indeed correct in your edit.
The synthesizer is sometimes a very unpredictable thing, and especially what your constraints are and what you optimize for is key in getting results.
If you were to optimize for size (area) your estimates might be closer.
The bigger and more complex your systems are, the greater this will effect your final place and route. This is why, when you look at your small example, results are closer.
This is also a fine reason why you shouldn't "fill up" your FPGA.
The more space you have left, the more efficient in closing timing the optimizer can be.
This is of course a "waste" of resources, but might lead to better results.
One advice I have received regarding FPGA layout is to not go into such detail that the synthesizer doesn't have room to operate. E.g. Don't make a flip flop out of logic, use the abstraction of the language instead. As long as you are aware of how the results are reached let the synthesizer do its job of finding out what is necessary and how to optimize it, it usually does a better job
\Paul
In the Xilinx toolchain, the build step in which LUTs and registers are assigned to specific slices is called "map". By default, and especially if your design occupies only a small amount of the device's resources, map will not try to pack as many LUTs and registers into a slice as it possibly could. This is why you're seeing a higher slice count than expected.
You can force map to pack slices more agressively by setting the -c (Pack Slices) option to 1:
-c [packfactor ]
The packfactor (for non-zero values) is the target slice density
percentage.
A packfactor value of 0 specifies that only related logic (logic having signals in common) should be packed into a single Slice, and
yields the least densely packed design.
A packfactor of 1 results in maximum packing density as the packer is attempting 1% slice utilization.
A packfactor of 100 means that only enough unrelated packs will occur to fit the device with 100% utilization. This results in minimum
packing density.

Resources