Data corruption when memcpy from DMA - linux

EDIT:
Not sure if that's a problem or not, but here is something I've noticed about:
The rx_skbuffers are allocated in two places: once when the driver is been initialized, it calls __netdev_alloc_skb_ip_align with GPF_KERNEL, and the second time if the rx_skbuff is already freed it calls netdev_alloc_skb_ip_align (which is internally uses GPF_ATOMIC).
Shouldn't these skb allocations be called with GPF_DMA ?
==========================================================================
I'm having issues with data corruption in an Ethernet driver (for ST MAC 10/100/1000) I'm working with. The driver runs on Allwinner A20 (ARM Cortex-A7).
Some details:
The driver holds a ring of Rx sk_buffers (allocated with __netdev_alloc_skb_ip_align).
The data (rx_skbuff[i]->data) of each of the sk_buffers is mapped to the DMA using dma_map_single.
The mapping has been succeed (verified with dma_mapping_error).
The problem:
After a while (minutes, hours, days... very random), kernel panics due to data corruption.
Debugging (EDITED):
After digging a bit more, I found out that sometimes, after a while, one of the sk_buffer structure is been corrupted, and this may lead the program to do things it should not and thus cause the kernel to panic.
After some more digging, I've found out that the corruption occurs after skb_copy_to_linear_data (which is just the same as memcpy). Keep in mind that this corruption doesn't occur after each call of skb_copy_to_linear_data, but when the corruption does occurs, it is always after the call to skb_copy_to_linear_data.
When the corruption occurs, it doesn't happens on the rx_q->rx_skbuff of the current entry (rx_q->rx_skbuff[entry]). For example, if we perform the skb_copy_to_linear_data on rx_q->rx_skbuff[X], the corrupted sk_buff structure will be rx_q->rx_skbuff[Y] (where X is not equal Y).
It seems that the physical address of the skb->data (that has been allocated right before the skb_copy_to_linear_data call), has the same physical address of rx_q->rx_skbuff[Y]->end. First thing I thought is that maybe the driver doesn't know rx_q->rx_skbuff[Y] has been released, but when this collision occurs I see the rx_q->rx_skbuff[Y]->users is 1.
How could it be, Any ideas ?
Not sure if that's a problem or not, but here is something I've noticed about:
The rx_skbuffers are allocated in two places: once when the driver is been initialized, it calls __netdev_alloc_skb_ip_align with GPF_KERNEL, and the second time if the rx_skbuff is already freed it calls netdev_alloc_skb_ip_align (which is internally uses GPF_ATOMIC). Shouldn't these skb allocations be called with GPF_DMA ?
Code:
Here is part of the code where the corruption occurs.
The full code of the driver is from linux kernel mainline 4.19, and it can be found here.
I've paste here only the part between lines 3451-3474.
Does anyone find here a wrong behavior regarning the use of the DMA-API ?
skb = netdev_alloc_skb_ip_align(priv->dev,
frame_len);
if (unlikely(!skb)) {
if (net_ratelimit())
dev_warn(priv->device,
"packet dropped\n");
priv->dev->stats.rx_dropped++;
continue;
}
dma_sync_single_for_cpu(priv->device,
rx_q->rx_skbuff_dma
[entry], frame_len,
DMA_FROM_DEVICE);
// Here I check if data has been corrupted (the answer is ALWAYS NO).
debug_check_data_corruption();
skb_copy_to_linear_data(skb,
rx_q->
rx_skbuff[entry]->data,
frame_len);
// Here I check again if data has been corrupted (the answer is SOMETIMES YES).
debug_check_data_corruption();
skb_put(skb, frame_len);
dma_sync_single_for_device(priv->device,
rx_q->rx_skbuff_dma
[entry], frame_len,
DMA_FROM_DEVICE);
Some last notes:
I tried to run the kernel with CONFIG_DMA_API_DEBUG enabled. It's is not always triggered, but when I catch the corruption by my self (with my debug fucntion), sometimes I can see that /sys/kernel/debug/dma-api/num_errors has been increased, and sometimes I also get this log: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000006879f902] [size=61 bytes]
I've also enabled the CONFIG_DEBUG_KMEMLEAK and right after I catches the data corruption event, I get this log: kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak), but I still don't understand what clues it throws, although it seem to be taken from the same part of code I've pasted here (__netdev_alloc_skb is been called from __netdev_alloc_skb_ip_align). This what /sys/kernel/debug/kmemleak displays:
unreferenced object 0xe9ea52c0 (size 192):
comm "softirq", pid 0, jiffies 6171209 (age 32709.360s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 40 4d 2d ea ............#M-.
00 00 00 00 00 00 00 00 d4 83 7c b3 7a 87 7c b3 ..........|.z.|.
backtrace:
[<045ac811>] __netdev_alloc_skb+0x9f/0xdc
[<4f2b009a>] stmmac_napi_poll+0x89b/0xfc4
[<1dd85c70>] net_rx_action+0xd3/0x28c
[<1c60fabb>] __do_softirq+0xd5/0x27c
[<9e007b1d>] irq_exit+0x8f/0xc0
[<beb36a07>] __handle_domain_irq+0x49/0x84
[<67c17c88>] gic_handle_irq+0x39/0x68
[<e8f5dc30>] __irq_svc+0x65/0x94
[<075bc7c7>] down_read+0x8/0x3c
[<075bc7c7>] down_read+0x8/0x3c
[<790c6556>] get_user_pages_unlocked+0x49/0x13c
[<544d56e3>] get_futex_key+0x77/0x2e0
[<1fd5d0e9>] futex_wait_setup+0x3f/0x144
[<8bc86dff>] futex_wait+0xa1/0x198
[<b362fbc0>] do_futex+0xd3/0x9a8
[<46f336be>] sys_futex+0xcd/0x138

Related

Interpret AVRCP packets

After some mucking about, I have got a pybluez script to connect to an AVRCP profile on various devices, and read the responses.
Code snippet:
addr="e2:8b:8e:89:6c:07" #S530 white
port=23
if (port>0):
print("Attempting to connect to L2CAP port ",port)
socket=bluetooth.BluetoothSocket(bluetooth.L2CAP);
socket.connect((addr,port))
print("Connected.")
while True:
print("Waiting on read:")
data=socket.recv(1024)
for b in data:
print("%02x"%b,end=" ")
print()
socket.close()
The results I'm getting when I press the button on the earpiece are as follows:
Attempting to connect to L2CAP port 23
Connected.
Waiting on read:
10 11 0e 01 48 00 00 19 58 10 00 00 01 03
Waiting on read:
20 11 0e 00 48 7c 44 00
Waiting on read:
30 11 0e 00 48 7c 46 00
Waiting on read:
40 11 0e 00 48 7c 44 00
After careful reading of the spec, it looks like I'm seeing PASSTHROUGH commands, with 44 being the "PLAY" operation command, and 46 being "PAUSE" (I think)
I don't know what the 10 11 0e means, apart from the fact that the first byte appears to be some sort of sequence number.
My issue is threefold:
I don't know where to find a list of valid operation_ids. It's
mentioned in the spec but not defined apart from a few random
examples.
The spec makes reference to subunit type and Id, (which would be the
48 in the above example) again without defining them AFAICT.
There is no mention of what the leading three bytes are. They may
even be part of L2CAP and nothing directly to do with AVRCP, I'm not
familiar enough with pybluez to tell.
Any assistance in any of the above points would be helpful.
Edit: For reference, the details of the AVRCP spect appears to be here: https://www.bluetooth.org/docman/handlers/DownloadDoc.ashx?doc_id=119996
The real answer is that the specification document assumes you have read other specification documents.
The three header bytes are part of the AVCTP transport layer:
http://www.cs.bilkent.edu.tr/~korpe/lab/resources/AVCTP%20Spec%20v1_0.pdf
In short:
0: 7..4: Incrementing transaction id. 0x01 to 0x0f
3..2: Packet type 00 = self contained packet
1 : 0=request 1=response
0 : 0=PID recognized 1: PID error
1-2: 2 byte bigendian profile id (in this case 110e, AVRCP)
The rest is described in the AVRCP profile doc, https://www.bluetooth.org/docman/handlers/DownloadDoc.ashx?doc_id=119996
I don't find the documentation to be amazingly clear.
I have provided a sample application which seems to work for most of the AVRCP devices I have been able to test:
https://github.com/rjmatthews62/BtAVRCP

Reading from USB device through HIDAPI on Linux sometimes results in missing data

I am currently porting code that uses a USB device from Windows to Linux.
I've thoroughly tested the original application and I'm pretty sure that the device works well. I implemented the USB interface on Linux using hidapi-libusb and there are times when the returned data from the device is missing at least a byte.
Once it happens, all the returned values are missing that much data. I more or less have to disconnect and reconnect the USB device in order to make the USB device read data correct. I'm starting to think that maybe the first byte is sometimes returned as 00 and Linux ignores it. It usual occurs on successive reads.
For example:
I send get register state and I expect 10 data available for USB read. Byte 5 is the number of the data.
Expected:
00 00 01 02 00 08 42 (Data 8)
00 00 01 02 00 09 42 (Data 9)
Actual:
00 00 01 02 00 08 42 (Data 8)
00 00 02 00 09 42 ab (Data 9)
Data 9's packet number becomes wrong because it is missing a byte. I've tried changing to hidapi-hidraw, and it happens significantly less. I've checked the hexdump of the hidraw of the device (/dev/hidraw0), and it is consistent with the data I am getting in my application. I've tried using memory leak detection tools and no leaks/corruption is detected.
Is this a Linux problem (3.2.0-4-amd64) or is it possibly the device?
The pseudo code of my application is just:
Initialize HIDAPI and device related
Connect to device using HIDAPI
Write USB command
Read USB command (done multiple times if write expects multiple data)
Parse data
Repeat 3 and 4 until all commands are performed
Free memory and close HIDAPI.
Things I've tried:
Ensure no delay is between read and writes
Add flushing of read data before writing (sometimes catches stray data)
Add a really long timeout (five seconds) on flushing of read data - significantly reduces the problem at a big cost.

Could I get memory access information through X86_64 machine code?

I want to do a statistic of memory bytes access on programs running on Linux (X86_64 architecture). I use perf tool to dump the file like this:
: ffffffff81484700 <load2+0x484700>:
2.86 : ffffffff8148473b: 41 8b 57 04 mov 0x4(%r15),%edx
5.71 : ffffffff81484800: 65 8b 3c 25 1c b0 00 mov %gs:0xb01c,%edi
22.86 : ffffffff814848a0: 42 8b b4 39 80 00 00 mov 0x80(%rcx,%r15,1),%esi
25.71 : ffffffff814848d8: 42 8b b4 39 80 00 00 mov 0x80(%rcx,%r15,1),%esi
2.86 : ffffffff81484947: 80 bb b0 00 00 00 00 cmpb $0x0,0xb0(%rbx)
2.86 : ffffffff81484954: 83 bb 88 03 00 00 01 cmpl $0x1,0x388(%rbx)
5.71 : ffffffff81484978: 80 79 40 00 cmpb $0x0,0x40(%rcx)
2.86 : ffffffff8148497e: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi
5.71 : ffffffff8148499b: 8b 71 34 mov 0x34(%rcx),%esi
5.71 : ffffffff814849a4: 0f af 34 24 imul (%rsp),%esi
My current method is to analyze file and get all memory access instructions, such as move, cmp, etc. Then calculate every access bytes of every instruction, such as mov 0x4(%r15),%edx will add 4 bytes.
I want to know whether there is possible way to calculate through machine code , such as by analyzing "41 8b 57 04", I can also add 4 bytes. Because I am not familiar with X86_64 machine code, could anyone give any clues? Or is there any better way to do statistics? Thanks in advance!
See https://stackoverflow.com/a/20319753/120163 for information about decoding Intel instructions; in fact, you really need to refer to Intel reference manuals: http://download.intel.com/design/intarch/manuals/24319101.pdf If you only want to do this manually for a few instructions, you can just look up the data in these manuals.
If you want to automate the computation of instruction total-memory-access, you will need a function that maps instructions to the amount of data accessed. Since the instruction set is complex, the corresponding function will be complex and take you a long time to write from scratch.
My SO answer https://stackoverflow.com/a/23843450/120163 provides C code that maps x86-32 instructions to their length, given a buffer that contains a block of binary code. Such code is necessary if one is to start at some point in the object code buffer and simply enumerate the instructions that are being used. (This code has been used in production; it is pretty solid). This routine was built basically by reading the Intel reference manual very carefully. For OP, this would have to be extended to x86-64, which shouldn't be very hard, mostly you have account for the extended-register prefix opcode byte and some differences from x86-32.
To solve OP's problem, one would also modify this routine to also return the number of byte reads by each individual instruction. This latter data also has to be extracted by careful inspection from the Intel reference manuals.
OP also has to worry about where he gets the object code from; if he doesn't run this routine in the address space of the object code itself, he will need to somehow get this object code from the .exe file. For that, he needs to build or run the equivalent of the Windows loader, and I'll bet that
has a bunch of dark corners. Check out the format of object code files.

When a program exits and unloads my .dll, my worker threads terminate before I can free their memory

I have a .dll that is statically linked to the MFC. During normal use of the .dll, it will create a worker thread using AfxBeginThread. Inside the function for that thread I create two arrays:
CByteArray ReadBuffer;
ReadBuffer.SetSize(92);
CByteArray PacketBuffer;
PacketBuffer.SetSize(46);
Those buffers will change size (typically to much larger) during the program execution. The problem is, when the program exits, the function for the thread seems to terminate without ever getting a chance to free the memory allocated by those arrays. I have an ExitInstance() in the .dll overloaded to do other clean up, but the time it is reached I already get the messages:
The thread 'Win32 Thread' (0x2208) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x1e34) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x1ff8) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x1fc0) has exited with code 0 (0x0).
These show the thread is being terminated in the middle of execution and doesn't seem to give it time to call any destructors or do any cleanup.
I tried to create my own CWinThread object with an overloaded ExitInstance() function, but again those threads exit before that function is called.
After the .dlls close function is called and the memory cleaned up, I get this:
Detected memory leaks!
Dumping objects ->
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\array_b.cpp(110) : {258} normal block at 0x023686E8, 2048 bytes long.
Data: < > C6 F9 1D C0 E2 0C 00 00 00 00 AA 00 C0 11 E0 11
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\array_b.cpp(110) : {201} normal block at 0x02366F00, 64 bytes long.
Data: <b # > 62 F9 1D C0 81 09 00 00 00 00 7F 00 40 11 E0 10
Object dump complete.
Showing leaks caused by the ReadBuffer and PacketBuffer from above. I can't find a way to properly close the threads & clean up their memory before the program exits. I have a way to gracefully terminate the threads that I could use, but I can't find a spot to execute it before the program terminates.
I am not sure that this is even a big issue, since the program is terminating, but always thought .dlls should clean up all of their own memory just to be safe.

How does one reclaim zeroed blocks of a sparse file?

Consider a sparse file with 1s written to a portion of the file.
I want to reclaim the actual space on disk for these 1s as I no longer need that portion of the sparse file. The portion of the file containing these 1s should become a "hole" as it was before the 1s were themselves written.
To do this, I cleared the region to 0s. This does not reclaim the blocks on disk.
How do I actually make the sparse file, well, sparse again?
This question is similar to this one but there is no accepted answer for that question.
Consider the following sequence of events run on a stock Linux server:
$ cat /tmp/test.c
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
int main(int argc, char **argv) {
int fd;
char c[1024];
memset(c,argc==1,1024);
fd = open("test",O_CREAT|O_WRONLY,0777);
lseek(fd,10000,SEEK_SET);
write(fd,c,1024);
close(fd);
return 0;
}
$ gcc -o /tmp/test /tmp/test.c
$ /tmp/test
$ hexdump -C ./test
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00002710 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
00002b10
$ du -B1 test; du -B1 --apparent-size test
4096 test
11024 test
$ /tmp/test clear
$ hexdump -C ./test
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00002b10
$ du -B1 test; du -B1 --apparent-size test
4096 test
11024 test
# NO CHANGE IN SIZE.... HMM....
EDIT -
Let me further qualify that I don't want to rewrite files, copy files, etc. If it is not possible to somehow free previously allocated blocks in situ, so be it, but I'd like to determine if such is actually possible or not. It seems like "no, it is not" at this point. I suppose I'm looking for sys_punchhole for Linux (discussions of which I just stumbled upon).
It appears as if linux have added a syscall called fallocate for "punching holes" in files. The implementations in individual filesystems seem to focus on the ability to use this for pre-allocating a larger continous number of blocks.
There is also the posix_fallocate call that only focus on the latter, and is not usable for hole punching.
Right now it appears that only NTFS supports hole-punching. This has been historically a problem across most filesystems. POSIX as far as I know, does not define an OS interface to punch holes, so none of the standard Linux filesystems have support for it. NetApp supports hole punching through Windows in its WAFL filesystem. There is a nice blog post about this here.
For your problem, as others have indicated, the only solution is to move the file leaving out blocks containing zeroes. Yeah its going to be slow. Or write an extension for your filesystem on Linux that does this and submit a patch to the good folks in the Linux kernel team. ;)
Edit: Looks like XFS supports hole-punching. Check this thread.
Another really twisted option can be to use a filesystem debugger to go and punch holes in all indirect blocks which point to zeroed out blocks in your file (maybe you can script that). Then run fsck which will correct all associated block counts, collect all orphaned blocks (the zeroed out ones) and put them in the lost+found directory (you can delete them to reclaim space) and correct other properties in the filesystem. Scary, huh?
Disclaimer: Do this at your own risk. I am not responsible for any data loss you incur. ;)
Ron Yorston offers several solutions; but they all involve either mounting the FS read-only (or unmounting it) while the sparsifying takes place; or making a new sparse file, then copying across those chunks of the original that aren't just 0s, and then replacing the original file with the newly-sparsified file.
It really depends on your filesystem though. We've already seen that NTFS handles this. I imagine that any of the other filesystems Wikipedia lists as handling transparent compression would do exactly the same - this is, after all, equivalent to transparently compressing the file.
After you have "zeroed" some region of the file you have to tell to the file system that this new region is intended to be a sparse region. So in case of NTFS you have to call DeviceIoControl() for that region again. At least I do this way in my utility: "sparse_checker"
For me the bigger problem is how to unset the sparse region back :).
Regards
This way is cheap, but it works. :-P
Read in all the data past the hole you want, into memory (or another file, or whatever).
Truncate the file to the start of the hole (ftruncate is your friend).
Seek to the end of the hole.
Write the data back in.
umount your filesystem and edit filesystem directly in way similar debugfs or fsck. usually you need driver for each used fs.
Seems like writing zeros (as in the referenced question) to the part you're done with is a logical thing to try. Here a link to an MSDN question for NTFS sparse files that does just that to "release" the "unused" part. YMMV.
http://msdn.microsoft.com/en-us/library/ms810500.aspx

Resources