Mapping of dmam_alloc_coherent allocated memory to the user space via remap_pfn_range gives pointer to wrong area of memory

Mapping of dmam_alloc_coherent allocated memory to the user space via remap_pfn_range gives pointer to wrong area of memory - linux

I prepare an application running on ARM Intel Cyclone V SoC.
I need to map the DMA coherent memory buffer to the user space.
The buffer is allocated in the device driver with with:
buf_addr = dmam_alloc_coherent(&pdev->dev, size, &dma_addr, GFP_KERNEL);
The mapping is done correctly, and I have verified, that the buffer accessed by the hardware via dma_addr HW address is visible for the kernel via buf_addr pointer.
Then in the mmap function of the device driver I do:
unsigned long physical = virt_to_phys(buf_addr);
unsigned long vsize = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range(vma,vma->vm_start, physical >> PAGE_SHIFT , vsize, vma->vm_page_prot);
The application mmaps the buffer with:
buf = mmap(NULL,buf_size,PROT_READ | PROT_WRITE, dev_file, MAP_SHARED);
I do not get any error from remap_pfn_range function. Also the application is able to access the mmapped memory, but it is not the buffer allocated with dmam_alloc_coherent.

I have found the macro dma_mmap_coherent that seems to be dedicated particularly for that purpose.
I have verified that the following modification in the mmap function ensures proper operation:
unsigned long vsize = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap=dma_mmap_coherent(&my_pdev->dev,vma,fdata, dma_addr, vsize);
Because the pdev pointer is not directly delivered to the mmap function it is passed from the probe function via the global variable my_pdev. In case of driver supporting multiple devices, it should be stored in the device context.

Related

How to read /proc/<pid>/pagemap in a kernel driver?

I am trying to read /proc//pagemap in a kernel driver like this:
uint64_t page;
uint64_t va = 0x7FFD1BF46530;`
loff_t pos = va / PAGE_SIZE * sizeof(uint64_t);
struct file * filp = filp_open("/proc/19030/pagemap", O_RDONLY, 0);
ssize_t nread = kernel_read(filp, &page, sizeof(page), &pos);
I get error -22 in nread (EINVAL, invalid argument) and
"kernel read not supported for file /19030/pagemap (pid: 19030 comm: tester)" in dmesg.
0x7FFD1BF46530 is a virtual address in a user space process pid 19030 (tester). I assume that pos is the offset into the file like in lseek64.
Doing the precise same thing as sudo with same values in a user space process, i.e. reading /proc/19030/pagemap works fine and produces a correct physical address.
The actual thing I am trying to do here is to find the physical address of a user space virtual address. I need the physical address for a device DMA transfer operation and a user space app needs to access this memory. This app allocates 1GB DMA memory with anonymous mmap from THP (Transparent Huge Pages). And I am trying to avoid the need for sudo by reading /proc//pagemap in a kernel driver via ioctl instead.
I would be happy to allocate huge page DMA memory in the driver but don't know how to do that. dma_alloc_coherent is limited to max 4MB allocations. Is there a way to get those allocated as continuous physical memory? I need hundreds of MB or many GB of DMA memory.
Problem with anonymous mmap is that it can only allocate max 1GB huge page as physically continuous memory. Allocating more works but the memory is not physically continuous and unusable for DMA.
Any good ideas or alternative ways of allocating huge pages as DMA memory?
Tried reading file /proc//pagemap in a kernel driver. Expected same results as when reading the file in a user space application which works ok.

"kernel read not supported for file …"
Indeed, as we see in __kernel_read()
if (unlikely(!file->f_op->read_iter || file->f_op->read))
return warn_unsupported(file, "read");
it fails if f_op->read_iter isn't or f_op->read is wired up (implemented), which is both the case for a pagemap file.
You could try pagemap_read() instead. – not feasible for reasons in the comments
When I had the problem of getting the physical address for a virtual address in a driver, I included and copied some kernel code (not that I recommend this, but I saw no other solution); here's an extract.
static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr
, unsigned long sz)
{ return NULL; }
void p4d_clear_bad(p4d_t *p4d) { p4d_ERROR(*p4d); p4d_clear(p4d); }
#include "mm/pagewalk.c"
static int pte(pte_t *pte, unsigned long addr
, unsigned long next, struct mm_walk *walk)
{
*(pte_t **)walk->private = pte;
return 1;
}
/* Scan the real Linux page tables and return a PTE pointer for
* a virtual address in a context.
* Returns true (1) if PTE was found, zero otherwise. The pointer to
* the PTE pointer is unmodified if PTE is not found.
*/
int
get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp)
{
struct mm_walk walk = { .pte_entry = pte, .mm = mm, .private = ptep };
return walk_page_range(addr, addr+PAGE_SIZE, &walk);
}
/* Find physical address for this virtual address. Normally used by
* I/O functions, but anyone can call it.
*/
static inline unsigned long iopa(unsigned long addr)
{
unsigned long pa;
/* I don't know why this won't work on PMacs or CHRP. It
* appears there is some bug, or there is some implicit
* mapping done not properly represented by BATs or in page
* tables.......I am actively working on resolving this, but
* can't hold up other stuff. -- Dan
*/
pte_t *pte;
struct mm_struct *mm;
#if 0
/* Check the BATs */
phys_addr_t v_mapped_by_bats(unsigned long va);
pa = v_mapped_by_bats(addr);
if (pa)
return pa;
#endif
/* Allow mapping of user addresses (within the thread)
* for DMA if necessary.
*/
if (addr < TASK_SIZE)
mm = current->mm;
else
mm = &init_mm;
ATTENTION: I needed the current address space.
You'd have to use mm = file->private_data instead.
pa = 0;
if (get_pteptr(mm, addr, &pte, NULL))
pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK);
return(pa);
}

How to pass efficiently the hugepages-backed buffer to the BM DMA device in Linux?

I need to provide a huge circular buffer (a few GB) for the bus-mastering DMA PCIe device implemented in FPGA.
The buffers should not be reserved at the boot time. Therefore, the buffer may be not contiguous.
The device supports scatter-gather (SG) operation, but for performance reasons, the addresses and lengths of consecutive contiguous segments of the buffer are stored inside the FPGA.
Therefore, usage of standard 4KB pages is not acceptable (there would be up to 262144 segments for each 1GB of the buffer).
The right solution should allocate the buffer consisting of 2MB hugepages in the user space (reducing the maximum number of segments by factor of 512).
The virtual address of the buffer should be transferred to the kernel driver via ioctl. Then the addresses and the length of the segments should be calculated and written to the FPGA.
In theory, I could use get_user_pages to create the list of the pages, and then call sg_alloc_table_from_pages to obtain the SG list suitable to program the DMA engine in FPGA.
Unfortunately, in this approach I must prepare the intermediate list of page structures with length of 262144 pages per 1GB of the buffer. This list is stored in RAM, not in the FPGA, so it is less problematic, but anyway it would be good to avoid it.
In fact I don't need to keep the pages maped for the kernel, as the hugepages are protected against swapping out, and they are mapped for the user space application that will process the received data.
So what I'm looking for is a function sg_alloc_table_from_user_hugepages, that could take such a user-space address of the hugepages-based memory buffer, and transfer it directly into the right scatterlist, without performing unnecessary and memory-consuming mapping for the kernel.
Of course such a function should verify that the buffer indeed consists of hugepages.
I have found and read these posts: (A), (B), but couldn't find a good answer.
Is there any official method to do it in the current Linux kernel?

At the moment I have a very inefficient solution based on get_user_pages_fast:
int sgt_prepare(const char __user *buf, size_t count,
struct sg_table * sgt, struct page *** a_pages,
int * a_n_pages)
{
int res = 0;
int n_pages;
struct page ** pages = NULL;
const unsigned long offset = ((unsigned long)buf) & (PAGE_SIZE-1);
//Calculate number of pages
n_pages = (offset + count + PAGE_SIZE - 1) >> PAGE_SHIFT;
printk(KERN_ALERT "n_pages: %d",n_pages);
//Allocate the table for pages
pages = vzalloc(sizeof(* pages) * n_pages);
printk(KERN_ALERT "pages: %p",pages);
if(pages == NULL) {
res = -ENOMEM;
goto sglm_err1;
}
//Now pin the pages
res = get_user_pages_fast(((unsigned long)buf & PAGE_MASK), n_pages, 0, pages);
printk(KERN_ALERT "gupf: %d",res);
if(res < n_pages) {
int i;
for(i=0; i<res; i++)
put_page(pages[i]);
res = -ENOMEM;
goto sglm_err1;
}
//Now create the sg-list
res = sg_alloc_table_from_pages(sgt, pages, n_pages, offset, count, GFP_KERNEL);
printk(KERN_ALERT "satf: %d",res);
if(res < 0)
goto sglm_err2;
*a_pages = pages;
*a_n_pages = n_pages;
return res;
sglm_err2:
//Here we jump if we know that the pages are pinned
{
int i;
for(i=0; i<n_pages; i++)
put_page(pages[i]);
}
sglm_err1:
if(sgt) sg_free_table(sgt);
if(pages) kfree(pages);
* a_pages = NULL;
* a_n_pages = 0;
return res;
}
void sgt_destroy(struct sg_table * sgt, struct page ** pages, int n_pages)
{
int i;
//Free the sg list
if(sgt->sgl)
sg_free_table(sgt);
//Unpin pages
for(i=0; i < n_pages; i++) {
set_page_dirty(pages[i]);
put_page(pages[i]);
}
}
The sgt_prepare function builds the sg_table sgt structure that i can use to create the DMA mapping. I have verified that it contains the number of entries equal to the number of hugepages used.
Unfortunately, it requires that the list of the pages is created (allocated and returned via the a_pages pointer argument), and kept as long as the buffer is used.
Therefore, I really dislike that solution. Now I have 256 2MB hugepages used as a DMA buffer. It means that I have to create and keeep unnecessary 128*1024 page structures. I also waste 512 MB of kernel address space for unnecessary kernel mapping.
The interesting question is if the a_pages may be kept only temporarily (until the sg-list is created)? In theory it should be possible, as the pages are still locked...

Map multiple kernel buffer into contiguous userspace buffer?

I have allocated multiple kernel accessible buffers using dma_alloc_coherent, each 4MiB in size. The goal is to map these buffers into a contiguous userspace virtual memory. The issue is that remap_pfn_range doesn't seem to be working, as the userspace memory sometimes works and sometimes doesn't, or sometimes duplicates the page mappings of the buffers.
// in probe() function
dma_alloc_coherent(&pcie->dev, BUF_SIZE, &bus_addr0, GFP_KERNEL);
dma_alloc_coherent(&pcie->dev, BUF_SIZE, &bus_addr1, GFP_KERNEL);
// ...
// in mmap() function
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = dma_to_phys(&pcie->dev, &bus_addr0) >> PAGE_SHIFT;
remap_pfn_range(pfn, vma->vm_start + 0, pfn, BUF_SIZE, vma->vm_page_prot);
pfn = dma_to_phys(&pcie->dev, &bus_addr1) >> PAGE_SHIFT;
remap_pfn_range(pfn, vma->vm_start + BUF_SIZE, pfn, BUF_SIZE, vma->vm_page_prot);
I'm not really sure of the best way to map multiple kernel buffers to contiguous userspace memory, but I have a feeling I am doing it wrong. Thanks in advance.

I have no idea why there isn't a better interface to map multiple buffers contiguously into user space. In theory you can use multiple calls to remap_pfn_range() but getting the correct pfn for memory allocated by dma_alloc_coherent() is essentially impossible on some platforms (e.g. ARM).
I have come up with a solution to this problem that might not be considered "good" but seems to work well enough in my usage on multiple platforms (x86_64, and various ARM). The solution is to temporarily modify the start and end addresses in the struct vm_area_struct while calling dma_mmap_coherent() multiple times, once for each buffer. As long as you reset the VMA start and end addresses to their original values, everything seems to work okay (see my prior disclaimer).
Here is an example:
static int mmap(struct file *file, struct vm_area_struct *vma)
{
. . .
int rc;
unsigned long vm_start_orig = vma->vm_start;
unsigned long vm_end_orig = vma->vm_end;
for (int idx = 0; idx < buffer_list_size; idx++) {
buffer_entry = &buffer_list[idx];
/* Temporarily modify VMA start and end addresses */
if (idx > 0) {
vma->vm_start = vma->vm_end;
}
vma->vm_end = vma->vm_start + buffer_entry->size;
rc = dma_mmap_coherent(dev, vma,
buffer_entry->virt_address,
buffer_entry->phys_addr,
buffer_entry->size);
if (rc != 0) {
pr_err("dma_mmap_coherent: %d (IDX = %d)\n", rc, idx);
return -EAGAIN;
}
}
/* Restore VMA addresses */
vma->vm_start = vm_start_orig;
vma->vm_end = vm_end_orig;
return rc;
}

Unfortunately, the only currently supported method for mmap()ing DMA coherent memory is the macro dma_mmap_coherent() or the function dma_mmap_attrs() (which is called by dma_mmap_coherent()). Unfortunately, that does not support splitting a single VMA across multiple, individually allocated blocks of DMA coherent memory.
(I wish there was a supported way to split the mmap()ing of a VMA across multiple allocations of DMA coherent memory because it affects the buffer allocation in a kernel subsystem that I help maintain. I had to change it to allocate the buffer as a single block of DMA coherent memory instead of many page-sized blocks.)

Mmap DMA memory uncached: "map pfn ram range req uncached-minus got write-back"

I am mapping DMA coherent memory from kernel to user space. At user level I use mmap() and in kernel driver I use dma_alloc_coherent() and afterwards remap_pfn_range() to remap the pages. This basically works as I can write data to the mapped area in my app and verify it in my kernel driver.
However, despite using dma_alloc_coherent (which should alloc uncached memory) and pgprot_noncached() the kernel informs me with this dmesg output:
map pfn ram range req uncached-minus for [mem 0xABC-0xCBA], got write-back
In my understanding, write-back is cached memory. But I need uncached memory for the DMA operation.
The Code (only showing the important parts):
User App
fd = open(dev_fn, O_RDWR | O_SYNC);
if (fd > 0)
{
mem = mmap ( NULL
, mmap_len
, PROT_READ | PROT_WRITE
, MAP_SHARED
, fd
, 0
);
}
For testing purposes I used mmap_len = getpagesize(); Which is 4096.
Kernel Driver
typedef struct
{
size_t mem_size;
dma_addr_t dma_addr;
void *cpu_addr;
} Dma_Priv;
fops_mmap()
{
dma_priv->mem_size = vma->vm_end - vma->vm_start;
dma_priv->cpu_addr = dma_alloc_coherent ( &gen_dev
, dma_priv->mem_size
, &dma_priv->dma_addr
, GFP_KERNEL
);
if (dma_priv->cpu_addr != NULL)
{
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range ( vma
, vma->vm_start
, virt_to_phys(dma_priv->cpu_addr)>>PAGE_SHIFT
, dma_priv->mem_size
, vma->vm_page_prot
)
}
}
Useful information I've found
PATting Linux:
Page 7 --> mmap with O_SYNC (uncached):
Applications can open /dev/mem with the O_SYNC flag and then do mmap
on it. With that, applications will be accessing that address with an
uncached memory type. mmap will succeed only if there is no other
conflicting mappings to the same region.
I used the flag, doesn't help.
Page 7 --> mmap without O_SYNC (uncached-minus):
mmap without O_SYNC, no existing mapping, and not a write-back region:
For an mmap that comes under this category, we use uncached-minus type
mapping. In the absence of any MTRR for this region, the effective
type will be uncached. But in cases where there is an MTRR, making
this region write-combine, then the effective type will be
write-combine.
pgprot_noncached()
In /arch/x86/include/asm/pgtable.h I found this:
#define pgprot_noncached(prot) \
((boot_cpu_data.x86 > 3) \
? (__pgprot(pgprot_val(prot) | \
cachemode2protval(_PAGE_CACHE_MODE_UC_MINUS))) \
: (prot))
Is it possible that x86 always sets a noncached request to UC_MINUS, which results in combination with MTRR in a cached write-back?
I am using Ubuntu 16.04.1, Kernel: 4.10.0-40-generic.

https://www.kernel.org/doc/Documentation/x86/pat.txt
Drivers wanting to export some pages to userspace do it by using mmap
interface and a combination of 1) pgprot_noncached() 2)
io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
With PAT support, a new API pgprot_writecombine is being added. So,
drivers can continue to use the above sequence, with either
pgprot_noncached() or pgprot_writecombine() in step 1, followed by
step 2.
In addition, step 2 internally tracks the region as UC or WC in
memtype list in order to ensure no conflicting mapping.
Note that this set of APIs only works with IO (non RAM) regions. If
driver wants to export a RAM region, it has to do set_memory_uc() or
set_memory_wc() as step 0 above and also track the usage of those
pages and use set_memory_wb() before the page is freed to free pool.
I added set_memory_uc() before pgprot_noncached() and it did the thing.
if (dma_priv->cpu_addr != NULL)
{
set_memory_uc(dma_priv->cpu_addr, (dma_priv->mem_size/PAGE_SIZE));
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range ( vma
, vma->vm_start
, virt_to_phys(dma_priv->cpu_addr)>>PAGE_SHIFT
, dma_priv->mem_size
, vma->vm_page_prot
)
}
This answer was posted as an edit to the question Mmap DMA memory uncached: "map pfn ram range req uncached-minus got write-back" by the OP Gbo under CC BY-SA 4.0.

Linux device driver to allow an FPGA to DMA directly to CPU RAM

I'm writing a linux device driver to allow an FPGA (currently connected to the PC via PCI express) to DMA data directly into CPU RAM. This needs to happen without any interaction and user space needs to have access to the data. Some details:
- Running 64 bit Fedora 14
- System has 8GB of RAM
- The FPGA (Cyclone IV) is on a PCIe card
In an attempt to accomplish this I performed the following:
- Reserved the upper 2GB of RAM in grub with memmap 6GB$2GB (will not boot is I add mem=2GB). I can see that the upper 2GB of RAM is reserved in /proc/meminfo
- Mapped BAR0 to allow reading and writing to FPGA registers (this works perfectly)
- Implemented an mmap function in my driver with remap_pfn_range()
- Use ioremap to get the virtual address of the buffer
- Added ioctl calls (for testing) to write data to the buffer
- Tested the mmap by making an ioctl call to write data into the buffer and verified the data was in the buffer from user space
The problem I'm facing is when the FPGA starts to DMA data to the buffer address I provide. I constantly get PTE errors (from DMAR:) or with the code below I get the following error:
DMAR: [DMA Write] Request device [01:00.0] fault addr 186dc5000
DMAR: [fault reason 01] Present bit in root entry is clear
DRHD: handling fault status reg 3
The address in the first line increments by 0x1000 each time based on the DMA from the FPGA
Here's my init() code:
#define IMG_BUF_OFFSET 0x180000000UL // Location in RAM (6GB)
#define IMG_BUF_SIZE 0x80000000UL // Size of the Buffer (2GB)
#define pci_dma_h(addr) ((addr >> 16) >> 16)
#define pci_dma_l(addr) (addr & 0xffffffffUL)
if((pdev = pci_get_device(FPGA_VEN_ID, FPGA_DEV_ID, NULL)))
{
printk("FPGA Found on the PCIe Bus\n");
// Enable the device
if(pci_enable_device(pdev))
{
printk("Failed to enable PCI device\n");
return(-1);
}
// Enable bus master
pci_set_master(pdev);
pci_read_config_word(pdev, PCI_VENDOR_ID, &id);
printk("Vendor id: %x\n", id);
pci_read_config_word(pdev, PCI_DEVICE_ID, &id);
printk("Device id: %x\n", id);
pci_read_config_word(pdev, PCI_STATUS, &id);
printk("Device Status: %x\n", id);
pci_read_config_dword(pdev, PCI_COMMAND, &temp);
printk("Command Register : : %x\n", temp);
printk("Resources Allocated :\n");
pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0, &temp);
printk("BAR0 : %x\n", temp);
// Get the starting address of BAR0
bar0_ptr = (unsigned int*)pcim_iomap(pdev, 0, FPGA_CONFIG_SIZE);
if(!bar0_ptr)
{
printk("Error mapping Bar0\n");
return -1;
}
printk("Remapped BAR0\n");
// Set DMA Masking
if(!pci_set_dma_mask(pdev, DMA_BIT_MASK(64)))
{
pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
printk("Device setup for 64bit DMA\n");
}
else if(!pci_set_dma_mask(pdev, DMA_BIT_MASK(32)))
{
pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
printk("Device setup for 32bit DMA\n");
}
else
{
printk(KERN_WARNING"No suitable DMA available.\n");
return -1;
}
// Get a pointer to reserved lower RAM in kernel address space (virtual address)
virt_addr = ioremap(IMG_BUF_OFFSET, IMG_BUF_SIZE);
kernel_image_buffer_ptr = (unsigned char*)virt_addr;
memset(kernel_image_buffer_ptr, 0, IMG_BUF_SIZE);
printk("Remapped image buffer: 0x%p\n", (void*)virt_addr);
}
Here's my mmap code:
unsigned long image_buffer;
unsigned int low;
unsigned int high;
if(remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot))
{
return(-EAGAIN);
}
image_buffer = (vma->vm_pgoff << PAGE_SHIFT);
if(0 > check_mem_region(IMG_BUF_OFFSET, IMG_BUF_SIZE))
{
printk("Failed to check region...memory in use\n");
return -1;
}
request_mem_region(IMG_BUF_OFFSET, IMG_BUF_SIZE, DRV_NAME);
// Get the bus address from the virtual address above
//dma_page = virt_to_page(addr);
//dma_offset = ((unsigned long)addr & ~PAGE_MASK);
//dma_addr = pci_map_page(pdev, dma_page, dma_offset, IMG_BUF_SIZE, PCI_DMA_FROMDEVICE);
//dma_addr = pci_map_single(pdev, image_buffer, IMG_BUF_SIZE, PCI_DMA_FROMDEVICE);
//dma_addr = IMG_BUF_OFFSET;
//printk("DMA Address: 0x%p\n", (void*)dma_addr);
// Write start or image buffer address to the FPGA
low = pci_dma_l(image_buffer);
low &= 0xfffffffc;
high = pci_dma_h(image_buffer);
if(high != 0)
low |= 0x00000001;
*(bar0_ptr + (17024/4)) = 0;
//printk("DMA Address LOW : 0x%x\n", cpu_to_le32(low));
//printk("DMA Address HIGH: 0x%x\n", cpu_to_le32(high));
*(bar0_ptr + (4096/4)) = cpu_to_le32(low); //2147483649;
*(bar0_ptr + (4100/4)) = cpu_to_le32(high);
*(bar0_ptr + (17052/4)) = cpu_to_le32(low & 0xfffffffe);//2147483648;
printk("Process Read Command: Addr:0x%x Ret:0x%x\n", 4096, *(bar0_ptr + (4096/4)));
printk("Process Read Command: Addr:0x%x Ret:0x%x\n", 4100, *(bar0_ptr + (4100/4)));
printk("Process Read Command: Addr:0x%x Ret:0x%x\n", 17052, *(bar0_ptr + (17052/4)));
return(0);
Thank you for any help you can provide.

Do you control the RTL code that writes the TLP packets yourself, or can you name the DMA engine and PCIe BFM (bus functional model) you are using? What do your packets look like in the simulator? Most decent BFM should trap this rather than let you find it post-deploy with a PCIe hardware capture system.
To target the upper 2GB of RAM you will need to be sending 2DW (64-bit) addresses from the device. Are the bits in your Fmt/Type set to do this? The faulting address looks like a masked 32-bit bus address, so something at this level is likely incorrect. Also bear in mind that because PCIe is big-endian take care when writing the target addresses to the PCIe device endpoint. You might have the lower bytes of the target address dropping into the payload if Fmt is incorrect - again a decent BFM should spot the resulting packet length mismatch.
If you have a recent motherboard/modern CPU, the PCIe endpoint should do PCIe AER (advanced error reporting), so if running a recent Centos/RHEL 6.3 you should get a dmesg report of endpoint faults. This is very useful as the report capture the first handful of DW's of the packet to special capture registers, so you can review the TLP as received.
In your kernel driver, I see you setup the DMA mask, that is not sufficient as you have not programmed the mmu to allow writes to the pages from the device. Look at the implementation of pci_alloc_consistent() to see what else you should be calling to achieve this.

If you are still looking for a reason, then it goes like this:
Your kernel has DMA_REMAPPING flags enabled by default, thus IOMMU is throwing the above error, as IOMMU context/domain entries are not programmed for your device.
You can try using intel_iommu=off in the kernel command line or putting IOMMU in bypass mode for your device.
Regards,
Samir

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string