I made the following x86-64 program to view where the base address of the Interrupt Descriptor Tables starts:
#include <stdio.h>
#include <inttypes.h>
typedef struct __attribute__((packed)) {
uint16_t limit;
uint64_t base;
}idt_data_t;
static inline void store_idt(idt_data_t *idt_data)
{
asm volatile("sidt %0":"=m" (*idt_data));
}
int main(void)
{
idt_data_t idt_data;
store_idt(&idt_data);
printf("IDT Limit : 0x%X\n", idt_data.limit);
printf("IDT Base : 0x%lX\n", idt_data.base);
return 0;
}
And it prints the following:
IDT Limit : 0xFFF
IDT Base : 0xFFFFFE0000000000
The base address doesn't seem to be correct because the address should always be a physical address, am I right?
Also, I'm not sure but the limit seems to be too high. What am I doing wrong?
It's a linear address, not necessarily a physical address. In other words, it's subject to the page table like most other addresses. It has to be in pages that are never paged to disk--it wouldn't be able to handle page faults if not--but it can be in addresses that differ physically from virtually.
On x86-64, each entry of the IDT is 16 bytes long. There are 256 interrupt vectors. 256 * 16 = 4096 = 0x1000. The IDTR limit is a "less than or equal" check, so it's typical to use 0xFFF.
SIDT is a privileged instruction on newer CPUs if the OS enables a certain feature, so it's advisable not to use it in user mode unless you're writing an exploit PoC or something. It's possible that an OS lies about the answer rather than throwing an exception, but I don't know.
I am trying to access the DMA address in a NIC directly from another PCIe device in Linux. Specifically, I am trying to read that from an NVIDIA GPU to bypass the CPU all together. I have researched for zero-copy networking and DMA to userspace posts, but they either didn't answer the question or involve some copy from Kernel space to User space. I am trying to avoid using any CPU clocks because of the inconsistency with the delay and I have very tight latency requirements.
I got a hold of the NIC driver for the intel card I use (e1000e driver) and I found where the ring buffers are allocated. As I understood from a previous paper I was reading, I would be interested in the descriptor of type dma_addr_t. They also have a member of the rx_ring struct called dma. I pass both the desc and the dma members using an ioctl call but I am unable to get anything in the GPU besides zeros.
The GPU code is as follows:
int *setup_gpu_dma(u64 addr)
{
// Allocate GPU memory
int *gpu_ptr;
cudaMalloc((void **) &gpu_ptr, MEM_SIZE);
// Allocate memory in user space to read the stuff back
int *h_data;
cudaMallocHost((void **)&h_data, MEM_SIZE);
// Present FPGA memory to CUDA as CPU locked pages
int error = cudaHostRegister((void **) &addr, MEM_SIZE,
CU_MEMHOSTALLOC_DEVICEMAP);
cout << "Allocation error = " << error << endl;
// DMA from GPU memory to FPGA memory
cudaMemcpy((void **) &gpu_ptr, (void **)&addr, MEM_SIZE, cudaMemcpyHostToDevice);
cudaMemcpy((void **) &h_data, (void **)&gpu_ptr, MEM_SIZE, cudaMemcpyDeviceToHost);
// Print the data
// Clean up
}
What am I doing wrong?
cudaHostRegister() operates on already-allocated host memory, so you have to pass addr, not &addr.
If addr is not a host pointer, this will not work. If it is a host pointer, your function interface should use void * and then there will be no need for the typecast.
I am using Fedora 17 running on board with integrated graphics.
Given that I am able to manipulate content of physical memory, how can I find out the physical memory offsets that I could write to in order to display something on the screen?
I tried to lookup 0xB8000 and 0xB0000 offsets but they contain all 0xff.
Is there a specific pattern that starts video buffer in memory?
Is there any good source of information about this topic?
The root cause of my problem was that Linux is not using legacy video mode, so the memory at 0xB8000 is restricted (in my case read only). However issuing an interrupt can switch to other modes:
INT 10 - VIDEO - SET VIDEO MODE
AH = 00h
AL = desired video mode (see #00010)
Found on: http://www.delorie.com/djgpp/doc/rbinter/id/74/0.html
Living like it's 1989
#include <linux/fb.h>
#define DEV_MEM "/dev/fb0"
/* Screen parameters (probably via ioctl() and /sys. */
#define YRES 240
#define XRES 320
#define BYTES_PER_PIXEL (sizeof(unsigned short)) /* 16 bit pixels. */
#define MAP_SIZE XRES*YRES*BYTES_PER_PIXEL
unsigned short *map_lbase;
if((fd = open(DEV_MEM, O_RDWR | O_SYNC)) == -1) {
fprintf(stderr, "cannot open %s - are you root?\n", DEV_MEM);
exit(1);
}
// Map that page.
map_lbase = (unsigned short *)mmap(NULL, MAP_SIZE,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if((long)map_lbase == -1) {
perror("cannot mmap");
exit(1);
}
Humons - Framebuffer API doc, Framebuffer Doc dir.
Smart Humons - Internals, Deferred I/O doc or how to emulate memory mapped video.
You can not use 0xB8000 and 0xB0000 directly as those are physical addresses. I assume you are in user space and not writing a kernel driver. Under Linux, we normally have the MMU enable; In other words, we have virtual memory. Not all processes/users have access to video memory. However, if you are allowed, you can mmap a framebuffer device to your address space. It is best to let the kernel give you an address and not request a specific one.
Or see how professionals do it.
Man: mmap
Edit: If you aren't root, you can still use Unix permissions on /dev/fb0 (or which ever device) to give group permission to read/write or use some sort of login process that gives a user on the current tty permission.
Maybe you could start around here:
http://www.tldp.org/HOWTO/Framebuffer-HOWTO/
But modern video graphics are in no way as simple as "finding where in the meory is the VRAM" and writing there.
I have 64 bit REHL linux, Linux ipms-sol1 2.6.32-71.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
RAM size = ~38GB
I changed default shared memory limits as follows in /etc/sysctl.conf & loaded changed file in memory as sysctl -p
kernel.shmmni=81474836
kernel.shmmax=32212254720
kernel.shmall=7864320
Just for experimental basis I have changed shmmax size to 32GB and tried allocating 10GB in code using shmget() as given below, but it fails to get 10GB of shared memory in single shot but when I reduce my demand for shared space to 8GB it succeeds any clue as to where am I possibly going wrong?
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#define SHMSZ 10737418240
main()
{
char c;
int shmspaceid;
key_t key;
char *shm, *s;
struct shmid_ds shmid;
key = 5678;
fprintf(stderr,"Changed code\n");
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR memory allocation failed\n");
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
Regards
Himanshu
I'm not sure that this solution is applicable to shared memory as well, but I know this phenomenon from normal malloc() calls.
It's pretty usual that you cannot allocate very large blocks of memory as you try it here. What the functions call means is "Allocate me a block of continuous memory of 10737418240 bytes". Often times, even if the total system memory could theoretically satisfy this need, the implied "a block of continuous memory" forces the limit of allocatable memory to be much lower.
The in-memory program structure, the number of programs loaded can all contribute to blocking certain areas of memory and not allow there to be 10 continuous gigabytes of memory allocatable.
I have found often times that a reboot will change that (as programs get loaded to a different position on the heap). You can try out your maximum allocatable block size with something like this:
int i=1024;
int error=0;
while(!error) {
char *a=(char*)malloc(i);
error=(a==null);
if(!error)
printf("Successfully allocated %i.\n", i);
i*=2;
}
Hope this helps or is applicable here. I found this out while checking why I could not allocate close to maximum system memory to a JVM.
Shot in the dark: you don't have enough swap space. Shared memory, by default, requires reserving space in swap. You can disable this behavior using SHM_NORESERVE:
http://linux.die.net/man/2/shmget
SHM_NORESERVE (since Linux 2.6.15) This flag serves the same purpose
as the mmap(2) MAP_NORESERVE flag. Do not reserve swap space for this
segment. When swap space is reserved, one has the guarantee that it is
possible to modify the segment. When swap space is not reserved one
might get SIGSEGV upon a write if no physical memory is available. See
also the discussion of the file /proc/sys/vm/overcommit_memory in
proc(5).
I was just looking at this and I recommend printing out the exact errno value and description for the problem, rather than just noting that it failed. For example:
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
//#define SHMSZ 10737418240
#define SHMSZ 8589934592
int main()
{
int shmspaceid;
key_t key = 5678;
struct shmid_ds shmid;
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR with shmget (%d: %s)\n", (int)(errno), strerror(errno));
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
I tried to reproduce your problem with an 8 GB block and 8 GB smhmax and shmall on my 16 GB system, but I could not. It worked fine. I do recommend using ipcs -m to look for other shared blocks that might prevent your 10 GB allocation from being honored. And definitely look closely at the exact error code that shmget() is returning through errno.
I want to get data from a DMA enabled, PCIe hardware device into user-space as quickly as possible.
Q: How do I combine "direct I/O to user-space with/and/via a DMA transfer"
Reading through LDD3, it seems that I need to perform a few different types of IO operations!?
dma_alloc_coherent gives me the physical address that I can pass to the hardware device.
But would need to have setup get_user_pages and perform a copy_to_user type call when the transfer completes. This seems a waste, asking the Device to DMA into kernel memory (acting as buffer) then transferring it again to user-space.
LDD3 p453: /* Only now is it safe to access the buffer, copy to user, etc. */
What I ideally want is some memory that:
I can use in user-space (Maybe request driver via a ioctl call to create DMA'able memory/buffer?)
I can get a physical address from to pass to the device so that all user-space has to do is perform a read on the driver
the read method would activate the DMA transfer, block waiting for the DMA complete interrupt and release the user-space read afterwards (user-space is now safe to use/read memory).
Do I need single-page streaming mappings, setup mapping and user-space buffers mapped with get_user_pages dma_map_page?
My code so far sets up get_user_pages at the given address from user-space (I call this the Direct I/O part). Then, dma_map_page with a page from get_user_pages. I give the device the return value from dma_map_page as the DMA physical transfer address.
I am using some kernel modules as reference: drivers_scsi_st.c and drivers-net-sh_eth.c. I would look at infiniband code, but cant find which one is the most basic!
Many thanks in advance.
I'm actually working on exactly the same thing right now and I'm going the ioctl() route. The general idea is for user space to allocate the buffer which will be used for the DMA transfer and an ioctl() will be used to pass the size and address of this buffer to the device driver. The driver will then use scatter-gather lists along with the streaming DMA API to transfer data directly to and from the device and user-space buffer.
The implementation strategy I'm using is that the ioctl() in the driver enters a loop that DMA's the userspace buffer in chunks of 256k (which is the hardware imposed limit for how many scatter/gather entries it can handle). This is isolated inside a function that blocks until each transfer is complete (see below). When all bytes are transfered or the incremental transfer function returns an error the ioctl() exits and returns to userspace
Pseudo code for the ioctl()
/*serialize all DMA transfers to/from the device*/
if (mutex_lock_interruptible( &device_ptr->mtx ) )
return -EINTR;
chunk_data = (unsigned long) user_space_addr;
while( *transferred < total_bytes && !ret ) {
chunk_bytes = total_bytes - *transferred;
if (chunk_bytes > HW_DMA_MAX)
chunk_bytes = HW_DMA_MAX; /* 256kb limit imposed by my device */
ret = transfer_chunk(device_ptr, chunk_data, chunk_bytes, transferred);
chunk_data += chunk_bytes;
chunk_offset += chunk_bytes;
}
mutex_unlock(&device_ptr->mtx);
Pseudo code for incremental transfer function:
/*Assuming the userspace pointer is passed as an unsigned long, */
/*calculate the first,last, and number of pages being transferred via*/
first_page = (udata & PAGE_MASK) >> PAGE_SHIFT;
last_page = ((udata+nbytes-1) & PAGE_MASK) >> PAGE_SHIFT;
first_page_offset = udata & PAGE_MASK;
npages = last_page - first_page + 1;
/* Ensure that all userspace pages are locked in memory for the */
/* duration of the DMA transfer */
down_read(¤t->mm->mmap_sem);
ret = get_user_pages(current,
current->mm,
udata,
npages,
is_writing_to_userspace,
0,
&pages_array,
NULL);
up_read(¤t->mm->mmap_sem);
/* Map a scatter-gather list to point at the userspace pages */
/*first*/
sg_set_page(&sglist[0], pages_array[0], PAGE_SIZE - fp_offset, fp_offset);
/*middle*/
for(i=1; i < npages-1; i++)
sg_set_page(&sglist[i], pages_array[i], PAGE_SIZE, 0);
/*last*/
if (npages > 1) {
sg_set_page(&sglist[npages-1], pages_array[npages-1],
nbytes - (PAGE_SIZE - fp_offset) - ((npages-2)*PAGE_SIZE), 0);
}
/* Do the hardware specific thing to give it the scatter-gather list
and tell it to start the DMA transfer */
/* Wait for the DMA transfer to complete */
ret = wait_event_interruptible_timeout( &device_ptr->dma_wait,
&device_ptr->flag_dma_done, HZ*2 );
if (ret == 0)
/* DMA operation timed out */
else if (ret == -ERESTARTSYS )
/* DMA operation interrupted by signal */
else {
/* DMA success */
*transferred += nbytes;
return 0;
}
The interrupt handler is exceptionally brief:
/* Do hardware specific thing to make the device happy */
/* Wake the thread waiting for this DMA operation to complete */
device_ptr->flag_dma_done = 1;
wake_up_interruptible(device_ptr->dma_wait);
Please note that this is just a general approach, I've been working on this driver for the last few weeks and have yet to actually test it... So please, don't treat this pseudo code as gospel and be sure to double check all logic and parameters ;-).
You basically have the right idea: in 2.1, you can just have userspace allocate any old memory. You do want it page-aligned, so posix_memalign() is a handy API to use.
Then have userspace pass in the userspace virtual address and size of this buffer somehow; ioctl() is a good quick and dirty way to do this. In the kernel, allocate an appropriately sized buffer array of struct page* -- user_buf_size/PAGE_SIZE entries -- and use get_user_pages() to get a list of struct page* for the userspace buffer.
Once you have that, you can allocate an array of struct scatterlist that is the same size as your page array and loop through the list of pages doing sg_set_page(). After the sg list is set up, you do dma_map_sg() on the array of scatterlist and then you can get the sg_dma_address and sg_dma_len for each entry in the scatterlist (note you have to use the return value of dma_map_sg() because you may end up with fewer mapped entries because things might get merged by the DMA mapping code).
That gives you all the bus addresses to pass to your device, and then you can trigger the DMA and wait for it however you want. The read()-based scheme you have is probably fine.
You can refer to drivers/infiniband/core/umem.c, specifically ib_umem_get(), for some code that builds up this mapping, although the generality that that code needs to deal with may make it a bit confusing.
Alternatively, if your device doesn't handle scatter/gather lists too well and you want contiguous memory, you could use get_free_pages() to allocate a physically contiguous buffer and use dma_map_page() on that. To give userspace access to that memory, your driver just needs to implement an mmap method instead of the ioctl as described above.
At some point I wanted to allow user-space application to allocate DMA buffers and get it mapped to user-space and get the physical address to be able to control my device and do DMA transactions (bus mastering) entirely from user-space, totally bypassing the Linux kernel. I have used a little bit different approach though. First I started with a minimal kernel module that was initializing/probing PCIe device and creating a character device. That driver then allowed a user-space application to do two things:
Map PCIe device's I/O bar into user-space using remap_pfn_range() function.
Allocate and free DMA buffers, map them to user space and pass a physical bus address to user-space application.
Basically, it boils down to a custom implementation of mmap() call (though file_operations). One for I/O bar is easy:
struct vm_operations_struct a2gx_bar_vma_ops = {
};
static int a2gx_cdev_mmap_bar2(struct file *filp, struct vm_area_struct *vma)
{
struct a2gx_dev *dev;
size_t size;
size = vma->vm_end - vma->vm_start;
if (size != 134217728)
return -EIO;
dev = filp->private_data;
vma->vm_ops = &a2gx_bar_vma_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = dev;
if (remap_pfn_range(vma, vma->vm_start,
vmalloc_to_pfn(dev->bar2),
size, vma->vm_page_prot))
{
return -EAGAIN;
}
return 0;
}
And another one that allocates DMA buffers using pci_alloc_consistent() is a little bit more complicated:
static void a2gx_dma_vma_close(struct vm_area_struct *vma)
{
struct a2gx_dma_buf *buf;
struct a2gx_dev *dev;
buf = vma->vm_private_data;
dev = buf->priv_data;
pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr, buf->dma_addr);
buf->cpu_addr = NULL; /* Mark this buffer data structure as unused/free */
}
struct vm_operations_struct a2gx_dma_vma_ops = {
.close = a2gx_dma_vma_close
};
static int a2gx_cdev_mmap_dma(struct file *filp, struct vm_area_struct *vma)
{
struct a2gx_dev *dev;
struct a2gx_dma_buf *buf;
size_t size;
unsigned int i;
/* Obtain a pointer to our device structure and calculate the size
of the requested DMA buffer */
dev = filp->private_data;
size = vma->vm_end - vma->vm_start;
if (size < sizeof(unsigned long))
return -EINVAL; /* Something fishy is happening */
/* Find a structure where we can store extra information about this
buffer to be able to release it later. */
for (i = 0; i < A2GX_DMA_BUF_MAX; ++i) {
buf = &dev->dma_buf[i];
if (buf->cpu_addr == NULL)
break;
}
if (buf->cpu_addr != NULL)
return -ENOBUFS; /* Oops, hit the limit of allowed number of
allocated buffers. Change A2GX_DMA_BUF_MAX and
recompile? */
/* Allocate consistent memory that can be used for DMA transactions */
buf->cpu_addr = pci_alloc_consistent(dev->pci_dev, size, &buf->dma_addr);
if (buf->cpu_addr == NULL)
return -ENOMEM; /* Out of juice */
/* There is no way to pass extra information to the user. And I am too lazy
to implement this mmap() call using ioctl(). So we simply tell the user
the bus address of this buffer by copying it to the allocated buffer
itself. Hacks, hacks everywhere. */
memcpy(buf->cpu_addr, &buf->dma_addr, sizeof(buf->dma_addr));
buf->size = size;
buf->priv_data = dev;
vma->vm_ops = &a2gx_dma_vma_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = buf;
/*
* Map this DMA buffer into user space.
*/
if (remap_pfn_range(vma, vma->vm_start,
vmalloc_to_pfn(buf->cpu_addr),
size, vma->vm_page_prot))
{
/* Out of luck, rollback... */
pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr,
buf->dma_addr);
buf->cpu_addr = NULL;
return -EAGAIN;
}
return 0; /* All good! */
}
Once those are in place, user space application can pretty much do everything — control the device by reading/writing from/to I/O registers, allocate and free DMA buffers of arbitrary size, and have the device perform DMA transactions. The only missing part is interrupt-handling. I was doing polling in user space, burning my CPU, and had interrupts disabled.
Hope it helps. Good Luck!
I'm getting confused with the direction to implement. I want to...
Consider the application when designing a driver.
What is the nature of data movement, frequency, size and what else might be going on in the system?
Is the traditional read/write API sufficient?
Is direct mapping the device into user space OK?
Is a reflective (semi-coherent) shared memory desirable?
Manually manipulating data (read/write) is a pretty good option if the data lends itself to being well understood. Using general purpose VM and read/write may be sufficient with an inline copy. Direct mapping non cachable accesses to the peripheral is convenient, but can be clumsy. If the access is the relatively infrequent movement of large blocks, it may make sense to use regular memory, have the drive pin, translate addresses, DMA and release the pages. As an optimization, the pages (maybe huge) can be pre pinned and translated; the drive then can recognize the prepared memory and avoid the complexities of dynamic translation. If there are lots of little I/O operations, having the drive run asynchronously makes sense. If elegance is important, the VM dirty page flag can be used to automatically identify what needs to be moved and a (meta_sync()) call can be used to flush pages. Perhaps a mixture of the above works...
Too often people don't look at the larger problem, before digging into the details. Often the simplest solutions are sufficient. A little effort constructing a behavioral model can help guide what API is preferable.
first_page_offset = udata & PAGE_MASK;
It seems wrong. It should be either:
first_page_offset = udata & ~PAGE_MASK;
or
first_page_offset = udata & (PAGE_SIZE - 1)
It is worth mention that driver with Scatter-Gather DMA support and user space memory allocation is most efficient and has highest performance. However in case we don't need high performance or we want to develop a driver in some simplified conditions we can use some tricks.
Give up zero copy design. It is worth to consider when data throughput is not too big. In such a design data can by copied to user by
copy_to_user(user_buffer, kernel_dma_buffer, count);
user_buffer might be for example buffer argument in character device read() system call implementation. We still need to take care of kernel_dma_buffer allocation. It might by memory obtained from dma_alloc_coherent() call for example.
The another trick is to limit system memory at the boot time and then use it as huge contiguous DMA buffer. It is especially useful during driver and FPGA DMA controller development and rather not recommended in production environments. Lets say PC has 32GB of RAM. If we add mem=20GB to kernel boot parameters list we can use 12GB as huge contiguous dma buffer. To map this memory to user space simply implement mmap() as
remap_pfn_range(vma,
vma->vm_start,
(0x500000000 >> PAGE_SHIFT) + vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot)
Of course this 12GB is completely omitted by OS and can be used only by process which has mapped it into its address space. We can try to avoid it by using Contiguous Memory Allocator (CMA).
Again above tricks will not replace full Scatter-Gather, zero copy DMA driver, but are useful during development time or in some less performance platforms.