Currently i have a requirement to support MSI with 2 vectors on my PCI device. Each vector needs to have a different handler routine. HW document says the following
vector 0 is for temperature sensor
vector 1 is for power sensor
Below is the driver code i am following.
1. First enable two vectors using pci_enable_msi_block(pdev, 2)
2. Next assign interrupt handlers using request_irq(two different irq, two diff interrupt handlers).
int vecs = 2;
struct pci_dev *pdev = dev->pci_dev;
result = pci_enable_msi_block(pdev, vecs);
Here result is zero which says call succeeded in enabling two vectors.
Questions i have is:
HW document says vector 0, i hope this is not the vector 0 of OS right? In any case i can't get vector 0 in OS.
Difficult problem i am facing is when i do request_irq() for first irq, how do i say to OS that i need to map this request to vector 0 of HW? Consecutively for second irq, how do i map t vector 1 of HW?
pci_enable_msi_block:
If 2 MSI messages are requested using this function and if the function call returns 0, then 2 MSI messages are allocated for the device and pdev->irq is updated to the lowest of the interrupts assigned to the device.
So pdev->irq and pdev->irq+1 are the new interrupts assigned to the device. You can now register two interrupt handlers:
request_irq(pdev->irq, handler1, ...)
request_irq(pdev->irq+1, handler2, ...)
With MSI and MSI-X, the interrupt number(irq) is a CPU "vector". Message signaled interrupts allow the device to write a small amount of data to a special memory-mapped I/O address; the chipset then delivers the corresponding interrupt to a processor.
May be there are two different MSI interrupt data that can be written to a MSI address. Its like your hardware supports 2 MSI (one for Temperature Sensor and one for Power Sensor). So when you issue pci_enable_msi_block(pdev, 2);, the interrupt will be asserted by the chipset to the processor whenever any of the two MSI data is written to that special memory-mapped I/O address (MSI address).
After the call to pci_enable_msi_block(pdev, 2); ,you can request two irqs through request_irq(pdev->irq, handler, flags....) and request_irq(pdev->irq + 1, handler, flags....). So whenever the MSI data is written to the MSI address, pdev->irq or pdev->irq + 1 will be asserted depending on which sensor sent the MSI and the corresponding handler will be invoked.
This two MSI data can be configured into the hardware's MSI data register.
Related
Message signalled interrupts (MSI) is an optional feature that enables PCI devices to request service by writing a system-specified message to a system-specified address (PCI DWORD memory write transaction). The transaction address specifies the message destination while the transaction data specifies the message. System software is expected to initialize the message destination and message during device configuration, allocating one or more non-shared messages to each MSI capable function.
Can MSI interrupt route to multiple cpus?
for example: echo F > /proc/irq/msi_irq/smp_affinity
From my opinion:
MSI interrupt can route to multiple cpus. when cpu receive interrupt message, use the destination info to route interrupts to multiple cpus.
MSI interrupt can not route to multiple cpus. MSI interrupt message can only write to LAPIC, so can only trigger interrupt to one cpu.
But, which opinion is right?
Yes, MSIs can be dispatched to multiple logical CPUs.
Chapter 10.11 (Advanced Programmable Interrupt Controller / Message Signalled Interrupts) of the Intel SDM vol 3 describes the format of the address destination of an MSI:
The RH (Redirection Hint) and DM (Destination mode) fields change the meaning of the Destination ID field:
• When RH is 0, the interrupt is directed to the processor listed in the Destination ID field.
• When RH is 1 and the physical destination mode is used, the Destination ID field must not be set to FFH;
it must point to a processor that is present and enabled to receive the interrupt.
• When RH is 1 and the logical destination mode is active in a system using a flat addressing model, the Destination ID field must be set so that bits set to 1 identify processors that are present and enabled to
receive the interrupt.
• If RH is set to 1 and the logical destination mode is active in a system using cluster addressing model,
then Destination ID field must not be set to FFH; the processors identified with this field must be
present and enabled to receive the interrupt.
If by msi_irq in /proc/irq/msi_irq/smp_affinity you mean an IRQ number that uses MSI then yes, setting that pseudo-file to f will limit said IRQ to the four logical CPUs selected.
In fact, as far as interrupt dispatching goes, only interrupts using the legacy PIC cannot have an affinity. IO-APIC interrupts and MSI can always target a group of processors.
What are irq domains, i read kernel documentation (https://www.kernel.org/doc/Documentation/IRQ-domain.txt) they say:
The number of interrupt controllers registered as unique irqchips
show a rising tendency: for example subdrivers of different kinds
such as GPIO controllers avoid reimplementing identical callback
mechanisms as the IRQ core system by modeling their interrupt
handlers as irqchips, i.e. in effect cascading interrupt controllers.
How GPIO controller can be called as interrupt controller?
What are linux irq domains, why are they needed?
It's documented perfectly in the first paragraph of Documentation/IRQ-domain.txt, so I will assume that you already know it. If no -- please ask what is unclear regarding that documentation. The text below explains how to use IRQ domain API and how it works.
How GPIO controller can be called as interrupt controller?
Let me answer this question using max732x.c driver as a reference (driver code). It's a GPIO driver and it also acts like interrupt controller, so it should be a good example of how IRQ domain API works.
Physical level
To completely understand further explanation, let's first look into MAX732x mechanics. Application circuit from datasheet (simplified for our example):
When there is a change of voltage level on P0-P7 pins, MAX7325 will generate interrupt on INT pin. The driver (running on SoC) can read the status of P0-P7 pins via I2C (SCL/SDA pins) and generate separate interrupts for each of P0-P7 pins. This is why this driver acts as interrupt controller.
Consider next configuration:
"Some device" changes level on P4 pin, tempting MAX7325 to generate interrupt. Interrupt from MAX7325 is connected to GPIO4 IP-core (inside of SoC), and it uses line #29 of that GPIO4 module to notify CPU about interrupt. So we can say that MAX7325 is cascaded to GPIO4 controller. GPIO4 also acts as interrupt controller, and it's cascaded to GIC interrupt controller.
Device tree
Let's declare above configuration in device tree. We can use bindings from Documentation/devicetree/bindings/gpio/gpio-max732x.txt as reference:
expander: max7325#6d {
compatible = "maxim,max7325";
reg = <0x6d>;
gpio-controller;
#gpio-cells = <2>;
interrupt-controller;
#interrupt-cells = <2>;
interrupt-parent = <&gpio4>;
interrupts = <29 IRQ_TYPE_EDGE_FALLING>;
};
The meaning of properties is as follows:
interrupt-controller property defines that device generates interrupts; it will be needed further to use this node as interrupt-parent in "Some device" node.
#interrupt-cells defines format of interrupts property; in our case it's 2: 1 cell for line number and 1 cell for interrupt type
interrupt-parent and interrupts properties are describing interrupt line connection
Let's say we have driver for MAX7325 and driver for "Some device". Both are running in CPU, of course. In "Some device" driver we want to request interrupt for event when "Some device" changes level on P4 pin of MAX7325. Let's first declare this in device tree:
some_device: some_device#1c {
reg = <0x1c>;
interrupt-parent = <&expander>;
interrupts = <4 IRQ_TYPE_EDGE_RISING>;
};
Interrupt propagation
Now we can do something like this (in "Some device" driver):
devm_request_threaded_irq(core->dev, core->gpio_irq, NULL,
some_device_isr, IRQF_TRIGGER_RISING | IRQF_ONESHOT,
dev_name(core->dev), core);
And some_device_isr() will be called each time when level on P4 pin of MAX7325 goes from low to high (rising edge). How it works? From left to the right, if you look to the picture above:
"Some device" changes level on P4 of MAX7325
MAX7325 changes level on its INT pin
GPIO4 module is configured to catch such a change, so it's generates interrupt to GIC
GIC notifies CPU
All those actions are happening on hardware level. Let's see what's happening on software level. It actually goes backwards (from right to the left on the picture):
CPU now is in interrupt context in GIC interrupt handler. From gic_handle_irq() it calls handle_domain_irq(), which in turn calls generic_handle_irq(). See Documentation/gpio/driver.txt for details. Now we are in SoC's GPIO controller IRQ handler.
SoC's GPIO driver also calls generic_handle_irq() to run handler, which is set for each particular pin. See for example how it's done in omap_gpio_irq_handler(). Now we are in MAX7325 IRQ handler.
MAX7325 IRQ handler (here) calls handle_nested_irq(), so that all IRQ handlers of devices connected to MAX7325 ("Some device" IRQ handler, in our case) will be called in max732x_irq_handler() thread
finally, IRQ handler of "Some device" driver is called
IRQ domain API
GIC driver, GPIO driver and MAX7325 driver -- they all are using IRQ domain API to represent those drivers as interrupt controllers. Let's take a look how it's done in MAX732x driver. It was added in this commit. It's easy to figure out how it works just by reading IRQ domain documentation and looking to this commit. The most interesting part of that commit is this line (in max732x_irq_handler()):
handle_nested_irq(irq_find_mapping(chip->gpio_chip.irqdomain, level));
irq_find_mapping() will find linux IRQ number by hardware IRQ number (using IRQ domain mapping function). Then handle_nested_irq() function will be called, which will run IRQ handler of "Some device" driver.
GPIOLIB_IRQCHIP
Since many GPIO drivers are using IRQ domain in the same way, it was decided to extract that code to GPIOLIB framework, more specifically to GPIOLIB_IRQCHIP. From Documentation/gpio/driver.txt:
To help out in handling the set-up and management of GPIO irqchips and the
associated irqdomain and resource allocation callbacks, the gpiolib has
some helpers that can be enabled by selecting the GPIOLIB_IRQCHIP Kconfig
symbol:
gpiochip_irqchip_add(): adds an irqchip to a gpiochip. It will pass
the struct gpio_chip* for the chip to all IRQ callbacks, so the callbacks
need to embed the gpio_chip in its state container and obtain a pointer
to the container using container_of().
(See Documentation/driver-model/design-patterns.txt)
gpiochip_set_chained_irqchip(): sets up a chained irq handler for a
gpio_chip from a parent IRQ and passes the struct gpio_chip* as handler
data. (Notice handler data, since the irqchip data is likely used by the
parent irqchip!) This is for the chained type of chip. This is also used
to set up a nested irqchip if NULL is passed as handler.
This commit converts IRQ domain API to GPIOLIB_IRQCHIP API in MAX732x driver.
Next questions
Further discussion is here:
part 2
part 3
Here's a comment I found in include/linux/irqdomain.h:
Interrupt controller "domain" data structure. This could be defined as
a irq domain controller. That is, it handles the mapping between
hardware and virtual interrupt numbers for a given interrupt domain.
the actual structure I think it's referring to there is irq_domain.
On the mainbord we have an interrupt controller (IRC) which acts as a multiplexer between the devices which can raise an interrupt and the CPU:
|--------|
|-----------| | |
-(0)------| IRC _____|______| CPU |
-(...)----| ____/ | | |
-(15)-----|/ | |--------|
|-----------|
Every device is associated with an IRQ (the number on the left). After every execution the CPU senses the interrupt-request line. If a signal is detected a state save will be performed and the CPU loads an Interrupt Handler Routine which can be found in the Interrupt Vector which is located on a fixed address in memory. As far as I can see the Number of the IRQ and the Vector number in the Interrupt Vector are not the same because I have for example my network card registered to IRQ 8. On an Intel Pentium processor this would point to a routine which is used to signal one error condition so there must be a mapping somewhere which points to the correct handler.
Questions:
1) If I write an device driver and register an IRQ X for it. From where does the system know which device should be handled? I can for example use request_irq() with IRQ number 10 but how does the system know that the handler should be used for the mouse or keyboard or for whatever i write the driver?
2) How is the Interrupt Vector looking then? I mean if I use the IRQ 10 for my device this would overwrite an standard handler which is for error handling in the table (the first usable one is 32 according to Silberschatz (Operating System Concepts)).
3) Who initialy sets the IRQs? The Bios? The OS?
4) Who is responsible for the matching of the IRQ and the offset in the Interrupt Vector?
5) It is possible to share IRQS. How is that possible? There are hardware lanes on the Mainboard which connect devices to the Interrupt Controller. How can to lanes be configured to the same Interrupt? There must be a table which says lane 2 and 3 handle IRQ15 e.g. Where does this table reside and how is it called?
Answers with respect to linux kernel. Should work for most other OS's also.
1) If I write an device driver and register an IRQ X for it. From where does the system know which device should be handled? I can for example use request_irq() with IRQ number 10 but how does the system know that the handler should be used for the mouse or keyboard or for whatever i write the driver?
There is no 1 answer to it. For example if this is a custom embedded system, the hardware designer will tell the driver writer "I am going to route device x to irq y". For more flexibility, for example for a network card which generally uses PCI protocol. There are hardware/firmware level arbitration to assign an irq number to a new device when it is detected. This will then be written to one of the PCI configuration register. The driver first reads this device register and then registers its interrupt handler for that particular irq. There will be similar mechanisms for other protocols.
What you can do is look up calls to request_irq in kernel code and how the driver obtained the irq value. It will be different for each kind of driver.
The answer to this question is thus, the system doesn't know. The hardware designer or the hardware protocols provide this information to driver writer. And then the driver writer registers the handler for that particular irq, telling the system what to do in case you see that irq.
2) How is the Interrupt Vector looking then? I mean if I use the IRQ 10 for my device this would overwrite an standard handler which is for error handling in the table (the first usable one is 32 according to Silberschatz (Operating System Concepts)).
Good question. There are two parts to it.
a) When you request_irq (irq,handler). The system really doesn't program entry 0 in the IVT or IDT. But entry N + irq. Where N is the number of error handlers or general purpose exceptions supported on that CPU. Details vary from system to system.
b) What happens if you erroneously request an irq which is used by another driver. You get an error and IDT is not programmed with your handler.
Note: IDT is interrupt descriptor table.
3) Who initialy sets the IRQs? The Bios? The OS?
Bios first and then OS. But there are certain OS's for example, MS-DOS which doesn't reprogram the IVT set up by BIOS. More sophisticated modern OS's like Windows or Linux do not want to rely on particular bios functions, and they re-program the IDT. But bios has to do it initially only then OS comes into picture.
4) Who is responsible for the matching of the IRQ and the offset in the Interrupt Vector?
I am really not clear what you mean. The flow is like this. First your device is assigned an irq number, and then you register an handler for it with that irq number. If you use wrong irq number, and then enable interrupt on your device, system will crash. Because the handler is registered fro wrong irq number.
5) It is possible to share IRQS. How is that possible? There are hardware lanes on the Mainboard which connect devices to the Interrupt Controller. How can to lanes be configured to the same Interrupt? There must be a table which says lane 2 and 3 handle IRQ15 e.g. Where does this table reside and how is it called?
This is a very good question. Extra table is not how it is solved in kernel. Rather for each shared irq, the handlers are kept in a linked list of function pointers. Kernel loops through all the handlers and invokes them one after another until one of the handler claims the interrupt as its own.
The code looks like this:
driver1:
d1_int_handler:
if (device_interrupted()) <------------- This reads the hardware
{
do_interrupt_handling();
return MY_INTERRUPT;
}else {
return NOT_MY_INTERRUPT;
}
driver2:
Similar to driver 1
kernel:
do_irq(irq n)
{
if (shared_irq(n))
{
irq_chain = get_chain(n);
while(irq_chain)
{
if ((ret = irq_chain->handler()) == MY_INTERRUPT)
break;
irq_chain = irq_chain->next;
}
if (ret != MY_INTERRUPT)
error "None of the drivers accepted the interrupt";
}
}
I am writing a device driver to handle interrupts for a PCIe card, which currently works for any interrupt vector raised on the IRQ line.
But it has a few types that can be raised, flagged by the Vector register. So now I need to read the vector information and be a bit cleverer...
So, do I :-
1/ Have separate dev nodes /dev/int1, /dev/int2, etc for each interrupt type, and just doc that int1 is for vector type A etc?
1.1/ As each file/char-devices will have its own minor number, when opened I'll know which is which. i think.
1.2/ ldd3 seems to demo this method.
2/ Have one node /dev/int (as I do now) and have multiple processes hanging off the same read method? sounds better?!
2.1/ Then only wake the correct process up...?
2.2/ Do I use separate wait_queue_head_t wait_queues? Or different flag/test conditions?
In the read method:-
wait_event_interruptible(wait_queue, flag);
In the handler not real code! :-
int vector = read_vector();
if vector = A then
wake_up_interruptible(wait_queue, flag)
return IRQ_HANDLED;
else
return IRQ_NONE/IRQ_RETVAL?
EDIT: notes from peoples comments :-
1) my user-space code mmap's all of the PCIe firmware registers
2) User-space code has a few threads, each perform a blocking read on the device driver device nodes, which then returns data from the firmware when an interrupt occurs. I need the correct thread woken up depending on the interrupt type.
I am not sure I understand correctly what you mean with the Vector register (a pointer to some documentation would help me precise for your case).
Anyway, any PCI device gets a unique interrupt number (given by the BIOS or some firmware on other architectures than x86). You just need to register this interrupt in your driver.
priv->name = DRV_NAME;
err = request_irq(pdev->irq, your_irqhandler, IRQF_SHARED, priv->name,
pdev);
if (err) {
dev_err(&pdev->dev, "cannot request IRQ\n");
goto err_out_unmap;
}
One other thing that I do not really understand is why you would export your interrupts as a dev node: interrupts are certainly something that need to remain in your driver/kernel code. But I guess here you want to export a device that is then accessed in userspace. I just find /dev/int no to be a good naming.
For your question about multiple dev nodes: if your different interrupt sources then provide access to different hardware resources (even if on the same PCI board) I would go for option 1), with a wait_queue for each device. Otherwise, I would go for option 2)
Since your interrupts are coming from the same physical device, if you chose option 1) or option 2), the interrupt line will have to be shared and you will have to read the vector in your interrupt handler to define which hardware resource raised the interrupt.
For option 1), it would be something like this:
static irqreturn_t pex_irqhandler(int irq, void *dev) {
struct pci_dev *pdev = dev;
int result;
result = pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, &myirq);
if (result) {
int vector = read_vector();
if (vector == A) {
set_flagA(flag);
} else if (vector == B) {
set_flagB(flag);
}
wake_up_interruptible(wait_queue, flag);
return IRQ_HANDLED;
} else {
return IRQ_NONE;
}
For option 2, it would be similar, but you would have only one if clause (for the respective vector value) in every different interrupt handler that you would request for every node.
If you have different chanel you can read() from, then you should definitely use different minor number. Imagine you have a card whith four serial port, you would definitely want four /dev/ttySx.
But does your device fit whith this model ?
First, I assume you're not trying to get your code into the mainline kernel. If you are, expect a vigorous discussion about the best way to do this. If you're writing a simple interrupt handling driver for a card which is mostly driven by mmap from user-space, there are a lot of ways to solve this problem.
If you use multiple device nodes (option 1), you can also implement poll so that a single application can open multiple device nodes and wait for a selection of interrupts. The minor number will be sufficient to tell them apart. If you have a wake queue for each vector, you can wake only the relevant listeners. You'll need to latch the vector after a successful poll to be sure that the read succeeds.
If you use a single device node (option 2), you'll need to add some extra magic so that the threads can register their interest in particular interrupt vectors. You could do this with an ioctl, or have the threads write the interrupt vectors to the device. Each thread should open the device node to get its own file descriptor. You can then associate the list of requested vectors with each open file descriptor. As a bonus, you can let the application read the interrupt vector from the device, so it knows which one happened.
You'll need to think about how the interrupt gets cleared. The interrupt handler will need to remove the interrupt, then store the result so it can be passed to user-space. You might find a kfifo useful for this rather than a wait queue. If you have a fifo for each open file descriptor, you can distribute the interrupt notifications to each listening application.
As I understood after reading the chapter related to The Linux Device Model in the Linux Device Drivers 3rd Edition, when a new device is configured, the kernel (2.6) follows more or less this sequence:
The Device is registered in the driver core (device_register(), what includes device initialization)
A kobject is registered in the device model
It creates an entry in sysfs and provokes a hotplug event
Bus and drivers are checked to see which one matches with the device
Probe
Device is binded to the driver
My main doubt is, in step 1, when is device_register() called and what fields should already be set in the device struct?
Is it called by the bus to which the device is connected? Any example in the code?
Have I misunderstood anything? :)
PCI hotplug code is going to call pci_do_scan_bus() to go through all slots, see if we find a device/bridge and add them to our device tree :
unsigned int __devinit pci_do_scan_bus(struct pci_bus *bus) {
max = pci_scan_child_bus(bus) //scan bus for all slots and devices in them
pci_bus_add_devices(bus); //add what we find
...
}
The fields in struct device are actually filled up as part of call to pci_scan_child_bus(). Here's the call graph (sort of :)):
pci_scan_child_bus > pci_scan_slot (scan for slots on the bus) > pci_scan_single_device > pci_device_add > device_initialize.
Note that device_initialize() is the first part of device_register(). You will see that the fields of struct device are filled up in pci_device_add after the call to device_initialize(). You can find it under drivers/pci/probe.c in the kernel sources. The struct pci_dev will also be filled up which will later be used by the device specific driver.
The actual addition of the kobject to the device hierarchy happens in pci_bus_add_devices. Here's the call graph :
pci_bus_add_devices > pci_bus_add_device > device_add.
As you can see, this call flow completes the second part of the device_register() function.
So, in short, device_register() consists of : 1. Initialize device and 2. Add device.
pci_device_add does step 1 and pci_bus_add_device does step 2.
Files of interest are : drivers/pci/{pci.c,bus.c,probe.c}
In struct bus type there is pointer to function match, whose job is to match the driver associated with device. So when the device is associated with a bus, then as soon as the device is connected to bus then it is responsiblity of bus to search for the device.
Pls correct me if that is not the case.