Can MSI interrupt route to multiple cpus? - linux

Message signalled interrupts (MSI) is an optional feature that enables PCI devices to request service by writing a system-specified message to a system-specified address (PCI DWORD memory write transaction). The transaction address specifies the message destination while the transaction data specifies the message. System software is expected to initialize the message destination and message during device configuration, allocating one or more non-shared messages to each MSI capable function.
Can MSI interrupt route to multiple cpus?
for example: echo F > /proc/irq/msi_irq/smp_affinity
From my opinion:
MSI interrupt can route to multiple cpus. when cpu receive interrupt message, use the destination info to route interrupts to multiple cpus.
MSI interrupt can not route to multiple cpus. MSI interrupt message can only write to LAPIC, so can only trigger interrupt to one cpu.
But, which opinion is right?

Yes, MSIs can be dispatched to multiple logical CPUs.
Chapter 10.11 (Advanced Programmable Interrupt Controller / Message Signalled Interrupts) of the Intel SDM vol 3 describes the format of the address destination of an MSI:
The RH (Redirection Hint) and DM (Destination mode) fields change the meaning of the Destination ID field:
• When RH is 0, the interrupt is directed to the processor listed in the Destination ID field.
• When RH is 1 and the physical destination mode is used, the Destination ID field must not be set to FFH;
it must point to a processor that is present and enabled to receive the interrupt.
• When RH is 1 and the logical destination mode is active in a system using a flat addressing model, the Destination ID field must be set so that bits set to 1 identify processors that are present and enabled to
receive the interrupt.
• If RH is set to 1 and the logical destination mode is active in a system using cluster addressing model,
then Destination ID field must not be set to FFH; the processors identified with this field must be
present and enabled to receive the interrupt.
If by msi_irq in /proc/irq/msi_irq/smp_affinity you mean an IRQ number that uses MSI then yes, setting that pseudo-file to f will limit said IRQ to the four logical CPUs selected.
In fact, as far as interrupt dispatching goes, only interrupts using the legacy PIC cannot have an affinity. IO-APIC interrupts and MSI can always target a group of processors.

Related

How are MMIO, IO and PCI configuration request routed and handled by the OS in a NUMA system?

TL;DR
How are MMIO, IO and PCI configuration requests routed to the right node in a NUMA system?
Each node has a "routing table" but I'm under the impression that the OS is supposed to be unaware of it.
How can an OS remap devices if it cannot change the "routing table"?
For a proper introduction to the "routing table", i.e. the Source Address Decoders (SADs), refer to Physical Address Decoding in Intel Xeon
v3/v4 CPUs: A Supplemental Datasheet.
I'll first try to recap what I've gather from papers and barely documented datasheets. Unfortunately this will longer the question and may not have all the pieces.
When a request egress the LLC1 the uncore need to known where to route it.
On the workstation CPUs the targets are either the DRAM, one the PCIe root port/integrated device or the DMI interface.
The uncore can easily tell if a memory request belongs to the DRAM thanks to the iMC registers2 or to one of the PCIe root port3/integrated device and will eventually fallback to DMI.
This, of course, also include MMIO and is almost identical for port mapped IO (which only skips the DRAM check).
PCI configuration requests (CFG) are routed as per specification with the only caveat that CFG requests on bus 0, not targeting the integrated devices, are sent down the DMI interface4.
On a server CPU the target of physical address can be off-socket.
A table is used to look up a node id (NID)5. This table is called the SAD.
Actually, the SAD is made of two decoders: the DRAM decoder (which uses a table) and the IO decoder6 (which is made of mostly fixed ranges and enable bits but also should contains tables).
The IO decoder overrides the DRAM decoder if needed.
The DRAM decoder work with a list of ranges, each associated with a list of target NIDs7.
If a memory, MMIO or PCIe memory-mapped configuration request (MMCFG) matches a range, the uncore will send the request along the QPI/UPI path to the selected target (it's unclear if the SAD can target the requestor node itself).
The IO decoder work either with enable bits for fixed ranges with fixed targets (e.g. reclaim the legacy BIOS windows to the "BIOS NID") or with variable range where part of the address is used to index a list of targets.
The port mapped IO destinations are looked up in a table depicted as IO[yyy].
The MMCFG IO destinations are looked up in a table named PCI[zzz].
The CFG destinations reuse the PCI[zzz] table.
Here yyy and zzz denote the index function, i.e. a part of the request address ( zzz is the bus number for CFG).
All this makes sense to me but these tables are not documented in the datasheet so a symbol like PCI[zzz] may actually means something entirely different.
While there is little to no documentation on these decoders, it's enough for a primitive mental model.
It is still not clear to me if the SAD are used even for requests that target the local resources or if they are used only for egressing requests.
This will be important later.
Let's say that when a request leaves the LLC the SADs are used to route it to a component (eventually in the same socket) and then it is handled similarly to the workstation case8.
As long as the hardware configuration isn't changed the SAD can be configured by the firmware and the OS can be totally agnostic of them.
But what happens if the OS remap a PCIe device that is behind a node's local PCIe link?
For a MMIO request to reach such a device it must first reach the device's node (since it is the only link with the rest of the system) but this can only happen if the SADs are properly reconfigured.
Reconfiguring of the SADs may be required even for requests originating from the same node as the device9.
But the OS is supposed to be unaware of the SAD, or isn't it?
I have a few possible answers:
The SAD are not used for accessing local resources and the OS restrict access to local IO only to processes (or the kernel) running in the parent node. This avoid the need to reconfigure the SADs.
The firmware configure the SADs so that each node has a portion of the address space (say 2k / N, where k is size of the physical address space and N is the number of node) and report that in the SRAT ACPI table. (but isn't the SRAT optional?)10 The OS then allocates MMIO resources only within each node memory portion. This may lead to suboptimal memory use.
The SADs are an optimisation, if the uncore don't know where to route a request it will pass it to the QPI/UPI link(s) until it is sinked by a node.
Everything above is wrong.
1 e.g. due to a miss or due to being UC.
2 e.g TOUM for the maximum physical address reclaimable by the DRAM, though it's not a continuous block.
3 Which are PCI-to-PCI (P2P) bridge and have the registers to set the IO, prefetchable and non-prefetchable memory windows.
4 That's why PCH devices appears on bus 0. Server CPU have two "internal" busses and their numbers can be changed.
5 A node id is made of a socket number and an uncore component id. The uncore component I'm aware of that can be targets are (after name translation from the "box" nomenclature): The first or second Home agent (iMC), the System agent and the DMI link.
6 The IO decoder table is split into IOS (IO Small decoder) and IOL (IO Large decoder). This reflects the hardware capability of the two tables, with the IOS being almost fixed and the IOL being a CAM. Both are consulted in parallel with the DRAM table, the IOS overrides the IOL if both matches.
7 The range is automatically interleaved (i.e. sub partitioned) between all of the eight targets. To use less than eight targets, duplicate entries can be used (e.g. all set to the same targets is the same as no interleaving).
8 I wonder what happens if the SADs route a request to an uncore component (say the iMC) but outside its reclaimed range? I guess it is discarded.
9 See bold part above, I don't know how the SAD work with requests targeting local resources.
10 Linux fake a NUMA node in UMA machines. In my box the quantity of memory allocated to the node includes MMIO (almost 10GiB allocated vs 8GiB or DRAM). It seems the whole range returned by the E8020 is used.
This answer uses elements from the the Xeon 5500 and 7500 datasheets (which use a ring and an Rbox to connect to the rest of the system), but this answer speaks about a system whose ring bus architecture resembles the IvB-EX Xeon e7 v2 (which uses 2 rings) but also applies to Haswell-EX (which uses 4 rings with an interconnect between the 2) SnB and Broadwell. I presume each ring is still split into 4 32 byte bidirectional rings: snoop/invalidate, data/block, request/address, acknowledge rings.
Xeon e7 v2:
On the falling edge of the reset signal, the package instantly determines the NodeID and adjacent NodeIDs, x2APIC cluster ID and logical processor APIC IDs (before SnB on desktop CPUs it used to sample MCH signals on input pins to do this; I'm not sure about multisocket since SnB). After this, the caches and TLBs are flushed, cores perform BISTs and then a multisocket MP initialisation algorithm occurs. Xeon D-1500 vol2 shows a CPUNODEID register; I'm not sure what sort of behaviour arises when changing that post-reset regarding in-flight transactions. Are SAD target lists changed via automatically generated writes when CPUNODEID changes on other cores, or does it trigger SMI handler or is this the task of the programmer? IDK. Presumably, all agents would have to be Quiesced to stop them accessing the Cbo before the SAD config can be changed.
Caching Agent
Each socket has a caching agent and a home agent (or 2 CAs and 2 HAs on MCC/HCC). The functionality of the caching agent is address-hashed across Cbo slices (before SnB, the caching agent was the combination of the Sbox and the 50% of Cboxes that were associated with it, where the Sbox was the QPI interface that picked up Cbox messages from the ring and converted them to QPI messages directly to the Rbox, and the Cbox was the interface between the LLC slice and the ring bus; after SnB, the Cbo implements the functionality of the Sbox and the Cbox on and sends QPI/IDI messages one ring interface). The mapping is set at boot time because that’s when the snooping mode is set. On Haswell-EX, the snoop mode can be toggled in the BIOS between no snoop, home snoop, early snoop, source snoop, HS w. Directory + OSB + HitME cache or COD, depending on the CPU; although I am yet to see any datasheet with PCIe configuration registers related to snoop modes. I'm not sure what behaviour arises for in-flight transactions when this is changed post-reset. When a core is disabled by BIOS, it does not impact the Cbos, which continue to function. The Caching Agent was a separate component on Nehalem-EX (see 7500 datasheet), but on later architectures it is used to refer to the Cbos as a whole. COD (Cluster on Die) introduced on Haswell-EP splits the caching agent into 2 per socket
CBo
Within each cache slice there is a cache box which is a controller and an agent between the LLC and the rest of the system, basically the LLC controller. The Cbo contains the Table Of Requests that holds all pending transactions. The Cbo supports three types of transactions: 1. Core/IIO initiated requests 2. Intel QPI external snoops 3. LLC capacity eviction. Each transaction has an associated entry in the TOR. The TOR entry holds information required by the Cbo to uniquely identify the request (e.g. address and transaction type) and state elements required to track the current status of the transaction.
Core/PCIe requests are address hashed to select a Cbo to translate and place the request onto the ring. The core knows what direction on the ring to send the request for smallest latency so it would be configured with the same address map and it waits for a vacancy on the ring to accommodate the transaction.
Core originated requests use the IDI packet interface. IDI (Intra-Die-Interconnect) is shown on Broadwell-EX. The diagram implies cores only use IDI packets and CBos use QPI and IDI. Xeon e7 v2 opcodes are shown on Table 2-18 and QPI opcodes on Table 2-218. This also prevents something like IIO claiming an address before it is decoded, because it only accepts QPI opcode. On some diagrams, Ubox is pictured to share a stop with IIO; on others they are separate.
When the vacancy arrives, the bidirectional stop logic it shares with the Cbo will place the 32 byte flit into the request ring and if the Cbo needs to send to the same direction at the same time then the transaction with a target that will be in correct polarity at the target is selected, otherwise some other form of arbitration is used. I presume the core can write to both directions in the cycle and the Cbo on the stop can read from both, or the Cbo and Core can write one and read one depending on the arbitration. I would imagine the link layer header of the IDI flit contains source, dest IDs allowing for a higher layer transaction (transport layer) to be split up and arrive non-contiguously as and when free gaps on the ring appear and so it knows whether to continue buffering the transaction and remove the flit from the ring. (Ring IDI/QPI is different to QPI which doesn't have link layer source/dest, because QPI is point to point; also QPI only has a 80 bit flit which is easy to find a diagram of; not the case for ring bus headers). The Cbo implements the QPI link layer credit/debit scheme per destination to stop buffer overflows. This request probably will only need 1 flit but cache line fetches would need 3. Cbo decodes when it sees the relevant opcode and address in the address range it has been assigned. Now it has to decode the address to see what to do with it.
Within the Cbo, requests go through the SAD at the same time that they are allocating into the TOR and are sent to the LLC. Non-LLC message class types go through the SAD when they are allocating into the TOR as well. The SAD receives the address, the address space, opcode, and a few other transaction details. To help reduce the number of decoder entries required for the varying DRAM capacities, the address decoder has several interleave options and on the Xeon 7500 is divided into 3 decoders with priority highest to lowest: I/O small (IOS) decoder, I/O large (IOL) decoder, DRAM decoder. On the Xeon e7 v2, there are 4 decoders: DRAM, MMIO, Interleave, legacy.
Each of these decoders is presented with an address and accessed in parallel. Addresses that match an entry in one of the decoders cause the decoder to lookup and generate the QPI memory attribute and NodeID. Only one match is allowed per decoder. Address matches in a higher-priority decoder will override simultaneous matches in lower-priority decoders.
I/O Decoders
There are 8 target list presented here, PCI[], MIOU[], MIOL[], FWH[], APIC[], IOH[], CPU[], IO[]. These, along with some I/O decoder entries are configurable via CSRs under the Cbo device, and each Cbo has its own SAD. The Local Cfg registers, which is just a fixed MMIO region that maps the local PCIe configuration registers only has one target, the local socket, and doesn't need any target list (it's always going to be routed to the Ubox, so the NodeID will be that of the local Ubox on 7500).
Intel Xeon processor E7 v2 product family implements 4-bits of NodeID (NID). Intel Xeon processor E7 v2 product family can support up to 2 HAs in each socket. The HAs in the same socket will be differentiated by NID[2]. In cases where the target is only 3 bits, NID[2] is assumed to be zero. Therefore, the socket ID will be NID[3,1:0]. The Intel® Xeon® processor 7500 series implements 5 bits and can support up to four sockets (chosen by NID[3:2] when NID[4] is zero). Within each socket are four devices (NID[1:0]): Intel® 7500 chipset (00), B0/S0 (01), Ubox (10), B1/S1 (11). B0/S0 and B1/S1 are two sets of HAs (Bboxes) and CAs (Sboxes).
E7 v2 has 4 bits of NodeID meaning it could support up to 8 sockets and 2 HAs per socket. If a socket has 2 HAs (MCC, HCC) it will still have 2 separate NodeIDs for each regardless of whether it is in COD mode or not and cache and DRAM locality can still be exploited by NUMA aware applications using the hemisphere bit as CBos are always associated with the HA in their socket half (hemisphere) by a hash algorithm. The hash function for instance might cause addr[6] to select a CBo hemisphere in the core and depending on how many (n) CBos are in the hemisphere, addr[x:7] mod n selects the CBo. Each CBo handles a particular range of LLC sets, where the socket's LLC cache covers the entire address range. COD instead creates 2 cache affinity domains (where the CBos in either hemisphere now each cover the full address range instead of half of the address range and therefore become their own official NUMA nodes with a fully spanning cache (rather than 2 NUCA caches) like it's its own socket -- core cache accesses will now never travel across to the other half of the socket, reducing latency, but reducing hit rate as the cache is now half the size -- you can still get this same level of cache latency and more capacity with COD off if you access only the memory associated with the NUCA/NUMA node). The HA that is associated with the CBo hemisphere is also the HA that the CBo uses for DRAM accesses if the DRAM access is local to the socket, because addr[6] is used to select a HA in the DRAM decoder. When there are multiple sockets, addr[8:6] selects a HA, which may be on another socket, so it will use the CBos in the half of addr[6], but the CBo will send the LLC miss to a home agent on another socket if `addr[8:7] of the home agent they are associated with isn't used, if the NUMA scheduler cannot 'affinitise' the process and the memory properly.
This is a paraphrasis of the I/O Decoders' list it appears. The table includes the IOS and the IOL decoder entries. The attribute, e.g. CFG, is the output opcode class from which an opcode will be selected.
DRAM Decoder
On the Xeon 7500, the DRAM decoder is composed of two arrays, a CAM array and a payload array. The DRAM decoder has 20 entries in each array for 20 different regions of minimum 256MiB with 8 sub-regions (interleaves). DRAM decoder
entries can also be used to define coherent, MMIO, MMCG and NXM spaces. If more than one PCI segment is required, DRAM decoder entries are used to configure it above the top of DRAM but unlike the normal PCIe entry below 4G, these entries should not be interleaved, so multiple targets will require multiple SAD entries, and would need to put these same SAD entries into every socket. The CAM array has special compare logic to compute whether an address is less than or equal to the region limit of each entry, to allow for any region size that is a multiple of 256 MiB. The region limit address bits [43:28] are stored in the CAM array of the entry (meaning 256MiB granularity). The payload array (tgtlist–attr) has 42 bits per entry.
You'd expect to see 20 64 bit (6 bits reserved) configuration registers for these SAD DRAM decoder entries (and there appears to be only one set of SAD rules that all the SADs on the socket use), although this is for 5500, and I think there is still only one set of registers for all the CBo SADs on newer CPUs.
On 5500, you set the PA[39:26] for the rule in [19:6] and the method of index generation in SAD_DRAM_RULE_0 and then in SAD_INTERLEAVE_LIST_0 you put the target list -- this sets the decoder fields and presumably the others are set. There must be a register to set the hemisphere bit for the entry (but it's not shown on the datasheet), which tells it to xor the first bit of the NID in the list with the CboxID[2] (what HA it belongs to -- same as NID[1] of current Cbox's HA i.e. addr[6]) of the current Cbox. The idea is that NID1 is always made to be 0 in the target list, such that the xor with the current Cbox's CboxID[2] causes it to select the home agent that belongs to the Cbox (which is why you don't need SAD entries for each individual CBox).
The target list entries are 2 bits on the Xeon 5500 Nehalem-EP (Gainestown), which is of the DP (Dual Processor) Server line, so max is only 2 sockets (only has 2 QPI links, one for IOH one for inter-CPU), hence a max of 4 HAs. It will be 3 bits on the 7500 (4x QPI) (4 sockets with 2HAs each: 8HAs), which expand to 4 bits to include a redundant NID[4]. SAD and DRAM decode entry configuration would surely have to be set identically across all sockets.
Interleaving for 8 nodes (HAs) is done on addr[8:6] of the address and these are removed (excluded) from the address sent to the target HA as it knows its own ID. The above table is for 7500, where addr[6] identifies the HA (NodeID[1]) on the socket.
There are 2 types of requests:
Non Coherent requests
Non-coherent requests are either data accesses that map to non-coherent address space (like MMIO) or they are non-memory requests like IO read/write, interrupts/events, etc. When NC requests access NC memory, they are sent to the Cbo according to the hash of the address just like coherent requests. The NC requests which do not target memory are sent to the Cbo that is attached to the core which generates the request.
MMIO
A transaction relating to a specific address is received by the Cbo is put through the I/O and DRAM decoder, stored in TOR but not sent to the LLC slice if the IDI opcode is uncacheable (e.g. PRd) which the core sends depending on memory type in PAT/MTRR read on L1 access, except for WCiL, which invalidates aliases in other cores and caching agents I think. It will match in the I/O decoder entry for one of the MMIO regions. There will be 8 IOAPICs on some CPUs (not on Xeon 7500 because it talks about ICH/IOH and it's Beckton (Nehalem-EX), so this is the very first implementation of the ringbus; but when the PCH was introduced on Nehalem Ibex Peak, it got integrated on chip in the IIO module). The yyy in FECX_(yyyx)XXXh (so bit [15:13]) will be used to index into the APIC[] table for the NodeID. This means there is 8KiB per IOAPIC in a 64KiB contiguous region. The SAD outputs the NodeID and the opcode to use (NcRd). Xeon 3400 manual talks about 'node Id' being assigned to IIO, maybe this is the 'chipset/IOH' node ID, but E7 v2 doesn't have this, so perhaps there is a hidden extension to select between IIO and Ubox or maybe there's a separate ID appended to the QPI request and Cbos also have an ID. IIO will then route the MMIO access based on integrated IIO and PCI bridge registers and subtractively decodes to DMI.
Also, there are 2 Regions MMIOL and MMIOH for MMIO BARs. There is 256MiB of MMIOLL interleaved for a maximum of the last 7 sockets and 256MiB of MMIOLU for a max of the first 7 sockets, depending on the value of PCIEXBAR (if PCIEXBAR is at 2GiB then MMIOLL won't exist). 7500 datasheet says the MMIOH requires separate I/O decoder entries per socket.
The general server CPU memory layout is like this:
There is a SCA_PCIE subregion that occupies the first 8MiB of the 256MiB configuration space which must be in a 256MiB region below 4GiB. When the address in the transaction is in the local clump address range (1MiB clump per socket), then the normal PCIe decoding is overridden and the NodeID is taken as PhysicalAddr[22:20]. Notice that 1MiB is Bus 0 (32 dev * 8 func) -- bus 0 of all sockets are visible to all other sockets. The NodeID will then be put in a QPI transaction with NcCfgRd opcode and NodeID destination and the flits are placed on the request ring; this will then be absorbed by the QPI link that recognises that NID range and once on the target socket, it will be routed to the Ubox (on SnB, IIO appears to handle NcCfg requests according to Xeon 3400 vol2 5.8.1) which will perform the configuration read; the Ubox has a unique NodeID on the 7500, so it will be routed to it seamlessly (on e7 v2 we assume that there is a hidden ID for these inserted in the packet). In the QPI packet the Cbo maybe inserts a requestor NodeID, a requestor CBo ID and transaction ID; the Ubox buffers and tracks the transaction (I assume sends an acknowledge back on the acknowledge ring) and will send back the results (success/failure) to the source NodeID in QPI packet on the data ring once the processing has finished and it will travel back over the the QPI link. The CBo completes the transaction and returns an IDI packet to the Core on the data ring.
If the address falls outside of the 8MiB region but is in the 256 PCIe configuration space then it will match the top entry in the I/O decoders list. The zzz bits (clearly bit 27-25) is used to index into the target list to select a node. We know that that's clearly the top 3 bits of the bus number. This has to suggest that, assuming index = socket no in interleave list, socket 0 allocation would start at bus 8 (with bus 0 as bus 0) socket 1 would start as bus 32 (with bus 0 as bus 1), socket 2 at bus 64 (with bus 0 as bus 2), socket 3 at 96 (with bus 0 at bus 3) (and in the request sent to socket 3, I assume it removes the NodeID bits such that it would see bus 1 and not 96 (or perhaps bus 0 if it aliases rather than adding 1)). Accessing from this PCIEXBAR config space limits the number of buses in a segment to 32, so full 256MiB aliases dedicated to a each segment need to be assigned (in the DRAM decoder you can set a MMCFG entry targeting a particular socket by presumably setting the target list to contain only a NodeID on the socket) and any access to this range is directed to that socket. PCI[zzz] seems to mean: index into the PCI target list to select a NodeID and give the request an opcode of the attribute class (NcCfgRd/Wr based on TOR details), the Cbo also translates the address to a bus/device/function no. The BIOS would have to read this configuration register in order to enter the correct subordinate device numbers for the PCI-to-PCI bridges on PCIe ports.
On the I/O decoder there are also the entries LclCSRs (Local Config), Glb IOH CSRs (IOH Cfg), Glbl CPU CSRs (CPU Cfg), lcl clump CSRs (Local clump CPU Config). These are fixed MMIO aliases of accessing the PCIe Configuration Space Registers (CSRs) without using PCIEXBAR.
IO
All sockets have IO ports that are fixed and ones that can be set in BARs with Memory Space Indicator set to IO space.
The IO request is received by the Cbo and it is looked up in the IO Table. There doesn't seem to be a separate IO opcode on the IDI list, neither an UC write counterpart of PRd, there might be an undocumented one, or it encodes it in WCiL.
The IO request can either be IO-PCIE or IO-SCA or Non-PCIe-IO. If the address matches the pattern in the bottom row and an IO bit is set and the address is CF8/CFC then 4 zeros are inserted into the address [11:8] (the core can only issue 8 bits of register index), making the CONFIG_ADDRESS function number LSB start at bit 12 rather than 8 and the bus number MSB is at bit 27 and it is now compared to the 3rd bottom row. If The top 5 bus bits match, then the NodeID is determined by [22:20], and an NcCfgRd/Wr is sent to that NodeID. If it is not CF8/CFC then the destination NodeID is determined by looking up a target list entry using IO_Addr[15:13] – so the 3 most significant bits of the IO port. This means that each socket has 8KiB of I/O space, and variable I/O will have to be set in the range for the socket. For fixed I/O accesses, if you make the top 3 bits that of the socket then theoretically they could be removed by the Cbo on the match; so in order to access 20h on socket 1, you'd use 2020h.
Other
Core issued requests which can match decoder entries by opcode are: • IntA, Lock, SplitLock, Unlock, SpCyc, DbgWr, IntPriUp, IntLog, IntPhy, EOI, FERR, Quiesce. The are allocated in the TOR but not sent to the LLC. SplitLock is used when a hardware lock operation crosses 2 cache lines. Lock used to be directed by the Cbo to the Ubox to quiesce all agents to perform an atomic operation but that is unnecessary overhead as it can be implemented in the cache coherence protocol. So either Lock RFO is now picked up by the Cbo and it invalidates other copies and responds and the core can’t read the line until then (presumably there is an indicator in the LLC that the line is locked in the core indicated in the snoop filter).
Coherent requests
Coherent requests are requests to access a memory address that is mapped to the coherent address space. They are usually used to transfer data at a cache line granularity and/or to change the state of a cache line. The most common coherent requests are Data and Code reads, RFOs, ItoMs and Writeback Evictions (WbMtoI) / Writebacks (WbMtoE) to LLC (inclusive cache only). The Coherent requests are serviced by the Cbo that holds the LLC slice for the specified address, determined by the hashing function.
If the address decodes to a DRAM region and the last level cache slice attached to that Cbo indicates that a core within the socket owns the line (for a coherent read), the Cbo snoops the request to that local core (this is obviously talking about an inclusive L3 with snoop filter bits in the the LLC slice by accesses from other cores; note that on SKL server mesh interconnect (not desktop as it still uses ring), the L3 is non-inclusive; in an inclusive cache, if it is valid in 2 cores then the LLC copy is valid).
If the line isn't in the cache (assume COD/HS mode w. Directory (ecc) + OSB + HitME cache is being used), then the caching agent that requests a cache line does not broadcast snoops. Instead it forwards the request in QPI to the HA NodeID that owns the address based on the SAD decoder rules, which then sends snoops to other caching agents which might own the line. The forwarding to the home node adds latency. However, the HA can implement a directory in the ECC bits of DRAM to improve performance. The directory is used to filter snoops to remote sockets or node controllers; on successful snoop response it will allocate an entry in the directory cache. The directory encodes 3 states: not present in any node, modified in 1 node (writeback and downgrade required, and the bit probably indicates which half of the nodes it is in for a targeted broadcast), shared in multiple nodes (meaning that snoops aren't required); If a directory is not supported, I assume it has to snoop all sockets on every DRAM access, while the access is taking place. Haswell-EP also includes directory caches (HitMe) to accelerate the directory lookup. However, with only 14 KiB per HA these caches are very small. Therefore, only cache lines that are frequently transferred between nodes are stored in the HitMe cache. The directory cache stores 8-bit vectors that indicate which of 8 nodes (there are up to 3*2 logical COD nodes on Haswell-EP) have copies of a cache line. When a request arrives at the home node, the directory cache is checked. If it contains an entry for the requested line, snoops are sent as indicated by the directory cache. If the request misses in the directory cache, the HA fetches from the address in memory and reads the directory bits and sends snoops accordingly if it is modified in some socket, or directly sends the data it fetched from memory if no snoops are required and doesn't allocate an entry. [3]
The E7 v2 HA implements a 128-entry In flight Memory Buffer (Home tracker), a 512-entry Tracker (Backup Tracker). All home channel messages incoming to the HA will go through the BT (if the BT is enabled). The BT maintains a set of FIFOs to order the requests that are waiting to get into the HT such that the HT entry is allocated to the oldest waiting request when it becomes available. It also has a Home Agent Data Buffer (a set of data buffers for transferring the data between the ring and memory controller. It has 128 entries, one for each Home Tracker). The HA needs to decode the appropriate DRAM channel from the address. This process is called “Target Address Decode”. The TAD rules for each HA would be configurable via CSRs; the rules consist of a memory range to match and then an interleave mode i.e. addr[8:6] of the truncated address to index into a list of DRAM channels. Because the TAD contains an address range to match it allows for memory reclaim for I/O-decoder stolen ranges: configure SAD DRAM decoder to decode a certain reclaim range and configure it in a TAD entry. (Workstation CPUs use TOLUD and REMAPBASE etc; the setting of these registers would change the Cbo decoders which read from them). After the target channel is decoded the request is forward to corresponding BGF (bubble generator fifo). On a read, the data will be returned from the iMC and is forwarded to the HADB. On a write, the data will be downloaded from the ring to the HADB and is forwarded to the iMC. The iMC on the server processors connects to 2 SMI2 ports per socket for riser cards with usually 12 DIMMs and a scalable memory buffer (e.g. Jordan Creek). The HA sends the data back to the requesting CBo ID + NodeID.
No Match
If an address is not matched by any range, or matches an address hole (non-existent memory attribute), the target defaults to the socket's Ubox, and the message is encoded as a NcRdPtl or NcWrPtl opcode with zero length or all zero byte enables, respectively. The Ubox treats the message as an error; errors can be configured to return an error response, or normal completion (returning data of all ones for reads or preventing data update for writes), with or without an MCA. This mechanism is used to prevent any illegal access from being routed externally during some initialisation sequences. 7500 series caching agent will not issue a MCA on access to a SAD address region marked as Non Existent Memory (NXM), unless the access is a store with write-back (WB) memory attribute.
The bus numbering on classic PCIe switch diagrams is very wrong. My GPU is in fact on bus 1 (as it's on a PCIe port). It appears that the integrated PCIe switches are invisible to software and the port just appears as a single PCIe controller / PCI-to-PCI bridge on bus 0 and the device connected to the port point to point appears on a new bus no (The Ethernet controller on bus 2 and the WLAN controller on bus 3).
In reality, PCI-to-PCI bridge 1 can't exist because it requires a subordinate bus no that isn't present on my system. Either it doesn't exist on my system and has 2 integrated PCI-to-PCI bridges like this:
Or it does exist, but has registers invisible to the programmer which change based on the content of the subordinate PCI-to-PCI bridges and the subordinate bus no would have to be 0 (rather than 1 in the diagram) (which won't mess up routing by having 2 bus numbers the same, because the interface the opposite side would not be bus 0, but directly connected to PCH logic, so a transaction will always appear to the devices to originate from PCI-to-PCI bridge 1 as if it were the root complex with nothing beyond that). That diagram presents a very misleading PCH; the PCIe interface is not as pure as that and is broken up and interfaced by the chipset where necessary (too expensive to have all the integrated PCH bridges and controllers as separate point-to-point PCIe devices on ports).
The CPU and the PCH function as single physical PCIe devices with logical subdevices. The Ubox intercepts NcCfgRd/NcCfgWr. IIO+PCH (CPUBUSNO(0)) devices can actually have a separate bus number to the core (CPUBUSNO(1)). If the bus no is CPUBUSNO(1), or CPUBUSNO(0) but below a specific device number, then it will handle the request direct. If it is on CPUBUSNO(0) and the device no is above a specific device no then it routes a Type 0 configuration TLP to the DMI interface, where any device other than logical PCI bridges respond to their function/device number (it acts like a logical bridge as if its subordinate number is that bus number, despite being on the same bus). If the bus no>CPUBUSNO(0) then it generates a Type 1 Configuration TLP on DMI, which will be absorbed by the logical bridge with that subordinate bus number.

What are linux irq domains, why are they needed?

What are irq domains, i read kernel documentation (https://www.kernel.org/doc/Documentation/IRQ-domain.txt) they say:
The number of interrupt controllers registered as unique irqchips
show a rising tendency: for example subdrivers of different kinds
such as GPIO controllers avoid reimplementing identical callback
mechanisms as the IRQ core system by modeling their interrupt
handlers as irqchips, i.e. in effect cascading interrupt controllers.
How GPIO controller can be called as interrupt controller?
What are linux irq domains, why are they needed?
It's documented perfectly in the first paragraph of Documentation/IRQ-domain.txt, so I will assume that you already know it. If no -- please ask what is unclear regarding that documentation. The text below explains how to use IRQ domain API and how it works.
How GPIO controller can be called as interrupt controller?
Let me answer this question using max732x.c driver as a reference (driver code). It's a GPIO driver and it also acts like interrupt controller, so it should be a good example of how IRQ domain API works.
Physical level
To completely understand further explanation, let's first look into MAX732x mechanics. Application circuit from datasheet (simplified for our example):
When there is a change of voltage level on P0-P7 pins, MAX7325 will generate interrupt on INT pin. The driver (running on SoC) can read the status of P0-P7 pins via I2C (SCL/SDA pins) and generate separate interrupts for each of P0-P7 pins. This is why this driver acts as interrupt controller.
Consider next configuration:
"Some device" changes level on P4 pin, tempting MAX7325 to generate interrupt. Interrupt from MAX7325 is connected to GPIO4 IP-core (inside of SoC), and it uses line #29 of that GPIO4 module to notify CPU about interrupt. So we can say that MAX7325 is cascaded to GPIO4 controller. GPIO4 also acts as interrupt controller, and it's cascaded to GIC interrupt controller.
Device tree
Let's declare above configuration in device tree. We can use bindings from Documentation/devicetree/bindings/gpio/gpio-max732x.txt as reference:
expander: max7325#6d {
compatible = "maxim,max7325";
reg = <0x6d>;
gpio-controller;
#gpio-cells = <2>;
interrupt-controller;
#interrupt-cells = <2>;
interrupt-parent = <&gpio4>;
interrupts = <29 IRQ_TYPE_EDGE_FALLING>;
};
The meaning of properties is as follows:
interrupt-controller property defines that device generates interrupts; it will be needed further to use this node as interrupt-parent in "Some device" node.
#interrupt-cells defines format of interrupts property; in our case it's 2: 1 cell for line number and 1 cell for interrupt type
interrupt-parent and interrupts properties are describing interrupt line connection
Let's say we have driver for MAX7325 and driver for "Some device". Both are running in CPU, of course. In "Some device" driver we want to request interrupt for event when "Some device" changes level on P4 pin of MAX7325. Let's first declare this in device tree:
some_device: some_device#1c {
reg = <0x1c>;
interrupt-parent = <&expander>;
interrupts = <4 IRQ_TYPE_EDGE_RISING>;
};
Interrupt propagation
Now we can do something like this (in "Some device" driver):
devm_request_threaded_irq(core->dev, core->gpio_irq, NULL,
some_device_isr, IRQF_TRIGGER_RISING | IRQF_ONESHOT,
dev_name(core->dev), core);
And some_device_isr() will be called each time when level on P4 pin of MAX7325 goes from low to high (rising edge). How it works? From left to the right, if you look to the picture above:
"Some device" changes level on P4 of MAX7325
MAX7325 changes level on its INT pin
GPIO4 module is configured to catch such a change, so it's generates interrupt to GIC
GIC notifies CPU
All those actions are happening on hardware level. Let's see what's happening on software level. It actually goes backwards (from right to the left on the picture):
CPU now is in interrupt context in GIC interrupt handler. From gic_handle_irq() it calls handle_domain_irq(), which in turn calls generic_handle_irq(). See Documentation/gpio/driver.txt for details. Now we are in SoC's GPIO controller IRQ handler.
SoC's GPIO driver also calls generic_handle_irq() to run handler, which is set for each particular pin. See for example how it's done in omap_gpio_irq_handler(). Now we are in MAX7325 IRQ handler.
MAX7325 IRQ handler (here) calls handle_nested_irq(), so that all IRQ handlers of devices connected to MAX7325 ("Some device" IRQ handler, in our case) will be called in max732x_irq_handler() thread
finally, IRQ handler of "Some device" driver is called
IRQ domain API
GIC driver, GPIO driver and MAX7325 driver -- they all are using IRQ domain API to represent those drivers as interrupt controllers. Let's take a look how it's done in MAX732x driver. It was added in this commit. It's easy to figure out how it works just by reading IRQ domain documentation and looking to this commit. The most interesting part of that commit is this line (in max732x_irq_handler()):
handle_nested_irq(irq_find_mapping(chip->gpio_chip.irqdomain, level));
irq_find_mapping() will find linux IRQ number by hardware IRQ number (using IRQ domain mapping function). Then handle_nested_irq() function will be called, which will run IRQ handler of "Some device" driver.
GPIOLIB_IRQCHIP
Since many GPIO drivers are using IRQ domain in the same way, it was decided to extract that code to GPIOLIB framework, more specifically to GPIOLIB_IRQCHIP. From Documentation/gpio/driver.txt:
To help out in handling the set-up and management of GPIO irqchips and the
associated irqdomain and resource allocation callbacks, the gpiolib has
some helpers that can be enabled by selecting the GPIOLIB_IRQCHIP Kconfig
symbol:
gpiochip_irqchip_add(): adds an irqchip to a gpiochip. It will pass
the struct gpio_chip* for the chip to all IRQ callbacks, so the callbacks
need to embed the gpio_chip in its state container and obtain a pointer
to the container using container_of().
(See Documentation/driver-model/design-patterns.txt)
gpiochip_set_chained_irqchip(): sets up a chained irq handler for a
gpio_chip from a parent IRQ and passes the struct gpio_chip* as handler
data. (Notice handler data, since the irqchip data is likely used by the
parent irqchip!) This is for the chained type of chip. This is also used
to set up a nested irqchip if NULL is passed as handler.
This commit converts IRQ domain API to GPIOLIB_IRQCHIP API in MAX732x driver.
Next questions
Further discussion is here:
part 2
part 3
Here's a comment I found in include/linux/irqdomain.h:
Interrupt controller "domain" data structure. This could be defined as
a irq domain controller. That is, it handles the mapping between
hardware and virtual interrupt numbers for a given interrupt domain.
the actual structure I think it's referring to there is irq_domain.

Linux DMA API: specifying address increment behavior?

I am writing a driver for Altera Soc Developement Kit and need to support two modes of data transfer to/from a FPGA:
FIFO transfers: When writing to (or reading from) an FPGA FIFO, the destination (or source) address must not be incremented by the DMA controller.
non-FIFO transfers: These are normal (RAM-like) transfers where both the source and destination addresses require an increment for each word transferred.
The particular DMA controller I am using is the CoreLink DMA-330 DMA Controller and its Linux driver is pl330.c (drivers/dma/pl330.c). This DMA controller does provide a mechanism to switch between "Fixed-address burst" and "Incrementing-address burst" (these are synonymous with my "FIFO transfers" and "non-FIFO transfers"). The pl330 driver specifies which behavior it wants by setting the appropriate bits in the CCRn register
#define CC_SRCINC (1 << 0)
#define CC_DSTINC (1 << 14)
My question: it is not at all clear to me how clients of the pl330 (my driver, for example) should specify the address-incrementing behavior.
The DMA engine client API says nothing about how to specify this while the DMA engine provider API simply states:
Addresses pointing to RAM are typically incremented (or decremented)
after each transfer. In case of a ring buffer, they may loop
(DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO)
are typically fixed.
without giving any detail as to how the address types are communicated to providers (in my case the pl300 driver).
The in the pl330_prep_slave_sg method it does:
if (direction == DMA_MEM_TO_DEV) {
desc->rqcfg.src_inc = 1;
desc->rqcfg.dst_inc = 0;
desc->req.rqtype = MEMTODEV;
fill_px(&desc->px,
addr, sg_dma_address(sg), sg_dma_len(sg));
} else {
desc->rqcfg.src_inc = 0;
desc->rqcfg.dst_inc = 1;
desc->req.rqtype = DEVTOMEM;
fill_px(&desc->px,
sg_dma_address(sg), addr, sg_dma_len(sg));
}
where later, the desc->rqcfg.src_inc, and desc->rqcfg.dst_inc are used by the driver to specify the address-increment behavior.
This implies the following:
Specifying a direction = DMA_MEM_TO_DEV means the client wishes to pull data from a FIFO into RAM. And presumably DMA_DEV_TO_MEM means the client wishes to push data from RAM into a FIFO.
Scatter-gather DMA operations (for the pl300 at least) is restricted to cases where either the source or destination end point is a FIFO. What if I wanted to do a scatter-gather operation from system RAM into FPGA (non-FIFO) memory?
Am I misunderstanding and/or overlooking something? Does the DMA engine already provide a (undocumented) mechanism to specify address-increment behavior?
Look at this
pd->device_prep_dma_memcpy = pl330_prep_dma_memcpy;
pd->device_prep_dma_cyclic = pl330_prep_dma_cyclic;
pd->device_prep_slave_sg = pl330_prep_slave_sg;
It means you have different approaches like you have read in documentation. RAM-like transfers could be done, I suspect, via device_prep_dma_memcpy().
It appears to me (after looking to various drivers in a kernel) the only DMA transfer styles which allows you (indirectly) control auto-increment behavior is the ones which have enum dma_transfer_direction in its corresponding device_prep_... function.
And this parameter declared only for device_prep_slave_sg and device_prep_dma_cyclic, according to include/linux/dmaengine.h
Another option should be to use and struct dma_interleaved_template which allows you to specify increment behaviour directly. But support for this method is limited (only i.MX DMA driver does support it in 3.8 kernel, for example. And even this support seems to be limited)
So I think, we are stuck with device_prep_slave_sg case with all sg-related complexities for a some while.
That is that I am doing at the moment (although it is for accessing of some EBI-connected device on Atmel SAM9 SOC)
Another thing to consider is a device's bus width. memcopy-variant can perform different bus-width transfers, depending on a source and target addresses and sizes. And this may not match size of FIFO element.

Enabling multiple MSI in PCI driver with different IRQ handlers

Currently i have a requirement to support MSI with 2 vectors on my PCI device. Each vector needs to have a different handler routine. HW document says the following
vector 0 is for temperature sensor
vector 1 is for power sensor
Below is the driver code i am following.
1. First enable two vectors using pci_enable_msi_block(pdev, 2)
2. Next assign interrupt handlers using request_irq(two different irq, two diff interrupt handlers).
int vecs = 2;
struct pci_dev *pdev = dev->pci_dev;
result = pci_enable_msi_block(pdev, vecs);
Here result is zero which says call succeeded in enabling two vectors.
Questions i have is:
HW document says vector 0, i hope this is not the vector 0 of OS right? In any case i can't get vector 0 in OS.
Difficult problem i am facing is when i do request_irq() for first irq, how do i say to OS that i need to map this request to vector 0 of HW? Consecutively for second irq, how do i map t vector 1 of HW?
pci_enable_msi_block:
If 2 MSI messages are requested using this function and if the function call returns 0, then 2 MSI messages are allocated for the device and pdev->irq is updated to the lowest of the interrupts assigned to the device.
So pdev->irq and pdev->irq+1 are the new interrupts assigned to the device. You can now register two interrupt handlers:
request_irq(pdev->irq, handler1, ...)
request_irq(pdev->irq+1, handler2, ...)
With MSI and MSI-X, the interrupt number(irq) is a CPU "vector". Message signaled interrupts allow the device to write a small amount of data to a special memory-mapped I/O address; the chipset then delivers the corresponding interrupt to a processor.
May be there are two different MSI interrupt data that can be written to a MSI address. Its like your hardware supports 2 MSI (one for Temperature Sensor and one for Power Sensor). So when you issue pci_enable_msi_block(pdev, 2);, the interrupt will be asserted by the chipset to the processor whenever any of the two MSI data is written to that special memory-mapped I/O address (MSI address).
After the call to pci_enable_msi_block(pdev, 2); ,you can request two irqs through request_irq(pdev->irq, handler, flags....) and request_irq(pdev->irq + 1, handler, flags....). So whenever the MSI data is written to the MSI address, pdev->irq or pdev->irq + 1 will be asserted depending on which sensor sent the MSI and the corresponding handler will be invoked.
This two MSI data can be configured into the hardware's MSI data register.

Interrupt handling (Linux/General)

On the mainbord we have an interrupt controller (IRC) which acts as a multiplexer between the devices which can raise an interrupt and the CPU:
|--------|
|-----------| | |
-(0)------| IRC _____|______| CPU |
-(...)----| ____/ | | |
-(15)-----|/ | |--------|
|-----------|
Every device is associated with an IRQ (the number on the left). After every execution the CPU senses the interrupt-request line. If a signal is detected a state save will be performed and the CPU loads an Interrupt Handler Routine which can be found in the Interrupt Vector which is located on a fixed address in memory. As far as I can see the Number of the IRQ and the Vector number in the Interrupt Vector are not the same because I have for example my network card registered to IRQ 8. On an Intel Pentium processor this would point to a routine which is used to signal one error condition so there must be a mapping somewhere which points to the correct handler.
Questions:
1) If I write an device driver and register an IRQ X for it. From where does the system know which device should be handled? I can for example use request_irq() with IRQ number 10 but how does the system know that the handler should be used for the mouse or keyboard or for whatever i write the driver?
2) How is the Interrupt Vector looking then? I mean if I use the IRQ 10 for my device this would overwrite an standard handler which is for error handling in the table (the first usable one is 32 according to Silberschatz (Operating System Concepts)).
3) Who initialy sets the IRQs? The Bios? The OS?
4) Who is responsible for the matching of the IRQ and the offset in the Interrupt Vector?
5) It is possible to share IRQS. How is that possible? There are hardware lanes on the Mainboard which connect devices to the Interrupt Controller. How can to lanes be configured to the same Interrupt? There must be a table which says lane 2 and 3 handle IRQ15 e.g. Where does this table reside and how is it called?
Answers with respect to linux kernel. Should work for most other OS's also.
1) If I write an device driver and register an IRQ X for it. From where does the system know which device should be handled? I can for example use request_irq() with IRQ number 10 but how does the system know that the handler should be used for the mouse or keyboard or for whatever i write the driver?
There is no 1 answer to it. For example if this is a custom embedded system, the hardware designer will tell the driver writer "I am going to route device x to irq y". For more flexibility, for example for a network card which generally uses PCI protocol. There are hardware/firmware level arbitration to assign an irq number to a new device when it is detected. This will then be written to one of the PCI configuration register. The driver first reads this device register and then registers its interrupt handler for that particular irq. There will be similar mechanisms for other protocols.
What you can do is look up calls to request_irq in kernel code and how the driver obtained the irq value. It will be different for each kind of driver.
The answer to this question is thus, the system doesn't know. The hardware designer or the hardware protocols provide this information to driver writer. And then the driver writer registers the handler for that particular irq, telling the system what to do in case you see that irq.
2) How is the Interrupt Vector looking then? I mean if I use the IRQ 10 for my device this would overwrite an standard handler which is for error handling in the table (the first usable one is 32 according to Silberschatz (Operating System Concepts)).
Good question. There are two parts to it.
a) When you request_irq (irq,handler). The system really doesn't program entry 0 in the IVT or IDT. But entry N + irq. Where N is the number of error handlers or general purpose exceptions supported on that CPU. Details vary from system to system.
b) What happens if you erroneously request an irq which is used by another driver. You get an error and IDT is not programmed with your handler.
Note: IDT is interrupt descriptor table.
3) Who initialy sets the IRQs? The Bios? The OS?
Bios first and then OS. But there are certain OS's for example, MS-DOS which doesn't reprogram the IVT set up by BIOS. More sophisticated modern OS's like Windows or Linux do not want to rely on particular bios functions, and they re-program the IDT. But bios has to do it initially only then OS comes into picture.
4) Who is responsible for the matching of the IRQ and the offset in the Interrupt Vector?
I am really not clear what you mean. The flow is like this. First your device is assigned an irq number, and then you register an handler for it with that irq number. If you use wrong irq number, and then enable interrupt on your device, system will crash. Because the handler is registered fro wrong irq number.
5) It is possible to share IRQS. How is that possible? There are hardware lanes on the Mainboard which connect devices to the Interrupt Controller. How can to lanes be configured to the same Interrupt? There must be a table which says lane 2 and 3 handle IRQ15 e.g. Where does this table reside and how is it called?
This is a very good question. Extra table is not how it is solved in kernel. Rather for each shared irq, the handlers are kept in a linked list of function pointers. Kernel loops through all the handlers and invokes them one after another until one of the handler claims the interrupt as its own.
The code looks like this:
driver1:
d1_int_handler:
if (device_interrupted()) <------------- This reads the hardware
{
do_interrupt_handling();
return MY_INTERRUPT;
}else {
return NOT_MY_INTERRUPT;
}
driver2:
Similar to driver 1
kernel:
do_irq(irq n)
{
if (shared_irq(n))
{
irq_chain = get_chain(n);
while(irq_chain)
{
if ((ret = irq_chain->handler()) == MY_INTERRUPT)
break;
irq_chain = irq_chain->next;
}
if (ret != MY_INTERRUPT)
error "None of the drivers accepted the interrupt";
}
}

Resources