Reusing the same host-visible buffer on different queue families - graphics

Considering host-visible buffers (mostly related to streaming buffers, i.e. buffers backed by VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT memory), let's imagine the following usage pattern:
Write new data to the mapped address on the host (after the proper synchronization).
Read the buffer with contents written in step 1 on queue family A.
Write new data to the mapped address on the host (after the proper synchronization).
Read the buffer with contents written in step 3 on queue family B.
Now, if I omit the queue family ownership transfer (QFOT), is the data written in step 3 inaccessible to queue family B in step 4?
The data written on the host becomes visible to the device when I submit the command(s) of step 4 using vkQueueSubmit, due to the implicit memory dependency of the host write ordering guarantee.
How does it play with different queue families?

OK, so we have a CPU modifiable buffer. And for some reason, this buffer was created in exclusive mode. And you want to do the following:
Write data to the buffer.
Copy the data using queue family A.
Write data to the buffer.
Copy the data using queue family B.
In order for step 4 to work, you are required to do an ownership transfer. The standard spells this out right before what you quoted:
If memory dependencies are correctly expressed between uses of such a resource between two queues in different families, but no ownership transfer is defined, the contents of that resource are undefined for any read accesses performed by the second queue family.
You do have dependencies correctly expressed (I assume). But copying data is a "read access". And it is being performed by queue family B, which is different from queue family A. Therefore, step 4 (a "read access") triggers this clause: "the contents of that resource are undefined".
"Contents" means all of the contents. The ones you wrote in step 1 and step 3. All of them are undefined for step 4, unless you do a queue family ownership transfer.

Related

FORTH writing to a "forth disk"

You have a word in forth called USE which will create a file.
USE xxx ( -- )
Designate OS text file xxx as the "Forth disk."
However, it's not clear how you can write to that FORTH Disk from within an interactive session. There are verbs such as FLUSH and UPDATE but neither of them see to do anything. I'm using gforth. I'm creating words in the session, and using them. I do not understand how a FORTH disk works in this context. It sounds like R's save.image(), except I can't get anything to save. Could you supply a sequence of commands that result in something being written to the argument of USE?
FORTH was originally designed around the idea of a low-level system with a raw persistent storage system (a 'disk') and NO filesystem -- so no concept of files or folders or anything like that. Instead, you read and write fixed size blocks on the disk, by block number.
Modern FORTH systems (like gforth) have support for filesystems, but ALSO still have support for the low-level raw 'disk' that is accessed by block number. Since gforth usually runs on an OS with a filesystem and no low-level disk access (without superuser permissions), to use the low-level disk block words, you need to give a file1 to use as the underlying storage for the raw disk blocks -- and that is what the USE word does.
If you want to understand how to use the low-level block I/O words in FORTH, you need to read a forth book about it, but basically, you use BLOCK to read a block into a buffer, UPDATE to mark a buffer as modified, and FLUSH to flush modified buffers to the disk. From the ANSI forth spec, you find:
7.6.1.0800 BLOCK ( u -- a-addr )
a-addr is the address of the first character of the block buffer assigned to mass-storage block u.
An ambiguous condition exists if u is not an available block number.
If block u is already in a block buffer, a-addr is the address of that block buffer.
If block u is not already in memory and there is an unassigned block buffer, transfer block u
from mass storage to an unassigned block buffer. a-addr is the address of that block buffer.
If block u is not already in memory and there are no unassigned block buffers, unassign a block
buffer. If the block in that buffer has been UPDATEd, transfer the block to mass storage and
transfer block u from mass storage into that buffer. a-addr is the address of that block buffer.
At the conclusion of the operation, the block buffer pointed to by a-addr is the current block
buffer and is assigned to u.
7.6.1.2400 UPDATE ( -- )
Mark the current block buffer as modified. An ambiguous condition exists if there is no
current block buffer.
UPDATE does not immediately cause I/O.
See: 7.6.1.0800 BLOCK, 7.6.1.0820 BUFFER, 7.6.1.1559 FLUSH, 7.6.1.2180 SAVE-BUFFERS.
1On a system like Linux with appropriate permissions, you can use USE with a raw disk device to get something like the original intent.
You can write your own words to manipulate blocks. But for the first time you can use simple block editor from the gforth (https://github.com/forthy42/gforth/blob/master/blocked.fb). I'm use it in the following way. First you need to load it:
use blocked.fb - this file is use file blocked.fb as forth disk;
1 load - load the vocabulary;
editor - this is change vocabulary to newly created.
Now you can modify file with words defined in vocabulary editor. Here example:
use tmp
0 l
0 t : one-plus-two 1 2 + . ;
flush
Brief explanation of some words (from blocked.f):
a - goes to marked position
c - moves cursor by n chars
t - goes to line n and inserts
i - inserts
d - deletes marked area
r - replaces marked area
f - search and mark
il - insert a line
dl - delete a line
qx - gives a quick index
nx - gives next index
bx - gives previous index
n - goes to next screen
b - goes to previous screen
l - goes to screen n
v - goes to current screen
s - searches until screen n
y - yank deleted string

FUA, flush and ordering

A FUA command usually means the data needs to be committed to the NVM before signalling completion.
From my understanding, there is no requirement to flush the data before the FUA command to the NVM.
1.
If we have
LBA0 (in Cache), LBA1(cache), LBA0 with FUA,
Can LBA0 with FUA complete first and then LBA0 in Cache complete?
Is there an ordering requirement for the two ?
2.
Again if we have commands:
Write LBA0 with X
Write LBA1 with Y
Write LBA0 with Z and FUA is set.
READ LBA0
Can Read of LBA0 return X or Z?
3.
Another question is about the FLUSH :
If flush comes with a data :
the data in write cache must be committed before the BIO that is flagged with Flush gets executed.
if a FS issues:
WRITE LBA0 with X (1)
FLUSH
WRITE LBA0 with Y (2)
if (2) is issued after FLUSH completes, the ordering is guaranteed.
However, if (2) is issued before FLUSH completes , so both are in flight commands, can (2) complete before FLUSH ?
My understanding is that the current state of Linux kernel handling of block I/O barriers is described by https://lwn.net/Articles/400541/
That is, queued requests (whether in software or hardware (NCQ, TCQ)) are generally unordered, and the kernel or hardware is free to process them in any order. If there is a need for ordering, higher levels (e.g. the filesystem) is responsible for waiting for completion.
Say, if a file system wants writes A, B, C to be on disk before writing a journal commit record X, it must do something like
submit A, B, C
wait for A, B, C to complete
Issue FLUSH (which makes sure that A, B, C go from the disk cache to persistent storage)
Write X with FUA bit set
Alternatively, if the HW doesn't support FUA, this can be emulated by
Write X
Wait for X to complete
Issue FLUSH
Now, for your questions:
No, there is no ordering requirement.
Unspecified.
I'm not sure, but I believe FLUSH is unordered, just like other commands.

AHCI specification

I have a question regarding the AHCI spec:
Is the variable pDmaXferCnt in the port used when the transfer is a DMA write or read?
The description in the spec seems to indicate that it isn't, but the PRDs are used instead.
But how does the HBA know how much data is to be sent or received to/from a SATA device?
This information will be available in the sector count of a H2D FIS, but unless I have overlooked it there doesn't seem to be a register of variable that holds this value.
The DX:transmit state also seems to indicate that pDmaXferCnt will have a set value, yet I can’t see where it would be set for a DMA read/write operation.
Thanks
From the spec:
"Implementation Note:
HBA state variables are used to describe the required externally visible behavior. Implementations are not required to have internal state values that directly correspond to these variables." - meaning you (maybe) won't find the pDmaXferCnt externally in a register.
There is another way to track the count though.
Under the HBA Memory Space Usage part of the AHCI spec, there are the data structures of the command list (list of command headers) and the command table (pointed to by the command header, each command table is a command to be sent). These are both accessible to the HBA.
In the command header in DW0 is the PRDTL - which is the count of how many PRD's to be used in the transfer.
Now in the actual command table that the command header points to, contains the actual PRD's, in each PRD is their own DBC or data byte count (amount of data in bytes to be DMA'd at the location specified in DBA). So if you take the each of the PRD's * there own DBC's and add them up you'll get the amount of data to be transfered.
Alternately in the in the command header DW1 is the PRDBC which is the count of bytes transfered, so you could check that after the command.
HBA - Host Bus Adapter
PRDTL - Physical Region Descriptor Table Length
PRD - Physical Region Descriptor (tracks where in physical memory and the count of bytes is to be transfered)
DBC - data byte count (inside a PRD)
DBA - data base address (physical address inside a PRD)
PRDBC - Physical Region Descriptor Byte Count
DMA - direct memory access
For more reading: http://www.intel.com/content/www/us/en/io/serial-ata/serial-ata-ahci-spec-rev1-3-1.html

How should one use Disruptor (Disruptor Pattern) to build real-world message systems?

As the RingBuffer up-front allocates objects of a given type, how can you use a single ring buffer to process messages of various different types?
You can't create new object instances to insert into the ringBuffer and that would defeat the purpose of up-front allocation.
So you could have 3 messages in an async messaging pattern:
NewOrderRequest
NewOrderCreated
NewOrderRejected
So my question is how are you meant to use the Disruptor pattern for real-world messageing systems?
Thanks
Links:
http://code.google.com/p/disruptor-net/wiki/CodeExamples
http://code.google.com/p/disruptor-net
http://code.google.com/p/disruptor
One approach (our most common pattern) is to store the message in its marshalled form, i.e. as a byte array. For incoming requests e.g. Fix messages, binary message, are quickly pulled of the network and placed in the ring buffer. The unmarshalling and dispatch of different types of messages are handled by EventProcessors (Consumers) on that ring buffer. For outbound requests, the message is serialised into the preallocated byte array that forms the entry in the ring buffer.
If you are using some fixed size byte array as the preallocated entry, some additional logic is required to handle overflow for larger messages. I.e. pick a reasonable default size and if it is exceeded allocate a temporary array that is bigger. Then discard it when the entry is reused or consumed (depending on your use case) reverting back to the original preallocated byte array.
If you have different consumers for different message types you could quickly identify if your consumer is interested in the specific message either by knowing an offset into the byte array that carries the type information or by passing a discriminator value through on the entry.
Also there is no rule against creating object instances and passing references (we do this in a couple of places too). You do lose the benefits of object preallocation, however one of the design goals of the disruptor was to allow the user the choice of the most appropriate form of storage.
There is a library called Javolution (http://javolution.org/) that let's you defined objects as structs with fixed-length fields like string[40] etc. that rely on byte-buffers internally instead of variable size objects... that allows the token ring to be initialized with fixed size objects and thus (hopefully) contiguous blocks of memory that allow the cache to work more efficiently.
We are using that for passing events / messages and use standard strings etc. for our business-logic.
Back to object pools.
The following is an hypothesis.
If you will have 3 types of messages (A,B,C), you can make 3 arrays of those pre-allocated. That will create 3 memory zones A, B, C.
It's not like there is only one cache line, there are many and they don't have to be contiguous. Some cache lines will refer to something in zone A, other B and other C.
So the ring buffer entry can have 1 reference to a common ancestor or interface of A & B & C.
The problem is to select the instance in the pools; the simplest is to have the same array length as the ring buffer length. This implies a lot of wasted pooled objects since only one of the 3 is ever used at any entry, ex: ring buffer entry 1234 might be using message B[1234] but A[1234] and C[1234] are not used and unusable by anyone.
You could also make a super-entry with all 3 A+B+C instance inlined and indicate the type with some byte or enum. Just as wasteful on memory size, but looks a bit worse because of the fatness of the entry. For example a reader only working on C messages will have less cache locality.
I hope I'm not too wrong with this hypothesis.

How to create a large file on a VFAT partition efficiently in embedded Linux

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.
Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.

Resources