Vulkan command execution order - graphics

Quoting Vulkan 1.0 specification document, chapter.5(Command Buffers) 4th paragraph,
"Unless otherwise specified, and without explicit synchronization, the various commands submitted to a queue via command buffers may execute in arbitrary order relative to each other, and/or concurrently"
In 1st paragraph of chapter 2.1.1(Queue Operation), it also states
"... Command buffers submitted to a single queue are played back in the order they were submitted, and commands within each buffer are played back in the order they were recorded"
Does "arbitrary order" in chapter 5 mean even out-of-order? Then isn't it a conflict to statement of chapter 2.1.1 "played back in the order they were submitted"? Or are commands just "PLAYED BACK" in-order but "EXECUTED" out-of-order?

Chapter 2.1.1 is defining the API ordering of commands: the order of command buffers submitted into the queue, and the order of commands within the command buffer (and the order of primitives within a command).
Chapter 5 is saying that the API ordering doesn't mean anything about execution ordering unless some text specifically says otherwise. Chapter 2.1.1 includes specific exceptions to this (blending, depth test), and Chapter 6 introduces execution dependencies which can enforce execution ordering, based on the API ordering defined in 2.1.1. But outside of cases that explicitly declare ordering, the API ordering does not prohibit the queue from doing whatever else it wants with regard to execution.
That being said, the spec explains that poorly.

Related

How ExpansionRegion of mode stream can be terminated when all items are processed?

Consider the simplest ExpansionRegion of mode stream. According to UML Documentation (16.12 Expansion Regions)
If the value [of mode] is stream, there is exactly one expansion execution, and element values are offered to this execution in a stream from each collection. That is, each element of a collection on an input ElementNode is offered separately as a token, one by one, on all outgoing ActivityEdges from the ExpansionNode
But this ExpansionRegion will never end! As soon as all tokens from input ExpansionNode are processed, Do something will be waiting indefinitely for a token from input, which will never come! How do I terminate this ExpansionRegion?
Update: it seems the only solution I could find is the following (but I'm not sure, see below) :
When there are no more tokens available from input then the control token from Do something is not accepted by Do something through C.3-C.4-C.2 path since according to 16.2.3.4 Actions and Pins in Activities
Executing an Action in an Activity requires all of its InputPins to be offered all necessary tokens, as specified by their
minimum multiplicity
and according to 15.2.3.2 Activity Nodes
When an ActivityNode begins execution, tokens are
accepted from some or all of its incoming ActivityEdges and a token is placed on the node.
so it seems reasonable to conclude from the above that Action (i.e. Do something) will not accept a control token if it is not able to execute so Decision node will pass the token to control flow C.5 since it has "else" guard and according to 15.3.3.6 Decision Nodes:
For use only with DecisionNodes, a predefined guard “else” represented as an Expression with “else” as its operator and no operands) may be used for at most one outgoing edge. This guard evaluates to true only if the token is not accepted by any other outgoing edge from the DecisionNode.
Update 2: Is the loop (C.1-C.2-C.3) required? It seems to me the answer is "yes" because without it Do something would process just one object token! I.e. Do something would receive a single control token at the ExpansionRegion's invocation according to 15.2.3.6 Activity Execution
When an Activity is first invoked, none of its nodes other than input ActivityParameterNodes will initially hold any tokens. However, nodes that do not have incoming edges and require no input data to execute are immediately enabled. A single control token is placed on each enabled node and they begin executing concurrently. Such nodes include ExecutableNodes (see sub clause 15.5) with no incoming ControlFlows and no mandatory input data and InitialNodes (see sub clause 15.3).
and according to 15.5.3.1 Executable Nodes
When an ExecutableNode completes an execution, the control token representing that execution is removed from the
ExecutableNode and control tokens are offered on all outgoing ControlFlows of the ExecutableNode.
Are there any clarification in UML Documentation saying that control token could "stay" on Do something (without the loop) and re-enable its execution to process next object token?
You seem to think that you modeled a deadlock. Actually, UML Activities cannot have deadlocks by definition. The execution of all action containers (Activitys and StructuredActivityNodes with their subtypes) ends when none of the contained actions is enabled.
StructuredActivityNode: A StructuredActivityNode completes execution according to the same rules as for the completion of the execution of an Activity, including terminating execution due to an ActivityFinalNode.
Activity: The execution of an Activity with no streaming Parameters completes when it has no nodes executing and no nodes
enabled for execution, or when it is explicitly terminated using an ActivityFinalNode.
After processing the last element in the input collection, no action is enabled anymore and therefore, the expansion region ends and offers the output collection to the outgoing object flow. Therefore, the initial node and all the control flows are not needed.
Having said that, it is possible, that you need control flows, because you have additional actions. Let's say you need to initialize the system before the first execution and Do something else after each execution of Do something. Your first example works well for this. Just place initialize on C.1 and Do something else on C.3.
Your second solution could be used, if you have to do some cleanup before leaving the expansion region. Just place it on C.5. I was not aware, that this would work, but after rereading the specification text cited by you, I agree that it is working.
It seems like you missed the point that object tokens are sufficient to trigger an action - without any need for a control token.
UML 2.5 p. 374:
15.2.3.4 Object Flows
Object tokens pass over ObjectFlows, carrying data through an Activity via their values, or carrying no data ( null tokens). A null token can still be passed along an ObjectFlow and used like any other token. For example, an Action can output a null token to explicitly indicate that it did not produce an optional value, and a downstream DecisionNode (see sub clause 15.3) can test for this and branch accordingly.
So once you get rid of the start node and the superfluous control structures you will get the desired behavior: the action starts with receiving an object and ends when emitting the resulting object.
I could not find a full fledged example on the fly, but that picture illustrates it well enough:
Longer explanation
P. 478 of UML 2.5: (please look into the specs for more details)
16.12 Expansion Regions
16.12.1 Summary
An ExpansionRegion is a StructuredActivityNode that executes its contained elements multiple times corresponding to elements of an input collection.
[...]
16.12.3 Semantics
An ExpansionRegion is a StructuredActivityNode that takes as input one or more collections of values and executes its contained ActivityNodes and ActivityEdges on each value in those collections. If the computation produces results, these may be collected into output collections. The number of output collections can differ from the number of input collections.
[...]
16.12.4 Notation
An ExpansionRegion is shown as a dashed rounded box with one of the keywords «parallel», «iterative» or «stream» in the upper left corner (see Figure 16.48). Input and output ExpansionNodes are drawn as small rectangles divided by vertical bars into small compartments.
[...]
As you can see, the expansion region is a "simple" action that takes an object and returns another. As such it's like any simple action with an input pin like shown above in my answer. That means it will start upon receipt of an object and emit an (eventually empty) object when it's done.

How does [branch] get executed in HLSL?

The [branch] attribute can mark an if statement in HLSL to make it execute only one branch instead of all branches and discarding the results like when using [flatten].
My question is how can this actually work, when a branch diverges withing a warp/wavefront? As far as I know, in this case all threads must execute all branches taken by any of the threads in the warp (like when using [flatten]) which is consequence of the fact, that they are all within the same SIMD block and must execute the same instruction.
Since GeForce series 6xx GPUs do actually support branching, though in limited form and with performance cost. The [branch] and [flatten] tags are just hints to the compiler to prefer one or the other if supported and possible. It basically depends on hardware and on the driver, so different hardware or different driver versions might in the end determine a different execution from what you specified with the tag.
You can find more info online, for example check this link

Memory visibility for the OpenCL host using clEnqueueMapBuffer and 'querying whether the command has finished'

The OpenCL 1.1 standard says (5.2.3):
If blocking_map is CL_FALSE i.e. map operation is non-blocking, the
pointer to the mapped region returned by clEnqueueMapBuffer cannot be used until the map command has completed. The event argument
returns an event object which can be used to query the execution
status of the map command. When the map command is completed, the
application can access the contents of the mapped region using the
pointer returned by clEnqueueMapBuffer.
But in (5.9, immediately after Table 5.15) there is the following statement:
Using clGetEventInfo to determine if a command identified by event has finished execution (i.e. CL_EVENT_COMMAND_EXECUTION_STATUS returns
CL_COMPLETE) is not a synchronization point. There are no
guarantees that the memory objects being modified by command
associated with event will be visible to other enqueued commands.
Q1: So, I'm wondering whether there is some other way to "query the execution
status of the map command" and whether memory consistency is quaranteed (for the host in this case) when a query has returned 'CL_COMPLETE'?
Q2: Am I missing something?
Q3: What are the typical OpenCL idioms for that situation?
1- use enqueue barrier and get your event from that commands to have both visibility and fine grained synchronization with host
Waiting it querying in while loop uses more cpu but is fine grained at least
2- events for fine grained control. Waits and barriers for visibility
For example clwaitforevents gives both and uses less cpu but bigger grained than querying
Device side just use event network to have graphs between queues
3- there isnt any typical. Choose which one suits your problem well

EventSourcing race condition

Here is the nice article which describes what is ES and how to deal with it.
Everything is fine there, but one image is bothering me. Here it is
I understand that in distributed event-based systems we are able to achieve eventual consistency only. Anyway ... How do we ensure that we don't book more seats than available? This is especially a problem if there are many concurrent requests.
It may happen that n aggregates are populated with the same amount of reserved seats, and all of these aggregate instances allow reservations.
I understand that in distributes event-based systems we are able to achieve eventual consistency only, anyway ... How to do not allow to book more seats than we have? Especially in terms of many concurrent requests?
All events are private to the command running them until the book of record acknowledges a successful write. So we don't share the events at all, and we don't report back to the caller, without knowing that our version of "what happened next" was accepted by the book of record.
The write of events is analogous to a compare-and-swap of the tail pointer in the aggregate history. If another command has changed the tail pointer while we were running, our swap fails, and we have to mitigate/retry/fail.
In practice, this is usually implemented by having the write command to the book of record include an expected position for the write. (Example: ES-ExpectedVersion in GES).
The book of record is expected to reject the write if the expected position is in the wrong place. Think of the position as a unique key in a table in a RDBMS, and you have the right idea.
This means, effectively, that the writes to the event stream are actually consistent -- the book of record only permits the write if the position you write to is correct, which means that the position hasn't changed since the copy of the history you loaded was written.
It's typical for commands to read event streams directly from the book of record, rather than the eventually consistent read models.
It may happen that n-AggregateRoots will be populated with the same amount of reserved seats, it means having validation in the reserve method won't help, though. Then n-AggregateRoots will emit the event of successful reservation.
Every bit of state needs to be supervised by a single aggregate root. You can have n different copies of that root running, all competing to write to the same history, but the compare and swap operation will only permit one winner, which ensures that "the" aggregate has a single internally consistent history.
There are going to be a couple of ways to deal with such a scenario.
First off, an event stream would have the current version as the version of the last event added. This means that when you would not, or should not, be able to persist the event stream if the event stream is not at the version when loaded. Since the very first write would cause the version of the event stream to be increased, the second write would not be permitted. Since events are not emitted, per se, but rather a result of the event sourcing we would not have the type of race condition in your example.
Well, if your commands are processed behind a queue any failures should be retried. Should it not be possible to process the request you would enter the normal "I'm sorry, Dave. I'm afraid I can't do that" scenario by letting the user know that they should try something else.
Another option is to start the processing by issuing an update against some table row to serialize any calls to the aggregate. Probably not the most elegant but it does cause a system-wide block on the processing.
I guess, to a large extent, one cannot really trust the read store when it comes to transactional processing.
Hope that helps :)

resequencer with holes on secuence

we have an ETL scenario where we use the resequencer.
Messages arrive to the flow with a sequence number that the resequencer uses it to send messages in order, but sometimes messages are discarded previously (because of data validation) and do not arrive to the resequencer. This produces holes in the sequence and resequencer stops sending messages using the default release strategy. To avoid this, we developed a new SequenceTimeoutReleaseStrategy that is a mix between default strategy and TimeoutCountSequenceSizeReleaseStrategy from SI. When a message arrives, it checks the timeout and release it if necesary.
All this worked well unless for the last messages that arrive before the timeout and have holes. This messages aren't release by the strategy. We could use a reaper but the secuence may have more than one hole in the sequence so when the resequencer release them it will stop in the first sequence break and remove the group losing the rest of the messages. So, the question is: is there a way to use the resequencer where there can be holes in the sequence?
One solution we have and want to avoid is having a scheduled tasks that removes the messages directly from the message store, but this could be a problem with concurrency and so on, so we prefer other solutions.
Any help is appreciated here
Regards
Guzman
There are two components involved; the release strategy says "something" can be released; the actual decision as to what is released is performed by the MessageGroupProcessor. In this case, a ResequencingMessageGroupProcessor.
You would need to customize that class to "skip" the hole(s).
You can't wire in a customized MGP using the <reseequencer/> namespace, you would have to wire up using <bean/> s - a ResequencingMessageHandler and a ConsumerEndpointFactoryBean.
Or use a BeanFactoryPostProcessor to change the constructor argument to your custom class.

Resources