Why does the Apple A7 (ARMv8a) have 2 branch units (in addition to the indirect branch unit)

Why does the Apple A7 (ARMv8a) have 2 branch units (in addition to the indirect branch unit) - branch-prediction

The Apple A7 microarchitecture has 2 branch units and an indirect branch unit. Since the A7 is a modern superscalar out of order cpu with a reasonably deep pipeline (read that as a significant penalty for speculation failure), it makes sense that it has branch prediction for conditional branches. Moreover, it makes sense that since indirect branches are a completely different animal than conditional branches that there would be a separate indirect branch processor.
AArch64SchedCyclone.td
But why the 2 branch units? It makes sense that there would be 2 types types of branch units. But then it doesn't make sense (to me at least) that the A7 could process 2 branches at the same time. Does not compute. So is there a further specialization of conditional branches?
Does Apple provide any guidance like ARM's Software Optimization Guide for its ARMv8 CPUs which would answer this question? Nope, the LLVM td files isn't a substitute for that sort of Guide.

Related

How do I keep RISC-V compliance?

I just had a discussion with a colleague about what RISC-V compliance actually means. We discussed the following topics in detail:
As far as I understood the idea, a processor is RISC-V compliant as long as it implements a RISC-V base instruction set and optionally one or more of the standard extensions. Entirely, not just partly. One might even define and implement own instructions (as brownfield or greenfield extensions) as long as they do not touch the base instruction set or any of the standard extensions. Guaranteeing this, the machine code generated by any RISC-V compliant compiler would run on my machine. That's the whole point of it, right?
The RISC-V ISA does not intend delayed branches. My understanding is that the definition whether branches are delayed or not is already part of the ISA and not a matter of implementation. Is this correct?
Assume that one wants to use RISC-V with delayed branches. Whether this is a good idea or not, let's just focus on the compliance question. In my opinion it were no longer RISC-V compliant to define and implement some of the existing branch/jump instructions of the base instruction set as delayed branches. The compilation of a RISC-V compliant compiler would no longer work on such a machine. One would be free to define own delayed branch instructions instead. Of course, as with any self-written extension, it cannot be expected that an arbitrary compiler would use such an instruction. Am I right?
According to the RISC-V specification, "the program counter pc holds the address of the current instruction." My interpretation of this sentence is that any jump/branch instruction refers to the address at which it is stored. Again, independent of the implementation. Example: Assume an implementation where the jump/branch instruction is executed a few cycles after it has been fetched. This would mean that PC has potentially increased already. It is therefore the implementation's task to somehow store the address of the jump/branch instruction. It is not the compiler's task to know about this delay and compensate for it by modifying the immediate that is to be added to the PC. Am I summarizing this correctly?
So, in a nutshell the short version of my questions:
Does RISC-V compliance mean that base integer instruction set and standard extension must neither be changed nor stripped?
Is the information whether a branch is delayed or not already part of the ISA?
Is the PC of RISC-V considered agnostic to any pipeline delay?
I consider an ISA in general to be agnostic to any implementation specifics. The counter-argument to what I claim is that one would have to tell the compiler about implementation specifics (delayed branches, PC behavior etc.) and that this could still be considered compliant with the ISA.

I am not an expert, but have implemented a few cores during the last 20 years. The key concept in your three part-questions are completeness and user visibility. To claim completeness means, in my opinion, that no part of a standard can be changed, nor stripped. However, it is a rare standard indeed if it has no dubious points and sections that may be interpreted differently by different people. In the specific case of RISC-V I would like to point to an aid to indicate compliance, if you have not seen it already.
It would be good to have some real experts answering this question.
Does RISC-V compliance mean that base integer instruction set and standard extension must neither be changed nor stripped?
I have the same understanding as you. It does not make sense to claim a behaviour as defined in a standard, and then not honouring that standard.
Is the information whether a branch is delayed or not already part of the ISA?
Again I concur with you. Delayed branches is an exposed feature for users of a processor. Hence an ISA must specify the eventual existence of such branches, indeed, from page 15 of riscv-spec-v2.2.pdf:
"Control transfer instructions in RV32I do not have architecturally visible delay slots."
Notice the wording, as long as your implementation does not expose any delay slot to a user, you can do as you want. And with a non-standard extension you are perfectly free to design instructions that have delay slots, you may even put RV32I instructions in those slots.
Is the PC of RISC-V considered agnostic to any pipeline delay?
Yes.

What is the effectiveness of multithreaded alpha-beta-pruning?

What would the effectiveness be of multithreading with alpha beta pruning if:
The multithreading was used iteratively. For example, thread one would look at the first branch, the second thread would look at the second thread, etc. I believe this should only be done at the first depth (the next move the AI made), since the other depths could be cut off.
One thread was at the first "move" generated searching to half the moveset generated, and the second thread was at the last "move" generated and searching back to half the moveset. Here, I think there could be increased speedup, because the last move could be considered the best move, and as a result, the second thread could cause cutoffs the first thread couldn't.
The multithreading was used to think on the opponent's time. For example, say the opponent took some time to think and make a move. The AI could iteratively deepen its search and find results while the opponent is thinking, i'd imagine, not necessarily causing speedup, but having more time for minimax analysis.
There may be other optimizations, i'd imagine, but these were the few that came into mind. I don't know if they actually will improve anything, though.

If I understand your idea correctly, you plan to search moves in the root position in parallel. In comparison to a strictly sequential algorithm, it should be better but I would not expect it to scale well (with multiple CPUs).
For comparison, here is a summary of existing parallelization strategies in chess:
https://www.chessprogramming.org/Parallel_Search
As alpha-beta is a sequential algorithm, all parallelization strategies are speculative. So, you want to avoid spending time on searching parts of the search tree, which will eventually be cut by other moves. One relatively simple strategy to avoid searching irrelevant subtrees is called Young Brothers Wait Concept.
There are also algorithms with improved scalability but at the cost of being more difficult to understand and implement. For instance, supporting work-stealing should improve scalability.

Are there functional programming languages that run on the GPU?

Using the traditional, sequential reduction approach, the following graph is reduced as:
(+ (+ 1 2) (+ 3 4)) ->
(+ 3 (+ 3 4)) ->
(+ 3 7) ->
10
Graph reductions are, though, inherently parallel. One could, instead, reduce it as:
(+ (+ 1 2) (+ 3 4)) ->
(+ 3 7) ->
10
As far as I know, every functional programming language uses the first approach. I believe this is mostly because, on the CPU, scheduling threads overcompensate the benefits of doing parallel reductions. Recently, though, we've been starting to use the GPU more than the CPU for parallel applications. If a language ran entirely on the GPU, those communication costs would vanish.
Are there functional languages making use of that idea?

What makes you think on GPU scheduling would not overcomponsate the benefits?
In fact, the kind of parallelism used in GPUs is far harder to schedule: it's SIMD parallelism, i.e. a whole batch of stream processors do all essentially the same thing at a time, except each one crushes a different bunch of numbers. So, not only would you need to schedule the subtasks, you would also need to keep them synchronised. Doing that automatically for general computations is virtually impossible.
Doing it for specific tasks works out quite well and has been embedded into functional languages; check out the Accelerate project.

SPOC provides some GPGPU access from OCaml.

on the CPU, scheduling threads overcompensate the benefits of doing parallel reduction
Thread scheduling is very effective in modern OSes. Thread initialization and termination may be a matter of concern, but there are plenty of techniques to eliminate those costs.
Graph reductions are, though, inherently parallel
As it was mentioned in other answer, GPUs are very special devices. One can't simply take arbitrary algorithm and make it 100 times faster just by rewriting on CUDA. Speaking of CUDA, it is not exactly a SIMD (Single Instruction on Multiple Data), but SIMT (Single Instruction on Multiple Thread). This is something far more complex, but let's think of it as a mere vector processing language. As name suggests, vector processors designed to handle dense vectors and matrices, i.e. simple linear data structures. So any branching within warp reduces efficiency of parallelism and performance down to zero. Modern architectures (Fermi+) are capable to process even some trees, but this is rather tricky and performance isn't that shining. So you simply can't accelerate arbitrary graph reduction.
What about functional languages for GPGPU. I believe it can't be serious. Most of valuable CUDA code exists inside hardly optimized libraries made by PhDs, and it is aimed straight toward performance. Readability, declarativity, clearness and even safety of functional languages don't matter there.

The language Obsidian is a domain specific language embedded in Haskell which targets GPGPU computations. It's rather more low-level than what you're asking for but I thought I'd mention it anyway.

https://github.com/BenjaminTrapani/FunGPU provides a Racket-like functional language that runs entirely on GPUs and other devices that can be targeted with SYCL. The runtime automatically schedules the independent sub-trees in a way that efficiently utilizes the GPU (multiple evaluations of the same instructions with different data are evaluated concurrently). It is still in early stages but could be worth experimenting with. It is already outperforming the Racket VM.

How to calculate Amadahl's Law for threading effectiveness

Its easy to find and understand the function definition for Amadahl's Law, but all of the working examples I was able to find were either too vague or too academic/cerebreal for my tiny pea brain to understand.
Amadahl's Law takes to parameters: F, the % of a task that cannot be improved via multi-threading, and N, the number of threads to use.
How does one calculate F with any degree of accuracy?
How do you look at a piece of code and determine whether that will be improved by multi-threading?

It's relatively easy to say which parts of your code certainly won't benefit from multi-threading: sequential parts. If you have to carry out a series of small steps in order, muli-threading won't help because you always need to wait for one step to be done before starting the next. Many common tasks aren't (necessarily) sequential in this sense: for example, searching a list for a number of items. If you want to extract every red item from a list, you can share parts of the list among several threads and collect all the red items from each part into a final result list. The difficulty in concurrent programming lies in finding efficient ways of doing this for real problems.
At a lower level you can talk about data dependency: a particular instruction or block depends on a previous block if it uses the results of that block's calculations in its own. So (pseudocode):
Block one:
load r1 into r2
add r1 to r3 into r4
Block two:
load r4 into r1
add 3 to r4 into r4
block two depends on block one: they must be executed in order. Here:
Block one:
load r1 into r2
add r1 to r3 into r4
Block two:
load r1 into r3
add 3 to r1 into r1
that isn't the case. This isn't directly useful for concurrency, but hopefully it illustrates the point more concretely. It also illustrates another problem in handling concurrency: as abstract blocks functionality these two can be run in parallel, but in the concrete example given here they're reading/writing some of the same registers, so a compiler/pipeliner/whatever would have to do more work to make them run together. This is all very complex, but is covered beautifully in http://www.amazon.com/Computer-Architecture-Quantitative-Approach-Edition/dp/1558605967.
Which other parts don't benefit from multi-threading depends on your programming environment and machine architecture.
As for how to get a percentage, there's probably some hand-waving involved in a practical case - I doubt you'll ever get a precise number. If you divide your code up into functional units and profile the execution time in each, that would give you a roughly appropriate weighting. Then if one part that takes up 90% of the execution time can be improved with multi-threading, you say that 90% of your 'task' can be so improved.

You should look on the algorithm not to the code if you want to see what can be improved by multithreading.
Typically, parallel algorithms should be designed as parallel from the ground up. It is much harder to "parallelize" code instead of the algorithm itself.
Look on dependencies in memory access (spatial dependencies), look on the sequence of operations (temporal dependencies), know your computer architecture in deails and you will know how to build a proper algorithm for your task.
According to formula itself - Wiki has very good explanation http://en.wikipedia.org/wiki/Amdahl's_law

Amdahl divides all work into two groups: Perfectly parallelizable and not at all parallelizable.
Think of the latter one as the piece of work that you can't ever get rid of. You can get rid of the former perfectly bay adding resources.
If you read a text file and process it line by line, you can never get rid of reading the file sequentially to parse the lines. You can parallelize the individual lines however. If reading the files take 1s your code will never run any faster than that.
For real code, you can hook up a profiler to see how much time is spend on each part. Hopefully, you can classify each part into one of the categories. Once you have done that you can easily estimate the speed-up and it will be pretty accurate.

Implementing a Release Schedule

The company I work for is trying to implement a release schedule and I want to get some constructive feedback from people that work in better structured environments than I am in.
We have one product that is finished and being used by several customers, but we have 4 additional products in the works - and actively being marketed as if they are finished. (Imagine that!)
We are a very small company working very quickly (and yes, sloppy at times) and with tight deadlines and tight budgets, so we don't have the luxury of written requirements, systematic QA process, etc. Basically the owners of the company come to the developers (3 of us) with ideas and we implement them. Then the subject matter experts test the features to make sure the app does what it is supposed to do.
I know that last paragraph opens me up to all kinds of "you can't do it this way" types of feedback, but I don't need that. I understand how wrong this approach is. At one point I was able to convice the owners to hire a project manager and a QA person, but after a short time both were laid off due to loss of revenue. We are where we are and there's no changing the culture at this point.
What I'm trying to do is manage expectations. We have a list of requested features a mile long and here's what I have proposed.
We will do quarterly releases to production of our finished products. The first release will be in October. Rather than trying to manage what will be done between now and then based on High/Medium/Low priorities, we will manage features based on what can and cannot be finished between now and September. At that point we will stop all feature development and focus on testing and fixing defects in order to get the product ready for release the following month. We'll repeat this process each quarter. Basically the steps will be like this:
1) Place all outstanding features into a future release based on how critical it is.
2) Work on these features during the quarter.
3) As new features are requested, place them into a "queue" for a particular release cycle.
4) If the feature has to go into the current release, then move other features to the next release.
5) At certain points during the cycle, evaluate which features may not get into the current release and adjust accordingly.
6) End development of features at least 30 days before scheduled push to production and focus on testing and bug fixing.
7) Push something to production on the scheduled date and then take the heat for not having everything finished that we agreed to in the beginning (hey, I'm being realistic...the people I work for aren't.)
Oh, also, if you plan to tell me to "get a new job" then don't bother answering. That's not an option at the moment.
If you have any advice regarding this proposed approach, or any links to resources that might help me better understand how to structure this process, I would greatly appreciate it.
Thanks in advance for your help.
Darvis

Given the lack of project management, organization, etc - I think you're going to run into problems trying to force yourself into a quarterly (or any fixed date) release cycle. This will be especially true if you have any "large" features which will take more than the 2 months to develop that you're allowing for development time.
My suggestion would be to do a feature-based release cycle. Just make your queue of features, decide which ones you think you can reasonably do in a specific time frame. Implement those features, give yourself one month (or whatever) for testing, then release. Move to the next group of features. If it takes 2 months instead of 3, great. If it takes 4, that's fine too.
That being said, I'd focus on trying to make this shorter, not longer. Having more, smaller releases will actually help you in this case, especially since you don't have the formal procedures (and personnel) to handle QA, etc. Staying flexible will help you fix problems that will get into your releases...

Quite simply, given the lack of defined process, there's not much chance of successfully implementing a solid release schedule. This isn't just pessimism, although I'l readily admit that it is.
Much like success being based largely upon hard work and a little luck, solid, repeatable release schedules are based upon process; having things such as functional specifications and design specifications are really critical to being able to release on a consistent, solid schedule. (You know there's a reason why people have such specification things, right?) And that's not to say that you can't hit your schedule and release expectations without such things; you very possibly can. But what having such process in place massively increases your chances of being able to meet expectations, at least partially because it assures that expectations don't drift into "unreasonable" territory while you're still implementing.
Again, this doesn't mean you can't achieve what you need to doing what you described above; to be honest, if you're in an environment that is actively hostile to having such process as described in place, you're probably working in the best way to achieve what you need to. And I don't mean to be completely pessimistic; it sounds like you've got a good grasp on how to attempt to get this stuff done; for what you've got to work with, it sounds like you've got a reasonable process in place. But I can virtually guarantee that you'll end up being better off if you can just get two things; 1) a functional specification from management that describes what they want the software to do, and 2) a design specification from engineering describing to management how you're going to make the software do what they want in the functional specification. I'd think you could possibly even fit this into your process; functional specifications don't need to be complicated; the key thing about them is that they are written down, which prevents bickering about what management asked for; it's right there. And design specifications don't need to take a lot of time either; a quick one-pager to let management know how (at a high level) you're going to implement what they need provides them with reassurance that you've heard them, and can help them understand the complexity involved (but don't go too deep into it, otherwise you run the risk of boring them and losing their attention).

Release early and often.
I often find that users don't know what they want until we show them. In our facility, we have a light-weight QA/test system. Let users try new things. As users approve of ideas, we move them to production. This lets us always be working on new things, testing interfaces, adding database tables and columns, without breaking production code.
The only "trick" is constantly reminding the customer that the test platform is not the production platform.

What are you using for source control?
We use SVN and have the flexibility to if necessary create a variety of different branches depending on what's to be deployed as of the next release. If releases a1, a2, and a3 are set to be released, we can merge these changes into a branch off production. If a2 becomes less of a priority we can roll back only a2's changes or if there's overlap just create a new branch and merge only a1 and a3, leaving a2 for some later release.
How likely are the owners to rewrite a specification prior to a given release? Where I work this happens frequently, making the ability to shift gears and rollback / re-merge if necessary very helpful.

My company also gets bogged down with feature requests.
What we did is go through all the features and give an estimate of the amount of time they will take to implement. Then, we leave it up to our "change committee" (our CEO, team leaders, etc.) to give us the features we will be completing during the next sprint.
This way, we aren't being given unreasonably large work-loads, and the end user has some say in what's completed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string