Is it possible to extract instruction specific energy consumption in a program? [closed]

Is it possible to extract instruction specific energy consumption in a program? [closed] - linux

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
What i mean is that given a source code file is it possible to extract energy consumption levels for a particular code block or 1 single instruction, using a tool like perf?

Use jRAPL which is a framework for profiling Java programs running on CPUs.
For example, the following code snippet attempts to measure the energy consumption of any code block, whose value is the difference between beginning and end:
double beginning = EnergyCheck.statCheck();
doWork();
double end = EnergyCheck.statCheck();
System.out.println(end - beginning);
And the detailed paper of this framework titled "Data-Oriented Characterization of Application-Level Energy Optimization" is in http://gustavopinto.org/lost+found/fase2015.pdf

There are tools for measuring power consumption (see #jww's comment for links), but they don't even try to attribute consumption to specific instructions the way perf record can statistically sample event -> instruction correlations.
You can get an idea by running a whole block of the same instruction, like you'd do when trying to microbenchmark the throughput or latency of an instruction. Divide energy consumed by number of instructions executed.
But a significant fraction of CPU power consumption is outside of the execution units, especially for out-of-order CPUs running relatively cheap instructions (like scalar ADD / AND, or different memory subsystem behaviour triggered by different, like hardware prefetching).
Different patterns of data dependencies and latencies might matter. (Or maybe not, maybe out-of-order schedulers tend to be constant power regardless of how many instructions are waiting for their inputs to be ready, and setting up bypass forwarding vs. reading from the register file might not be significant.)
So a power or energy-per-instruction number is not directly meaningful, mostly only relative to a long block of dependent AND instructions or something. (Should be one of the lowest-power instructions, probably fewer transistors flipping inside the ALU than with ADD.) That's a good baseline for power microbenchmarks that run 1 instruction or uop per clock, but maybe not a good baseline for power microbenches where the front-end is doing more or less work.
You might want to investigate how dependent AND vs. independent NOP or AND instructions affect energy per time or energy per instruction. (i.e. how does power outside the execution units scale with instructions-per-clock and/or register read / write-back.)

Related

How can a system like Tesla’s AutoPilot keep up with constant changing requests for multiple process? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 months ago.
Improve this question
As a software developer, I am trying to understand how a system could possibly work as fast and efficiently enough and operate with consistanly and flawlessly with such precision for all the ongoing actions it needs to account for in a system such as a Tesla AutoPilot (self driving car system)...
In a car driving driving 65 MPH, if a deer runs out in front of the car, it immediately makes adjustments to protect the vehicle from a crash - while having to keep up with all the other sensors requests constantly firing off at the same time for possible actions on a micro-milllisecond, without skipping a beat.
How is all of that accomplished sysinctly? And have processing reporting back to it so quickly that it almost intentaniously is able to respond (without getting backed up with requests)?

I don't know anything about Tesla code, but I have read other real time code and analysed time slips in it. One basic idea is that if you check something every millisecond you will always respond to change within a millisecond. The simplest possible real time system has a "cyclic executive" built around a repeating schedule that tells it what to do when, worked out so that in all possible cases everything that has to be dealt with is dealt with within its deadline. Traditionally you are worrying about cpu time here, but not necessarily. The system I looked at was most affected by the schedule for a serial bus called a 1553 (https://en.wikipedia.org/wiki/MIL-STD-1553)- there almost wasn't enough time to get everything transmitted and received on time.
This is a bit too simple because it doesn't cope with rare events which have to be dealt with really quickly, such as response to interrupts. Clever schemes for interrupt handling don't have as much of an advantage as you would expect, because there is often a rare worst case that makes the clever scheme underperform a cyclic executive and real time code has to work in the worst case, but in practice you do need something with interrupt handlers and high priority processes that must be run on demand and with low priority processes that can be ignored when other stuff needs to make deadlines but will be run otherwise. There are various schemes and methodologies for arguing that these more complex systems will always make their deadlines. One of the best known ones is https://en.wikipedia.org/wiki/Rate-monotonic_scheduling. See also https://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling.
An open source real time operating system that has seen real life use is https://en.wikipedia.org/wiki/RTEMS.

OpenCV and haartraining - how to reduce time for calculation?

Is it possible to improve the performance of haartraining application. If I see well it uses only one thread. Does nature of his algorithm exclude using multithreading?I'm looking for possibility for increasing speed of calculating classificator therefore my question is: only one way to decrease calculation time is to use faster procesor - number of cores doesn't have matter?

How is machine language run? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This question does not really relate to any programming language specifically, it relates to, I think, EVERY programming language out there.
So, the developer enters code into an IDE or something of the sort. The IDE turns that, directly or indirectly (maybe there's many steps involved: A turns it into B turns it into C turns it into D, etc.), into a machine language (which is just a bunch of numbers). How is machine language interpreted and run? I mean, doesn't code have to come down to some mechanical thing in the end, or how would it be run? If chips run the code, what runs the chips? And what runs that? And what runs that? On and on and on.

There is nothing really mechanical about it - the way a computer works is electrical.
This is not complete description - that would take a book. But it is the basis of how it works.
The basis of the whole thing is the diode and the transistor. A diode or transistor is made from a piece of silicon with some impurities that can be made to conduct electricity sometimes. A diode only allows electricity to flow in one direction and a transistor only allows electricitry to flow in one direction with an amount proportional to the electricity provided at the "base". So a transistor acts like a switch but it is turned on and off using electricity instead of something mechanical.
So when a computer loads a byte from memory, it does so by turning on individual wires for each bit of the address address and the memory chip turns on the wires for each data bit depending on the value stored in the location designated by those address wires.
When a computer loads bytes containing an instruction, it then decodes the instruction by turning on individual wires that control to other parts of the CPU:
If the instruction is arithmetic then one wire may determine which registers are connected to the arithmetic logic unit (ALU) while other wires determine whether the ALU adds or subtracts and another may determines whether it shifts left or does not shift left.
If the instruction is a store then the wires that get turned on are the address lines, the wire that determine which register is attached to the data lines, and the line that tells the memory to store the value.
The way these individual wires are turned on and off is via this huge collection of diodes and transistors, but to make designing circuits manageable these groups of diodes and transistors are clumped into groups that are standardized components: logic gates like AND, OR and NOT gates. These logic gates have one or two wires coming in and one coming out with a bunch of diodes and transistors inside. Here is an electrical schematic for how all the diodes and transistors can be wired up to make an OR gate: http://www.interfacebus.com/exclusive-or-gate-internal-schematic.png
Then when you have the abstraction level of logic gates it is a much more manageable job to design a CPU. Here is an example of someone who built a CPU using just a bunch of logic gate chips: http://cpuville.com
Turns out there is already a book! I just found a book (and accompanying website with videos and course materials) for how to make a computer from scratch. Have a look at this: http://nand2tetris.org/

When machine code is generated from a program how does it translates to hardware level operations? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Like if say the instruction is something like 100010101 1010101 01010101 011101010101. Now how is this translating to an actual job of deleting something from memory? Memory consists of actual physical transistors the HOLD data. What causes them to lose that data is some external signal?
I want to know how that signal is generated. Like how some binary numbers change the state of a physical transistor. Is there a level beyond machine code that isn't explicitly visible to a programmer? I have heard of microcode that handle code at hardware level, even below assembly language. But still I pretty much don't understand. Thanks!

I recommend reading the Petzold book "Code". It explains these things as best as possible without the physics/electronics knowledge.
Each bit in the memory, at a functional level, HOLDs either a zero or a one (lets not get into the exceptions, not relevant to the discussion), you cannot delete memory you can set it to zeros or ones or a combination. The arbitrary definition of deleted or erased is just that, a definition, the software that erases memory is simply telling the memory to HOLD the value for erased.
There are two basic types of ram, static and dynamic. And are as their names imply, so long as you dont remove power the static ram will hold its value until changed. Dynamic memory is more like a rechargeable battery and there is a lot of logic that you dont see with assembler or microcode or any software (usually) that keeps the charged batteries charged and empty ones empty. Think about a bunch of water glasses, each one is a bit. Static memory the glasses hold the water until emptied, no evaporation, nothing. Glasses with water lets say are ones and ones without are zeros (an arbitrary definition). When your software wants to write a byte there is a lot of logic that interprets that instruction and commands the memory to write, in this case there is a little helper that fills up or empties the glasses when commanded or reads the values in the glasses when commanded. In the case of dynamic memory, the glasses have little holes in the bottom and are constantly but slowly letting the water drain out. So glasses that are holding a one have to be filled back up, so the helper logic not only responds to the read and write commands but also walks down the row of glasses periodically and fills back up the ones. Why would you bother with unreliable memory like that? It takes twice (four times?) as many transistors for an sram than a dram. Twice the heat/power, twice the size, twice the price, with the added logic it is still cheaper all the way around to use dram for bulk memory. The bits in your processor that are used say for the registers and other things are sram based, static. Bulk memory, the gigabytes of system memory, are usually dram, dynamic.
The bulk of the work done in the processor/computer is done by electronics that the instruction set or microcode in the rare case of microcoding (x86 families are/were microcoded but when you look at all processor types, microcontrollers that drive most of the everyday items you touch they are generally not microcoded, so most processors are not microcoded). In the same way that you need some worker to help you turn C into assembler, and assembler into machine code, there is logic to turn that machine code into commands to the various parts of the chip and peripherals outside the chip. download either the llvm or gcc source code to get an idea of the percentages of your program being compiled is compared to the amount of software it takes to do that compiling. You will get an idea of how many transistors are needed to turn your 8 or 16 or 32 bit instruction into some sort of command to some hardware.
Again I recommend the Petzold book, he does an excellent job of teaching how computers work.
I also recommend writing an emulator. You have done assembler, so you understand the processor at that level, in the same assembler reference for the processor the machine code is usually defined as well, so you can write a program that reads the bits and bytes of the machine code and actually performs the function. An instruction mov r0,#11 you would have some variable in your emulator program for register 0 and when you see that instruction you put the value 11 in that variable and continue on. I would avoid x86, go with something simpler pic 12, msp430, 6502, hc11, or even the thumb instruction set I used. My code isnt necessarily pretty in anyway, closer to brute force (and still buggy no doubt). If everyone reading this were to take the same instruction set definition and write an emulator you would probably have as many different implementations as there are people writing emulators. Likewise for hardware, what you get depends on the team or individual implementing the design. So not only is there a lot of logic involved in parsing through and executing the machine code, that logic can and does vary from implementation to implementation. One x86 to the next might be similar to refactoring software. Or for various reasons the team may choose a do-over and start from scratch with a different implementation. Realistically it is somewhere in the middle chunks of old logic reused tied to new logic.
Microcoding is like a hybrid car. Microcode is just another instruction set, machine code, and requires lots of logic to implement/execute. What it buys you in large processors is that the microcode can be modified in the field. Not unlike a compiler in that your C program may be fine but the compiler+computer as a whole may be buggy, by putting a fix in the compiler, which is soft, you dont have to replace the computer, the hardware. If a bug can be fixed in microcode then they will patch it in such a way that the BIOS on boot will reprogram the microcode in the chip and now your programs will run fine. No transistors were created or destroyed nor wires added, just the programmable parts changed. Microcode is essentially an emulator, but an emulator that is a very very good fit for the instruction set. Google transmeta and the work that was going on there when Linus was working there. the microcode was a little more visible on that processor.
I think the best way to answer your question, barring how do transistors work, is to say either look at the amount of software/source in a compiler that takes a relatively simple programming language and converts it to assembler. Or look at an emulator like qemu and how much software it takes to implement a virtual machine capable of running your program. The amount of hardware in the chips and on the motherboard is on par with this, not counting the transistors in the memories, millions to many millions of transistors are needed to implement what is usually few hundred different instructions or less. If you write a pic12 emulator and get a feel for the task then ponder what a 6502 would take, then a z80, then a 486, then think about what a quad core intel 64 bit might involve. The number of transistors for a processor/chip is often advertised/bragged about so you can also get a feel from that as to how much is there that you cannot see from assembler.

It may help if you start with an understanding of electronics, and work up from there (rather than from complex code down).
Let's simplify this for a moment. Imagine an electric circuit with a power source, switch and a light bulb. If you complete the circuit by closing the switch, the bulb comes on. You can think of the state of the circuit as a 1 or a 0 depending on whether it is completed (closed) or not (open).
Greatly simplified, if you replace the switch with a transistor, you can now control the state of the bulb with an electric signal from a separate circuit. The transistor accepts a 1 or a 0 and will complete or open the first circuit. If you group these kinds of simple circuits together, you can begin to create gates and start to perform logic functions.
Memory is based on similar principles.
In essence, the power coming in the back of your computer is being broken into billions of tiny pieces by the components of the computer. The behavior and activity of such is directed by the designs and plans of the engineers who came up with the microprocessors and circuits, but ultimately it is all orchestrated by you, the programmer (or user).

Heh, good question! Kind of involved for SO though!
Actually, main memory consists of arrays of capacitors, not transistors, although cache memories may be implemented with transistor-based SRAM.
At the low level, the CPU implements one or more state machines that process the ISA, or the Instruction Set Architecture.
Look up the following circuits:
Flip-flop
Decoder
ALU
Logic gates
A series of FFs can hold the current instruction. A decoder can select a memory or register to modify, and the state machine can then generate signals (using the gates) that change the state of a FF at some address.
Now, modern memories use a decoder to select an entire line of capacitors, and then another decoder is used when reading to select one bit out of them, and the write happens by using a state machine to change one of those bits, then the entire line is written back.
It's possible to implement a CPU in a modern programmable logic device. If you start with simple circuits you can design and implement your own CPU for fun these days.

That's one big topic you are asking about :-) The topic is generally called "Computer Organization" or "Microarchitecture". You can follow this Wikipedia link to get started if you want to learn.

I don't have any knowledge beyond a very basic level about either electronics or computer science but I have a simple theory that could answer your question and most probably the actual processes involved might be very complex manifestations of my answer.
You could imagine the logic gates getting their electric signals from the keystrokes or mouse strokes you make.
A series or pattern of keys you may press may trigger particular voltage or current signals in these logic gates.
Now what value of currents or voltages will be produced in which all logic gates when you press a particular pattern of keys, is determined by the very design of these gates and circuits.
For eg. If you have a programming language in which the "print(var)" command prints "var",
the sequence of keys "p-r-i-n-t-" would trigger a particular set of signals in a particular set of logic gates that would result in displaying "var" on your screen.
Again, what all gates are activated by your keystrokes depends on their design.
Also, typing "print(var)" on your desktop or anywhere else apart from the IDE will not yield same results because the software behind that IDE activates a particular set of transistors or gates which would respond in an appropriate way.
This is what I think happens at the Fundamental level, and the rest is all built layer upon layer.

Looking for programs on audio tape/cassette containing programs for Sinclair ZX80 PC? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
OK, so back before ice age, I recall having a Sinclair ZX80 PC (with TV as a display, and a cassette tape player as storage device).
Obviously, the programs on cassette tapes made a very distinct sound (er... noise) when playing the tape... I was wondering if someone still had those tapes?
The reason (and the reason this Q is programming related) is that IIRC different languages made somewhat different pitched noises, but I would like to run the tape and listen myself to confirm if that was really the case...

I have the tapes but they've been stored in the garage at my parents' house and the last thirty years hasn't been kind to them.
You can get images here though: http://www.zx81.nl/dload if that's any use. Perhaps there is a tool out there for converting from the bytes back to the audio ;)
Edit: Perhaps here: http://ldesoras.free.fr/prod.html#src_ay3hacking

On the ZX80, ZX81 and ZX Spectrum, tape output is achieved by the CPU toggling the output line level between a high state and a low state. Input is achieved by having the CPU watch an input line level. The very low level of operation was one of Sir Clive's cost-saving measures; rival machines like the BBC Micro had dedicated hardware for serialisation and deserialisation of data, so the CPU would just say "output 0xfe" and then the hardware would make the relevant noises and raise an interrupt when it was ready for the next byte. The BBC Micro specifically implements the Kansas City Standard, whereas the Sinclair machines in every instance use whatever adhoc format best fitted the constraints of the machine.
The effect of that is that while almost every other machine that uses tape has tape output that sounds much the same from one program to the next by necessity, programs on a Sinclair machine could choose to use whatever encoding they wanted, which is the principle around which a thousand speed loaders were written. It's therefore not impossible that different programs would output distinctively different sounds. Some even used the symmetry between the tape input and output to do crude digital sampling, editing and playback, though they were never more than novelties for obvious reasons.
That being said, the base units of the ZX80 and ZX81 contained just 1kb RAM so it's quite likely that programmers would just use the ROM routines for reading and writing data, due to space constraints if nothing else. Then the sound differences would just be on account of characteristic data, as suggested by slugster.

I know these come up on auction sites like Ebay quite frequently - if you want to buy them yourself. If you get someone else who owns one to listen then you are going to get their subjective opinion :)
In any case, the language used to save it would be the secondary cause of the pitch changes - it will be related to the data. IOW you could probably create a straight binary data file that sounded very similar to a BASIC program (the BASIC would have been saved as text, as it is interpreted).

I know the threads old but... I was playing about with something similar last night and I've got a wav of an old zx81 game if you're still interested? pm me and I'll post it somewhere.

You can use something like http://www.wintzx.fr/ or pick something from http://www.worldofspectrum.org/utilities.html#tzxtools to convert an emulator file to an audio file and then you can just play it on your PC. Some tools also allow you to play the file directly. Emulator files can be found at http://www.zx81.nl/files.html and many other places.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string