Why is slither.io CPU intensive? - resources

From a programmer's point of view, how can a snakes and dots game consume much more resources than many other more advanced games?

Lazy programming.
Considering you could run the original Quake, a fully 3D game with texture mapping on a Pentium 66 Mhz with 16 Mb RAM without a 3D accelerator card, there is no reason for a silly 2D browser snake game to stall on a modern multi-Ghz CPU other than programmer incompetence.

All the calculations are done by the CPU, whereas other computation-intensive games tend to make use of the GPU. You can observe this by looking at your CPU monitor/task manager, and your graphics driver configuration interface (if you have a fancy card).
When I load the slither game my GPU temp doesn't increase at all, whereas the CPU temp climbs steadily, and the game immediately swallows half of my cores. CPU use further increases in "busy" periods when there are many other players interacting in a small area.

I suspect Slither.io is calculating block chain while we play the game. in such cases the game would run smoother on an old graphics card where their secondary intention is not possible. The greater the GPU the more time of the game is split for their nefarious purposes.

Related

AXI4 throughput on FPGA

I'm working on a 7 series FPGA, and am planning to use MIG memory controller for interface with DDR3, and an AXI4 interface between memory controller and the other modules inside the FPGA. What sort of throughput efficiency will I get, say if I run it at some X clock and 64-bit data. What I mean is 64X is illogical to assume. What fraction of it is lost in handshaking for burst mode and a non-burst mode? I'm just looking for rough values, not exact. Something in the ballpark.
Thank you.
Per Xilinx's xapp792 70% efficiency is a reasonable number. This is for video which generally has very burstable DDR SDRAM friendly access patterns. Random memory access will probably be far less.

What is the least amount of (managable) samples I can give to a PCM buffer?

Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).
Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.
Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!
(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)
TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.
The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.
The sequence of events is:
The audio hardware progressively outputs samples from its buffer
At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
The operating system service the interrupt, and marks the thread as being ready to run
The operating system schedules the thread to run on a CPU
The thread computes, or otherwise obtains samples, and writes them into the output buffer.
The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.
General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.
It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.
Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.
It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.
While 1ms audio buffers are possible they are not necessarily desirable. The tick rate of modern Windows, for example, is between 10ms and 30ms. That means that, usually at the audio driver end, you need to keep a ring buffer of a bunch of those 1 ms packets to deal with buffer starvation conditions, in case the CPU gets pulled out from under you by some other thread.
All modern audio engines use powers of 2 for their audio buffer sizes. Start with 256 samples per frame and see how that works for you. Every piece of hardware is different, and you can't depend on how a Mac or PC gives you time slices. The user might be calculating pi on some other process while you are running your audio program. It's safer to leave the buffer size large, like 2048 samples per frame, and let the user turn it down if the latency bothers them.

opengl offscreen rendering in linux is slow

My opengl application runs at about 110 fps. The moment I add off-screen rendering, it slows down to 15 fps. I am using frame buffer objects and glReadPixels to render off-screen. I have searched on net and found that GPU memory to CPU memory data transfer is slow and vice versa is fast.
I have ATI Mobility Radeon™ X2300 with 128MB video memory.
So my questions are
1) Is there is way to increase VRAM to CPU ram data transfer speed?
2) Are there any GPUs in market optimized for better read speed?
The problem is not the transfer speed but more the serialization between CPU and GPU. When you call glReadPixels in that way the CPU will stop and wait for the GPU to finish all rendering and this is quite inefficient as you noticed already.
The solution is to use PBOs. You can have N number of PBOs and on every frame you bind PBO X (where 0 <= X < N) and glReadPixels to that PBO. Before that you can map X and read the pixels of a past frame. N is not a magic number, for most implementations 3 frames delay is something usual. So N=3 is a good starting point.

Thermal aware scheduler in linux

Currently i'm working on making a temperature aware version of linux for my university project. Right now I have to create a temperature aware scheduler which could take into account processor temperature and perform some scheduling. Is there any generalized way to get the temperature of the processor cores or can I integrate the coretemp driver with the linux kernel in any way ( I didn't find a way to do so on the internet ).
lm-sensors simply uses some device files exported by the kernel for CPU temperature, you can just read whatever these device files have as backing variables in the kernel to get the temperature information. In terms of a scheduler I would not write one from scratch and would start with the kernels CFS implementation and in your case modify the load balancer check to include temperature (currently it uses a metric that is the calculated cost of moving a task from one core to another in terms of cache issues, etc... I'm not sure if you want to keep this or not).
Temperature control is very difficult. The difficulty is with thermal capacity and conductance. It is quite easy to read a temperature. How you control it will depend on the system model. A Kalman filter or some higher order filter will be helpful. You don't know,
Sources of heat.
Distance from sensors.
Number of sensors.
Control elements, like a fan.
If you only measure at the CPU itself, the hard drive could have over heated 10 minutes ago, but the heat is only arriving at the CPU now. Throttling the CPU at this instance is not going to help. Only by getting a good thermal model of the system can you control the heat. Yet, you say you don't really know anything about the system? I don't see how a scheduler by itself can do this.
I have worked on mobile freezer application where operators would load pallets of ice cream, etc from a freezer to a truck. Very small distances between sensors and control elements can create havoc with a control system. Also, you want your ambient temperature to be read instantly if possible. There is a lot of lag in temperature control. A small distance could delay a reading by 5-15 minutes (ie, it take 5-15 minutes for heat to transfer 1cm).
I don't see the utility of what you are proposing. If you want this for a PC, then video cards, hard drives, power supplies, sound cards, etc. can create as much heat as the CPU. You can not generically model a PC; maybe you could with an Apple product. I don't think you will have a lot of success, but you will learn a lot from trying!

How to quantify the processing tradeoffs of CUDA devices for C kernels?

I recently upgraded from a GTX480 to a GTX680 in the hope that the tripled number of cores would manifest as significant performance gains in my CUDA code. To my horror, I've discovered that my memory intensive CUDA kernels run 30%-50% slower on the GTX680.
I realize that this is not strictly a programming question but it does directly impact on the performance of CUDA kernels on different devices. Can anyone provide some insight into the specifications of CUDA devices and how they can be used to deduce their performance on CUDA C kernels?
Not exactly an answer to your question, but some information that might be of help in understanding the performance of the GK104 (Kepler, GTX680) vs. the GF110 (Fermi, GTX580):
On Fermi, the cores run on double the frequency of the rest of the logic. On Kepler, they run at the same frequency. That effectively halves the number of cores on Kepler if one wants to do more of an apples to apples comparison to Fermi. So that leaves the GK104 (Kepler) with 1536 / 2 = 768 "Fermi equivalent cores", which is only 50% more than the 512 cores on the GF110 (Fermi).
Looking at the transistor counts, the GF110 has 3 billion transistors while the GK104 has 3.5 billion. So, even though the Kepler has 3 times as many cores, it only has slightly more transistors. So now, not only does the Kepler have only 50% more "Fermi equivalent cores" than Fermi, but each of those cores must be much simpler than the ones of Fermi.
So, those two issues probably explain why many projects see a slowdown when porting to Kepler.
Further, the GK104, being a version of Kepler made for graphics cards, has been tuned in such a way that cooperation between threads is slower than on Fermi (as such cooperation is not as important for graphics). Any potential potential performance gain, after taking the above facts into account, may be negated by this.
There is also the issue of double precision floating point performance. The version of GF110 used in Tesla cards can do double precision floating point at 1/2 the performance of single precision. When the chip is used in graphics cards, the double precision performance is artificially limited to 1/8 of single precision performance, but this is still much better than the 1/24 double precision performance of GK104.
One of the advances of new Kepler architecture is 1536 cores grouped into 8 192-core SMX'es but at the same time this number of cores is a big problem. Because shared memory is still limited to 48 kb. So if your application needs a lot of SMX resources then you can't execute 4 warps in parallel on single SMX. You can profile your code to find real occupancy of you GPU. The possible ways to improve you application:
use warp vote functions instead of shared memory communications;
increase a number of tread blocks and decrease a number threads in one block;
optimize global loads/stores. Kepler have 32 load/store modules for each SMX (twice more than on Kepler).
I am installing nvieuw and I use coolbits 2.0 to unlock your shader cores from default to max performance. Also, you must have both connectors of your device to 1 display, which can be enabled in nVidia control panel screen 1/2 and screen 2/2. Now you must clone this screen with the other, and Windows resolution config set screen mode to extended desktop.
With nVidia inspector 1.9 (BIOS level drivers), you can activate this mode by setting up a profile for the application (you need to add application's exe file to the profile). Now you have almost double performance (keep an eye on the temperature).
DX11 also features tesselation, so you want to override that and scale your native resolution.
Your native resolution can be achieved by rendering a lower like 960-540P and let the 3D pipelines do the rest to scale up to full hd (in nv control panel desktop size and position). Now scale the lower res to full screen with display, and you have full HD with double the amount of texture size rendering on the fly and everything should be good to rendering 3D textures with extreme LOD-bias (level of detail). Your display needs to be on auto zoom!
Also, you can beat sli config computers. This way I get higher scores than 3-way sli in tessmark. High AA settings like 32X mixed sample makes al look like hd in AAA quality (in tessmark and heavon benchies). There are no resolution setting in the endscore, so that shows it's not important that you render your native resolution!
This should give you some real results, so please read thoughtfully not literary.
I think the problem may lie in the number of Streaming Multiprocessors: The GTX 480 has 15 SMs, the GTX 680 only 8.
The number of SMs is important, since at most 8/16 blocks or 1536/2048 threads (compute capability 2.0/3.0) can reside on an single SM. The resources they share, e.g. shared memory and registers, can further limit the number of blocks per SM. Also, the higher number of cores per SM on the GTX 680 can only reasonably be exploited using instruction-level parallelism, i.e. by pipelining several independent operations.
To find out the number of blocks you can run concurrently per SM, you can use nVidia's CUDA Occupancy Calculator spreadsheet. To see the amount of shared memory and registers required by your kernel, add -Xptxas –v to the nvcc command line when compiling.

Resources