Is it expensive to do trilinear texture mapping in a shader?

Is it expensive to do trilinear texture mapping in a shader? - graphics

Imagine a big procedural world that worths more than 300k triangles (about 5-10k per 8x8x8m chunk). Since it's procedural, I need to create all the data by code. I've managed to make a good normal smoothing algorithm, and now I'm going to use textures (everyone needs textures, you don't want to walk around simple-colored world, do you?). The problem is that I don't know where it's better to calculate UV sets on (I'm using triplanar texture mapping).
I have three approaches:
1. Do calculations on CPU, then upload to the GPU (mesh is not being modified every frame - 90% of time it stays static, calculations is done per chunk when chunk changes);
2. Do calculations on GPU using vertex shader (calculations is done for every triangle in every frame, but the meshes is kinda big - is it expensive to do so every frame?);
3. Move the algorithm onto OpenCL (calculations is done per chunk when chunk changes, I already use OpenCL to do meshing of my data) and call the kernel when the mesh is being changed (inexpensive, but all my OpenCL experience is based on modifying existing code, but still I have some C background, so it may take a long time before I'll get it to work);
Which approach is better in respect that my little experiment is already little heavy for a mid-range hardware?
I'm using C# (.NET 4.0) with SlimDX and Cloo.

Related

Is it possible to make the nodes and tress of MCTS work on GPU-only with PyTorch?

I've viewed some discussion about MCTS and GPU. It's said there's no advantage using GPU, as it doesn't have many matrix-multiply. But it does have a drawback using CPU, as the data transfering between devices really takes time.
Here I mean the nodes and tree should be on GPU. Then they may process the data on GPU, without copying the the data from CPU. If I just create class node and tree, they will let their methods work on CPU.
So I wonder whether I can move the searching part to GPU. Is there any example?

Texture Memory Usage

I am trying to find out how much texture memory does consumed from my application. There are following types of texture and calculations by me:
RGB textures -> textureWidth * textureHeight * 3 (memory usage)
RGBA textures -> textureWidth * textureHeight * 4 (memory usage)
As a result I am wondering that can graphics driver allocate much more memory than above calculated memory?

A few simple answers:
To the best of my knowledge, it's been around 2 decades since (the majority of) hardware devices supported packed 24bit RGB data. In modern hardware this is usually represented in an "XRGB" (or equivalent) format where there is one padding byte per pixel. It is painful in hardware to efficiently handle pixels that straddle cache lines etc. Further, since many applications (read "games") use texture compression, having support for fully packed 24bit seems a bit redundant.
Texture dimensions: If a texture's dimensions are not 'nice' for the particular hardware (e.g,. maybe, say, x is not a multiple of 16bytes, or, say, 4x4 or 8x8 blocks), then the driver may pad the physical size of the texture.
Finally, if you have MIP mapping (and you do want this for performance as well as quality reasons), it will expand the texture size by around 33%.

In addition to the answer from Simon F, it's also worth noting that badly written applications can force the driver to allocate memory for multiple copies of the same texture. This can occur if it attempts to modify the texture while it is still referenced by an in-flight rendering operation. This is commonly known as "resource copy-on-write" or "resource ghosting".
Blog here explains in more detail:
https://community.arm.com/developer/tools-software/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources

Using GPU for number-crunching and rendering at the same time in parallel

Can rendering job and number-crunching job (f.ex. on OpenCL) be effectively shared on the same single GPU? For example,
thread A runs OpenCL task to generate an image
then, when image is ready, thread A notifies another thread B (image is ready) and continues to new image calculation
thread B starts some pre-display activities on a given image (like overlay calculation with GDI), combines final image and render it to display
Can this kind of GPU resource sharing get performance improvement or, on the contrary, will cause to overall slowdown of compute and rendering tasks?
Thanks

There is many factors involved here, but generally, you shouldn't see a slowdown.
Problems with directly answering your question:
OpenCL could be using your CPU as well, depending on how you set it up
Your gfx stuff could be done mostly on CPU or a different part of the GPU, depending on what you display ; for example, many GDI implementations render using the CPU, and only use very simple 2D acceleration techniques on the GPU, mostly to blit the final composed image.
It might depend on the GPU, GPU driver, graphic stack etc. you use.
As most of the time, you will get the best answer for this by trying it out, or at least benchmarking the different parts. After all, you won't really get much of a benefit if your computations are too simple or the image rendering part is too simple.
Also you might try going even further and rendering the result with shaders or the like - in that case you could prevent having to move the data back from the gpu memory to main memory, which could - depending on your circumstances - also give you a speed boost.

if data/crunching ratio is big and also if you have to send data from cpu to gpu:
crunch ---> crunch ----> render
GPU th0 : crunch for (t-1) crunch for (t) rendering
CPU th1 : send data for t send data for t+1 send data for t+2
CPU th2 : get data of t-2 get data of t-1 get data of t
CPU th3-th7 : Other things independent of crunching or rendering.
At the same time: crunching&comm. crunching&comm. rendering&communication
and other things and other things and other things
if data/crunching ratio is big and also if you dont have to send data from cpu to gpu:
use interoperatability of CL (example: cl-gl interop)
if data/crunching ratio is small
should not see any slowdown.
Medium data/crunching ratio: crunch --> render --->crunch ---> render
GPU th0 : crunch for (t) rendering crunch for (t+1) render again! and continue cycling like this
CPU th1 : get data of (t-1) send data for t+1 get data of (t)
CPU th2-th7 : Other things independent of crunching or rendering.
At the same time: crunching&getting. rendering&sending. crunching&getting
and other things and other things and other things

Simulate a simple Graphic Card

Ok.I can find simulation designs for simple architectures.(Edit :definitely like not x86) For example use an int as the program counter , use a byte array as the Memory and so on.But how can I simulate the graphic card's(the simplest graphic card imaginable) functionality ?
like use an array to represent each pixel and "paint" each pixel one by one.
But when to paint- synchronized with CPU or asynchronously ? Who stores graphic data in that array ? Is there an instruction for storing a pixel and painting a pixel ?
Please consider all the question marks ('?') doesn't mean "you are asking a lot of questions" but explains the problem itself - How to simulate a Graphic Card ?
Edit : LINK to a basic implementation design for CPU+Memory simulation

Graphic cards typically carry a number of KBs or MBs of memory that stores colors of individual pixels that are then displayed on the screen. The card scans this memory a number of times per second turning the numeric representation of pixel colors into video signals (analog or digital) that the display understands and visualizes.
The CPU has access to this memory and whenever it changes it, the card eventually translates the new color data into appropriate video signals and the display shows the updated picture. The card does all the processing asynchronously and doesn't need much help from the CPU. From the CPU's point of view it's pretty much like write the new pixel color into the graphic card's memory at the location corresponding to the coordinates of the pixel and forget about it. It may be a little more complex in reality (due to poor synchronization artifacts such as tearing, snow and the like), but that's the gist of it.
When you simulate a graphic card, you need to somehow mirror the memory of the simulated card in the physical graphic card's memory. If in the OS you can have direct access to the physical graphic card's memory, it's an easy task. Simply implement writing to the memory of your emulated computer something like this:
void MemoryWrite(unsigned long Address, unsigned char Value)
{
if ((Address >= SimulatedCardVideoMemoryStart) &&
(Address - SimulatedCardVideoMemoryStart < SimulatedCardVideoMemorySize))
{
PhysicalCard[Address - SimulatedCardVideoMemoryStart] = Value;
}
EmulatedComputerMemory[Address] = Value;
}
The above, of course, assumes that the simulated card has exactly the same resolution (say, 1024x768) and pixel representation (say, 3 bytes per pixel, first byte for red, second for green and third for blue) as the physical card. In real life things can be slightly more complex, but again, that's the general idea.
You can access the physical card's memory directly in MSDOS or on a bare x86 PC without any OS if you make your code bootable by the PC BIOS and limit it to using only the BIOS service functions (interrupts) and direct hardware access for all the other PC devices.
Btw, it will probably be very easy to implement your emulator as a DOS program and run it either directly in Windows XP (Vista and 7 have extremely limited support for DOS apps in 32-bit editions and none in 64-bit editions; you may, however, install XP Mode, which is XP in a VM in 7) or better yet in something like DOSBox, which appears to be available for multiple OSes.
If you implement the thing as a Windows program, you will have to use either GDI or DirectX in order to draw something on the screen. Unless I'm mistaken, neither of these two options lets you access the physical card's memory directly such that changes in it would be automatically displayed.
Drawing individual pixels on the screen using GDI or DirectX may be expensive if there's a lot of rendering. Redrawing all simulated card's pixels every time when one of them gets changed amounts to the same performance problem. The best solution is probably to update the screen 25-50 times a second and update only the parts that have changed since the last redraw. Subdivide the simulated card's buffer into smaller buffers representing rectangular areas of, say, 64x64 pixels, mark these buffers as "dirty" whenever the emulator writes to them and mark them as "clean" when they've been drawn on the screen. You may set up a periodic timer driving screen redraws and do them in a separate thread. You should be able to do something similar to this in Linux, but I don't know much about graphics programming there.

In directx, if reuse a slot does gpu keep previous resource in its memory? Also can original processor resources be safely altered?

I was writing this question about directx and the following questions were part of it, but I realized I needed to separate them out.
If something isn't in a "slot" (register) on the GPU, will it have to be retransferred to the GPU to be used again, i.e. if put texture A in register t0, then later put texture B in register t0, is texture t0 no longer available on the GPU? Or is it still resident in the GPU memory, but I will have to place a call to load it into a texture register to get at it? Or something else entirely?
In a similar vein do calls to PSSetShaders, or PSSetShaderResource, or IASetVertexBuffers, etc.... block and copy data to the GPU before returning, so after the call one can alter or even free the resources they were based on because it is now resident on the GPU?
I guess this is more than one question, but I expect I'll get trouble if I try asking too many directx questions in one day (thought I think these are honestly decent questions about which the msdn documentation remains pretty silent, even if they are all newbie questions).

if put texture A in regsiter t0, then later put texture B in register t0, is texture t0 no longer available on the GPU?
It is no longer bound to the texture register so will not get applied to any polygons. You will have to bind it to a texture register again to use it.
Or is it still resident in the GPU memory, but I will have to place a call to load it into a texture register to get at it?
Typically they will stay in video memory until enough other resources have been loaded in that it needs to reclaim the memory. This was more obvious in DirectX 9 when you had to specify which memory pool to place resource in. Now everything is effectively in what was the D3DPOOL_MANAGED memory pool in Direct3D 9. When you set the texture register to use the texture it will be fast as long as the textures is still in video memory.
In a similar vein do calls to PSSetShaders, or PSSetShaderResource, or IASetVertexBuffers, etc.... block and copy data to the GPU before returning, so after the call one can alter or even free the resources they were based on because it is now resident on the GPU?
DirectX manages the resources for you and tries to keep them in video memory as long as it can.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string