I am trying to find out how much texture memory does consumed from my application. There are following types of texture and calculations by me:
RGB textures -> textureWidth * textureHeight * 3 (memory usage)
RGBA textures -> textureWidth * textureHeight * 4 (memory usage)
As a result I am wondering that can graphics driver allocate much more memory than above calculated memory?
A few simple answers:
To the best of my knowledge, it's been around 2 decades since (the majority of) hardware devices supported packed 24bit RGB data. In modern hardware this is usually represented in an "XRGB" (or equivalent) format where there is one padding byte per pixel. It is painful in hardware to efficiently handle pixels that straddle cache lines etc. Further, since many applications (read "games") use texture compression, having support for fully packed 24bit seems a bit redundant.
Texture dimensions: If a texture's dimensions are not 'nice' for the particular hardware (e.g,. maybe, say, x is not a multiple of 16bytes, or, say, 4x4 or 8x8 blocks), then the driver may pad the physical size of the texture.
Finally, if you have MIP mapping (and you do want this for performance as well as quality reasons), it will expand the texture size by around 33%.
In addition to the answer from Simon F, it's also worth noting that badly written applications can force the driver to allocate memory for multiple copies of the same texture. This can occur if it attempts to modify the texture while it is still referenced by an in-flight rendering operation. This is commonly known as "resource copy-on-write" or "resource ghosting".
Blog here explains in more detail:
https://community.arm.com/developer/tools-software/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources
Related
As I'm reading VideoCoreIV-AG100-R spec of BCM vc4 chip, there is a paragraph talking about:
All rendering by the 3D system is in tiles, requiring separate binning and rendering passes to render a frame. In
normal operation the host processor creates a control list in memory defining all the operations and supplying
all the data for rendering for a complete frame.
It mentions of rendering a frame requires binning and rendering pass. Could anybody explain in details how exactly those 2 passes playing roles in a graphic pipeline? Thanks a lot.
For tile based render architecture passes are:
Binning pass - generates stream\map between frame tiles & corresponding geometry which should be rendered into particular tile
Rendering pass - takes map between tiles & geometry and renders the appropriate pixels per tile.
In mobile GPUs due to many limitations compared to Desktops GPUs (such as memory bandwidth due to in mobile devices memory is shared between GPU & CPU,etc) vendors uses approaches to split work into small pieces to decrease overall memory bandwidth consumption - for ex. apply Tile Based Rendering - to achieve efficient utilization of all available resources and gain acceptable performance.
Details
Tile Based Rendering approach described on many GPU vendors sites such as:
A look at the PowerVR graphics architecture: Tile-based rendering
GPU Framebuffer Memory: Understanding Tiling
Imagine a big procedural world that worths more than 300k triangles (about 5-10k per 8x8x8m chunk). Since it's procedural, I need to create all the data by code. I've managed to make a good normal smoothing algorithm, and now I'm going to use textures (everyone needs textures, you don't want to walk around simple-colored world, do you?). The problem is that I don't know where it's better to calculate UV sets on (I'm using triplanar texture mapping).
I have three approaches:
1. Do calculations on CPU, then upload to the GPU (mesh is not being modified every frame - 90% of time it stays static, calculations is done per chunk when chunk changes);
2. Do calculations on GPU using vertex shader (calculations is done for every triangle in every frame, but the meshes is kinda big - is it expensive to do so every frame?);
3. Move the algorithm onto OpenCL (calculations is done per chunk when chunk changes, I already use OpenCL to do meshing of my data) and call the kernel when the mesh is being changed (inexpensive, but all my OpenCL experience is based on modifying existing code, but still I have some C background, so it may take a long time before I'll get it to work);
Which approach is better in respect that my little experiment is already little heavy for a mid-range hardware?
I'm using C# (.NET 4.0) with SlimDX and Cloo.
Ok.I can find simulation designs for simple architectures.(Edit :definitely like not x86) For example use an int as the program counter , use a byte array as the Memory and so on.But how can I simulate the graphic card's(the simplest graphic card imaginable) functionality ?
like use an array to represent each pixel and "paint" each pixel one by one.
But when to paint- synchronized with CPU or asynchronously ? Who stores graphic data in that array ? Is there an instruction for storing a pixel and painting a pixel ?
Please consider all the question marks ('?') doesn't mean "you are asking a lot of questions" but explains the problem itself - How to simulate a Graphic Card ?
Edit : LINK to a basic implementation design for CPU+Memory simulation
Graphic cards typically carry a number of KBs or MBs of memory that stores colors of individual pixels that are then displayed on the screen. The card scans this memory a number of times per second turning the numeric representation of pixel colors into video signals (analog or digital) that the display understands and visualizes.
The CPU has access to this memory and whenever it changes it, the card eventually translates the new color data into appropriate video signals and the display shows the updated picture. The card does all the processing asynchronously and doesn't need much help from the CPU. From the CPU's point of view it's pretty much like write the new pixel color into the graphic card's memory at the location corresponding to the coordinates of the pixel and forget about it. It may be a little more complex in reality (due to poor synchronization artifacts such as tearing, snow and the like), but that's the gist of it.
When you simulate a graphic card, you need to somehow mirror the memory of the simulated card in the physical graphic card's memory. If in the OS you can have direct access to the physical graphic card's memory, it's an easy task. Simply implement writing to the memory of your emulated computer something like this:
void MemoryWrite(unsigned long Address, unsigned char Value)
{
if ((Address >= SimulatedCardVideoMemoryStart) &&
(Address - SimulatedCardVideoMemoryStart < SimulatedCardVideoMemorySize))
{
PhysicalCard[Address - SimulatedCardVideoMemoryStart] = Value;
}
EmulatedComputerMemory[Address] = Value;
}
The above, of course, assumes that the simulated card has exactly the same resolution (say, 1024x768) and pixel representation (say, 3 bytes per pixel, first byte for red, second for green and third for blue) as the physical card. In real life things can be slightly more complex, but again, that's the general idea.
You can access the physical card's memory directly in MSDOS or on a bare x86 PC without any OS if you make your code bootable by the PC BIOS and limit it to using only the BIOS service functions (interrupts) and direct hardware access for all the other PC devices.
Btw, it will probably be very easy to implement your emulator as a DOS program and run it either directly in Windows XP (Vista and 7 have extremely limited support for DOS apps in 32-bit editions and none in 64-bit editions; you may, however, install XP Mode, which is XP in a VM in 7) or better yet in something like DOSBox, which appears to be available for multiple OSes.
If you implement the thing as a Windows program, you will have to use either GDI or DirectX in order to draw something on the screen. Unless I'm mistaken, neither of these two options lets you access the physical card's memory directly such that changes in it would be automatically displayed.
Drawing individual pixels on the screen using GDI or DirectX may be expensive if there's a lot of rendering. Redrawing all simulated card's pixels every time when one of them gets changed amounts to the same performance problem. The best solution is probably to update the screen 25-50 times a second and update only the parts that have changed since the last redraw. Subdivide the simulated card's buffer into smaller buffers representing rectangular areas of, say, 64x64 pixels, mark these buffers as "dirty" whenever the emulator writes to them and mark them as "clean" when they've been drawn on the screen. You may set up a periodic timer driving screen redraws and do them in a separate thread. You should be able to do something similar to this in Linux, but I don't know much about graphics programming there.
I was writing this question about directx and the following questions were part of it, but I realized I needed to separate them out.
If something isn't in a "slot" (register) on the GPU, will it have to be retransferred to the GPU to be used again, i.e. if put texture A in register t0, then later put texture B in register t0, is texture t0 no longer available on the GPU? Or is it still resident in the GPU memory, but I will have to place a call to load it into a texture register to get at it? Or something else entirely?
In a similar vein do calls to PSSetShaders, or PSSetShaderResource, or IASetVertexBuffers, etc.... block and copy data to the GPU before returning, so after the call one can alter or even free the resources they were based on because it is now resident on the GPU?
I guess this is more than one question, but I expect I'll get trouble if I try asking too many directx questions in one day (thought I think these are honestly decent questions about which the msdn documentation remains pretty silent, even if they are all newbie questions).
if put texture A in regsiter t0, then later put texture B in register t0, is texture t0 no longer available on the GPU?
It is no longer bound to the texture register so will not get applied to any polygons. You will have to bind it to a texture register again to use it.
Or is it still resident in the GPU memory, but I will have to place a call to load it into a texture register to get at it?
Typically they will stay in video memory until enough other resources have been loaded in that it needs to reclaim the memory. This was more obvious in DirectX 9 when you had to specify which memory pool to place resource in. Now everything is effectively in what was the D3DPOOL_MANAGED memory pool in Direct3D 9. When you set the texture register to use the texture it will be fast as long as the textures is still in video memory.
In a similar vein do calls to PSSetShaders, or PSSetShaderResource, or IASetVertexBuffers, etc.... block and copy data to the GPU before returning, so after the call one can alter or even free the resources they were based on because it is now resident on the GPU?
DirectX manages the resources for you and tries to keep them in video memory as long as it can.
Jonathan Leffler's comment in the question "How can I find the Size of some specified files?" is thought-provoking. I will break it into parts for analysis.
-- files are stored on pages;
you normally end up with more space being
used than that calculation gives
because a 1 byte file (often) occupies
one page (of maybe 512 bytes).
The
exact values vary - it was easier in
the days of the 7th Edition Unix file
system (though not trivial even then
4-5. if you wanted to take account of
indirect blocks referenced by the
inode as well as the raw data blocks).
Questions about the parts
What is the definition of "page"?
Why is the word "maybe" in the after-thought "one page (of maybe 512 bytes)"?
Why was it easier to measure exact sizes in the "7th Edition Unix file system"?
What is the definition of "indirect block"?
How can you have references by two things: "the inode" and "the raw data blocks"?
Historical Questions Emerged
I. What is the historical context Leffler is speaking about?
II. Have the
definitions changed over time?
I think he means block instead of page, a block being the minimum addressable unit on the filesystem.
block sizes can vary
Not sure why but perhaps it the filesystem interface exposed api's allowing a more exact measurement.
An indirect block is a block referenced by a pointer
The inode occupies space (blocks) just as the raw data does. This is what the author meant.
As usual for Wikipedia pages, Block (data storage) is informative despite being far too exuberant about linking all keywords.
In computing (specifically data transmission and data storage), a block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured is said to be blocked. The process of putting data into blocks is called blocking. Blocking is used to facilitate the handling of the data-stream by the computer program receiving the data. Blocked data is normally read a whole block at a time. Blocking is almost universally employed when storing data to 9-track magnetic tape, to rotating media such as floppy disks, hard disks, optical discs and to NAND flash memory.
Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data, though the block size in file systems may be a multiple of the physical block size. In classical file systems, a single block may only contain a part of a single file. This leads to space inefficiency due to internal fragmentation, since file lengths are often not multiples of block size, and thus the last block of files will remain partially empty. This will create slack space, which averages half a block per file. Some newer file systems attempt to solve this through techniques called block suballocation and tail merging.
There's also a reasonable overview of the classical Unix File System.
Traditionally, hard disk geometry (the layout of blocks on the disk itself) has been CHS.
Head: the magnetic reader/writer on each (side of a) platter; can move in and out to access different cylinders
Cylinder: a track that passes under a head as the platter rotates
Sector: a constant-sized amount of data stored contiguously on a portion the cylinder; the smallest unit of data that the drive can deal with
CHS isn't used much these days, as
Hard disks no longer use a constant number of sectors per cylinder. More data is squeezed onto a platter by using a constant arclength per sector rather than a constant rotational angle, so there are more sectors on the outer cylinders than there are on the inner cylinders.
By the ATA specification, a drive may have no more than 216 cylinders per head, 24 heads, and 28 sectors per cylinder; with 512B sectors, this is a limit of 128GB. Through BIOS INT13, it is not possible to access anything beyond 7.88GB through CHS anyways.
For backwards-compatibility, larger drives still claim to have a CHS geometry (otherwise DOS wouldn't be able to boot), but getting to any of the higher data requires using LBA addressing.
CHS doesn't even make sense on RAID or non-rotational media.
but for historical reasons, this has affected block sizes: because sector sizes were almost always 512B, filesystem block sizes have always been multiples of 512B. (There is a movement afoot to introduce drives with 1kB and 4kB sector sizes, but compatibility looks rather painful.)
Generally speaking, smaller filesystem block sizes result in less wasted space when storing many small files (unless advanced techniques like tail merging are in use), while larger block sizes reduce external fragmentation and have lower overhead on large disks. The filesystem block size is usually a power of 2, is limited below by the block device's sector size, and is often limited above by the OS's page size.
The page size varies by OS and platform (and, in the case of Linux, can vary by configuration as well). Like block size, smaller block sizes reduce internal fragmentation but require more administrative overhead. 4kB page sizes on 32-bit platforms is common.
Now, on to describe indirect blocks. In the UFS design,
An inode describes a file.
In the UFS design, the number of pointers to data blocks that an inode could hold is very limited (less than 16). The specific number appears to vary in derived implementations.
For small files, the pointers can directly point to the data blocks that compose a file.
For larger files, there must be indirect pointers, which point to a block which only contains more pointers to blocks. These may be direct pointers to data blocks belonging to the file, or if the file is very large, they may be even more indirect pointers.
Thus the amount of storage required for a file may be greater than just the blocks containing its data, when indirect pointers are in use.
Not all filesystems use this method for keeping track of the data blocks belong to a file. FAT simply uses a single file allocation table which is effectively a gigantic series of linked lists, and many modern filesystems use extents.