I want to import an image using Haxe, my image is big but it is 8 bits and it weighs 89kb. The problem occurs when I import it, because the size of the memory grows by 35mb. I suppose it is reserving memory for a 32bit image.
Any idea how to import an image of 8bits, without consuming so much memory?
OpenFL currently uses 32-bit memory for images at runtime, regardless of the original compression format. I believe there is an enhancement task open right now to allow for 24-bit, 16-bit and other color formats.
Related
I am currently trying to use the vgg16 model from keras library but whenever I create an object of the VGG16 model by doing
from keras.applications.vgg16 import VGG16
model = VGG16()
I get the following message 3 times.
tensorflow/core/framework/allocator.cc.124 allocation of 449576960 exceeds 10% of system memory
following this, my computer freezes. I am using a 64-bit, 4gb RAM with linux mint 18 and I have no access to GPU.
Is this problem has to do something with my RAM?
As a temporary solution I am running my python scripts from command line because my computer freezes less there compared to any IDE. Also, this does not happen when I use any alternate model like InceptionV3.
I have tried the solution provided here
but it didn't work
Any help is appreciated.
You are most likely running out of memory (RAM).
Try running top (or htop) in parallel and see your memory utilization.
In general, VGG models are rather big and require a decent amount of RAM. That said, the actual requirement depends on batch size. Smaller batch means smaller activation layer.
For example, a 6 image batch would consume about a gig of ram (reference). As a test you could lower your batch size to 1 and see it that fits in your memory.
When I'm running my Python code on the most powerfull AWS GPU instances (with 1 or 8 x Tesla v100 16mb aka. P3.x2large or P3.16xlarge) they are both only 2-3 times faster than my DELL XPS Geforce 1050-Ti laptop?
I'm using Windows, Keras, Cuda 9, Tensorflow 1.12 and the newest Nvidia drivers.
When I check the GPU load via GZU the GPU max. run at 43% load for a very short period - each time. The controller runs at max. 100%...
The dataset I use is matrices in JSON format and the files are located on a Nitro drive at 10TB with MAX 64.000 IOPS. No matter if the folder contains 10TB, 1TB or 100mb...the training is still very very slow per iteration?
All advises are more than welcome!
UPDATE 1:
From the Tensorflow docs:
"To start an input pipeline, you must define a source. For example, to construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data are on disk in the recommended TFRecord format, you can construct a tf.data.TFRecordDataset."
Before I had matrices stored in JSON format (Made by Node). My TF runs in Python.
I will now only save the coordinates in Node and save it in JSON format.
The question is now: In Python what is the best solution to load data? Can TF use the coordinates only or do I have to make the coordinates back to matrices again or what?
The performance of any machine learning model depends on many things. Including but not limited to: How much pre-processing you do, how much data you copy from CPU to GPU, Op bottlenecks, and many more. Check out the tensorflow performance guide as a first step. There are also a few videos from the tensorflow dev summit 2018 that talk about performance. How to properly use tf.data, and how to debug performance are two that I recommend.
The only thing I can say for sure is that JSON is a bad format for this purpose. You should switch to tfrecord format, which uses protobuf (better than JSON).
Unfortunately performance and optimisation of any system takes a lot of effort and time, and can be a rabbit hole that just keeps going down.
First off, you should be having a really good reason to go for an increased computational overhead with Windows-based AMI.
If your CPU is at ~100%, while GPU is <100%, then your CPU is likely the bottleneck. If you are on cloud, consider moving to instances with larger CPU-count (CPU is cheap, GPU is scarce). If you can't increase CPU count, moving some parts of your graph to GPU is an option. However, tf.data-based input pipeline is run entirely on CPU (but highly scalable due to C++ implementation). Prefetching to GPUs might also help here, but the cost of spawning another background thread to populate the buffer for downstream might damp this effect. Another option is to do some or all pre-processing steps offline (i.e. prior to training).
A word of caution on using Keras as the input pipeline. Keras relies on Python´s multithreading (and optionally multiprocessing) libraries, which may both lack performance (when doing heavy I/O or augmentations on-the-fly) and scalability (when running on multiple CPUs) compared to GIL-free implementations. Consider performing preprocessing offline, pre-loading input data, or using alternative input pipelines (as the aforementioned TF native tf.data, or 3rd party ones, like Tensorpack).
Specifically, in node-opencv, opencv Matrix objects are represented as a javascript object wrapping a c++ opencv Matrix.
However, if you don't .release() them manually, the V8 engine does not seem to know how big they are, and the NodeJS memory footprint can grow far beyond any limits you try to set on the command line; i.e. it only seems to run the GC when it approaches the set memory limits, but because it does not see the objects as large, this does not happen until it's too late.
Is there something we can add to the objects which will allow V8 to see them as large objects?
Illustrating this, you can create and 'forget' large 1M buffers all day on a nodejs set to limit it's memory to 256Mbytes.
But if you do the same with 1M opencv Matrices, NodeJS will quickly use much more than the 256M limit - unless you either run GC manually, or release the Matrices manually.
(caveat: a c++ opencv matrix is a reference to memory; i.e. more than one Matrix object can point to the same data - but it would be a start to have V8 see ALL references to the same memory as being the size of that memory for the purposes of GC, safer that seeing them as all very small.)
Circumstances: on an RPi3, we have a limited memory footprint, and processing live video (using about 4M of mat objects per frame) can soon exhaust all memory.
Also, the environment I'm working in (a Node-Red node) is designed for 'public' use, so difficult to ensure that all users completely understand the need to manually .release() images; hence this question is about how to bring this large data under the GC's control.
You can inform v8 about your external memory usage with AdjustAmountOfExternalAllocatedMemory(int64_t delta). There are wrappers for this function in n-api and NAN.
By the way, "large objects" has a special meaning in v8: objects large enough to be created in large object space and never moved. External memory is off-heap memory and I think what you're referring to.
According to Wikipedia in article about initrd
"Many Linux distributions ship a single, generic kernel image - one that the distribution's developers intend will boot on as wide a variety of hardware as possible. The device drivers for this generic kernel image are included as loadable modules because statically compiling many drivers into one kernel causes the kernel image to be much larger, perhaps too large to boot on computers with limited memory. This then raises the problem of detecting and loading the modules necessary to mount the root file system at boot time, or for that matter, deducing where or what the root file system is.
To avoid having to hardcode handling for so many special cases into the kernel, an initial boot stage with a temporary root file-system — now dubbed early user space — is used. This root file-system can contain user-space helpers which do the hardware detection, module loading and device discovery necessary to get the real root file-system mounted. "
My question is if we add modules etc needed to load actual filesystem in initrd not in actual kernel image to save save then what we will achieve in case of Bootpimage where both kernel and initrd are combined to form a single bootpimage. This size of kernel would increase even by using initrd.
Can someone clarify ?
Define "the size of the kernel".
Yes, if you have a minimal kernel image plus an initrd full of hundreds of modules, it will probably take up more total storage space than the equivalent kernel with everything compiled in, what with all the module headers and such. However, once it's booted, determined what hardware it's on, loaded a few modules and thrown all the rest away (the init in initrd), it will take up considerably less memory. The all-built-in kernel image on the other hand, once booted, is still just as big in memory as on disk, wasting space with all that unneeded driver code.
Storage is almost always considerably cheaper and more abundant than RAM, so optimising for storage space at the cost of reducing available memory once the system is running would generally be a bit silly. Even for network booting, sacrificing runtime capability for total image size for the sake of booting slightly faster makes little sense. The few kinds of systems where such considerations might have any merit almost certainly wouldn't be using generic multiplatform kernels in the first place.
There are several aspects to size and this maybe confusing.
Binary size on disk/network
Boot time size
Run time size
tl-dr; Using an initrd with modules gives a generic image a minimum run time memory footprint with current (3.17) Linux kernel code.
My question is if we add modules etc needed to load actual filesystem in initrd not in actual kernel image to save save then what we will achieve in case of Bootpimage where both kernel and initrd are combined to form a single bootpimage. This size of kernel would increase even by using initrd.
You are correct in that the same amount of data will be transfered no matter which mechanism you chose. In fact, the initrd with module loading will be bigger than a fully statically linked kernel and the boot time will be slower. Sounds bad.
A customized kernel which is specifically built for the device and contains no extra hardware driver nor module support is always the best. The Debian handbook on kernel compilation give two reason that a use may want to make a custom kernel.
Limit the risk of security problems via feature minimization.
to optimize memory consumption
The second option is often the most critical parameter. To minimize the amount of memory that a running kernel consumes. The initrd (or initramfs) is a binary disk image that is loaded as a ram disk. It is all user code with the single task of probing the devices and using module loading to get the correct drivers for the system. After this job is done, it mounts a real boot device or the normal root file system. When this happens, the initrd image is discarded.
The initrd does not consume run-time memory. You get both a generic image and one that has a fairly minimal run time footprint.
I will say that the efforts made by distro people have on occasion created performance issues. Typically ARM drivers were only compiled for one SOC; although the source supported an SOC family, but only one could be selected through conditions. In more recent kernels the ARM drivers always support the whole SOC family. The memory overhead is minimal. However, using a function pointer for a low-level driver transfer function can limit the bandwidth of the controller.
The cacheflush routine have an option for multi-cache. The function pointers cause compilers to automatically spill. However, if you compile for a specific cache type, the compiler can inline functions. This often generates much better and smaller code. Most drivers do not have this type of infra-structure. But you will have better run-time behavior if you compile a monolithic kernel that is tuned for your CPU. Several critical kernel functions will use inlined functions.
Drivers will not usually be faster when compiled in to the kernel. Many systems support hot-plug via USB, PCMCIA, SDIO, etc. These systems have a memory advantage with module loading as well.
Using DirectX 9 I want to capture what is on the screen and display a smaller version of it in my program.
To capture it I found and am using GetFrontBufferData. However the way it works is by writing to a surface defined in system memory (D3DPOOL_SYSTEMMEM). This results in my having to then transfer the screen shot back into video memory so I can then draw it.
As you can imagine this needless transfer from (video memory -> System memory -> video memory) causes quite a shutter in my program.
Is there a way I can get the image stored in the front buffer and put it onto a surface in video memory?
This question is a spin off of my recent question : Capture and Draw a ScreenShot using DirectX