Are they same thing: Linux's framebuffer and GPU's memory

Are they same thing: Linux's framebuffer and GPU's memory - linux

From my understanding they are different.
Linux framebuffer is a software object and GPU's memory is a physical memory mapped to GPU device.
My questions are the following:
1) Is my understanding correct?
2) If so, somehow merging two things into one looks like possible to improve the performance (I guess there are much more technical details why this is not possible and so on...)
3) If not, could you explain how Linux framebuffer and GPU work together?

Linux framebuffer device is a virtual device that wraps data it receives to display. So generally answer is no - it is not GPU memory. In theory driver can map GPU memory into fbdev, but it is unlikely anyone doing this. Main problem is that there may be many virtual consoles, but e.g. only one monitor - fbdev must handle this. Other thing is that GPU memory only quite recently became virtualised (directly accessible), on older GPUs you can't just write into GPU memory anything you like.
Aside from that, fbdev provides unified interface, while direct access to GPU memory will require hardware-specific data formats. When there is a difference between formats, fbdev driver performs conversion.
As for performance - it is already very good. There is probably not much benefit to raise it even further.

Related

where is disk scheduling implemented

I'm recently learning the part of disk schedulingof operating system. And I could understand the various algorithms for this issue, like FCFS, LIFO, SSTF, SCAN and so on. But I was wondering where these algorithms are implemented?
I don't think operating system is the answer because OS can't know the details of the I/O devices. So are they implemented on the devices themselves? Could anyone clarify it for me? Any related literature or links will be appreciated.

The simple answer is that these days, this all takes place in the drive controller.
In ye olde days, operating systems usually implemented disk I/O in two layers. At the the top was a drive independent logical layer. This viewed the drive as an array of blocks. Below this was a physical layer that viewed disks as platters, tracks, and sectors. Because the physical details varied among drives, the physical layer was usually implemented in a disk-(or class of disks) specific device driver.
In these dark times, you often had to wait for your drive vendor to create a new device driver before you could upgrade your operating system.
In the mid-1980's it started to become common for disk drives to provide a logical I/O interface. The device driver no longer saw disks/platters/sectors. Instead, it just saw an array of logical blocks. The drive took care of physical locations and redirecting of bad blocks (tasks that the operating system used to handle). This allowed single device driver to manage multiple types of devices, sharing the same interface and differing only in the number of logical blocks.
These days, you'd be hard pressed to find a disk drive that does not provide a logical interface.
All the scheduling algorithms that involve physical locations have to take place within the disk drive.
Unless you are doing disk drive engineering, such scheduling algorithms are quite meaningless. If you learning hard drive engineering, expect that occupation to disappear soon.

In practice, disk scheduling (in the sense of e.g. reordering the pending disk reads to minimize rotational delay) is less important today than it was in the XXth century.
hard disks are probably less used, in favor of SSDs, and they are even more slow w.r.t. fast RAM access time.
the disk sector as seen by the kernel have been reorganized by the disk controller itself, so the CHS addressing (as seen by the OS kernel) does not correspond to geometrical reality
the hard disk drive is smarter today and its internal controller has significant memory and computing capabilities. The SATA protocol has some "higher level" requests (e.g. TRIM). Read about SMART and hybrid drives.
However, application code can give hints to the operating system about access patterns. Look for example into posix_fadvise(2).
Read also Operating Systems: Three Easy Pieces

Using memory mapped file with OpenCL

I access a file on a disk using memory mapped I/O (mmap call on linux).
Is it possible to pass this virtual memory buffer to OpenCL using CL_MEM_USE_HOST_PTR (for reading only). And could this result in performance gains?
I want to avoid copying an entire file into host memory, and instead let the OpenCL kernel control which parts of the file get loaded/buffered by the operating system.

I think this should work - you shouldn't end up with errors, crashes, or incorrect results; whether or not it brings performance gains probably depends on hardware, driver/CL implementation, and access patterns. I would not be surprised if it didn't make much of a difference in many cases. I could imagine the GPU driver prefaulting and wiring down all the pages in order to map it into the GPU's address space.

Large physically contiguous memory area

For my M.Sc. thesis, I have to reverse-engineer the hash function Intel uses inside its CPUs to spread data among Last Level Cache slices in Sandy Bridge and newer generations. To this aim, I am developing an application in Linux, which needs a physically contiguous memory area in order to make my tests. The idea is to read data from this area, so that they are cached, probe if older data have been evicted (through delay measures or LLC miss counters) in order to find colliding memory addresses and finally discover the hash function by comparing these colliding addresses.
The same procedure has already been used in Windows by a researcher, and proved to work.
To do this, I need to allocate an area that must be large (64 MB or more) and fully cachable, so without DMA-friendly options in TLB. How can I perform this allocation?
To have a full control over the allocation (i.e., for it to be really physically contiguous), my idea was to write a Linux module, export a device and mmap() it from userspace, but I do not know how to allocate so much contiguous memory inside the kernel.
I heard about Linux Contiguous Memory Allocator (CMA), but I don't know how it works

Applications don't see physical memory, a process have some address space in virtual memory. Read about the MMU (what is contiguous in virtual space might not really be physically contiguous and vice versa)
You might perhaps want to lock some memory using mlock(2)
But your application will be scheduled, and other processes (or scheduled tasks) would dirty your CPU cache. See also sched_setaffinity(2)
(and even kernel code might be perhaps preempted)

This page on Kernel Newbies, has some ideas about memory allocation. But the max for get_free_pages looks like 8MiB. (Perhaps that's a compile-time constraint?)
Since this would be all-custom, you could explore the mem= boot parameter of the linux kernel. This will limit the amount of memory used, and you can party all over the remaining memory without anyone knowing. Heck, if you boot up a busybox system, you could probably do mem=32M, but even mem=256M should work if you're not booting a GUI.
You will also want to look into the Offline Scheduler (and here). It "unplugs" the CPU from Linux so you can have full control over ALL code running on it. (Some parts of this are already in the mainline kernel, and maybe all of it is.)

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.

EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.

In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

single common address space for all tasks

How to give single common address space for all tasks. IF its happening like this can we avoid virtual to physical memory mapping.
I f all task sharing common address space then how can we avoid virtual to physical memory mapping.

There are a few modern (research) OS's that do this, like Singularity and there are performance benefits, primarily because it no longer needs to do context changes and the file/symbol loader no longer needs to do address translation for global caches and kernel functions.
You do need to be a bit more specific about what you're looking for, tho'. You tagged your post as OSX and Linux, both of which require virtual memory. When running on systems without a MMU (and thus no virtual memory) it emulates it, which I'm fairly certain you can't circumvent. I'm not an expert by any means.

uClinux is an implementation of Linux that runs on processors that lack an MMU (such as ARM7), so by definition must have a single address space for all tasks.
So one answer to "how" is "use uClinux".
You tagged this VxWorks, and there is another answer; VxWorks supports a flat memory. In fact when I last used it the MMU protection was an (expensive) add on. Many other RTOS designed for micro controllers similarly do not support an MMU, such as eCOS, and FreeRTOS.
Of RTOS's that do support an MMU, QNX is probably amongst the most robust and mature, while still maintaining high performance.

I'm not sure why you would want to disable virtual memory mapping - it's a built in function of the cpu, and pretty much essential when running an OS to properly isolate processes from each other.
Most operating systems allow you to disable virtual memory, so that your memory capacity is limited by physical memory. However, A processes address space is still virtual, and virtual to physical mapping is still happening.
A way to get what you want is to run an operating system that executes in Real Mode, such as DOS or Windows 3.0, or write your own.
The advantages of virtual memory far outweigh the disadvantages. Why do you want to avoid virtual memory.

This is how some older operating systems and even how some modern operating systems that lack VM still work. It has many disadvantages for things like desktop and server applications but it can be useful in an embedded and/or real-time context, or where you have minimal hardware.

The VxWorks AE(Advanced Edition supports) deviates from the concept of Common address space for all tasks.So it can effectively be used in both systems with MMU and without MMU .The common address space for all tasks is called flat memory model and the separate address space for different tasks is called over lapped memory model or segmented memory model.You should not confuse the memory model with the memory lay out as seen in object files which divides data in to Code Segment ,Data Segment ,BSS etc .Both are entirely different things :).
This link in stack overflow will help better
Difference between flat memory model and protected memory model?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string