Is there a way to independently task and use heterogenous multi gpus in a windows 7 system? - graphics

Can I have two mixed chipset/generation AMD gpus in my desktop; a 6950 and 4870, and dedicate one gpu (4870) for opencl/gpgpu purposes only, eliminating the device from video output or display driving consideration by the OS, allowing the 4870 to essentially remain in a deep sleep or appear ejected/disabled until it's stream processors are called upon?
Compared to the 4870, the 6950 is a heavyweight in opencl calculations; enough so that it can crunch numbers and still allow an active user session, and even web browsing. HOWEVER, as soon as I navigate to a webpage with embedded flash video, forget what I have running and open media player or media center- basically any gpu-accelerated video task that requires the 6950 to initialize UVD, the display system hangs.
I'm looking for a way to plug my 4870 in an open pcie slot, have it sit in a dormant state with near-0 heat production and power consumption (essentially only maintain the interface signalling, like an ethernet card in a powered-off desktop maintaining the line and waiting for a WOL command), and attain a D0 state (I don't even care if the latency of this wake event is on the scale of seconds) to then run opencl calculations ON ITS OWN. I do not wish to achieve a non-CF heterogeneous gpu teaming setup! In my example of a UVD hung situation I would see manually stopping the opencl calculations on the 6950, beginning those calculations then on the 4870 to free up the 6950 for multimedia usage/gaming as my desire outcome (granted, with a hit to the calculation rate). Even better if the two gpus could independently run similar calculations while no one is using the desktop. I don't even mind if I have to initiate the power-state transitions of the 4870 from/into an 'OFF' state (say, by a shortcut on the desktop), as long as it doesn't require a system restart, ending the user session and logging off... and the manual ON/OFF 'switch' for the 4870 is something any proficient windows end-user could do- like click a shortcut to run a script, or even go into device manage and toggle enable/disable. As long as the 4870 isn't wastefully idling by for 1 sole use that may occur sporadically.
I couldn't think of a solution to facilitate this function besides writing a new ini for the 4870 to override the typical power management characteristics written for usage of the device as a typical graphics card (say to drop in/out of powered state w/o relinquishing irq or other allocated resources to 'hold the door open' on interface availability and addressing). But that is an endeavor that is both well above my abilities, and I easily see an additional involvement of licensing being necessitated to achieve.

Windows 7 (and maybe windows 10) doesn't define a "selected device". It's softwares' own responsibility to pick the right device. For example, google chrome's add-on software(for video decode) will pick whatever gpu comes as first target defined in it. If it is written to pick first-indexed device, then it needs a pci-e re-plug of both cards to switch their roles.
This OS written to fit for majority(%99) of users, not for multi-gpu users(%1 ?). It simply picks one of gpus or software has explicit control over devices and simply benchmarks all gpus and picks fastest. So you should look for software's abilities instead of OS.
Same thing goes for games too! When I play dota-2 on vulkan api, it uses HD7870 for compute(of textures, particles, etc..) but uses R7-240 for graphics! But I prefer the opposite because R7-240 can't draw fast. Because this game is written for majority of people who don't have more than 1 gpu.
Money controls development I'm sorry for this. Then, market-penetration is needed for money. %99 market penetration needs writing code for public, not scientific guys or rich ones. Public has simply 1 gpu and that is a cheap one.
I wish this:
select 1 gpu for: unzipping files, wathing videos, compressing internet uploads and caching for file system(up to 2GB)
select another gpu for: gaming, opencl applications, mining, ..
select all gpus for: games, benchmarks, seen as single device by my applications,..
but is not guaranteed to become true because money still talks.
If I were a driver developer, I would add a "rename" option(and become poor in return) to give N virtual devices to OS, so OS and other software will be able to gain only 1/N 'th power of whole system or N/N by just using those renames or main devices. A rename could be a single compute unit of a gpu. When OS tells drivers "give me %25 of all cores that share same memory" so it pick a device and gives %25 of total cores of system. So even users could create renames for their own work.
I even sent a message to microsoft for "file system cache on my 2nd graphics card" but they did not return!

Related

What causes dma_map_page/dma_unmap_page to take longer time on some hardware?

I've been programming a Linux kernel module for several years for a PCIe device. One of the main feature is to transfer data from the PCIe card to the host memory using DMA.
I'm using streaming DMA, i.e. it's the user program that allocates the memory, and my kernel module has to do the job of locking the pages and creating the scatter gather structure. It works correctly.
However, when used on some more recent hardware with Intel processors, the function calls dma_map_page and dma_unmap_page are taking much longer time to execute.
I've tried to use dma_map_sg and dma_unmap_sg, it takes approximately the same longer-time.
I've tried to split the dma_unmap_sg into a first call to dma_sync_sg_for_cpu, followed by the call to dma_unmap_sg_attrs with attribute DMA_ATTR_SKIP_CPU_SYNC. It works correctly. And I can see the additional time is spend on the unmap operation, not on the sync.
I've tried to play with the linux command line parameters relating to the iommu (on, force, strict=0), and also intel_iommu, with no change in the behavior.
Some other hardware show a decent transfer rate, i.e. more than 6GB/s on PCIe3x8 (max 8GB/s).
The issue on some recent hardware is limiting transfer rate to ~3GB/s (I've checked that the card is correctly configured for PCIe3x8, and the programmer of the Windows device driver manages to achieve the 6GB/s on the same system. Things are more behind the curtains in Windows and I cannot get much information from him.)
On some hardware, the behavior is either normal or slowed, depending on the Linux distribution (and the Linux kernel version I guess). On some other hardware, the roles are reversed, i.e. the slow one becomes the fast one and vice-versa.
I cannot figure out the cause of this. Any clue?
The trouble was the bounce buffers. Didn't know about this.

measure power consumption of microcontroller (low power mode) in order to optimize my code

I'm developing code for a microcontroller (specifically MSP432P401R LaunchPad).
I'd like to measure the power consumption of the microcontroller while running my code, in order to optimize it, expecially in low power mode.
The dev board is connected through USB to my computer.
Is it possible? Do I need some special instruments? I have an oscilloscope. I've read about current probes for oscilloscopes, but they seems very expensive.
Is there any other way?
The mcu I bought has 80 µA/MHz current consumption in low power mode. Is there a way to measure such low level of current?
Thank you.
The MSP432P401R LaunchPad has built in EnergyTrace capabilities. These can be accessed via TI's Code Composer Studio IDE. Check the user guide for the launchpad board (SLAU597B) for details of how to enable this. You can get rather detailed information about energy use including relating usage back to code. I haven't explored these features deeply, but it has the distinct advantage of not needing any additional equipment.
Otherwise, yes you can measure such low current consumption with conventional instruments, but it must be done "very delicately". As a software guy, it's beyond my abilities and when I have done it, it was always with a good EE at my side on a custom board that was specifically designed to be able to measure such things and electrically isolate different parts of the system.
Also, one minor point. The 80 uA/MHz current consumption is for active mode processing. In low power sleep modes, standby current consumption can drop to just a few microAmps and in the very lowest power modes even to the nanoAmp range. The data sheet is a wealth of confusing information about this.

What bluetooth to use (2.1 or 4.0) and how?

The title seems to be too general (I can't think of a good title). I'll try to be specific in the question description.
I was required to make an industrial control box that collects data periodically (maybe 10-20 bytes of data per 5 seconds). The operator will use a laptop or mobile phone to collect the data (without opening the box) via Bluetooth, weekly or monthly or probably at even longer period.
I will be in charge of selecting proper modules/chips, doing PCB, and doing the embedded software too. Because the box is not produced in high volume, so I have freedom (modules/chips to use, prices, capabilities, etc.) in designing different components.
The whole application requires an USART port to read in data when available (maybe every 5-10 seconds), a SPI port for data storage (SD Card reader/writer), several GPIO pins for LED indicator or maybe buttons (whether we need buttons and how many is up to my design).
I am new to Bluetooth. I read through wiki and some googling pages. Knowing about the pairing, knowing about the class 1 and class 2 differences, knowing about the 2.1 and 4.0 differences.
But I still have quite several places not clear to decide what Bluetooth module/chip to use.
A friend of mine mentioned TI CC2540 to me. I checked and it only supports 4.0 BLE mode. And from Google, BT4.0 has payload of at most 20 bytes. Is BT4.0 suitable for my application when bulk data will need to be collected every month or several months? Or it's better to use BT2.1 with EDR for this application? BT4.0 BLE mode seems to have faster pairing speed but lower throughput?
I read through CC2540, and found that it is not a BT only chip, it has several GPIO pins and uart pins (I am not sure about SPI). Can I say that CC2540 itself is powerful enough to hold the whole application? Including bluetooth, data receiving via UART, and SD card reading/writing?
My original design was to use an ARM cortex-M/AVR32 MCU. The program is just a loop to serve each task/events in rounds (or I can even install Linux). There will be a Bluetooth module. The module will automatically take care of pairing. I will only need to send the module what data to send to the other end. Of course there might be some other controlling, such as to turn the module to low power mode because the Bluetooth will only be used once per month or something like that. But after some study of Bluetooth, I am not sure whether such BT module exists or not. Is programming chips like CC2540 a must?
In my understanding, my designed device will be a BT slave, the laptop/phone will be the master. My device will periodically probe (maybe with longer period to save power) the existence of master and pair with it. Once it's paired, it will start sending data. Is my understanding on the procedure correct? Is there any difference in pairing/data sending for 2.1 and 4.0?
How should authentication be designed? I of course want unlimited phones/laptops to pair with the device, but only if they can prove they are the operator.
It's a bit messy. I will be appreciated if you have read through the above questions. The following is the summary,
2.1 or 4.0 to use?
Which one is the better choice? Meaning that suitable for the application, and easy to develop.
ARM/avr32 + CC2540 (or the like)
CC2540 only or the like (if possible)
ARM/avr32 + some BT modules ( such as Bluegiga https://www.bluegiga.com/en-US/products/ )
Should Linux be used?
How the pairing and data sending should be like for power saving? Are buttons useful to facilitate the sleep mode and active pairing and data sending mode for power saving?
How the authentication should be done? Only operators are allowed but he can use any laptops/phones.
Thanks for reading.
My two cents (I've done a fair amount of Bluetooth work, and I've designed consumer products currently in the field)... Please note that I'll be glossing over A LOT of design decisions to keep this 'concise'.
2.1 or 4.0 to use?
Looking over your estimates, it sounds like you're looking at about 2MB of data per week, maybe 8MB per month. The question of what technology to use here comes down to how long people are willing to wait to collect data.
For BLE (BT 4.0) - assume your data transfer is in the 2-4kB/s range. For 2.1, assume it's in the 15-30kB range. This depends on a ton of factors, but has generally been my experience between Android/iOS and BLE devices.
At 2MB, either one of those is going to take a long time to transfer. Is it possible to get data more frequently? Or maybe have a wifi-connected base station somewhere that collects data frequently? (I'm not sure of the application)
Which one is the better choice? Meaning that suitable for the
application, and easy to develop. ARM/avr32 + CC2540 (or the like)
CC2540 only or the like (if possible) ARM/avr32 + some BT modules (
such as Bluegiga https://www.bluegiga.com/en-US/products/ ) Should
Linux be used?
This is a tricky question to answer without knowing more about the complexity of the system, the sensors, the data storage capacity, BOM costs, etc...
In my experience, a Linux OS is HUGE overkill for a simple GPIO/UART/I2C based system (unless you're super comfortable with Linux). Chips that can run Linux, and add the additional RAM are usually costly (e.g. a cheap ARM Cortex M0 is like 50 cents in decent volume and sounds like all you need to use).
The question usually comes down to 'external MCU or not' - as in, do you try to get an all-in-one BT module, which has application space to program on. This is a size and cost savings to using it, but it adds risks and unknowns vs a braindead BT module + external MCU.
One thing I would like to clarify is that you mention the TI CC2540 a few times (actually, CC2541 is the newer version). Anyways, that chip is an IC level component. Unless you want to do the antenna design and put it through FCC intentional radiator certification (the certs are between 1k-10k usually - and I'm assuming you're in the US when I say FCC).
What I think you're looking for is a ready-made, certified module. For example, the BLE113 from Bluegiga is an FCC certified module which contains the CC2541 internally (plus some bells and whistles). That one specifically also has an interpreted language called BGScript to speed up development. It's great for very simple applications - and has nice, baked in low-power modes.
So, the BLE113 is an example of a module that can be used with/without an external MCU depending on application complexity.
If you're willing to go through FCC intentional radiator certification, then the TI CC2541 is common, as well as the Nordic NRF51822 (this chip has a built in ARM core that you can program on as well, so you don't need an external MCU).
An example of a BLE module which requires an external MCU would be the Bobcats (AMS001) from AckMe. They have a BLE stack running on the chip which communicates to an external MCU using UART.
As with a comment above, if you need iOS compatibility, using a Bluetooth 2.1 (BT Classic) is a huge pain due to the MFI program (which I have gone through - pure misery). So, unless REALLY necessary, I would stick with BT Classic and Android/PC.
Some sample BT classic chips might be the Roving Networks RN42, the AmpedRF BT 33 (or 43 or 53). If you're interested, I did a throughput test on iOS devices with a Bluetooth classic device (https://stackoverflow.com/a/22441949/992509)
How the pairing and data sending should be like for power saving? Are
buttons useful to facilitate the sleep mode and active pairing and
data sending mode for power saving?
If the radio is only on every week or month when the operator is pulling data down, there's very little to do other than to put the BT module into reset mode to ensure there is no power used. BT Classic tends to use more power during transfer, but if you're streaming data constantly, the differences can be minimal if you choose the right modules (e.g. lower throughput on BLE for a longer period of time, vs higher throughput on BT 2.1 for less time - it works itself out in the wash).
The one thing I would do here is to have a button trigger the ability to pair to the BT modules, because that way they're not always on and advertising, and are just asleep or in reset.
How the authentication should be done? Only operators are allowed but
he can use any laptops/phones.
Again, not knowing the environment, this can be tricky. If it's in a secure environment, that should be enough (e.g. behind locked doors that you need to be inside to press a button to wake up the BT module).
Otherwise, at a bare minimum, have the standard BT pairing code enabled, rather than allowing anyone to pair. Additionally, you could bake extra authentication and security in (e.g. you can't download data without a passcode or something).
If you wanted to get really crazy, you could encrypt the data (this might involve using Linux though) and make sure only trusted people had decryption keys.
We use both protocols on different products.
Bluetooth 4.0 BLE aka Smart
low battery consumption
low data rates (I came up to 20 bytes each 40 ms. As I remember apple's minimum interval is 18 ms and other handset makers adapted that interval)
you have to use Bluetooth's characteristics mechanism
you have to implement chaining if your data packages are longer
great distances 20-100m
new technology with a lot of awful premature implementations. Getting better slowly.
we used a chip from Bluegiga that allowed a script language for programming. But still many limitations and bugs are build in.
we had a greater learning curve to implement BLE than using 2.1
Bluetooth 2.1
good for high data rates depending on used baud rate. The bottle neck here was the buffer in the controller.
weak distances 2-10 m
Its much easier to stream data
Did not notice a big time difference in pairing and connecting with both technologies.
Here are two examples of devices which clearly demand either 2.1 or BLE. Maybe your use case is closer to one of those examples:
Humidity sensors attached to trees in a Forrest. Each week the ranger walks through the Forrest and collects the data.
Wireless stereo headsets

Can a high frequency hardware input be accurately read under a windows or windows CE environment?

This question straddles the blurry line between software and hardware so I hope the community feels it belongs on stackoverflow otherwise I would appreciate some advice as to where to move it.
My team is currently in the process of developing a robotic controller and although most of our I/O requirements are pretty tame we have one specific input that occasionally triggers at a 2 microsecond interval (Either a digital rise or fall). We need to be able to accurately timestamp that input and record it. We do not need to act on it in any sort of immediate way. We feel as if we should be able to take advantage of a high speed USB or PCI-E bus in order to facilitate this kind of high speed input but we lack expertise in this area and do not know what a solution to this problem would look like or if it even can be achieved without expensive/very invovled proprietary solutions. In addition there is a desire in our group to move away from Windows CE to a Windows Embedded (probably XP) x86 based platform however if this is something that could only be achieved on a CE system that would be important to know.
In summary:
Is there a way to read and timestamp accurately a 2 microsecond (500 KHz) input under a windows or windows CE environment?
Bonus points for implementation ideas/visions/methods!
Thank you!
I wouldnt try to do it directly in windows? What GPIO do you have anyway on your windows machine? For around or under $20 there are a long list of solutions that are either serial or usb based (of the shelf microcontroller boards) which you could program to have their only job to be to watch that I/O and time stamp it and then casually send this data over usb or serial (or other) to the host.
I am thinking along the lines of the msp430 launchpad, or stellaris lauchpad, or stm32f0 discovery or stm32f4 discovery, the family of arduinos, mbed, leaf labs maple mini, and on and on.
To do it properly under windows you would need to add some hardware anyway...

Graphics development on ARM

I am planning to make a small OS and run a Tetris clone on it using an ARM Cortex-M3. Unfortunately, I am not able to buy any development boards as of now, so I will have to use simulators.
I have actually looked into QEMU which has LM3S6965EVB support, which contains an ARM Cortex-M3 processor. But apparently newer revisions of the board are not compatible with the model in QEMU as none of the examples I have downloaded from TI seem to work. Even the OLED display is different.
Another problem is to do graphics development as the OLED display for LM3S6965EVB has a really low resolution. I was able to get it up to 640x480 by editing the QEMU source but as I could not get any examples to work, so I don't know if it works either. Using the debug parameters for SSD0323, all I can see is that it accepts some of the data that is sent to initialize the device, then hangs...
I have considered choosing another board in QEMU but that would mean redoing many things from scratch when I get my hands on a real device, as the other ones are too powerful for something as simple as this.
What should I do? Are there any other simulators out there that can help me accomplish what I am trying to do? I want to develop a small OS and some small games.
Thanks in advance. I have been searching for a solution for days and I am really stuck.
How much you have to redo, in part, has to do with your software/system engineering, you can abstract where needed and only have to re-write the abstraction layer not the entire package. Actually you can do much of your software design/testing on your host system and never cross compile, only later cross compile to a simulator or real hardware.
For example, I assume you would construct the next video screen somewhere in ram then depending on the hardware change some bits in a register and page flip or have to do a copy from this frame buffer to the video/lcd in whatever form it wants. Using thumbulator you could build your screen in ram somewhere (in the simulated memory space) then add to the simulator when the simulation writes to such and such register take these bytes from ram and display them on the host computer (running the simulation) basically simulate some hardware. use sdl or basic X calls or whatever you prefer. I normally take snapshots to .bmp files (very easy to write) then look at them later.
Later, on hardware your abstracted update_screen() function would have hardware specific code to display that screen.
thumbulator only runs thumb instructions not ARM and not Thumb2, thumb being the common denominator between the arm processors (ARMv4T and newer except for cortex-m) and those that support thumb2 extensions (cortex-m). other than startup code the compiling and programming experience is the same across the arm family. the code (other than startup code and of course hardware specific accesses) will run across the arm family as well as the simulator. If you go to a cortex-m then adding an architecture specification to the command line will change the build from thumb only to thumb+thumb2 instructions giving you some performance boost. if you surf around my other projects on github you will find this idea reapeated over and over again, I have many simple cortex-m examples where I use gcc and llvm and build the same .C code with thumb instructions and thumb+thumb2.
Another completely different answer is get a GBA (Nintendo Game Boy Advance). You can get a GBA SP (has a backlit display, makes the whole experience better) for about $30 or so on ebay. You can buy flash cartridges that take sd cards for about the same amount. It has an ARM7TDMI, it runs thumb code much faster than ARM code, giving you that thumb experience in preparation for other/newer cores like the cortex-m. For another $30 you can get a game link cable, chop it up, attach a rs232 level shifter (I can talk you through all of this), and make a gba serial cable. My preferred setup is to have a flash cartridge that I have pre-programmed with a serial bootloader, I download the program over serial into ram then run from ram. This avoids having to yank the flash cartridge and/or sd card every time you re-compile the program. doable, and a cheaper solution but gets tiring fast.
If you have a Nintendo DS for $12 to $15 you can get an sd based flash cartridge that you can likewise use for development. I recommend learning the gba first, which you can do on the NDS if you buy a gba side memory cartridge (need a ds lite not an ndsi nor 3d) supported by the software on the cartridge. (the ez flash 3 in 1 gba size for example is a good one, as well as memory you can flash that one with the nds and carry it over to the gba (this is how I put my serial bootloader on it)). these loaders will let you put your .gba file on the nds cartridge sd card then load it into the gba cartridge and it switches the nds into gba mode and runs as a gba.
there are lots of other solutions, sparkfun.com likely has a number of arm based boards that can drive lcds and/or come with lcds. You can go to earthlcd and get one of the serial based lcd panels that make for rapid development, later of course a cheaper solution is desired. Along the same lines you could instead simulate an earthlcd like thing using your host computer have the embedded microcontroller send screen updates over serial to the host and the host displays the graphics. Later replace that screen update with something else.
This latter solution, for about $20 you can get a stm32f4 discovery board, has a cortex-m4, runs up to 168MHz, has a number of serial ports of which at least two have pins not being used by something else you could easily have one port for debug messages and the other for this virtual serial screen. In the stm32f4 directory in my stm32vld repo on github I have a number of getting started examples for using that board (as well as the stm32vld which is a few bucks cheaper but not as powerful as this stm32f4). Likewise your host application can take keystrokes and turn them into user control/game control commands back into the game software on the microcontroller.
There is of course the beagleboard or hawkboard or raspberri pi when it comes out, or open-rd (I dont like the plug computer but do like the open rd) which have video processing and video output direct to a monitor and/or tv using composite or whatever. About $150 to $200 and it just works run with it. You definitely dont need to run linux on these platforms, you can make your own os or whatever you like and run that, very simple.
There are more solutions than you probably have time and/or money to pursue you need to find one that fits within your comfort or happyness zone for how you like to do development and try that path.

Resources