Windows Advanced Rasterization Platform (WARP) supports a variety of feature levels that vary based on the version of the DirectX API that is installed:
feature levels 9_1, 9_2, 9_3, 10_0, and 10_1 when Direct3D 11 is installed
all above feature levels plus 11_0 when Direct3D 11.1 is installed on Windows 7
all above feature levels plus 11_1 when Direct3D 11.1 is installed on Windows 8
How can I easily determine what feature level is available via WARP? I know for the hardware device I can run ID3D11Device::GetFeatureLevel, but I don't see an equivalent for WARP.
Use the code from Anatomy of Direct3D 11 Create Device but use the WARP device type instead.
D3D_FEATURE_LEVEL lvl[] = {
D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0,
D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0 };
DWORD createDeviceFlags = 0;
#ifdef _DEBUG
createDeviceFlags |= D3D11_CREATE_DEVICE_DEBUG;
#endif
ID3D11Device* pDevice = nullptr;
ID3D11DeviceContext* pContext = nullptr;
D3D_FEATURE_LEVEL fl;
HRESULT hr = D3D11CreateDevice( nullptr, D3D_DRIVER_TYPE_WARP, nullptr,
createDeviceFlags, lvl, _countof(lvl),
D3D11_SDK_VERSION, &pDevice, &fl, &pContext );
if ( hr == E_INVALIDARG )
{
hr = D3D11CreateDevice( nullptr, D3D_DRIVER_TYPE_WARP, nullptr,
createDeviceFlags, &lvl[1], _countof(lvl)-1,
D3D11_SDK_VERSION, &pDevice, &fl, &pContext );
}
if ( FAILED(hr) )
// error handling
Then check fl to see if it is 10.1, 11.0, or 11.1. We don't need to list the 9.1, 9.2, or 9.3 feature level in lvl since WARP supports at least 10.1 on Windows desktop PCs. For robustness, I'd suggest listing 10.0 as well.
Related
I am working on my project using OpenCl. In order to improve the performance of my algorithm, is it possible to pipeline a single kernel? If a kernel consists of many steps, lets say A,B,C, I want A accept new data as soon as it finish its part and pass it to B. I can create channels between them, but I dont know how to do it in detail.
Can I write A,B,C(3 kernels) in a .cl file ? but how to enqueueNDRange?
I am using Altera SDK for FPGA HPC development.
Thanks.
Pipeline can be achieved by using several kernels connected with channels. With all kernels running concurrently, data is transferred from one to another:
Pipeline example from Intel FPGA OpenCL SDK Programming Guide
Very basic example of such pipeline would be:
channel int foo_bar_channel;
channel float bar_baz_channel;
__kernel void foo(__global int* in) {
for (int i = 0; i < 1024; ++i) {
int value = in[i];
value = clamp(value, 0, 255); // do some work
write_channel_altera(foo_bar_channel, value); // send data to the next kernel
}
}
__kernel void bar() {
for (int i = 0; i < 1024; ++i) {
int value = read_channel_altera(foo_bar_channel); // take data from foo
float fvalue = (float) value;
write_channel_altera(bar_baz_channel, value); // send data to the next kernel
}
}
__kernel void baz(__global int* out) {
for (int i = 0; i < 1024; ++i) {n
float value = read_channel_altera(bar_baz_channel);
float s = sin(value);
out[i] = s; // write result in the end
}
}
You can write all kernels in the a single .cl file, or use different files and then #include them into a main .cl file.
We want all our kernels run concurrently, so they can accept data from each other. Since only in-order command queues are supported, we have to use different queue for each kernel:
cl_queue foo_queue = clCreateCommandQueue(...);
cl_queue bar_queue = clCreateCommandQueue(...);
cl_queue baz_queue = clCreateCommandQueue(...);
clEnqueueTask(foo_queue, foo_kernel);
clEnqueueTask(bar_queue, bar_kernel);
clEnqueueTask(baz_queue, baz_kernel);
clFinish(baz_queue); // last kernel in our pipeline
Unlike OpenCL programming for GPU, we rely on a data pipelining, so NDRange kernels would not give us any benefit. Single work-item kernels are used instead of NDRange kernels, so we enqueue them using clEnqueueTask function. Additional kernel attribute (reqd_work_group_size) can be used to mark a single work-item kernel, to give the compiler some room for optimizations.
Check the Intel FPGA SDK for OpenCL Programming Guide for more information about channels and kernel attributes (specifically, section 1.6.4 Implementing the Intel FPGA SDK for OpenCL Channels Extension):
https://www.altera.com/en_US/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf
How do I implement below operation efficiently on msvc compiler?
uint32x4_t temp = { 1, 2, 3, 4 };
I have to load 4 different values in neon register very efficiently since I am working to optimize performance. Above statement works for android clang but fails on msvc compiler since uint32x4_t is typedef'ed to __n128.
Following is the structure of __n128:
typedef union __declspec(intrin_type) _ADVSIMD_ALIGN(8) __n128
{
unsigned __int64 n128_u64[2];
unsigned __int32 n128_u32[4];
unsigned __int16 n128_u16[8];
unsigned __int8 n128_u8[16];
__int64 n128_i64[2];
__int32 n128_i32[4];
__int16 n128_i16[8];
__int8 n128_i8[16];
float n128_f32[4];
struct
{
__n64 low64;
__n64 high64;
} DUMMYNEONSTRUCT;
} __n128;
In C99, when initializing a union with a initializer list, you can specify the particular members that are initialized, as follows:
uint32x4_t temp = { .n128_u32 = {1,2,3,4} };
However, this C99 syntax is only supported in Visual Studio 2013 and higher. Visual Studio 2012 and below do not support this feature, and thus, you can only initialize a union with a static initializer based on the first entry (n128_u64). You could come up with an initializer that fits your uint32 data into uint64. Since they are constants, it will not take any additional execution time. Looks really ugly:
uint32x4_t temp = { { 1 << 32 | 2, 3 << 32, 4 } };
If this code needs to be portable between compilers, a better option would be to create a preprocessor macro, that handles formatting of constants.
I want to bind the threads in my code to each physical core.
With GCC I have successfully done this using sched_setaffinity so I no longer have to set export OMP_PROC_BIND=true. I want to do the same thing in Windows with MSVC. Windows and Linux using a different thread topology. Linux scatters the threads while windows uses a compact form. In other words in Linux with four cores and eight hyper-threads I only need to bind the threads to the first four processing units. In windows I set them to to every other processing unit.
I have successfully done this using SetProcessAffinityMask. I can see from Windows Task Manger when I right click on the processes and click "Set Affinity" that every other CPU is set (0, 2, 4, 6 on my eight hyper thread system). The problem is that the efficiency of my code is unstable when I run. Sometimes it's nearly constant but most of the time it has big changes. I changed the priority to high but it makes no difference. In Linux the efficiency is stable. Maybe Windows is still migrating the threads? Is there something else I need to do to bind the threads in Windows?
Here is the code I'm using
#ifdef _WIN32
HANDLE process;
DWORD_PTR processAffinityMask = 0;
//Windows uses a compact thread topology. Set mask to every other thread
for(int i=0; i<ncores; i++) processAffinityMask |= 1<<(2*i);
//processAffinityMask = 0x55;
process = GetCurrentProcess();
SetProcessAffinityMask(process, processAffinityMask);
#else
cpu_set_t mask;
CPU_ZERO(&mask);
for(int i=0; i<ncores; i++) CPU_SET(i, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
#endif
Edit: here is the code I used now which seems to be stable on Linux and Windows
#ifdef _WIN32
HANDLE process;
DWORD_PTR processAffinityMask;
//Windows uses a compact thread topology. Set mask to every other thread
for(int i=0; i<ncores; i++) processAffinityMask |= 1<<(2*i);
process = GetCurrentProcess();
SetProcessAffinityMask(process, processAffinityMask);
#pragma omp parallel
{
HANDLE thread = GetCurrentThread();
DWORD_PTR threadAffinityMask = 1<<(2*omp_get_thread_num());
SetThreadAffinityMask(thread, threadAffinityMask);
}
#else
cpu_set_t mask;
CPU_ZERO(&mask);
for(int i=0; i<ncores; i++) CPU_SET(i, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
#pragma omp parallel
{
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(omp_get_thread_num(),&mask);
pthread_setaffinity_np(pthread_self(), sizeof(mask), &mask);
}
#endif
You should use the SetThreadAffinityMask function (see MSDN reference). You are setting the process's mask.
You can obtain a thread ID in OpenMP with this code:
int tid = omp_get_thread_num();
However the code above provides OpenMP's internal thread ID, and not the system thread ID. This article explains more on the subject:
http://msdn.microsoft.com/en-us/magazine/cc163717.aspx
if you need to explicitly work with those trheads - use the explicit affinity type as explained in this Intel documentation:
https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm
So, working with Visual Studio 2008 developing native C++ code for a Windows CE 6.0 platform. Consider the following multithreaded application:
#include "stdafx.h"
void IncrementCounter(int& counter)
{
if (++counter >= 1000)
{
counter = 0;
}
}
unsigned long ThreadFunction(void* arguments)
{
int threadCounter = 0;
while (true)
{
Sleep(20);
IncrementCounter(threadCounter);
}
return 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
CreateThread(
NULL,
0,
(LPTHREAD_START_ROUTINE)ThreadFunction,
NULL,
0,
NULL
);
int mainCounter = 0;
while (true)
{
Sleep(20);
IncrementCounter(mainCounter);
}
return 0;
}
When I build this to run on my Windows 7 dev. machine and run a debug session from Visual Studio with a breakpoint on the counter = 0; statement, execution will eventually break and two threads will be displayed in the "Threads" debug window. I can switch back and forth between the two threads using either a double-click or right-click->"Switch to Thread" and see a call stack and browse source and inspect symbols (for the call stack frames within my application code) for both threads. However when I do the same on Windows CE connecting via. ActiveSync/WMDC (have tried on both our custom CE 6.0 hardware with an in-house OS and SDK, and an old Windows mobile 5.0 PDA with the stock MS SDK) I can see a call stack and browse source for the thread in which the break has taken place (where the current execution point is within my application code), however I don't get anything useful for the other thread, which is currently blocked in kernel space waiting it's sleep timeout.
Anyone know whether it's possible to get this working better on Windows CE? I'm guessing it might be something to do with the debugger not knowing where to find .pdb symbol files for the WinCE kernel elements, or perhaps do I need to be running a Debug OS?
Windows CE 6 remote debugging. No call stack when pause program describes the same issue, but doesn't really provide a solution
thanks
Richard
Probably its because of missing pdb file for coredll.dll. If you are creating image for your device you will have access to this file, otherwise I am afraid its platform dependent.
You can find here that this issue is considered to be by design in VS2005 so maybe its the same for VS2008:
http://connect.microsoft.com/VisualStudio/feedback/details/190785/unable-to-debug-windows-mobile-application-that-is-in-a-system-call
In following link you can find some instructions for finding call stack using platform builder for "Thread That Is Not Running"
https://distrinet.cs.kuleuven.be/projects/SEESCOA/internal/workpackages/workpackage6/Task6dot2/ESCE/classes/331.pdf
Since I'am using only VS 2005 I cannot confirm if its of any help.
If logging is not sufficient (as was suggested in the SO link you provided), to find call stack for a thread like in your example I suggest using GetThreadCallStack function. Here is a step by step procedure:
1 - Add following code to your project:
typedef struct _CallSnapshotEx {
DWORD dwReturnAddr;
DWORD dwFramePtr;
DWORD dwCurProc;
DWORD dwParams[4];
} CallSnapshotEx;
#define STACKSNAP_EXTENDED_INFO 2
DWORD dwGUIThread;
void DumpGUIThreadCallStack() {
HINSTANCE hCore = LoadLibrary(_T("coredll.dll"));
typedef ULONG (*GETTHREADCALLSTACK)(HANDLE hThrd, ULONG dwMaxFrames, LPVOID lpFrames[], DWORD dwFlags,DWORD dwSkip);
GETTHREADCALLSTACK pGetThreadCallStack = (GETTHREADCALLSTACK)GetProcAddress(hCore, _T("GetThreadCallStack"));
if ( !pGetThreadCallStack )
return;
#define MAX_FRAMES 40
CallSnapshotEx lpFrames[MAX_FRAMES];
DWORD dwCnt = pGetThreadCallStack((HANDLE)dwGUIThread, MAX_FRAMES, (void**)lpFrames, STACKSNAP_EXTENDED_INFO, 0);
TCHAR szBuff[64];
for ( DWORD i = 0; i < dwCnt; ++i ) {
wsprintf(szBuff, L"[%d] %p\n", i, lpFrames[i].dwReturnAddr);
OutputDebugString(szBuff);
}
}
it will dump in Output window call frames return addresses (sample output is in point 3).
2 - initialize dwGUIThread in WinMain:
dwGUIThread = GetCurrentThreadId();
3 - execute DumpGUIThreadCallStack(); before actuall breakpoint inside ThreadFunction. It will write to output window text similar to this:
[0] 8C04D2C4
[1] 8C04D34C
[2] 40026D48
[3] 000111F4 <--- 1
[4] 00011BAC <--- 2
[5] 4003C2DC
1 and 2 are return addresses that you are interested in, and you want to find symbols nearest to them.
4 - while inside debugger switch to disassembly mode (right click on source file and choose 'Go to disassembly'). In this mode at the top of the window you will see Address: line. You should copy paste to it addresses from output window, in my case 000111F4 will direct me to following lines:
while (true)
{
Sleep(20);
000111F0 mov r0, #0x14
000111F4 bl 0001193C // <--- 1
IncrementCounter(mainCounter);
which gives you what your GUI thread is actually doing.
Visual Studio Debugger allows to execute functions from immediate window, but I was unable to call DumpGUIThreadCallStack, I am always getting 'Error: function evaluation not supported'.
To find nearest symbols for frame return addresses you can also use .map files together with .cod files (/FAcs compiled sources), there are some good tutorials on that on google.
Above example was tested with the use of VS 2005 and Standard SDK 5.0, on WCE6.0 (end user) device.
Is there a way that we can find the status of a display monitor in a linux environment?
pointers on any standard C libraries / unix calls would be helpful. I got many interesting articles on how this can be achieved on win32, but none of them would point a solution for a linux environment.
i tried using xrandr, but it fails to detect the status dynamically
any pointers??
Here is a simple program using Linux Real Mode Interface:
#include "lrmi.h"
int main(void)
{
struct LRMI_regs r = {0};
r.eax = 0x4F10;
r.ebx = 0x02;
ioperm( 0, 1024, 1 );
iopl( 3 );
if( !LRMI_init() || !LRMI_int( 0x10, &r ) )
{
return -1;
}
return (r.ebx >> 8) & 0xFF;
}
Some possible return values: 0 (on), 1 (standby), 2 (suspend), 4 (off), 8 (reduced on).