Memory layout mismatching between CPU and GPU code with CUDA

Memory layout mismatching between CPU and GPU code with CUDA - visual-c++

I'm experiencing a very weird situation. I have this template structures:
#ifdef __CUDACC__
#define __HOSTDEVICE __host__ __device__
#else
#define __HOSTDEVICE
#endif
template <typename T>
struct matrix
{
T* ptr;
int col_size, row_size;
int stride;
// some host & device methods
};
struct dummy1 {};
struct dummy2 : dummy1 {};
template <typename T>
struct a_functor : dummy2
{
matriz<T> help_m;
matrix<T> x, y;
T *x_ptr, *y_ptr;
int bsx, ind_thr;
__HOSTDEVICE void operator()(T* __x, T* __y)
{
// functor code
}
};
I've structured my code to separate cpp and cu files, so a_functor object is created in cpp file and used in a kernel function. The problem is that, executing operator() inside a kernel, I found some random behaviour I couldn't explain only looking at code. It was like my structs were sort of corrupted. So, calling a sizeof() on an a_functor object, I found:
CPU code (.cpp and .cu outside kernel): 64 bytes
GPU code (inside kernel): 68 bytes
There was obviously some kind of mismatching that ruined the whole stuff. Going further, I tracked the distance between struct parameter pointers and struct itself - to try to inspect the produced memory layout - and here's what I found:
a_functor foo;
// CPU
(char*)(&foo.help_m) - (char*)(&foo) = 0
(char*)(&foo.x) - (char*)(&foo) = 16
(char*)(&foo.y) - (char*)(&foo) = 32
(char*)(&foo.x_ptr) - (char*)(&foo) = 48
(char*)(&foo.y_ptr) - (char*)(&foo) = 52
(char*)(&foo.bsx) - (char*)(&foo) = 56
(char*)(&foo.ind_thr) - (char*)(&foo) = 60
// GPU - inside a_functor::operator(), in-kernel
(char*)(&this->help_m) - (char*)(this) = 4
(char*)(&this->x) - (char*)(this) = 20
(char*)(&this->y) - (char*)(this) = 36
(char*)(&this->x_ptr) - (char*)(this) = 52
(char*)(&this->y_ptr) - (char*)(this) = 56
(char*)(&this->bsx) - (char*)(this) = 60
(char*)(&this->ind_thr) - (char*)(this) = 64
I really can't understand why nvcc generated this memory layout for my struct (what are that 4 bytes supposed to be/do!?!). I thought it could be an alignment problem and I tryed to explicitly align a_functor, but I can't because it is passed by value in kernel
template <typename T, typename Str>
__global__ void mykernel(Str foo, T* src, T*dst);
and when I try compile I get
error: cannot pass a parameter with a too large explicit alignment to a global routine on win32 platforms
So, to solve this strange situation (...and I do think that's an nvcc bug), what should I do? The only thing I can think of is playing with alignment and passing my struct to kernel by pointer to avoid the aforementioned error. However, I'm really wondering: why that memory layout mismatching?! It really makes no sense...
Further information: I'm using Visual Studio 2008, compiling with MSVC on Windows XP 32bit platform. I installed the latest CUDA Toolkit 5.0.35. My card is a GeForce GTX 570 (compute capability 2.0).

From the comments it appears there may be differences between the code you're actually running and the code you've posted, so it's difficult to give more than vague answers without someone being able to reproduce the problem. That said, on Windows there are cases where the layout and size of a struct can differ between the CPU and the GPU, these are documented in the programming guide:
On Windows, the CUDA compiler may produce a different memory layout,
compared to the host Microsoft compiler, for a C++ object of class
type T that satisfies any of the following conditions:
T has virtual functions or derives from a direct or indirect base class that has virtual functions;
T has a direct or indirect virtual base class;
T has multiple inheritance with more than one direct or indirect empty base class.
The size for such an object may also be
different in host and device code. As long as type T is used
exclusively in host or device code, the program should work correctly.
Do not pass objects of type T between host and device code (e.g., as
arguments to global functions or through cudaMemcpy*() calls).
The third case may apply in your case where you have an empty base class, do you have multiple inheritance in the real code?

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.

The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.

Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

Is CGAL 2D Regularized Boolean Set-Operations lib thread safe?

I am currently using the library mentioned in the title, see
CGAL 2D-reg-bool-set-op-pol
The library provides types for polygons and polygon sets which are internally represented as so called arrangements.
My question is: How far is this library thread safe, that is, fit for parallel computation on its objects?
There could be several levels in which thread safety is guaranteed:
1) If I take an object from a library like an arrangement
Polygon_set_2 S;
I might be able to execute
Polygon_2 P;
S.join(P);
and
Polygon_2 Q;
S.join(Q);
in two different concurrent execution units/threads in parallel without harm and get the right result, as if I had done everything sequentially. That would be the highest degree of thread safety/possible parallelism.
2) In fact for me a much lesser degree would be enough. In that case S and P would be members of a class C so that two class instances have different S and P instances. Then I would like to compute (say) S.join(P) in parallel for a list of instances of the class C, say, by calling a suitable member function of C with std::async
Just to be complete, I insert here a bit of actual code from my project which gives more flesh to these terse descriptions.
// the following typedefs are more or less standard from the
// CGAL library examples.
typedef CGAL::Exact_predicates_exact_constructions_kernel Kernel;
typedef Kernel::Point_2 Point_2;
typedef Kernel::Circle_2 Circle_2;
typedef Kernel::Line_2 Line_2;
typedef CGAL::Gps_circle_segment_traits_2<Kernel> Traits_2;
typedef CGAL::General_polygon_set_2<Traits_2> Polygon_set_2;
typedef Traits_2::General_polygon_2 Polygon_2;
typedef Traits_2::General_polygon_with_holes_2 Polygon_with_holes_2;
typedef Traits_2::Curve_2 Curve_2;
typedef Traits_2::X_monotone_curve_2 X_monotone_curve_2;
typedef Traits_2::Point_2 Point_2t;
typedef Traits_2::CoordNT coordnt;
typedef CGAL::Arrangement_2<Traits_2> Arrangement_2;
typedef Arrangement_2::Face_handle Face_handle;
// the following type is not copied from the CGAL library example code but
// introduced by me
typedef std::vector<Polygon_with_holes_2> pwh_vec_t;
// the following is an excerpt of my full GerberLayer class,
// that retains only data members which are used in the join()
// member function. These data is therefore local to the class instance.
class GerberLayer
{
public:
GerberLayer();
~GerberLayer();
void join();
pwh_vec_t raw_poly_lis;
pwh_vec_t joined_poly_lis;
Polygon_set_2 Saux;
annotate_vec_t annotate_lis;
polar_vec_t polar_lis;
};
//
// it is not necessary to understand the working of the function
// I deleted all debug and timing output etc. It is just to "showcase" some typical
// operations from the CGAL reg set boolean ops for polygons library from
// Efi Fogel et.al.
//
void GerberLayer::join()
{
Saux.clear();
auto it_annbase = annotate_lis.begin();
annotate_vec_t::iterator itann = annotate_lis.begin();
bool first_block = true;
int cnt = 0;
while (itann != annotate_lis.end()) {
gpolarity akt_polar = itann->polar;
auto itnext = std::find_if(itann, annotate_lis.end(),
[=](auto a) {return a.polar != akt_polar;});
Polygon_set_2 Sblock;
if (first_block) {
if (akt_polar == Dark) {
Saux.join(raw_poly_lis.begin() + (itann - it_annbase),
raw_poly_lis.begin() + (itnext - it_annbase));
}
first_block = false;
} else {
if (akt_polar == Dark) {
Saux.join(raw_poly_lis.begin() + (itann - it_annbase),
raw_poly_lis.begin() + (itnext - it_annbase));
} else {
Polygon_set_2 Saux1;
Saux1.join(raw_poly_lis.begin() + (itann - it_annbase),
raw_poly_lis.begin() + (itnext - it_annbase));
Saux.complement();
pwh_vec_t auxlis;
Saux1.polygons_with_holes(std::back_inserter(auxlis));
Saux.join(auxlis.begin(), auxlis.end());
Saux.complement();
}
}
itann = itnext;
}
ende:
joined_poly_lis.clear();
annotate_lis.clear();
Saux.polygons_with_holes (std::back_inserter (joined_poly_lis));
}
int join_wrapper(GerberLayer* p_layer)
{
p_layer->join();
return 0;
}
// here the parallelism (of the "embarassing kind") occurs:
// for every GerberLayer a dedicated task is started, which calls
// the above GerberLayer::join() function
void Window::do_unify()
{
std::vector<std::future<int>> fivec;
for(int i = 0; i < gerber_layer_manager.num_layers(); ++i) {
GerberLayer* p_layer = gerber_layer_manager.at(i);
fivec.push_back(std::async(join_wrapper, p_layer));
}
int sz = wait_for_all(fivec); // written by me, not shown
}
One might think, that 2) must be possible trivially as only "different" instances of polygons and arrangements are in the play. But: It is imaginable, as the library works with arbitrary precision points (Point_2t in my code above) that, for some implementation reason or other, all the points are inserted in a list static to the class Point_2t, so that identical points are represented only once in this list. So there would be nothing like "independent instances of Point_2t" and as a consequence also not for "Polygon_2" or "Polygon_set_2" and one could say farewell to thread safety.
I tried to resolve this question by googling (not by analyzing the library code, I have to admit) and would hope for an authoritative answer (hopefully positive as this primitive parallelism would greatly speed up my code).
Addendum:
1)
I implemented this already and made a test run with nothing exceptional occurring and visually plausible results, but of course this proves nothing.
2) The same question for the CGAL 2D-Arrangement-package from the same authors.
Thanks in advance!
P.S.: I am using CGAL 4.7 from the packages supplied with Ubuntu 16.04 (Xenial). A newer version on Ubuntu 18.04 gave me errors so I decided to stay with 4.7. Should a version newer than 4.7 be thread-safe, but not 4.7, of course I will try to use that newer version.
Incidentally I could not find out if the libcgal***.so libraries as supplied by Ubuntu 16.04 are thread safe as described in the documentation. Especially I found no reference to the Macro-Variable CGAL_HAS_THREADS that is mentioned in the "thread-safety" part of the docs, when I looked through the build-logs of the Xenial cgal package on launchpad.

Indeed there are several level of thread safety.
The 2D Regularized Boolean operation package depends of the 2D Arrangement package, and both packages depend on a kernel. For most operations the EPEC kernel is required.
Both packages are thread-safe, except for the rational-arc traits (Arr_rational_function_traits_2).
However, the EPEC kernel is not thread-safe yet when sharing number-type objects among threads. So, if you, for example, construct different arrangements in different threads, from different input sets of curves, respectively, you are safe.

Misaligned pointer use with std::shared_ptr<NSDate> dereference

I am working in a legacy codebase with a large amount of Objective-C++ written using manual retain/release. Memory is managed using lots of C++ std::shared_ptr<NSMyCoolObjectiveCPointer>, with a suitable deleter passed in on construction that calls release on the contained object. This seems to work great; however, when enabling UBSan, it complains about misaligned pointers, usually when dereferencing the shared_ptrs to do some work.
I've searched for clues and/or solutions, but it's difficult to find technical discussion of the ins and outs of Objective-C object pointers, and even more difficult to find any discussion about Objective-C++, so here I am.
Here is a full Objective-C++ program that demonstrates my problem. When I run this on my Macbook with UBSan, I get a misaligned pointer issue in shared_ptr::operator*:
#import <Foundation/Foundation.h>
#import <memory>
class DateImpl {
public:
DateImpl(NSDate* date) : _date{[date retain], [](NSDate* date) { [date release]; }} {}
NSString* description() const { return [&*_date description]; }
private:
std::shared_ptr<NSDate> _date;
};
int main(int argc, const char * argv[]) {
#autoreleasepool {
DateImpl date{[NSDate distantPast]};
NSLog(#"%#", date.description());
return 0;
}
}
I get this in the call to DateImpl::description:
runtime error: reference binding to misaligned address 0xe2b7fda734fc266f for type 'std::__1::shared_ptr<NSDate>::element_type' (aka 'NSDate'), which requires 8 byte alignment
0xe2b7fda734fc266f: note: pointer points here
<memory cannot be printed>
I suspect that there is something awry with the usage of &* to "cast" the shared_ptr<NSDate> to an NSDate*. I think I could probably work around this issue by using .get() on the shared_ptr instead, but I am genuinely curious about what is going on. Thanks for any feedback or hints!

There were some red herrings here: shared_ptr, manual retain/release, etc. But I ended up discovering that even this very simple code (with ARC enabled) causes the ubsan hit:
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
#autoreleasepool {
NSDate& d = *[NSDate distantPast];
NSLog(#"%#", &d);
}
return 0;
}
It seems to simply be an issue with [NSDate distantPast] (and, incidentally, [NSDate distantFuture], but not, for instance, [NSDate date]). I conclude that these must be singleton objects allocated sketchily/misaligned-ly somewhere in the depths of Foundation, and when you dereference them it causes a misaligned pointer read.
(Note it does not happen when the code is simply NSLog(#"%#", &*[NSDate distantPast]). I assume this is because the compiler simply collapses &* on a raw pointer into a no-op. It doesn't for the shared_ptr case in the original question because shared_ptr overloads operator*. Given this, I believe there is no easy way to make this happen in pure Objective-C, since you can't separate the & operation from the * operation, like you can when C++ references are involved [by storing the temporary result of * in an NSDate&].)

You are not supposed to ever use a "bare" NSDate type. Objective-C objects should always be used with a pointer-to-object type (e.g. NSDate *), and you are never supposed to get the "type behind the pointer".
In particular, on 64-bit platforms, Objective-C object pointers can sometimes not be valid pointers, but rather be "tagged pointers" which store the "value" of the object in certain bits of the pointer, rather than as an actual allocated object. You must always let the Objective-C runtime machinery deal with Objective-C object pointers. Dereferencing it as a regular C/C++ pointer can lead to undefined behavior.

Is there any kill_proc() replacement for proprietary Linux kernel drivers?

I'm in the process of porting 4 proprietary (read: non-GPL) Linux kernel drivers (that I didn't write) from RHEL 5.x to RHEL 6.x (2.6.32 kernel). The drivers all use kill_proc() for signalling the user-space "session", but this function has been removed from the more recent kernels (somewhere between 2.6.18 and 2.6.32). I've seen this question asked many times here and elsewhere and I've searched fairly extensively, but of the many suggested solutions, none work due to either the functions no longer being exported, or requrieing a GPL-only function (see below). Does anyone know of a solution that could work for a proprietary driver?
given: kill_proc(pid, sig, 1);
The simplest solution I found was to use: kill_proc_info(sig, SEND_SIG_PRIV, pid); however kill_proc_info is no longer exported so it can't be used.
kill_pid_info() has been suggested (this is called by kill_proc_info() after setting an rcu_read_lock(). kill_pid_info() requires a struct pid* so I could use: kill_pid_info(sig, SEND_SIG_PRIV, find_vpid(pid)); however find_vpid() is exported for GPL use only and this is a proprietary driver. Is there another way to get the struct pid*?
kill_pid_info() also sets up an rcu_read_lock() and then calls group_send_sig_info(). Unfortunately, group_send_siginfo() is not exported, and also it requires a struct task_struct*, but the required find_task_by_vpid() function is not exported either.
Another suggestion was kill_pid(), but this also requires a struct pid*, and again, the function find_vpid() is only exported for GPL.
There were also suggestions for send_sig() and send_sig_info(), but these also require a struct task_struct*, and again, find_task_by_pid() is not exported, and pid_task() requires that (GPLd) find_vpid() to get a struct pid*. Also, these function don't set an rcu_read_lock() and they also pass a FALSE value for the group flag (whereas kill_proc ended up using a TRUE value) - so there could be some subtle differences.
That's all that I could find. Does anyone have a suggestion that will work for my case? Thanks in advance.

Since there have been no responses to my question, I've been
reading much of the kernel code and I think I've found a
solution.
It seems that the only exported function that provides the
same semantics as kill_proc() is kill_pid(). We can't use
the GPL find_vpid() function to get the needed struct pid*,
but if we can get the struct task_struct*, then we can get
the struct pid* from there as:
task->pids[PIDTYPE_PID].pid
Since find_task_by_vpid() is no longer exported, it seems
the only way to find the task is to go through the entire
task list looking for it. So, the proposed solution is:
int my_kill_proc(pid_t pid, int sig) {
int error = -ESRCH; /* default return value */
struct task_struct* p;
struct task_struct* t = NULL;
struct pid* pspid;
rcu_read_lock();
p = &init_task; /* start at init */
do {
if (p->pid == pid) { /* does the pid (not tgid) match? */
t = p;
break;
}
p = next_task(p); /* "this isn't the task you're looking for" */
} while (p != &init_task); /* stop when we get back to init */
if (t != NULL) {
pspid = t->pids[PIDTYPE_PID].pid;
if (pspid != NULL) error = kill_pid(pspid,sig,1);
}
rcu_read_unlock();
return error;
}
I know it will take a lot more time to search the whole task list rather
than using the hash tables, but it's all I've got. Some concerns/questions
that I have:
Is the rcu_read_lock() sufficient for this? Would
it be better to use something like preempt_disable() instead?
Can the struct task_struct ever NOT have a PIDTYPE_PID entry
in the pids array? And if so, is checking for NULL sufficient?
I'm new to working with the kernel, are there any other
suggestions to make this better?

Variant type storage and alignment issues

I've made a variant type to use instead of boost::variant. Mine works storing an index of the current type on a list of the possible types, and storing data in a byte array with enough space to store the biggest type.
unsigned char data[my_types::max_size];
int type;
Now, when I write a value to this variant type comes the trouble. I use the following:
template<typename T>
void set(T a) {
int t = type_index(T);
if (t != -1) {
type = t;
puts("writing atom data");
*((T *) data) = a; //THIS PART CRASHES!!!!
puts("did it!");
} else {
throw atom_bad_assignment;
}
}
The line that crashes is the one that stores data to the internal buffer. As you can see, I just cast the byte array directly to a pointer of the desired type. This gives me bad address signals and bus errors when trying to write some values.
I'm using GCC on a 64-bit system. How do I set the alignment for the byte array to make sure the address of the array is 64-bit aligned? (or properly aligned for any architecture I might port this project to).
EDIT: Thank you all, but the mistake was somewhere else. Apparently, Intel doesn't really care about alignment. Aligned stuff is faster but not mandatory, and the program works fine this way. My problem was I didn't clear the data buffer before writing stuff and this caused trouble with the constructors of some types. I will not, however, mark the question as answered, so more people can give me tips on alignment ;)

See http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Variable-Attributes.html
unsigned char data[my_types::max_size] __attribute__ ((aligned));
int type;

I believe
#pragma pack(64)
will work on all modern compilers; it definitely works on GCC.
A more correct solution (that doesn't mess with packing globally) would be:
#pragma pack(push, 64)
// define union here
#pragma pack(pop)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Memory layout mismatching between CPU and GPU code with CUDA - visual-c++

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

Is CGAL 2D Regularized Boolean Set-Operations lib thread safe?

Misaligned pointer use with std::shared_ptr<NSDate> dereference

Is there any kill_proc() replacement for proprietary Linux kernel drivers?

Variant type storage and alignment issues

Categories

Resources