How to declare 16-bits pointer to string in GCC C compiler for arm processor - string

I tried to declare an array of short pointers to strings (16-bits instead of default 32-bits) in GNU GCC C compiler for ARM Cortex-M0 processor to reduce flash consumption. I have about 200 strings in two language, so reducing the size of pointer from 32-bits to 16-bits could save 800 bytes of flash. It should be possible because the flash size is less than 64 kB so the high word (16-bits) of pointers to flash is constans and equal to 0x0800:
const unsigned char str1[] ="First string";
const unsigned char str2[] ="Second string";
const unsigned short ptrs[] = {&str1, &str2}; //this line generate error
but i got error in 3-th line
"error: initializer element is not computable at load time"
Then i tried:
const unsigned short ptr1 = (&str1 & 0xFFFF);
and i got:
"error: invalid operands to binary & (have 'const unsigned char (*)[11]' and 'int')"
After many attempts i ended up in assembly:
.section .rodata.strings
.align 2
ptr0:
ptr3: .short (str3-str0)
ptr4: .short (str4-str0)
str0:
str3: .asciz "3-th string"
str4: .asciz "4-th string"
compilation pass well, but now i have problem trying to reference pointers: ptr4 and ptr0 from C code. Trying to pass "ptr4-ptr0" as an 8-bit argument to C function:
ptr = getStringFromTable (ptr4-ptr0)
declared as:
const unsigned char* getStringFromTable (unsigned char stringIndex)
i got wrong code like this:
ldr r3, [pc, #28] ; (0x8000a78 <main+164>)
ldrb r1, [r3, #0]
ldr r3, [pc, #28] ; (0x8000a7c <main+168>)
ldrb r3, [r3, #0]
subs r1, r1, r3
uxtb r1, r1
bl 0x8000692 <getStringFromTable>
instead of something like this:
movs r0, #2
bl 0x8000692 <getStringFromTable>
I would be grateful for any suggestion.
.....after a few days.....
Following #TonyK and #old_timer advices i finally solved the problem in the following way:
in assembly i wrote:
.global str0, ptr0
.section .rodata.strings
.align 2
ptr0: .short (str3-str0)
.short (str4-str0)
str0:
str3: .asciz "3-th string"
str4: .asciz "4-th string"
then i declared in C:
extern unsigned short ptr0[];
extern const unsigned char str0[] ;
enum ptrs {ptr3, ptr4}; //automatically: ptr3=0, ptr4=1
const unsigned char* getStringFromTable (enum ptrs index)
{
return &str0[ptr0[index]] ;
}
and now this text:
ptr = getStringFromTable (ptr4)
is compiled to the correct code:
08000988: 0x00000120 movs r0, #1
0800098a: 0xfff745ff bl 0x8000818 <getStringFromTable>
i just have to remember to keep the order of enum ptrs each time i will add a string to the assembly and a new item to enum ptrs

Declare ptr0 and str0 as .global in your assembly language file. Then in C:
extern unsigned short ptr0[] ;
extern const char str0[] ;
const char* getStringFromTable (unsigned char index)
{
return &str0[ptr0[index]] ;
}
This works as long as the total size of the str0 table is less than 64K.

A pointer is an address and addresses in arm cannot be 16 bits that makes no sense, other than Acorn based arms (24 bit if I remember right), addresses are minimum 32 bits (for arm) and going into aarch64 larger but never smaller.
This
ptr3: .short (str3-str0)
does not produce an address (so it cant be a pointer) it produces an offset that is only usable when you add it to the base address str0.
You cannot generate 16 bit addresses (in a debugged/usable arm compiler), but since everything appears to be static here (const/rodata) that makes it even easier solve, solvable runtime as well, but even simpler pre-computed based on information provided thus far.
const unsigned char str1[] ="First string";
const unsigned char str2[] ="Second string";
const unsigned char str3[] ="Third string";
brute force takes like 30 lines of code to produce the header file below, much less if you try to compact it although ad-hoc programs don't need to be pretty.
This output which is intentionally long to demonstrate the solution (and to be able to visually check the tool) but the compiler doesn't care (so best to make it long and verbose for readability/validation purposes):
mystrings.h
const unsigned char strs[39]=
{
0x46, // 0 F
0x69, // 1 i
0x72, // 2 r
0x73, // 3 s
0x74, // 4 t
0x20, // 5
0x73, // 6 s
0x74, // 7 t
0x72, // 8 r
0x69, // 9 i
0x6E, // 10 n
0x67, // 11 g
0x00, // 12
0x53, // 13 S
0x65, // 14 e
0x63, // 15 c
0x6F, // 16 o
0x6E, // 17 n
0x64, // 18 d
0x20, // 19
0x73, // 20 s
0x74, // 21 t
0x72, // 22 r
0x69, // 23 i
0x6E, // 24 n
0x00, // 25
0x54, // 26 T
0x68, // 27 h
0x69, // 28 i
0x72, // 29 r
0x64, // 30 d
0x20, // 31
0x73, // 32 s
0x74, // 33 t
0x72, // 34 r
0x69, // 35 i
0x6E, // 36 n
0x67, // 37 g
0x00, // 38
};
const unsigned short ptrs[3]=
{
0x0000 // 0 0
0x000D // 1 13
0x001A // 2 26
};
The compiler then handles all of the address generation when you use it
&strs[ptrs[n]]
depending on how you write your tool can even have things like
#define FIRST_STRING 0
#define SECOND_STRING 1
and so on so that your code could find the string with
strs[ptrs[SECOND_STRING]]
making the program that much more readable. All auto generated from an ad-hoc tool that does this offset work for you.
the main() part of the tool could look like
add_string(FIRST_STRING,"First string");
add_string(SECOND_STRING,"Second string");
add_string(THIRD_STRING,"Third string");
with that function and some more code to dump the result.
and then you simply include the generated output and use the
strs[ptrs[THIRD_STRING]]
type syntax in the real application.
In order to continue down the path you started, if that is what you prefer (looks like more work but is still pretty quick to code).
ptr0:
ptr3: .short (str3-str0)
ptr4: .short (str4-str0)
str0:
str3: .asciz "3-th string"
str4: .asciz "4-th string"
Then you need to export str0 and ptr3, ptr4 (as needed depending on your assembler's assembly language) then access them as a pointer to str0+ptr3
extern unsigned int str0;
extern unsigned short ptr3;
...
... *((unsigned char *)(str0+ptr3))
fixing whatever syntax mistakes I intentionally or unintentionally added to that pseudo code.
That would work as well and you would have the one base address then the hundreds of 16 bit offsets to that address.
could even do some flavor of
const unsigned short ptrs[]={ptr0,ptr1,ptr2,ptr3};
...
(unsigned char *)(str0+ptrs[n])
using some flavor of C syntax to create that array but probably not worth that extra effort...
The solution a few of us have mentioned thus far (one example demonstrated above)(16 bit offsets which are NOT addresses which means NOT pointers) is much easier to code and maintain and use and maybe read depending on your implementation. However implemented it requires a full sized base address and offsets. It might be possible to code this in C without an ad-hoc tool, but the ad-hoc tool literally only takes a few minutes to write.
I write programs to write programs or programs to compress/manipulate data almost daily, why not. Compression is a good example of this want to embed a black and white image into your resource limited mcu flash? Don't put all the pixels in the binary, start with a run length encoding and go from there, which means a third party tool written by you or not that converts the real data into a structure that fits, same thing here a third party tool that prepares/compresses the data for the application. This problem is really just another compression algorithm since you are trying to reduce the amount of data without losing any.
Also note depending on what these strings are if it is possible to have duplicates or fractions the tool could be even smarter:
const unsigned char str1[] ="First string";
const unsigned char str2[] ="Second string";
const unsigned char str3[] ="Third string";
const unsigned char str4[] ="string";
const unsigned char str5[] ="Third string";
creating
const unsigned char strs[39]=
{
0x46, // 0 F
0x69, // 1 i
0x72, // 2 r
0x73, // 3 s
0x74, // 4 t
0x20, // 5
0x73, // 6 s
0x74, // 7 t
0x72, // 8 r
0x69, // 9 i
0x6E, // 10 n
0x67, // 11 g
0x00, // 12
0x53, // 13 S
0x65, // 14 e
0x63, // 15 c
0x6F, // 16 o
0x6E, // 17 n
0x64, // 18 d
0x20, // 19
0x73, // 20 s
0x74, // 21 t
0x72, // 22 r
0x69, // 23 i
0x6E, // 24 n
0x00, // 25
0x54, // 26 T
0x68, // 27 h
0x69, // 28 i
0x72, // 29 r
0x64, // 30 d
0x20, // 31
0x73, // 32 s
0x74, // 33 t
0x72, // 34 r
0x69, // 35 i
0x6E, // 36 n
0x67, // 37 g
0x00, // 38
};
const unsigned short ptrs[5]=
{
0x0000 // 0 0
0x000D // 1 13
0x001A // 2 26
0x0006 // 3 6
0x001A // 4 26
};

Related

CRC Bluetooth Low Energy 4.2

In the core bluetooth 4.2 documentation here it talks about a CRC check for data integrity (P2456). This details the below:
With an example below:
4e 01 02 03 04 05 06 07 08 09
Producing CRC: 6d d2
I have tried a number of different methods but can't seem to reproduce the example. Can anyone provide some sample code to produce the CRC above.
You left out a key part of the example in the document, which is that the UAP used in the example is 0x47. The CRC needs to be initialized with the UAP. (Oddly, with the bits reversed and in the high byte, relative to the data bits coming in.)
The code below computes the example. The result is d26d. The CRC is transmitted least significant bit first, so it is sent 6d d2. On the receive side the same CRC is computed on the whole thing with the CRC, and the result is zero, which is how the receive side is supposed to check what was sent.
#include <stdio.h>
static unsigned crc_blue(unsigned char *payload, size_t len) {
unsigned crc = 0xe200; // UAP == 0x47
while (len--) {
crc ^= *payload++;
for (int k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ 0x8408 : crc >> 1;
}
return crc;
}
int main(void) {
unsigned char payload[] = {
0x4e, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
printf("%04x\n", crc_blue(payload, sizeof(payload)));
unsigned char recvd[] = {
0x4e, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x6d, 0xd2};
printf("%04x\n", crc_blue(recvd, sizeof(recvd)));
return 0;
}
Your code would need to initialize the UAP appropriately for that device.

Scatterlist in linux crypto api

I start to learn how to work with Crypto API in linux. It's offered to use scatterlist structures to transfer plaintext to block cipher function. Scatterlist handle to the plaintext by storing location of plaintext on the memmory page. Simplyfied definition of struct scatterlist is:
struct scatterlist {
unsigned long page_link; //number of virtual page in kernel space where data buffer is stored
unsigned int offset; //offset from page start address to data buffer start address
unsigned int length; //data buffer length
dma_addr_t dma_address; //i don't know the purpose of this variable at the moment
};
To get scatterlist variable which handle to plaintext buffer we use next function: void sg_init_one(struct scatterlist *, const void *, unsigned int);. To get buffer start address from scatterlist variable we use next function:void *sg_virt(struct scatterlist *sg).
For example:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/crypto.h>
#include <linux/scatterlist.h>
u8 plaintext_global[16]={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
static int __init simple_init (void){
u8 *ptr_to_local, *ptr_to_global;
u8 palintext_local[16]={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
struct scatterlist sg[2];
sg_init_one(&sg[0], plaintext_local, 16);
sg_init_one(&sg[1], plaintext_global, 16);
printk("sg[0].page_link=%u\n", sg[0].page_link);
printk("sg[0].offset=%u\n", sg[0].offset);
printk("sg[0].length=%u\n", sg[0].length);
printk("sg[1].page_link=%u\n", sg[1].page_link);
printk("sg[1].offset=%u\n", sg[1].offset);
printk("sg[1].length=%u\n", sg[1].length);
ptr_to_local=sg_virt(&sg[0]);
ptr_to_global=sg_virt(&sg[1]);
printk("plaintext_local start address:%p\n", plaintext_local);
printk("sg_virt(&sg[0]):%p\n", ptr_to_local);
printk("plaintext_global start address:%p\n", plaintext_global);
printk("sg_virt(&sg[1]):%p\n", ptr_to_global);
}
And output in dmesg after insmod this module:
sg[0].page_link=31209922
sg[0].offset=3168
sg[0].length=16
sg[1].page_link=16853378
sg[1].offset=0
sg[1].length=16
plaintext_local start address:ffff8800770e7c60
sg_virt(&sg[0]):ffff8800770e7c60
plaintext_global start address:ffffffffc04a6000
sg_virt(&sg[1]):ffff8800404a6000
First question is why with local plaintext buffer sg_virt return the same value as local buffer address, but with global plaintext buffer return value of sg_virt have another prefix than global buffer address?
Next. Now I use crypto api:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/crypto.h>
#include <linux/scatterlist.h>
u8 aes_in[]={0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff};
u8 aes_key[]={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
u8 aes_out[]={0x69, 0xc4, 0xe0, 0xd8, 0x6a, 0x7b, 0x04, 0x30, 0xd8, 0xcd, 0xb7, 0x80, 0x70, 0xb4, 0xc5, 0x5a};
static int __init simple_init (void){
struct crypto_blkcipher *blk;
struct blkcipher_desc desc;
struct scatterlist sg[3];
u8 encrypted[100];
u8 decrypted[100];
blk=crypto_alloc_blkcipher("ecb(aes)",0,0);
crypto_blkcipher_setkey(blk, aes_key, 16);
sg_init_one(&sg[0], aes_in, 16);
sg_init_one(&sg[1], encrypted, 16);
sg_init_one(&sg[2], decrypted, 16);
desc.tfm=blk;
desc.flags=0;
sg_copy_from_buffer(&sg[0],1,aes_128_in, 16);
crypto_blkcipher_encrypt(&desc, &sg[1], &sg[0], 16);
crypto_blkcipher_decrypt(&desc, &sg[2], &sg[1], 16);
crypto_free_blkcipher(blk);
}
Encrypted data: 69 c4 e0 d8 6a 7b 04 30 d8 cd b7 80 70 b4 c5 5a
Decrypted data: 00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff
Next question, what in detail did sg_copy_from_buffer function? Without this function encrypted data not right:
Encrypted data without sg_copy_from_buffer : 03 07 23 fc 20 11 42 c6 60 b3 36 07 eb c8 c9 62
Encrypted data without sg_copy_from_buffer : 00 00 00 00 00 00 00 00 58 51 02 a0 f7 7f 00 00
For your first question, the scatterlist saves the buffer you give it as a struct page internally(the "page link is actually a pointer to a struct page"), of which you can think as a physical address(Not exactly, but a struct page does represent a unique physical page).
That means scatterlist will first convert the buffer's virtual address to the corresponding physical address through sg_init_one, which finally calls the macro function ___pa to do that. When you call sg_virt, it will convert the physical address stored in the scatterlist back to a virtual address through another macro ___va.
Actually, ___pa is used to convert a virtual address within the linear mapping address range or in the kernel image address range to its corresponding physical address. ___va is used to convert a physical address to its corresponding virtual address within the linear mapping address range. They probably give a wrong output when converting an address out of aforementioned address ranges.
However, the global buffer you give to a scatterlist is within the kernel module address range which is behind the kernel image address range and the local buffer is within the kernel stack address range which is before the kernel image address range. Both of them are not within the kernel linear mapping address range, so after being converted through "___pa" and "___va", they are probably wrong.
According to your test, the local buffer address is right but the global buffer address is wrong, this is because the local buffer address is before the kernel image address range and the global buffer address is after the kernel image address range, so they are converted in a different way in "__pa" but in the same way in "___va". You can see it from the following code snippet in linux kernel source code /arch/x86/include/asm/page.h and
/arch/x86/include/asm/page_64.h.
// This function is the implementation of ___pa on x86-64
static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
// __START_KERNEL_map is the start address of the kernel image address range.
unsigned long y = x - __START_KERNEL_map;
// You can see that this function behaves differently depending on x and __START_KERNEL_map
/* use the carry flag to determine if x was < __START_KERNEL_map */
// phys_base is the start of system's physical address
// PAGE_OFFSET is the start of linear mapping address range
x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
return x;
}
// This is the implementation of ___va on x86-64
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
For a virtual address before the kernel image address range, ___va just adds an offset and ___pa just subtracts the same offset, so the local buffer address is right. However, for a virtual address after the kernel image address, ___va does the same work but ___pa behaves differently, so the global buffer address is wrong.
For x86-64 linux kernel memory layout, please refer to linux source code at Documentation/x86/x86_64/mm.rst
Note that only the buffer allocated by kmalloc will stay in the kernel linear mapping address range, so you should always use kmalloc to allocate memory for linux kernel crypto operations.
For your second question, a scatterlist can be made of a single struct scatterlist. However it is actually designed for managing a list of memory chunks, and every chunk is represented by a struct scatterlist. Using sg_copy_from_buffer, you can copy the data stored in a continuous buffer to a list of memory chunks managed by several struct scatterlists. In short, sg_copy_from_buffer has no concern with encryption.
For more details, please refer to the following kernel source code files.
/include/linux/scatterlist.h
/lib/scatterlist.c
/arch/x86/include/asm/page.h
/arch/x86/include/asm/page_64.h

How to align variables in a struct for PIC24 chips?

I have this struct, but knowing that every 4th byte is not used in memory, I need to align the struct correctly in memory. I'm not exactly sure how to do this, though I know that I'm supposed to and I also know where it needs to happen.
typedef struct _vol_meta {
uint16_t crc; // 2 bytes
uint8_t ver_major; // 1 byte
char pad1; // need to align here - 1 byte
uint16_t size; // 2 bytes
uint8_t ver_minor; // 1 byte
char pad2; // need to align here - 1 byte
uint8_t pagenum; // 1 byte
uint8_t rownum; // 1 byte
char pad3[2]; // align here - 2 bytes
uint8_t name[15]; // 15 bytes
// not sure how I'm supposed to align the array of uint8_t vars?
} VOL_META;
Is there some kind of c data type like
align 2
That tells the compiler to skip the next 2 bytes or something? Kind of lost here.
You can use (surprise) 'aligned' attribute, like that:
__ attribute __ ((aligned (2)) //word alignment
xc16 user guide sect.8.12 is your friend.

Linux sys_call_table rip relative addressing x86_64

I am trying to get offset of sys_call_table on Linux x86_64.
First of all I read pointer to system_call entry by reading it from MSR_LSTAR and it's correct
static unsigned long read_msr(unsigned int msr)
{
unsigned low, high;
asm volatile("rdmsr" : "=a" (low), "=d" (high) : "c" (msr));
return ((low) | ((u64)(high) << 32));
}
Then I parse it to find opcode of call instruction and it is also correct
#define CALL_OP 0xFF
#define CALL_MODRM 0x14
static unsigned long find_syscall_table(unsigned char *ptr)
{
//correct
for (; (*ptr != CALL_OP) || (*(ptr+1) != CALL_MODRM); ptr++);
//not correct
ptr += *(unsigned int*)(ptr + 3);
pr_info("%lx", (unsigned long)ptr);
return ptr;
}
But I failed to get address after call opcode. First byte of ptr is opcode, then ModRM byte, then SIB and then 32bit displacement, so I add 3 to ptr and dereferenced it as integer value and then add it to ptr, because it is %RIP, and address is RIP relative. But the result value is wrong, it don't coincide with value I see in gdb, so where am I wrong?
It's not x7e9fed00 but rather -0x7e9fed00 - a negative displacement.
That is the sign-magnitude form of the 2's complement negative number 0x81601300
which is stored by a little-endian processor as "00 13 60 81"
No idea if you will find sys_call_table at the resulting address however. As an alternative idea, it seems some people find it by searching memory for the known pointers to functions that should be listed in it.

128-bit division intrinsic in Visual C++

I'm wondering if there really is no 128-bit division intrinsic function in Visual C++?
There is a 64x64=128 bit multiplication intrinsic function called _umul128(), which nicely matches the MUL x64 assembler instruction.
Naturally, I assumed there would be a 128/64=64 bit division intrinsic as well (modelling the DIV instruction), but to my amazement neither Visual C++ nor Intel C++ seem to have it, at least it's not listed in intrin.h.
Can someone confirm that? I tried grep'ing for the function names in the compiler executable files, but couldn't find _umul128 in the first place, so I guess I looked in the wrong spot.
Update: at least I have now found the pattern umul128 (without the leading underscore) in c1.dll of Visual C++ 2010. All the other intrinsics are listed around it, but unfortunately no "udiv128" or the like :( So it seems they really have "forgotten" to implement it.
To clarify: I'm not only looking for a 128-bit data type, but a way to divide a 128-bit scalar int by a 64-bit int in C++. Either an intrinsic function or native 128-bit integer support would solve my problem.
Edit: The answer is no, there is no _udiv128 intrinsic in Visual Studio 2010 up to 2017, but it is available in Visual Studio 2019 RTM
If you don't mind little hacks, this may help (64-bit mode only, not tested):
#include <windows.h>
#include <stdio.h>
unsigned char udiv128Data[] =
{
0x48, 0x89, 0xD0, // mov rax,rdx
0x48, 0x89, 0xCA, // mov rdx,rcx
0x49, 0xF7, 0xF0, // div r8
0x49, 0x89, 0x11, // mov [r9],rdx
0xC3 // ret
};
unsigned char sdiv128Data[] =
{
0x48, 0x89, 0xD0, // mov rax,rdx
0x48, 0x89, 0xCA, // mov rdx,rcx
0x49, 0xF7, 0xF8, // idiv r8
0x49, 0x89, 0x11, // mov [r9],rdx
0xC3 // ret
};
unsigned __int64 (__fastcall *udiv128)(unsigned __int64 numhi,
unsigned __int64 numlo,
unsigned __int64 den,
unsigned __int64* rem) =
(unsigned __int64 (__fastcall *)(unsigned __int64,
unsigned __int64,
unsigned __int64,
unsigned __int64*))udiv128Data;
__int64 (__fastcall *sdiv128)(__int64 numhi,
__int64 numlo,
__int64 den,
__int64* rem) =
(__int64 (__fastcall *)(__int64,
__int64,
__int64,
__int64*))sdiv128Data;
int main(void)
{
DWORD dummy;
unsigned __int64 ur;
__int64 sr;
VirtualProtect(udiv128Data, sizeof(udiv128Data), PAGE_EXECUTE_READWRITE, &dummy);
VirtualProtect(sdiv128Data, sizeof(sdiv128Data), PAGE_EXECUTE_READWRITE, &dummy);
printf("0x00000123456789ABCDEF000000000000 / 0x0001000000000000 = 0x%llX\n",
udiv128(0x00000123456789AB, 0xCDEF000000000000, 0x0001000000000000, &ur));
printf("-6 / -2 = %lld\n",
sdiv128(-1, -6, -2, &sr));
return 0;
}
A small improvement - one less instruction
extern "C" digit64 udiv128(digit64 low, digit64 hi, digit64 divisor, digit64 *remainder);
; Arguments
; RCX Low Digit
; RDX High Digit
; R8 Divisor
; R9 *Remainder
; RAX Quotient upon return
.code
udiv128 proc
mov rax, rcx ; Put the low digit in place (hi is already there)
div r8 ; 128 bit divide rdx-rax/r8 = rdx remainder, rax quotient
mov [r9], rdx ; Save the reminder
ret ; Return the quotient
udiv128 endp
end
It's available now. You can use _div128 and _udiv128
The _div128 intrinsic divides a 128-bit integer by a 64-bit integer. The return value holds the quotient, and the intrinsic returns the remainder through a pointer parameter. _div128 is Microsoft specific.
Last year it was said to be available from "Dev16" but I'm not sure which version is that. I guess it's VS 16.0 A.K.A VS2019, but the documentation on MSDN shows that it goes further to VS2015
I am no expert, but I dug this up:
http://research.swtch.com/2008/01/division-via-multiplication.html
Interesting stuff. Hope it helps.
EDIT: This is insightful too: http://www.gamedev.net/topic/508197-x64-div-intrinsic/
Thanks #alexey-frunze, it worked with little tweak for VS2017, checked with same parameters with VS2019:
#include <iostream>
#include <string.h>
#include <math.h>
#include <immintrin.h>
#define no_init_all
#include <windows.h>
unsigned char udiv128Data[] =
{
0x48, 0x89, 0xD0, // mov rax,rdx
0x48, 0x89, 0xCA, // mov rdx,rcx
0x49, 0xF7, 0xF0, // div r8
0x49, 0x89, 0x11, // mov [r9],rdx
0xC3 // ret
};
unsigned char sdiv128Data[] =
{
0x48, 0x89, 0xD0, // mov rax,rdx
0x48, 0x89, 0xCA, // mov rdx,rcx
0x49, 0xF7, 0xF8, // idiv r8
0x49, 0x89, 0x11, // mov [r9],rdx
0xC3 // ret
};
unsigned __int64(__fastcall* udiv128)(
unsigned __int64 numhi,
unsigned __int64 numlo,
unsigned __int64 den,
unsigned __int64* rem) =
(unsigned __int64(__fastcall*)(
unsigned __int64,
unsigned __int64,
unsigned __int64,
unsigned __int64*))
((unsigned __int64*)udiv128Data);
__int64(__fastcall *sdiv128)(
__int64 numhi,
__int64 numlo,
__int64 den,
__int64* rem) =
(__int64(__fastcall *)(
__int64,
__int64,
__int64,
__int64*))
((__int64*)sdiv128Data);
void test1()
{
unsigned __int64 a = 0x3c95ba9e6a637e7;
unsigned __int64 b = 0x37e739d13a6d036;
unsigned __int64 c = 0xa6d036507ecc7a7;
unsigned __int64 d = 0x7ecc37a70c26e68;
unsigned __int64 e = 0x6e68ac7e5f15726;
DWORD dummy;
VirtualProtect(udiv128Data, sizeof(udiv128Data), PAGE_EXECUTE_READWRITE, &dummy);
e = udiv128(a, b, c, &d);
printf("d = %llx, e = %llx\n", d, e); // d = 1ed37bdf861c50, e = 5cf9ffa49b0ec9aa
}
void test2()
{
__int64 a = 0x3c95ba9e6a637e7;
__int64 b = 0x37e739d13a6d036;
__int64 c = 0xa6d036507ecc7a7;
__int64 d = 0x7ecc37a70c26e68;
__int64 e = 0x6e68ac7e5f15726;
DWORD dummy;
VirtualProtect(sdiv128Data, sizeof(sdiv128Data), PAGE_EXECUTE_READWRITE, &dummy);
e = sdiv128(a, b, c, &d);
printf("d = %llx, e = %llx\n", d, e); // d = 1ed37bdf861c50, e = 5cf9ffa49b0ec9aa
}
int main()
{
test1();
test2();
return 0;
}

Resources