Since fpu operation is very costly, I would like to less use fpu operations as far as I can. In the meanwhile, I'm wondering what kind of operations within the float variable would counted as the operation that fpu involves?
Such as the following code would involve fpu unit?
struct my_float_struct {
int f;
} g;
void func(float a)
{
g.f = a;
}
would calling func will cause lazy FPU context switch?
In kernel you shouldn't use float at all (use lookup tables and multiplying by coefficients to save precision on divide operations instead). In user space all operations involving float type on RHS would lead to FPU instruction usage. If you have 2 integers on RHS but assigning result to float (on LHS) -- you will end up with integer operations, but also with casting to float instruction (like
cvtsi2ss).
It's actually very easy to answer your question just looking into disassemble for corresponding C code.
C code (build with gcc -Wall -O0 -g main.c):
int main(void)
{
volatile float f1, f2;
volatile int i;
f1 = i * 356;
f2 = f1 * f2;
return 0;
}
Disassemble (using objdump -DS a.out):
int main(void)
{
4004b6: 55 push %rbp
4004b7: 48 89 e5 mov %rsp,%rbp
volatile float f1, f2;
volatile int i;
f1 = i * 356;
4004ba: 8b 45 f4 mov -0xc(%rbp),%eax
4004bd: 69 c0 64 01 00 00 imul $0x164,%eax,%eax
4004c3: 66 0f ef d2 pxor %xmm2,%xmm2
4004c7: f3 0f 2a d0 cvtsi2ss %eax,%xmm2
4004cb: 66 0f 7e d0 movd %xmm2,%eax
4004cf: 89 45 fc mov %eax,-0x4(%rbp)
f2 = f1 * f2;
4004d2: f3 0f 10 4d fc movss -0x4(%rbp),%xmm1
4004d7: f3 0f 10 45 f8 movss -0x8(%rbp),%xmm0
4004dc: f3 0f 59 c8 mulss %xmm0,%xmm1
4004e0: 66 0f 7e c8 movd %xmm1,%eax
4004e4: 89 45 f8 mov %eax,-0x8(%rbp)
return 0;
4004e7: b8 00 00 00 00 mov $0x0,%eax
}
From here you can see that:
when multiplying integer by integer, but assigning result to float, it leads to imul instruction (integer multiplying), but also to cvtsi2ss (converting to float) instruction.
when multiplying 2 floats, it leads to mulss instruction, which float multiplying
Basically you can see immediately if FPU involved by instruction operates on FPU registers, like %xmm0, %xmm1, etc.
Related
RFC Reference
I am working on a project which involves sockets programming and interpreting the output from DIG DNS queries.
I'm using RFC 1035 as my reference. Although this is quite old now (1987) as far as I can tell from later RFCs (for example 8490) the DNS headers are still the same.
https://www.rfc-editor.org/rfc/rfc1035
Code Overview: IPv6 TCP query
I have written a short program in C which reads from a IPv6 TCP socket. I send data to this socket using DIG. (My program simply reads all data it sees on a socket, and prints it to stdout.)
Note that there are two unusual things here:
Firstly the use of IPv6
Secondly the use of TCP (DNS messages are often UDP)
Here is the command used:
dig #::1 -p 8053 duckduckgo.com +tcp
I am running dig version DiG 9.16.13-Debian, on Debian Testing. (cera 2021-May)
Output, Discussion and Question
Here is the hexadecimal and printable character output which is read from the socket:
Hex:
00 37 61 78 01 20 00 01 00 00 00 00 00 01 0A 64 75 63 6B 64 75 63 6B 67 6F 03 63 6F 6D 00 00 01 00 01 00 00 29 10 00 00 00 00 00 00 0C 00 0A 00 08 00 7A 4* 48 2C 16 0* 33
Char:
00 7 61 x 01 20 00 01 00 00 00 00 00 01 0A d u c k d u c k g o 03 c o m 00 00 01 00 01 00 00 ) 10 00 00 00 00 00 00 0C 00 0A 00 08 00 z 4* H , 16 0* 33
If non-printable characters are encountered, the hex value is printed instead.
Although this is a fairly long stream of data, the question relates to the length of the header.
According to RFC 1035, the length of the header should be 12 bytes.
4.1.1. Header section format
The header contains the following fields:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The header is followed by a QUESTION SECTION. The question section begins with a single byte which specifies the length.
Inspecting the data stream above, we see that the byte at offset 12 has a value of 0. I repeat it below with offset numbers to make it clear. The data is in the middle row, the row above and below are byte offsets.
0 1 2 3 4 5 6 7 8 9 10 11 <- byte 12
00 37 61 78 01 20 00 01 00 00 00 00 00 01 0A 64 75 63 6B ...
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <- byte 15
This clearly doesn't make any sense.
Looking again at the stream, we can see that "duckduckgo" is preceeded by the byte 0A. This is 10 in decimal and corresponds to the 10 characters of "duckduckgo". This string is followed by a byte 03 which corresponds to the 3 bytes of "com".
The offset of the byte 0A is 15. Not 12.
I must have misunderstood the RFC specification. But what have I misunderstood? Does the header itself start at a different offset to what I think it is? (Byte zero.) Or is there perhaps some padding between the end of the header and the beginning of the first question section?
Existing Question on this site:
Comments: The below link states that there is no padding. This is the only answer on this question. The question is about DNS responses rather than queries, and does not ask about the header section of a query. (Although information from one should presumably apply to the other, but possibly does not.)
Do DNS messages pad names to an even number of bytes?
Comments: The below link asks about the best way to build a data structure to handle DNS data. Additionally, the answer notes that one has to be careful about network byte order and machine byte order. I am already aware of this and I use ntohs() to convert from network byte order to x86_64 byte order before printing information to stdout. This is not the problem and does not explain why I see information about the dns query starting at byte 15 instead of 12, when the header should be a fixed size of 12 bytes.
Implementing a DNS Query in c++ according to RFC 1035
Thanks to #SteffenUllrich who prompted the solution for this in the comments.
RFC 1035 4.2.2 states
4.2.2. TCP usage
Messages sent over TCP connections use server port 53 (decimal). The
message is prefixed with a two byte length field which gives the message
Mockapetris [Page 32]
RFC 1035 Domain Implementation and Specification November 1987
length, excluding the two byte length field. This length field allows
the low-level processing to assemble a complete message before beginning
to parse it.
I had removed the 2-byte field at the start of my struct at some point.
This is what the structure looks like with the 2 byte length field re-enabled.
struct __attribute__((__packed__)) dns_header
{
unsigned short ID;
union
{
unsigned short FLAGS;
struct
{
unsigned short QR : 1;
unsigned short OPCODE : 4;
unsigned short AA : 1;
unsigned short TC : 1;
unsigned short RD : 1;
unsigned short RA : 1;
unsigned short Z : 3;
unsigned short RCODE : 4;
};
};
unsigned short QDCOUNT;
unsigned short ANCOUNT;
unsigned short NSCOUNT;
unsigned short ARCOUNT;
};
struct __attribute__((__packed__)) dns_struct_tcp
{
unsigned short length; // length excluding 2 bytes for length field
struct dns_header header;
};
For example: I recieved a TCP packet of length 53 bytes. The value of length is set to 51.
To read data into this struct:
memcpy(&dnsdata, buf, sizeof(struct dns_struct_tcp));
To interpret this data (since it is stored in network byte order):
void dns_header_print(FILE *file, const struct dns_header *header)
{
fprintf(file, "ID: %u\n", ntohs(header->ID));
char str_FLAGS[8 * sizeof(unsigned short) + 1];
str_FLAGS[8 * sizeof(unsigned short)] = '\0';
print_binary_16_fixed_width(str_FLAGS, header->FLAGS);
fprintf(file, "FLAGS: %s\n", str_FLAGS);
fprintf(file, "FLAGS: QOP ATRRZZZR \n");
fprintf(file, " RCODEACDA CODE\n");
fprintf(file, "QDCOUNT: %u\n", ntohs(header->QDCOUNT));
fprintf(file, "ANCOUNT: %u\n", ntohs(header->ANCOUNT));
fprintf(file, "NSCOUNT: %u\n", ntohs(header->NSCOUNT));
fprintf(file, "ARCOUNT: %u\n", ntohs(header->ARCOUNT));
}
Note that the flags are unchanged, since each field of flags is less than 8 bits in length. However on x86_64 systems, unsigned short is stored in little-endian format, hence ntohs() is use to convert data which is in big-endian (network) byte order to little-endian (host) byte order.
I have learned that the MIFARE Classic authentication has a weakness about the parity bit. But I wonder how the reader sends the parity bit to the tag?
For example, here is the trace of a failed authentication attempt:
reader:26
tag: 02 00
reader:93 20
tag: c1 08 41 6a e2
reader:93 70 c1 08 41 6a e2 e4 7c
tag: 18 37 cd
reader:60 00 f5 7b
tag: ab cd 19 49
reader:59 d5 92 0f 15 b9 d5 53
tag: a //error code: 0x5
I know after the anti-collision, the tag will send NT(32-bit) as challenge to the reader, and the reader responds with the challenge {NR}(32-bit) and {AR}(32-bit). But I don't know where the 8-bit parity bit is in the above example. Which are the parity bits?
The example trace that you posted in your question either does not contain information about parity bits or all parity bits were valid (according to ISO/IEC 14443-3).
E.g. when the communication traces shows that the reader sends 60 00 f5 7b, the actual data that is sent over the RF interface would be (P is the parity bit):
b1 ... b8 P b1 ... b8 P b1 ... b8 P b1 ... b8 P
S 0000 0110 1 0000 0000 1 1010 1111 1 1101 1110 1 E
Parity bits are sent after each 8th byte (i.e. after each octet) and makes sure that all 9 bits contain an odd number of binary ones (odd parity). Therefore, it forms a 1-bit checksum over that byte. In your trace only bytes (but not the parity bits in between them) are shown.
The vulnerability regarding parity bits in MIFARE Classic is that parity bits are encrypted together with the actual data (cf. de Koning Gans, Hoepman,
and Garcia (2008): A Practical Attack on the MIFARE Classic, in CARDIS 2008, LNCS 5189, pp. 267-282, Springer).
Consequently, when you look at the communication trace without considering encryption, there may be parity errors according to the ISO/IEC 14443-3 parity calculation rule since the encrypted parity bit might not match the parity bit for the raw data stream. Tools like the Proxmark III would indicate such observed parity errors as exclamation marks ("!") after the corresponding bytes in the communication trace.
I'm trying to follow section A.1.2 of RFC 6979 and am having some difficulty.
So h1 is as follows:
h1
AF 2B DB E1 AA 9B 6E C1 E2 AD E1 D6 94 F4 1F C7
1A 83 1D 02 68 E9 89 15 62 11 3D 8A 62 AD D1 BF
If that is run through bits2octets(h1) you're supposed to get this:
01 79 5E DF 0D 54 DB 76 0F 15 6D 0D AC 04 C0 32
2B 3A 20 42 24
I don't understand how.
Here's bits2octets defined in Java (from the RFC):
private byte[] bits2octets(byte[] in)
{
BigInteger z1 = bits2int(in);
BigInteger z2 = z1.subtract(q);
return int2octets(z2.signum() < 0 ? z1 : z2);
}
Here's bits2int:
private BigInteger bits2int(byte[] in)
{
BigInteger v = new BigInteger(1, in);
int vlen = in.length * 8;
if (vlen > qlen) {
v = v.shiftRight(vlen - qlen);
}
return v;
}
Heres q:
q = 0x4000000000000000000020108A2E0CC0D99F8A5EF
h1 is 32 bytes long. q is 21 bytes long.
So bits2int returns the first 21 bytes of h1. ie.
af2bdbe1aa9b6ec1e2ade1d694f41fc71a831d0268
Convert that to an integer and then subtract q and you get this:
af2bdbe1aa9b6ec1e2ade1d694f41fc71a831d0268
- 04000000000000000000020108A2E0CC0D99F8A5EF
------------------------------------------
ab2bdbe1aa9b6ec1e2addfd58c513efb0ce9245c79
The result is positive so it - z2 - is kept.
Then int2octets() is called.
private byte[] int2octets(BigInteger v)
{
byte[] out = v.toByteArray();
if (out.length < rolen) {
byte[] out2 = new byte[rolen];
System.arraycopy(out, 0,
out2, rolen - out.length,
out.length);
return out2;
} else if (out.length > rolen) {
byte[] out2 = new byte[rolen];
System.arraycopy(out, out.length - rolen,
out2, 0, rolen);
return out2;
} else {
return out;
}
}
q and v are the same size so ab2bdbe1aa9b6ec1e2addfd58c513efb0ce9245c79
is returned. But that's not what the test vector says:
bits2octets(h1)
01 79 5E DF 0D 54 DB 76 0F 15 6D 0D AC 04 C0 32
2B 3A 20 42 24
I don't get it. Did I mess up in my analysis somewhere?
The output is obtained as (0xaf2b...d1bf >> (256 - 163)) mod q = 0x0179...4224. Your mistake was assuming bits2int shifted bytes instead of bits.
I'm developing a device driver for a Xilinx Virtex 6 PCIe custom board.
When doing DMA write (from host to device) here is what happens:
user space app:
a. fill buffer with the following byte pattern (tested up to 16kB)
00 00 .. 00 (64bytes)
01 01 .. 00 (64bytes)
...
ff ff .. ff (64bytes)
00 00 .. 00 (64bytes)
01 01 .. 00 (64bytes)
etc
b. call custom ioctl to pass pointer to buffer and size
kernel space:
a. retrieve buffer (bufp) with
copy_from_user(ptdev->kbuf, bufp, cnt)
b. setup and start DMA
b1. //setup physical address
iowrite32(cpu_to_be32((u32) ptdev->kbuf_dma_addr),
ptdev->region0 + TDO_DMA_HOST_ADDR);
b2. //setup transfer size
iowrite32(cpu_to_be32( ((cnt+3)/4)*4 ),
ptdev->region0 + TDO_DMA_BYTELEN);
b3. //memory barrier to make sure kbuf is in memorry
mb();
//start dma
b4. iowrite32(cpu_to_be32(TDO_DMA_H2A | TDO_DMA_BURST_FIXED | TDO_DMA_START),
ptdev->region0 + TDO_DMA_CTL_STAT);
c. put process to sleep
wait_res = wait_event_interruptible_timeout(ptdev->dma_queue,
!(tdo_dma_busy(ptdev, &dma_stat)),
timeout);
d. check wait_res result and dma status register and return
Note that the kernel buffer is allocated once at device probe with:
ptdev->kbuf = pci_alloc_consistent(dev, ptdev->kbuf_size, --512kB
&ptdev->kbuf_dma_addr);
device pcie TLP dump (obtained through logic analyzer after Xilinx core):
a. TLP received (by the device)
a1. 40000001 0000000F F7C04808 37900000 (MWr corresponds to b1 above)
a1. 40000001 0000000F F7C0480C 00000FF8 (MWr corresponds to b2 above)
a1. 40000001 0000000F F7C04800 00010011 (MWr corresponds to b4 above)
b. TLP sent (by the device)
b1. 00000080 010000FF 37900000 (MRd 80h DW # addr 37900000h)
b2. 00000080 010000FF 37900200 (MRd 80h DW # addr 37900200h)
b3. 00000080 010000FF 37900400 (MRd 80h DW # addr 37900400h)
b4. 00000080 010000FF 37900600 (MRd 80h DW # addr 37900600h)
...
c. TLP received (by the device)
c1. 4A000020 00000080 01000000 00 00 .. 00 01 01 .. 01 CplD 128B
c2. 4A000020 00000080 01000000 02 02 .. 02 03 03 .. 03 CplD 128B
c3. 4A000020 00000080 01000000 04 04 .. 04 05 05 .. 05 CplD 128B
c4. 4A000020 00000080 01000000 06 06 .. 0A 0A 0A .. 0A CplD 128B <=
c5. 4A000010 00000040 01000040 07 07 .. 07 CplD 64B <=
c6. 4A000010 00000040 01000040 0B 0B .. 0B CplD 64B <=
c7. 4A000020 00000080 01000000 08 08 .. 08 09 09 .. 09 CplD 128B <=
c8. 4A000020 00000080 01000000 0C 0C .. 0C 0D 0D .. 0D CplD 128B
.. the remaining bytes are transfered correctly and
the total number of bytes (FF8h) matches the requested size
signal interrupt
Now this apparent memory ordering error happens with high probality (0.8 < p < 1) and the ordering mismatch happens at different random points in the transfer.
EDIT: Note that the point c4 above would indicate that the memory is not filled in the right order by the kernel driver (I suppose the memory controller fills TLPs with contiguous memory). 64B being the cacheline size maybe this has something to do with cache operations.
When I disable cache on the kernel buffer with,
echo "base=0xaf180000 size=0x00008000 type=uncachable" > /proc/mtrr
the error still happens but much more seldom (p < 0.1 and depends on transfer size)
This only happens on a i7-4770 (Haswell) based machine (tested on 3 identical machine, with 3 boards).
I tried kernel 2.6.32 (RH6.5), stock 3.10.28, and stock 3.13.1 with the same results.
I tried the code and device in an i7-610 QM57 based machine and Xeon 5400 machine without any issues.
Any ideas/suggestions are welcome.
Best regards
Claudio
I know this is an old thread, but the reason for the "errors" is completion reordering. Multiple outstanding read requests don't have to be answered in order. Completions are only in order for the same request.
On top of that: there is always the same tag assigned to the requests, which is illegal if the requests are active at the same time.
In the example provided all MemRd TLP have the same TAG. You can't use the same TAG while you haven't received the last corresponding CplD with this TAG. So if you send MemRd, wait until you get CplD with this tag and fire MemRd again all your data will be in order (but in this case bus utilization will be low and you can't get high bandwidth occupation).
Also read this pci_alloc_consistent uncached memory. It doesn't like as a cache issue on your platform. I would better debug the device core.
QM57 supports PCIe 2.0
http://www.intel.com/Products/Notebook/Chipsets/QM57/qm57-overview.htm
whereas I imagine the mobo of i7-4770 machine supports PCIe 3.0
http://ark.intel.com/products/75122
I suspect there might be a kind of negotiation failure between PCIe 3.0 mobo and your V6 device (PCIe 2.0, too)
I have a binary file , the definition of its content is as below : ( all data is stored
in little endian (ie. least significant byte first)) . The example numbers below are HEX
11 63 39 46 --- Time, UTC in seconds since 1 Jan 1970.
01 00 --- 0001 = No Fix, 0002 = SPS
97 85 ff e0 7b db 4c 40 --- Latitude, as double
a1 d5 ce 56 8d 26 28 40 --- Longitude, as double
f0 37 e1 42 --- Height in meters, as float
fe 2b f0 3a --- Speed in km/h, as float
00 00 00 00 --- Heading (degrees ?), as float
01 00 --- RCR, log reason. 0001=Time, 0004=Distance
59 20 6a f3 4a 26 e3 3f --- Distance in meters, as double,
2a --- ? Don't know
a8 --- Checksum, xor of all bytes above not including 0x2a
the data from the Binary file "in HEX" is as below
"F25D39460200269652F5032445401F4228D79BCC54C09A3A2743B4ADE73F2A83"
I appreciate if you can support me to translate this data line based on the instruction before.
Probably wrong, but here's a shot at it using Ruby:
hex = "F25D39460200269652F5032445401F4228D79BCC54C09A3A2743B4ADE73F2A83"
ints = hex.scan(/../).map{ |s| s.to_i(16) }
raw = ints.pack('C*')
fields = raw.unpack( 'VvEEVVVvE')
p fields
#=> [1178164722, 2, 42.2813707974677, -83.1970117467067, 1126644378, 1072147892, nil, 33578, nil]
p Time.at( fields.first )
#=> 2007-05-02 21:58:42 -0600
I'd appreciate it if someone well-versed in #pack and #unpack would show me a better way to accomplish the first three lines.
My Cygnus Hex Editor could load such a file and, using structure templates, display the data in its native formats.
Beyond that, it's just a matter of doing through each value and working out the translation for each byte.