Why is string manipulation more expensive? - string

I've heard this so many times, that I have taken it for granted. But thinking back on it, can someone help me realize why string manipulation, say comparison etc, is more expensive than say an integer, or some other primitive?

8bit example:
1 bit can be 1 or 0. With 2 bits you can represent 0, 1, 2, and 3. And so on.
With a byte you have 2^8 possibilities, from 0 to 255.
In a string a single letter is stored in a byte, so "Hello world" is 11 bytes.
If I want to do 100 + 100, 100 is stored in 1 byte of memory, I need only two bytes to sum two numbers. The result will need again 1 byte.
Now let's try with strings, "100" + "100", this is 3 bytes plus 3 bytes and the result, "100100" needs 6 bytes to be stored.
This is over-simplified, but more or less it works in this way.

The int data type in C# was carefully selected to be a good match with processor design. Which can store an int in a cpu register, a storage location that's an easy factor of 3 faster than memory. And a single cpu instruction to compare values of type int. The CMP instruction runs in less than a single cpu cycle, a fraction of a nano-second.
That doesn't work nearly as well for a string, it is a variable length data type and every single char in the string must be compared to test for equality. So it is automatically proportionally slower by the size of the string. Furthermore, string comparison is afflicted by culture dependent comparison rules. The kind that make "ss" and "ß" equal in German and "Aa" and "Å" equal in Danish. Nothing subtle to deal with, taken care of by highly optimized table-driven code inside the CLR. It can't beat CMP.

I've always thought it was because of the immutability of strings. That is, every time you make a change to the string, it requires allocating memory for a whole new string (rather than modifying the original in place).
Probably a woefully naive understanding but perhaps someone else can expound further.

There are several things to consider when looking at the "cost" of manipulating strings.
There is the cost in terms of memory usage, there is the cost in terms of CPU cycles used, and there is a cost associated with the complexity of the code involved.
Integer manipulation (Add, Subtract, Multipy, Divide, Compare) is most often done by the CPU at the hardware level, in few (or even 1) instruction. When the manipulation is done, the answer fits back in the same size chunk of memory.
Strings are stored in blocks of memory, which have to be manipulated a byte or word at a time. Comparing two 100 character long strings may require 100 separate comparison operations.
Any manipulation that makes a string longer will require, either moving the string to a bigger block of memory, or moving other stuff around in memory to allow growing the existing block.
Any manipulation that leaves the string the same, or smaller, could be done in place, if the language allows for it. If not, then again, a new block of memory has to be allocated and contents moved.

Related

what is the max size of a string type in terraform

I'm trying to locate a definitive answer to, "What is the max size of a Terraform value type of 'string'"?
Been searching and googling and can't seem to find it defined anywhere. Anyone have any reference they could point me to?
Tia
Bill W
The length of strings in Terraform is constrained in two main ways:
Terraform internally tracks the length of the string, which is stored as an integer type which has a limited range.
Strings need to exist in system memory as a consecutive sequence of bytes.
The first of these is directly answerable: Terraform tracks the length of a string using an integer type large enough to represent all pointers on the host platform. From a practical standpoint then, that means a 64-bit integer when you're using a 64-bit build, and a 32-bit number when you're using a 32-bit build.
That means that there's a hard upper limit imposed by the maximum value of that integer. Terraform is internally tracking the length of the UTF-8 representation of the string in bytes, and so this upper limit is measured in bytes rather than in characters:
32-bit systems: 4,294,967,295 bytes
64-bit systems: 18,446,744,073,709,551,615 bytes
Terraform stores strings in memory using Unicode NFC normal form, UTF-8 encoded, and so the number of characters will vary depending on how many bytes each character takes up in the UTF-8 encoding form. ASCII characters take only one byte, but other characters can require up to four bytes.
A string of the maximum representable length would take up the entire address space of the Terraform process, which is impossible (there needs to be room for the Terraform application code, libraries, and kernel space too!), and so in practice the available memory on your system is the more relevant limit. That limit varies based on various characteristics of the system where you're running Terraform, and so isn't answerable in a general sense.

Quickjob MOVZON X'FF' to OFA1

what does MOVZON X'FF' do in quickjob. I believe it just moves input to output. Please let me know, if I am wrong.
The smallest unit of information is the bit. Processors usually don‘t work on single bits when accessing memory; they work on bytes. A byte consists of 8 consecutive bits (for most architectures).
To describe how different processor instructions work with bytes, bytes are sometimes subdivided into two 4-bit groups, called nibbles. Counting left to right, bits 0-3 are called „left nibble“, „high order nibble“, or „zone nibble“. Bits 4-7, the right half, are called „right nibble“, „low order nibble“, or „number nibble“.
There are instructions that work on the whole byte, e.g. MOVE. And there are instructions that work on nibbles. MOVEZONE (MOVZON) works on zone nibbles and leaves the number nibbles alone; MOVENUM (MOVNUM) works on number nibbles, and leaves the zone nibbles alone.
This kind of instructions are usually used with bytes that contain numeric values, coded as either zoned decimal, or packed decimal. They are rather exotic when working on text data.
This reference is used.
Given the instruction:
MOVZON X'FF' to OFA1
The receiving field OFA1 refers to the first record position (the 1) of the output file ( the OF) designated as A. The instruction will set the high-order bits (0-3 or "zone bits") of the first position to ones, matching bits 0-3 of the X'FF'.
However, it appears, as a matter of style, the instruction should have been written as MOVZON X'F0' TO OAF1 since the low-order bits (4-7) are not used.

What does the IS_ALIGNED macro in the linux kernel do?

I've been trying to read the implementation of a kernel module, and I'm stumbling on this piece of code.
unsigned long addr = (unsigned long) buf;
if (!IS_ALIGNED(addr, 1 << 9)) {
DMCRIT("#%s in %s is not sector-aligned. I/O buffer must be sector-aligned.", name, caller);
BUG();
}
The IS_ALIGNED macro is defined in the kernel source as follows:
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
I understand that data has to be aligned along the size of a datatype to work, but I still don't understand what the code does.
It left-shifts 1 by 9, then subtracts by 1, which gives 111111111. Then 111111111 does bitwise-and with x.
Why does this code work? How is this checking for byte alignment?
In systems programming it is common to need a memory address to be aligned to a certain number of bytes -- that is, several lowest-order bits are zero.
Basically, !IS_ALIGNED(addr, 1 << 9) checks whether addr is on a 512-byte (2^9) boundary (the last 9 bits are zero). This is a common requirement when erasing flash locations because flash memory is split into large blocks which must be erased or written as a single unit.
Another application of this I ran into. I was working with a certain DMA controller which has a modulo feature. Basically, that means you can allow it to change only the last several bits of an address (destination address in this case). This is useful for protecting memory from mistakes in the way you use a DMA controller. Problem it, I initially forgot to tell the compiler to align the DMA destination buffer to the modulo value. This caused some incredibly interesting bugs (random variables that have nothing to do with the thing using the DMA controller being overwritten... sometimes).
As far as "how does the macro code work?", if you subtract 1 from a number that ends with all zeroes, you will get a number that ends with all ones. For example, 0b00010000 - 0b1 = 0b00001111. This is a way of creating a binary mask from the integer number of required-alignment bytes. This mask has ones only in the bits we are interested in checking for zero-value. After we AND the address with the mask containing ones in the lowest-order bits we get a 0 if any only if the lowest 9 (in this case) bits are zero.
"Why does it need to be aligned?": This comes down to the internal makeup of flash memory. Erasing and writing flash is a much less straightforward process then reading it, and typically it requires higher-than-logic-level voltages to be supplied to the memory cells. The circuitry required to make write and erase operations possible with a one-byte granularity would waste a great deal of silicon real estate only to be used rarely. Basically, designing a flash chip is a statistics and tradeoff game (like anything else in engineering) and the statistics work out such that writing and erasing in groups gives the best bang for the buck.
At no extra charge, I will tell you that you will be seeing a lot of this type of this type of thing if you are reading driver and kernel code. It may be helpful to familiarize yourself with the contents of this article (or at least keep it around as a reference): https://graphics.stanford.edu/~seander/bithacks.html

CRC16 collision (2 CRC values of blocks of different size)

The Problem
I have a textfile which contains one string per line (linebreak \r\n). This file is secured using CRC16 in two different ways.
CRC16 of blocks of 4096 bytes
CRC16 of blocks of 32768 bytes
Now I have to modify any of these 4096 byte blocks, so it (the block)
contains a specific string
does not change the size of the textfile
has the same CRC value as the original block (same for the 32k block, that contains this 4k block)
Depart of that limitations I may do any modifications to the block that are required to fullfill it as long as the file itself does not break its format. I think it is the best to use any of the completly filled 4k blocks, not the last block, that could be really short.
The Question
How should I start to solve that problem? The first thing I would come up is some kind of bruteforce but wouldn't it take extremly long to find the changes that will result in both CRC values stay the same? Is there probably a mathematical way to solve that?
It should be done in seconds or max. few minutes.
There are math ways to solve this but I don't know them. I'm proposing a brute-force solution:
A block looks like this:
SSSSSSSMMMMEEEEEEE
Each character represents a byte. S = start bytes, M = bytes you can modify, E = end bytes.
After every byte added to the CRC it has a new internal state. You can reuse the checksum state up to that position that you modify. You only need to recalculate the checksum for the modified bytes and all following bytes. So calculate the CRC for the S-part only once.
You don't need to recompute the following bytes either. You just need to check whether the CRC state is the same or different after the modification you made. If it is the same, the entire block will also be the same. If it is different, the entire block is likely to be different (not guaranteed, but you should abort the trial). So you compute the CRC of just the S+M' part (M' being the modified bytes). If it equals the state of CRC(S+M) you won.
That way you have much less data to go through and a recent desktop or server can do the 2^32 trials required in a few minutes. Use parallelism.
Take a look at spoof.c. That will directly solve your problem for the CRC of the 4K block. However you will need to modify the code to solve the problem simultaneously for both the CRC of the 4K block and the CRC of the enclosing 32K block. It is simply a matter of adding more equations to solve. The code is extremely fast, running in O(log(n)) time, where n is the length of the message.
The basic idea is that you will need to solve 32 linear equations over GF(2) in 32 or more unknowns, where each unknown is a bit location that you are permitting to be changed. It is important to provide more than 32 unknowns with which to solve the problem, since if you pick exactly 32, it is not at all unlikely that you will end up with a singular matrix and no solution. The spoof code will automatically find non-singular choices of 32 unknown bit locations out of the > 32 that you provide.

Variable substitution faster than in-line integer in Vic-20 basic?

The following two (functionally equivalent) programs are taken from an old issue of Compute's Gazette. The primary difference is that program 1 puts the target base memory locations (7680 and 38400) in-line, whereas program 2 assigns them to a variable first.
Program 1 runs about 50% slower than program 2. Why? I would think that the extra variable retrieval would add time, not subtract it!
10 PRINT"[CLR]":A=0:TI$="000000"
20 POKE 7680+A,81:POKE 38400+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
40 PRINT TI/60:END
Program 1
10 PRINT "[CLR]":A=0:B=7600:C=38400:TI$="000000"
20 POKE B+A,81:POKE C+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
40 PRINT TI/60:END
Program 2
The reason is that BASIC is fully interpreted here, so the strings "7680" and "38400" need to be converted to binary integers EVERY TIME line 20 is reached (506 times in this program). In program 2, they're converted once and stored in B. So as long as the search-for-and-fetch of B is faster than convert-string-to-binary, program 2 will be faster.
If you were to use a BASIC compiler (not sure if one exists for VIC-20, but it would be a cool retro-programming project), then the programs would likely be the same speed, or perhaps 1 might be slightly faster, depending on what optimizations the compiler did.
It's from page 76 of this issue: http://www.scribd.com/doc/33728028/Compute-Gazette-Issue-01-1983-Jul
I used to love this magazine. It actually says a 30% improvement. Look at what's happening in program 2 and it becomes clear, because you are looping a lot using variables the program is doing all the memory allocation upfront to calculate memory addresses. When you do the slower approach each iteration has to allocate memory for the highlighted below as part of calculating out the memory address:
POKE 7680+A,81:POKE 38400+A
This is just the nature of the BASIC Interpreter on the VIC.
Accessing the first defined variable will be fast; the second will be a little slower, etc. Parsing multi-digit constants requires the interpreter to perform repeated multiplication by ten. I don't know what the exact tradeoffs are between variables and constants, but short variable names use less space than multi-digit constants. Incidentally, the constant zero may be parsed more quickly if written as a single decimal point (with no digits) than written as a digit zero.

Resources