what is the max size of a string type in terraform - string

I'm trying to locate a definitive answer to, "What is the max size of a Terraform value type of 'string'"?
Been searching and googling and can't seem to find it defined anywhere. Anyone have any reference they could point me to?
Tia
Bill W

The length of strings in Terraform is constrained in two main ways:
Terraform internally tracks the length of the string, which is stored as an integer type which has a limited range.
Strings need to exist in system memory as a consecutive sequence of bytes.
The first of these is directly answerable: Terraform tracks the length of a string using an integer type large enough to represent all pointers on the host platform. From a practical standpoint then, that means a 64-bit integer when you're using a 64-bit build, and a 32-bit number when you're using a 32-bit build.
That means that there's a hard upper limit imposed by the maximum value of that integer. Terraform is internally tracking the length of the UTF-8 representation of the string in bytes, and so this upper limit is measured in bytes rather than in characters:
32-bit systems: 4,294,967,295 bytes
64-bit systems: 18,446,744,073,709,551,615 bytes
Terraform stores strings in memory using Unicode NFC normal form, UTF-8 encoded, and so the number of characters will vary depending on how many bytes each character takes up in the UTF-8 encoding form. ASCII characters take only one byte, but other characters can require up to four bytes.
A string of the maximum representable length would take up the entire address space of the Terraform process, which is impossible (there needs to be room for the Terraform application code, libraries, and kernel space too!), and so in practice the available memory on your system is the more relevant limit. That limit varies based on various characteristics of the system where you're running Terraform, and so isn't answerable in a general sense.

Related

Quickjob MOVZON X'FF' to OFA1

what does MOVZON X'FF' do in quickjob. I believe it just moves input to output. Please let me know, if I am wrong.
The smallest unit of information is the bit. Processors usually don‘t work on single bits when accessing memory; they work on bytes. A byte consists of 8 consecutive bits (for most architectures).
To describe how different processor instructions work with bytes, bytes are sometimes subdivided into two 4-bit groups, called nibbles. Counting left to right, bits 0-3 are called „left nibble“, „high order nibble“, or „zone nibble“. Bits 4-7, the right half, are called „right nibble“, „low order nibble“, or „number nibble“.
There are instructions that work on the whole byte, e.g. MOVE. And there are instructions that work on nibbles. MOVEZONE (MOVZON) works on zone nibbles and leaves the number nibbles alone; MOVENUM (MOVNUM) works on number nibbles, and leaves the zone nibbles alone.
This kind of instructions are usually used with bytes that contain numeric values, coded as either zoned decimal, or packed decimal. They are rather exotic when working on text data.
This reference is used.
Given the instruction:
MOVZON X'FF' to OFA1
The receiving field OFA1 refers to the first record position (the 1) of the output file ( the OF) designated as A. The instruction will set the high-order bits (0-3 or "zone bits") of the first position to ones, matching bits 0-3 of the X'FF'.
However, it appears, as a matter of style, the instruction should have been written as MOVZON X'F0' TO OAF1 since the low-order bits (4-7) are not used.

referential structure will typically use 64-bits or 32 bits?

I learned that a reference takes 64 bits :
That is, a referential structure will typically use 64-bits for the memory address stored in the array, on top of whatever number of bits are used to represent the object that is considered the element.
How could I see it in action?
In [75]: patients = ["trump", "Trump", "trumP"]
In [76]: id(patients[1])
Out[76]: 4529777048
In [77]: math.log2(4529777048)
Out[77]: 32.076792897710234
It's 2**32 rather than 2**64.
With math.log2(id(obj)) you ask "2 raised to what power gives us the address of obj in the memory?".
This is not how id() works. id() gives you a constant and unique value for every object. In CPython this is the address of the object in the memory.
On 64 bit systems it makes sense to store this address in a 64-bit variable since you would not be able to cover the full address-space with a 32 bit variable.
However a 64 bit reference does not mean, that every object has the address of 2**64. As of 2018 this would not even be possible since our x86_64 pcs have just a 48-bit address space. That the id of your first patient was near 2**32 is (mostly) coincidence.
id will return the address in memory. So, this is not what you are looking for.
Typically a way to get the size in memory of something in Python is using sys.getsizeof(). However, this will return the size of the object. You are interested in the size of the reference to that object.
You can however still calculate this more or less as follows: 8 * struct.calcsize("P"). This will basically reveal if you are on a 32-bit or 64-bit system, and therefore you know what the size of a reference is. But really calculating it by inspecting a reference, I don't know if that is possible.

What do xfrm_replay_state_esn fields mean?

I'm trying to understand a little bit more about Linux kernel IPSec networking by looking at the kernel source. I understand conceptually that IPSec prevents replay attacks with a sequence number and a replay window, i.e. if a recipient receives a packet with a sequence number that is not within the replay window, or it has received before, then it drops that packet and increments the replay counter.
I'm trying to correlate this to the structure xfrm_replay_state_esn which is defined as such:
struct xfrm_replay_state_esn {
unsigned int bmp_len;
__u32 oseq;
__u32 seq;
__u32 oseq_hi;
__u32 seq_hi;
__u32 replay_window;
__u32 bmp[0];
};
I've tried searching for documentation, but it's scant and I haven't been able to find a man of the various functions and structures, so I don't understand what the individual fields relate to.
XFRM is an IPSec implementation for the Linux kernel. The name XFRM stands for "transform" referencing the transformation of IP packets as per the IPSec protocol.
The following RFCs are relevant for IPSec:
RFC4301: Definition of the IPSec protocol.
RFC4302: Definition of the Authentication Header (AH) sub-protocol for ensuring authenticity of IP packets.
RFC4303: Definition of the Encapsulating Security Payload (ESP) sub-protocol for ensuring authenticity and secrecy of IP packets.
The IPSec protocol allows for sequence numbers of size 32 bits or 64 bits. The 64 bit sequence numbers are referred to as Extended Sequence Numbers (ESN).
The anti-replay mechanism is defined in the RFCs for both AH and ESP. The mechanism keeps a window of acceptable sequence numbers of incoming packets. The window extends back from the highest sequence number received so far, defining a lower bound for the acceptable sequence numbers. When receiving a sequence number below that bound, it is rejected. When receiving a sequence number higher than the current highest sequence number, the window is shifted forward. When receiving a sequence number within the window, the mechanism will mark this sequence number in a checklist for ensuring that each sequence number in the window is only received once. If the sequence number has already been marked, it is rejected.
This checklist can be implemented as a bitmap, where each sequence number in the window is represented by a single bit, with 0 meaning this sequence number has not been received yet, and 1 meaning it has already been received.
Based on this information, the meaning of the fields in the xfrm_replay_state_esn struct can be given as follows.
The struct holds the state of the anti-replay mechanism with extended sequence numbers (64 bits):
The highest sequence number received so far is represented by seq and seq_hi. Each is a 32 bit integer, so together they can represent a 64 bit number, with seq holding the lower 32 bit and seq_hi holding the higher 32 bit. The reason for splitting the 64 bit value into two 32 bit values, instead of representing it as a single 64 bit variable, is that the IPSec protocol mandates an optimization where only the lower 32 bit of the sequence number are included in the package. For this reason, it is more convenient to have the lower 32 bits as a separate variable in the struct, so that it can be accessed directly without resorting to bit-operations.
The sequence number counter for outgoing packages is tracked in oseq and oseq_hi. As before, the 64 bit number is represented by two 32 bit variables.
The size of the window is represented by replay_window. The smallest acceptable sequence number if given by the sequence number expressed by seq and seq_hi minus replay_window plus one.
The bitmap for checking off received sequence numbers within the window is represented by bmp. It is defined as a zero-sized array, but when the memory for the struct is allocated, additional memory is reserved after the struct, which can then be accessed e.g. with bmp[i] (which is of course just syntactic sugar for *(bmp+i)). The size of the bitmap is held in bmp_len. It is of course related to the window size, i.e. window size divided by 8*sizeof(u32), rounded up. I would speculate that it is stored explicitly to avoid having to recalculate this value frequently.

Why is string manipulation more expensive?

I've heard this so many times, that I have taken it for granted. But thinking back on it, can someone help me realize why string manipulation, say comparison etc, is more expensive than say an integer, or some other primitive?
8bit example:
1 bit can be 1 or 0. With 2 bits you can represent 0, 1, 2, and 3. And so on.
With a byte you have 2^8 possibilities, from 0 to 255.
In a string a single letter is stored in a byte, so "Hello world" is 11 bytes.
If I want to do 100 + 100, 100 is stored in 1 byte of memory, I need only two bytes to sum two numbers. The result will need again 1 byte.
Now let's try with strings, "100" + "100", this is 3 bytes plus 3 bytes and the result, "100100" needs 6 bytes to be stored.
This is over-simplified, but more or less it works in this way.
The int data type in C# was carefully selected to be a good match with processor design. Which can store an int in a cpu register, a storage location that's an easy factor of 3 faster than memory. And a single cpu instruction to compare values of type int. The CMP instruction runs in less than a single cpu cycle, a fraction of a nano-second.
That doesn't work nearly as well for a string, it is a variable length data type and every single char in the string must be compared to test for equality. So it is automatically proportionally slower by the size of the string. Furthermore, string comparison is afflicted by culture dependent comparison rules. The kind that make "ss" and "ß" equal in German and "Aa" and "Å" equal in Danish. Nothing subtle to deal with, taken care of by highly optimized table-driven code inside the CLR. It can't beat CMP.
I've always thought it was because of the immutability of strings. That is, every time you make a change to the string, it requires allocating memory for a whole new string (rather than modifying the original in place).
Probably a woefully naive understanding but perhaps someone else can expound further.
There are several things to consider when looking at the "cost" of manipulating strings.
There is the cost in terms of memory usage, there is the cost in terms of CPU cycles used, and there is a cost associated with the complexity of the code involved.
Integer manipulation (Add, Subtract, Multipy, Divide, Compare) is most often done by the CPU at the hardware level, in few (or even 1) instruction. When the manipulation is done, the answer fits back in the same size chunk of memory.
Strings are stored in blocks of memory, which have to be manipulated a byte or word at a time. Comparing two 100 character long strings may require 100 separate comparison operations.
Any manipulation that makes a string longer will require, either moving the string to a bigger block of memory, or moving other stuff around in memory to allow growing the existing block.
Any manipulation that leaves the string the same, or smaller, could be done in place, if the language allows for it. If not, then again, a new block of memory has to be allocated and contents moved.

linux socket programming with the consideration of real size of char

I'm writing a client and server program with Linux socket programming. I'm confused about something. Although sizeof(char) is guaranteed to be 1, I know the real size of char may be different in different computer. It may be 8bits,16bits or some other size. The problem is that what if client and server have different size of char. For example client char size is 8bits and server char size is 16bits. Client call write(socket_fd, *c, sizeof(char)) and Server call read(socket_fd, *c, sizeof(char)). Does Client sends 8bits and Server wants to receive 16bits? If it is true, what will happen?
Another question: Is it good for me to pass text between client and server because I don't need to consider the big endian and little endian problem?
Thanks in advance.
What system are you communicating with that has 16bits in a byte? In any case, if you want to know exactly how many bits you have - use int8 instead.
#Basile is right. A char is always eight bits in linux. I found this in the book Linux Kernel Development. This book also states some other rules:
Although there is no rule that the int type be 32 bits, it is in Linux on all currently supported architectures.
The same goes for the short type, which is 16 bits on all current architectures, although no rule explicitly decrees that.
Never assume the size of a pointer or a long, which can be either 32 or 64 bits on the currently supported machines in Linux.
Because the size of a long varies on different architectures, never assume that sizeof(int) is equal to sizeof(long).
Likewise, do not assume that a pointer and an int are the same size.
For the choice of pass by binary data or text data through the network, the book UNIX Network Programming Volume1 gives the two solutions:
Pass all numeric data as text strings.
Explicitly define the binary formats of the supported datatypes (number of bits, big- or little-endian) and pass all data between the client and server in this format. RPC packages normally use this technique. RFC 1832 [Srinivasan 1995] describes the External Data Representation (XDR) standard that is used with the Sun RPC package.
The c definition of char as the size of a memory cell is different from the definition used in Unicode.
A Unicode code-point can, depending on the encoding used, require up to 6 bytes of storage.
This is a slightly different problem than byte order and word size differences between different architectures, etc.
If you wish to express complex structures (containing unicode text), it's probably a
good idea to implement a message protocol, that encode messages to a byte array, that can be send over any communication channel.
A simple client/server mechanism is to send a fixed size header containing the length of the following message. It's a nice exercise to build something like this in c... :-)
Depending on what you are trying to do, it may be worthwhile to look at existing technologies for the message interface; Look at Etch, Thrift, SWIG, *-rpc, asn1, soap, xml, json, corba, etc.

Resources