Perl: string length limitations in real life - string

While, for example, perldata documents that scalar strings in Perl are limited only by available memory, I'm strongly suspecting in real life there would be some other limits.
I'm considering the following ideas:
I'm not sure how strings are implemented in Perl — is there some sort of byte/character counter? If there is, then probably it's implemented as a platform-dependent integer (i.e. 32-bit or 64-bit), so effectively it would limit strings to something like 2 ** 31, 2 ** 32, 2 ** 63 or 2 ** 64 bytes.
If Perl doesn't use a counter and instead uses some byte to terminate the string (which would be strange, as it's perfectly ok to have a string like "foo\0bar" in Perl), then all operations would inevitably get much slower as string length increases.
Most string functions that Perl deals with strings, such as length, for example, return normal scalar integer, and I strongly suspect that it would be platform-limited integer too.
So, what would be the other factors that limit Perl string length in real life? What should be considered an okay string length for practical purposes?

It keep track of the size of the buffer and the number of bytes therein.
$ perl -MDevel::Peek -e'$x="abcdefghij"; Dump($x);'
SV = PV(0x9222b00) at 0x9222678
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x9238220 "abcdefghij"\0
CUR = 10 <-- 10 bytes used
LEN = 12 <-- 12 bytes allocated
On a 32-bit build of Perl, it uses 32-bit unsigned integer for these values. This is (exactly) large enough to create a string that uses up your process's entire 4 GiB address space.
On a 64-bit build of Perl, it uses 64-bit unsigned integer for those values. This is (exactly) large enough to create a string that uses up your process's entire 16 EiB address space.
The docs are correct. The size of the string is limited only by available memory.

Related

what is the max size of a string type in terraform

I'm trying to locate a definitive answer to, "What is the max size of a Terraform value type of 'string'"?
Been searching and googling and can't seem to find it defined anywhere. Anyone have any reference they could point me to?
Tia
Bill W
The length of strings in Terraform is constrained in two main ways:
Terraform internally tracks the length of the string, which is stored as an integer type which has a limited range.
Strings need to exist in system memory as a consecutive sequence of bytes.
The first of these is directly answerable: Terraform tracks the length of a string using an integer type large enough to represent all pointers on the host platform. From a practical standpoint then, that means a 64-bit integer when you're using a 64-bit build, and a 32-bit number when you're using a 32-bit build.
That means that there's a hard upper limit imposed by the maximum value of that integer. Terraform is internally tracking the length of the UTF-8 representation of the string in bytes, and so this upper limit is measured in bytes rather than in characters:
32-bit systems: 4,294,967,295 bytes
64-bit systems: 18,446,744,073,709,551,615 bytes
Terraform stores strings in memory using Unicode NFC normal form, UTF-8 encoded, and so the number of characters will vary depending on how many bytes each character takes up in the UTF-8 encoding form. ASCII characters take only one byte, but other characters can require up to four bytes.
A string of the maximum representable length would take up the entire address space of the Terraform process, which is impossible (there needs to be room for the Terraform application code, libraries, and kernel space too!), and so in practice the available memory on your system is the more relevant limit. That limit varies based on various characteristics of the system where you're running Terraform, and so isn't answerable in a general sense.

Replace same character from python string but different replacement values [duplicate]

This question already has answers here:
Python Number Limit [duplicate]
(4 answers)
Closed 2 years ago.
Let's say I have this python string
>>> s = 'dog /superdog/ | cat /thundercat/'
Is there a way to like replace the character / (first one) with [ & second / with ].
I was thinking like an about like this.
Output:
'dog [superdog] | cat [thundercat]'
I tried doing like this but did not quite get that well.
>>> s = 'dog /superdog/ | cat /thundercat/'
>>> s.replace('/','[')
'dog [superdog[ | cat [thundercat['
I was thinking to know the best and pythonic way as possible. Thank you!
Python can handle arbitrarily large numbers because python has built-in arbitrary-precision integers. The limit is related to the amount of RAM memory Python can access. These built-in Long Integers arithmetic is implemented as an Integer object which is initially set to 32 bits for speed, and then start allocating memory on demand.
Integers are commonly stored using a word of memory, which is 4 bytes or 32 bits, so integers from 0 up to 4,294,967,295 (2e32 -1) can be stored.
But if your system has 1GB available to a python process, it will have 8589934592 bits to represent numbers, and you can use numbers like (2e8589934592 -1).
Computers can only handle numbers up to a certain size, but this is to be taken with some caveats.
2147483648 through 2147483647 are the limits of 32 bit numbers.
Most of todays computers can handle numbers of 64
bits, i.e. numbers from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807, or from −(2^63) to 2^63 − 1
It is possible to create a software that can handle arbitrary large numbers, as long as RAM or storage suffice. Those solutions are rather slow, but e.g. SSL encryption is based on numbers thousands of digits long.
As a side note, you are doubling your initial million in every iteration, not adding a million.

Problems with SHA 2 Hashing and Java

I am working on following the SHA-2 cryptographically functions as stated in https://en.wikipedia.org/wiki/SHA-2.
I am examining the lines that say:
begin with the original message of length L bits append a single '1' bit;
append K '0' bits, where K is the minimum number >= 0 such that L + 1 + K + 64 is a multiple of 512
append L as a 64-bit big-endian integer, making the total post-processed length a multiple of 512 bits.
I do not understand the last two lines. If my string is short can its length after adding K '0' bits be 512. How should I implement this in Java code?
First of all, it should be made clear that the "string" that is talked about is not a Java String but a bit string. These algorithms are binary/bit based. The implementation will generally not handle bits but bytes. So there is a translation phase where you should see bytes instead of bits.
SHA-512 is operated on in blocks of 512 bits (SHA-224/256) or 1024 bits (SHA-384/512). So basically you have a 64 or 128 byte buffer that you are filling before operating on it. You could also directly cache the data in 32 bit int fields (SHA-224/256) or 64 bit long fields, as that is the word size that is operated on.
Now the padding is relatively simple procedure. The padding is called bit-padding. As it is used in big-endian mode (SHA-2 fortunately uses this instead of the braindead little endian mode in SHA-3) the padding consists of a single bit set on the highest order bit in a byte, with the rest filled by zero's. That makes for a value of (byte) 0x80 which must be put in the buffer.
If you cannot create this padding because the buffer is full then you will have to process the previous block, and then set the first bit of the now available buffer to (byte) 0x80. In the newer Java you can also use (byte) 0b1_0000000 byte the way, which is more explicit.
Now you simply add zero's until you have 8 to 16 bytes left, again depending on the hash output size used. If there aren't enough bytes then fill till the end, process the block, and re-start filling with zero bytes until you have 8 or 16 bytes left again.
Now finally you have to encode the number of bits in those 8 or 16 bytes you've left. So multiply your input by eight, and make sure you encode those bytes in the same way as you'd expect in Java with the least significant bits as much to the right as possible. You might want to use https://docs.oracle.com/javase/8/docs/api/java/nio/ByteBuffer.html#putLong-long- for this if you don't want to program it yourself. You may probably forget about anything over 2^56 bytes anyway, so if you have SHA-384/SHA-512 then simply set the first eight bytes to zero.
And that's it, except that you still need to process that last block and then use as many bytes from the left as required for your particular output size.

python3 - why is size of string bigger than encode

In Python 3, the size of a string such as 'test'.__sizeof__() returns 73. However, if I encode it as utf-8, 'test'.encode().__sizeof__() returns 37.
Why does the size of string significantly larger than the size of its encoding in utf-8?
In CPython, up to and including 3.2, unicode characters, which became str characters in 3.x, were stored as either 16 or 32 bit unsigned ints, depending on whether one had a 'narrow' or 'wide' build. (Always narrow on Windows, both used on linux). In 3.3 and following, CPython switched to a flexible string representation (FSR), using 1, 2, or 4 bytes (8, 16, or 32 bits) per char, depending on the width needed for the 'widest' char in the string. See PEP 393
For 64 bit 3.4.3, 'test'.__sizeof__ == 53, while still b'test'.__sizeof__ == 37. Since both are using 1 byte per char, the extra 16 bytes are extra overhead in a string object. Part of that is the hidden specification of whether the string is using 1, 2, or 4 bytes per char. For comparison, 'tes\u1111'.__sizeof__() == 82 and 'tes\U00011111'.__sizeof__() == 92.
(No, I do not know why the jump to 82. One would probably have to check the code to be sure.)
str in python 3 is typically stored as 16-bit integers instead of bytes, unlike the encoded bytes object. This makes the string twice as large. Some extra metadata is probably also present, inflating the object further.

Why is string manipulation more expensive?

I've heard this so many times, that I have taken it for granted. But thinking back on it, can someone help me realize why string manipulation, say comparison etc, is more expensive than say an integer, or some other primitive?
8bit example:
1 bit can be 1 or 0. With 2 bits you can represent 0, 1, 2, and 3. And so on.
With a byte you have 2^8 possibilities, from 0 to 255.
In a string a single letter is stored in a byte, so "Hello world" is 11 bytes.
If I want to do 100 + 100, 100 is stored in 1 byte of memory, I need only two bytes to sum two numbers. The result will need again 1 byte.
Now let's try with strings, "100" + "100", this is 3 bytes plus 3 bytes and the result, "100100" needs 6 bytes to be stored.
This is over-simplified, but more or less it works in this way.
The int data type in C# was carefully selected to be a good match with processor design. Which can store an int in a cpu register, a storage location that's an easy factor of 3 faster than memory. And a single cpu instruction to compare values of type int. The CMP instruction runs in less than a single cpu cycle, a fraction of a nano-second.
That doesn't work nearly as well for a string, it is a variable length data type and every single char in the string must be compared to test for equality. So it is automatically proportionally slower by the size of the string. Furthermore, string comparison is afflicted by culture dependent comparison rules. The kind that make "ss" and "ß" equal in German and "Aa" and "Å" equal in Danish. Nothing subtle to deal with, taken care of by highly optimized table-driven code inside the CLR. It can't beat CMP.
I've always thought it was because of the immutability of strings. That is, every time you make a change to the string, it requires allocating memory for a whole new string (rather than modifying the original in place).
Probably a woefully naive understanding but perhaps someone else can expound further.
There are several things to consider when looking at the "cost" of manipulating strings.
There is the cost in terms of memory usage, there is the cost in terms of CPU cycles used, and there is a cost associated with the complexity of the code involved.
Integer manipulation (Add, Subtract, Multipy, Divide, Compare) is most often done by the CPU at the hardware level, in few (or even 1) instruction. When the manipulation is done, the answer fits back in the same size chunk of memory.
Strings are stored in blocks of memory, which have to be manipulated a byte or word at a time. Comparing two 100 character long strings may require 100 separate comparison operations.
Any manipulation that makes a string longer will require, either moving the string to a bigger block of memory, or moving other stuff around in memory to allow growing the existing block.
Any manipulation that leaves the string the same, or smaller, could be done in place, if the language allows for it. If not, then again, a new block of memory has to be allocated and contents moved.

Resources