What are the rules for cpython's string interning?

What are the rules for cpython's string interning? - string

In python 3.5, is it possible to predict when we will get an interned string or when we will get a copy? After reading a few Stack Overflow answers on this issue I've found this one the most helpful but still not comprehensive. Than I looked at Python docs, but the interning is not guaranteed by default
Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.
So, my question is about inner intern() conditions, i.e. decision-making (whether to intern string literal or not): why the same piece of code works on one system and not on another one and what rules did author of the answer on mentioned topic mean when saying
the rules for when this happens are quite convoluted

You think there are rules?
The only rule for interning is that the return value of intern is interned. Everything else is up to the whims of whoever decided some piece of code should or shouldn't do interning. For example, "left" gets interned by PyCodeNew:
/* Intern selected string constants */
for (i = PyTuple_GET_SIZE(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!all_name_chars(v))
continue;
PyUnicode_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}
The "rule" here is that a string object in the co_consts of a Python code object gets interned if it consists purely of ASCII characters that are legal in a Python identifier. "left" gets interned, but "as,df" wouldn't be, and "1234" would be interned even though an identifier can't start with a digit. While identifiers can contain non-ASCII characters, such characters are still rejected by this check. Actual identifiers don't ever pass through this code; they get unconditionally interned a few lines up, ASCII or not. This code is subject to change, and there's plenty of other code that does interning or interning-like things.
Asking us for the "rules" for string interning is like asking a meteorologist what the rules are for whether it rains on your wedding. We can tell you quite a lot about how it works, but it won't be much use to you, and you'll always get surprises.

From what I understood from the post you linked:
When you use if a == b, you are checking if the value of a is the value of b, whereas when you use if a is b, you are checking if a and b are the same object (or share the same spot in the memory).
Now python interns the constant strings (defined by "blabla").
So:
>>> a = "abcdef"
>>> a is "abcdef"
True
But when you do:
>>> a = "".join([chr(i) for i in range(ord('a'), ord('g'))])
>>> a
'abcdef'
>>> a is "abcdef"
False
In the C programming language, using a string with "" will make it a const char *. I think this is what is happening here.

Related

What is the difference between Option::None in Rust and null in other languages?

I want to know what are the main differences between Rust's Option::None and "null" in other programming languages. And why is None considered to be better?

A brief history lesson in how to not have a value!
In the beginning, there was C. And in C, we had these things called pointers. Oftentimes, it was useful for a pointer to be initialized later, or to be optional, so the C community decided that there would be a special place in memory, usually memory address zero, which we would all just agree meant "there's no value here". Functions which returned T* could return NULL (written in all-caps in C, since it's a macro) to indicate failure, or lack of value. In a low-level language like C, in that day and age, this worked fairly well. We were only just rising out of the primordial ooze that is assembly language and into the realm of typed (and comparatively safe) languages.
C++ and later Java basically aped this approach. In C++, every pointer could be NULL (later, nullptr was added, which is not a macro and is a slightly more type-safe NULL). In Java, the problem becomes more clear: Every non-primitive value (which means, basically, every value that isn't a number or character) could be null. If my function takes a String as argument, then I always have to be ready to handle the case where some naive young programmer passes me a null. If I get a String as a result from a function, then I have to check the docs to see if it might not actually exist.
That gave us NullPointerException, which even today crowds this very website with tens of questions every day by young programmers falling into traps.
It is clear, in the modern day, that "every value might be null" is not a sustainable approach in a statically typed language. (Dynamically typed languages tend to be more prepared to deal with failure at every point anyway, by their very nature, so the existence of, say, nil in Ruby or None in Python is somewhat less egregious.)
Kotlin, which is often lauded as a "better Java", approaches this problem by introducing nullable types. Now, not every type is nullable. If I have a String, then I actually have a String. If I intend that my String be nullable, then I can opt-in to nullability with String?, which is a type that's either a string or null. Crucially, this is type-safe. If I have a value of type String?, then I can't call String methods on it until I do a null check or make an assertion. So if x has type String?, then I can't do x.toLowerCase() unless I first do one of the following.
Put it inside a if (x != null), to make sure that x is not null (or some other form of control flow that proves my case).
Use the ? null-safe call operator to do x?.toLowerCase(). This will compile to an if (x != null) check and will return a String? which is null if the original string was null.
Use the !! to assert that the value is not null. The assertion is checked and will throw an exception if I turn out to be wrong.
Note that (3) is what Java does by default at every turn. The difference is that, in Kotlin, the "I'm asserting that I know better than the type checker" case is opt-in, and you have to go out of your way to get into the situation where you can get a null pointer exception. (I'm glossing over platform types, which are a convenient hack in the type system to interoperate with Java. It's not really germane here)
Nullable types is how Kotlin solves the problem, and it's how Typescript (with --strict mode) and Scala 3 (with null checks turned on) handle the problem as well. However, it's not really viable in Rust without significant changes to the compiler. That's because nullable types require the language to support subtyping. In a language like Java or Kotlin, which is built using object-oriented principles first, it's easy to introduce new subtypes. String is a subtype of String?, and also of Any (essentially java.lang.Object), and also of Any? (any value and also null). So a string like "A" has a principal type of String but it also has all of those other types, by subtyping.
Rust doesn't really have this concept. Rust has a bit of type coercion available with trait objects, but that's not really subtyping. It's not correct to say that String is a subtype of dyn Display, only that a String (in an unsized context) can be coerced automatically into a dyn Display.
So we need to go back to the drawing board. Nullable types don't really work here. However, fortunately, there's another way to handle the problem of "this value may or may not exist", and it's called optional types. I wouldn't hazard a guess as to what language first tried this idea, but it was definitely popularized by Haskell and is common in more functional languages.
In functional languages, we often have a notion of principal types, similar to in OOP languages. That is, a value x has a "best" type T. In OOP languages with subtyping, x might have other types which are supertypes of T. However, in functional languages without subtyping, x truly only has one type. There are other types that can unify with T, such as (written in Haskell's notation) forall a. a. But it's not really correct to say that the type of x is forall a. a.
The whole nullable type trick in Kotlin relied on the fact that "abc" was both a String and a String?, while null was only a String?. Since we don't have subtyping, we'll need two separate values for the String and String? case.
If we have a structure like this in Rust
struct Foo(i32);
Then Foo(0) is a value of type Foo. Period. End of story. It's not a Foo?, or an optional Foo, or anything like that. It has one type.
However, there's this related value called Some(Foo(0)) which is an Option<Foo>. Note that Foo(0) and Some(Foo(0)) are not the same value, they just happen to be related in a fairly natural way. The difference is that while a Foo must exist, an Option<Foo> could be Some(Foo(0)) or it could be a None, which is kind of like our null in Kotlin. We still have to check whether or not the value is None before we can do anything. In Rust, we generally do that with pattern matching, or by using one of the several built-in functions that do the pattern matching for us. It's the same idea, just implementing using techniques from functional programming.
So if we want to get the value out of an Option, or a default if it doesn't exist, we can write
my_option.unwrap_or(0)
If we want to do something to the inside if it exists, or null out if it doesn't, then we write
my_option.and_then(|inner_value| ...)
This is basically what ?. does in Kotlin. If we want to assert that a value is present and panic otherwise, we can write
my_option.unwrap()
Finally, our general purpose Swiss army knife for dealing with this is pattern matching.
match my_option {
None => {
...
}
Some(value) => {
...
}
}
So we have two different approaches to dealing with this problem: nullable types based on subtyping, and optional types based on composition. "Which is better" is getting into opinion a bit, but I'll try to summarize the arguments I see on both sides.
Advocates of nullable types tend to focus on ergonomics. It's very convenient to be able to do a quick "null check" and then use the same value, rather than having to jump through hoops of unwrapping and wrapping values constantly. It's also nice being able to pass literal values to functions expecting String? or Int? without worrying about bundling them or constantly checking whether they're in a Some or not.
On the other hand, optional types have the benefit of being less "magic". If nullable types didn't exist in Kotlin, we'd be somewhat out of luck. But if Option didn't exist in Rust, someone could write it themselves in a few lines of code. It's not special syntax, it's just a type that exists and has some (ordinary) functions defined on it. It's also built up of composition, which means (naturally) that it composes better. That is, (T?)? is equivalent to T? (the former still only has one null value), while Option<Option<T>> is distinct from Option<T>. I don't recommend writing functions that return Option<Option<T>> directly, but you can end up getting bitten by this when writing generic functions (i.e. your function returns S? and the caller happens to instantiate S with Int?).
I go a bit more into the differences in this post, where someone asked basically the same question but in Elm rather than Rust.

There's two implicit assumptions in the question: first, that other language's "nulls" (and nils and undefs and nones) are all the same. They aren't.
The second assumption is that "null" and "none" provide similar functionality. There's many different uses of null: the value is unknown (SQL trinary logic), the value is a sentinel (C's null pointer and null byte), there was an error (Perl's undef), and to indicate no value (Rust's Option::none).
Null itself comes in at least three different flavors: an invalid value, a keyword, and special objects or types.
Keyword as null
Many languages opt for a specific keyword to indicate null. Perl has undef, Go and Ruby have nil, Python has None. These are useful in that they are distinct from false or 0.
Invalid value as null
Unlike having a keyword, these are things within the language which mean a specific value, but are used as null anyway. The best examples are C's null pointer and null bytes.
Special objects and types as null
Increasingly, languages will use special objects and types for null. These have all the advantages of a keyword, but rather than a generic "something went wrong" they can have very specific meaning per domain. They can also be user-defined offering flexibility beyond what the language designer intended. And, if you use objects, they can have additional information attached.
For example, std::ptr::null in Rust indicates a null raw pointer. Go has the error interface. Rust has Result::Err and Option::None.
Null as unknown
Most programming languages use two-value or binary logic: true or false. But some, particularly SQL, use three-value or trinary logic: true, false, and unknown. Here null means "we do not know what the value is". It's not true, it's not false, it's not an error, it's not empty, it's not zero: the value is unknown. This changes how the logic works.
If you compare unknown to anything, even itself, the result is unknown. This allows you to draw logical conclusions based on incomplete data. For example, Alice is 7'4" tall, we don't know how tall Bob is. Is Alice taller than Bob? Unknown.
Null as uninitialized
When you declare a variable, it has to contain some value. Even in C, which famously does not initialize variables for you, it contains whatever value happened to be sitting in memory. Modern languages will initialize values for you, but they have to initialize it to something. Often that something is null.
Null as sentinel
When you have a list of things and you need to know when the list is done, you can use a sentinel value. This is anything which is not a valid value for the list. For example, if you have a list of positive numbers you can use -1.
The classic example is C. In C, you don't know how long an array is. You need something to tell you to stop. You can pass around the size of the array, or you can use a sentinel value. You read the array until you see the sentinel. A string in C is just an array of characters which ends with a "null byte", that's just a 0 which is an invalid character. If you have an array of pointers, you can end it with a null pointer. The disadvantages are 1) there's not always a truly invalid value and 2) if you forget the sentinel you walk off the end of the array and bad stuff happens.
A more modern example is how to know to stop iterating. For example in Go, for thing := range things { ... }, range will return nil when there are no more items in the range causing the for loop to exit. While this is more flexible, it has the same problem as the classic sentinel: you need a value which will never appear in the list. What if null is a valid value?
Languages such as Python and Ruby choose to solve this problem by raising a special exception when the iteration is done. Both will raise StopIteration which their loops will catch and then exit the loop. This avoids the problem of choosing a sentinel value, Python and Ruby iterators can return anything.
Null as error
While many languages use exceptions for error handling, some languages do not. Instead, they return a special value to indicate an error, often null. C, Go, Perl, and Rust are good examples.
This has the same problem as a sentinel, you need to use a value which is not a valid return value. Sometimes this is not possible. For example, functions in C can only return a single value of a given type. If it returns a pointer, it can return a null pointer to indicate error. But if it returns a number, you have to pick an otherwise valid number as the error value. This conflating the error and return values is a problem.
Go works around this by allowing functions to return multiple values, typically the return value and an error. Then the two values can be checked independently. Rust can only return a single value, so it works around this by returning the special type Result. This contains either an Ok with the return value or an Err with an error code.
In both Rust and Go, they are not just values, but they can have methods called on them expanding their functionality. Rust's Result::Err has the additional advantage of being a special type, you can't accidentally use a Result::Err as anything else.
Null as no value
Finally, we have "none of the given options". Quite simply, there's a set of valid options and the result is none of them. This is distinct from "unknown" in that we know the value is not in the valid set of values. For example, if I asked "which fruit is this car" the result would be null meaning "the car is not any fruit".
When asking for the value of a key in a collection, and that key does not exist, you will get "no value". For example, Rust's HashMap get will return None if the key does not exist.
This is not an error, though they often get confused. For example, Ruby will raise ArgumentError if you pass nonsense into a function. For example, array.first(-2) asks for the first -2 values which is nonsense and will raise an ArgumentError.
Option vs Result
Which finally brings us back to Option::None. It is the "special type" version of null which has many advantages.
Rust uses Option for many things: to indicate an uninitialized value, to indicate simple errors, as no value, and for Rust specific things. The documentation gives numerous examples.
Initial values
Return values for functions that are not defined over their entire input range (partial functions)
Return value for otherwise reporting simple errors, where None is returned on error
Optional struct fields
Struct fields that can be loaned or “taken”
Optional function arguments
Nullable pointers
Swapping things out of difficult situations
To use it in so many places dilutes its value as a special type. It also overlaps with Result which is what you're supposed to use to return results from functions.
The best advice I've seen is to use Result<Option<T>, E> to separate the three possibilities: a valid result, no result, and an error. Jeremy Fitzhardinge gives a good example of what to return from querying a key/value store.
...I'm a big advocate of returning Result<Option<T>, E> where Ok(Some(value)) means "here's the thing you asked for", Ok(None) means "the thing you asked for definitely doesn't exist", and Err(...) means "I can't say whether the thing you asked for exists or not because something went wrong".

Is an explicit NUL-byte necessary at the end of a bytearray for cython to be able to convert it to a null-terminated C-string

When converting a bytearray-object (or a bytes-object for that matter) to a C-string, the cython-documentation recommends to use the following:
cdef char * cstr = py_bytearray
there is no overhead, as cstr is pointing to the buffer of the bytearray-object.
However, C-strings are null-terminated and thus in order to be able to pass cstr to a C-function it must also be null-terminated. The cython-documentation doesn't provide any information, whether the resulting C-strings are null-terminated.
It is possible to add a NUL-byte explicitly to the byarray-object, e.g. by using b'text\x00' instead of just `b'text'. Yet this is cumbersome, easy to forget, and there is at least experimental evidence, that the explicit NUL-byte is not needed:
%%cython
from libc.stdio cimport printf
def printit(py_bytearray):
cdef char *ptr = py_bytearray
printf("%s\n", ptr)
And now
printit(bytearray(b'text'))
prints the desired "text" to stdout (which, in the case an IPython-notebook, is obviously not the output shown in the browser).
But is this a lucky coincidence or is there a guarantee, that the buffer of a bytearray-object (or a bytes-object) is null-terminated?

I think it's safe (at least in Python 3), however I'd be a bit wary.
Cython uses the C-API function PyByteArray_AsString. The Python3 documentation for it says "The returned array always has an extra null byte appended." The Python2 version does not have that note so it's difficult to be sure if it's safe.
Practically speaking, I think Python deals with this by always over-allocating bytearrays by one and NULL terminating them (see source code for one example of where this is done).
The only reason to be a bit cautious is that it's perfectly acceptable for bytearrays (and Python strings for that matter) to contain a 0 byte within the string, so it isn't a good indicator of where the end is. Therefore, you should really be using their len anyway. (This is a weak argument though, especially since you're probably the one initializing them, so you know if this should be true)
(My initial version of this answer had something about _PyByteArray_empty_string. #ead pointed out in the comments that I was mistaken about this and hence it's edited out...)

Why is the keyword `string` used to verify a variable type

For example, suppose we have a variable named i and set to 10. To check if it is an integer, in tcl one types : string is integer $i.
Why is there the keyword string ? Does it mean the same as in python and C++ ? How to check if a tcl string (in the meaning of a sequence of characters) is a string ? string is string $myString does not work because string is not a class in tcl.

Tcl doesn't have types. Or rather it does, but they're all serializable to strings and that happens magically behind the scenes; it looks like it doesn't have types, and you're not supposed to talk about them. Tcl does have classes, but they're not used for types of atomic values; something like 1.3 is not an instance of an object, it's just a value (often of floating point type, but it could also be a string or a singleton list or version identifier, or even a command name or variable name if you really want). Tcl's classes define objects that are commands, and those are (deliberately!) heavyweight entities.
The string is family of tests check whether a value meets the requirements for being interpreted as a particular kind of value. There's quite a few kinds of value, some of which make no sense as types at all (e.g., an all-uppercase string). There's nothing for string is string because everything you can ask that about would automatically pass; all values are already strings, or may be transparently converted to them.
There's exactly one way to probe what the type of a value currently is, and that is the command ::tcl::unsupported::representation (8.6 only). That reports the current type of a value as part of its output, and you're not supposed to rely on it (there's quite a few types under the hood, many of which are pretty obscure unless you know a lot about Tcl's implementation).
% set s 1.3
1.3
% ::tcl::unsupported::representation $s
value is a pure string with a refcount of 4, object pointer at 0x100836ca0, string representation "1.3"
% expr {$s + 3}
4.3
% ::tcl::unsupported::representation $s
value is a double with a refcount of 4, object pointer at 0x100836ca0, internal representation 0x3ff4cccccccccccd:0x0, string representation "1.3"
As you can see, types are pretty flexible. You're supposed to ignore them. We mean it. Make your code demand the types it needs, and throw an error if it can't get them. That's what Tcl's C API does for you.

caesar cipher check in ocaml

I want to implement a check function that given two strings s1 and s2 will check if s2 is the caesar cipher of s1 or not. the inter face needs to be looked like string->string->bool.
the problem is that I am not allowed to use any string functions other than String.length, so how can I solve it? i am not permitted any list array, iterations. Only recursions and pattern matching.
Please help me. And also can you tell me how I can write a substring function in ocaml other than the module function with the above restrictions?

My guess is that you are probably allowed to use s.[i] to get the ith character of string s. This is the same as String.get, but the instructor may not think of it in those terms. Without some form of getting the individual characters for the string, I believe that this is impossible. You should probably double check with your instructor to be sure, but I would be surprised if he had meant for you to be unable to separate a string into characters (which is something that you cannot do with pattern-matching alone in Ocaml).
Once you can get individual characters, the way to do it should be pretty clear (you do not need substring to traverse each string recursively).
If you still want to write substring, creating it would be complex since you don't have access to String.create or other similar functions. But you can write your own version of String.create using recursion, one character string literals (like "x"), the ability to set a character in a string to another (like s.[0] <- c), and string concatenation (s1 ^ s2). Again, of course, all of this is assuming that those operators are allowed to be used.

Why is it illegal for variables to start with numbers?

Why is it illegal for variables to start with numbers?I know it's a convention but what's the reason?
Edit:
I mean variables like "1foo" or "23bar" not only numbers like "3"

Because the lexer in most languages will assume you are trying to specify a numeric literal. And then you could declare variables that are indistinguishable from numeric literals, creating a huge bombshell of ambiguity.

Pop quiz: in a hypothetical language that permits a variable to begin with a number, what is this?
0xDEADBEEF
In C (and related languages) this can only be a hexadecimal number. If a language allows a variable name to begin with a digit, that could be a variable or a hexadecimal number. That's one quick example of potentially millions.

Numbers are interpreted 'as is' without any syntax whereas strings/characters are mostly represented with quotes.
So, the program can understand the difference between a variable name containing characters and a string of characters but it does not goes the same with numerals.

One reason, probably the most obvious one, is that it would make your life more difficult, without bringing anything reasonably useful to the table. For example, in C, you wouldn't be able to tell whether a string of digits is an identifier or a numeric literal.
int 10 = 15;
int 15 = 10 + 5;
In the second line, is 10 a variable holding the numeric literal 15 or is it the numeric literal 10?
Another reason is that allowing a variable name to begin with a digit makes error checking during compilation a lot more complicated, again, without bringing anything reasonably useful to the table.

In languages such as Prolog, Erlang, and some early versions of Fortran, you very nearly got to do this, for completely different reasons.
Prolog/Erlang don't have variable assignment, they have unification. IIRC, if X is a variable, then code following 2 = X, or X = 2 is processed if X may have the value 2. So if X is already unified with a value, then that value must be 2, and if not, X becomes 2 from then on. So writing 3 = 3 is fine - it should become a no-op, and 2 = 3 always fails - either a non-match in Prolog or (I think) a runtime error in Erlang. Numbers behave like variables which have already been unified with the value the numbers represent.
In early Fortran ( apologies for not having used fortran in twenty years and forgetting its syntax ), all function arguments were passed by reference, so if you have a function which was equivalent to void foo ( int &x ) { x = 3; } and called it with a number, the compiler would store the number in a static variable and pass that. So calling foo (2) would set that static stored value of 2 to 3. If it happened to use the same static variable for the literal 2 somewhere else, such as calling another function with the literal 2, then the value passed to the second function would be 3 instead.
So you can have variables which are syntactically identical to numbers, as long as they are automatically initialised to the value of the literal. But if you allow them to be mutable rather pure variables, weirdness abounds.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string