How to split a `cstring` in nim - nim-lang

Currently, I am converting cstring to string using $ operator from cstrutils. There is a string.split in strutils module.
Is there a way to split a cstring e.g.
let a : cstring = "hello world"
let b = a.split("o")
# let b = ($a).split("o")) # work `$a` is string.

A C string is just a C string, which is a pointer to a block of characters terminated with char(0), \x0.
In Nim C strings are mostly used when we use C libs which expects C strings as arguments. Nim strings are an opaque object with a length field and a data buffer, which is zero terminated and compatible to C strings, so we can pass that buffer to C libs. So from Nim strings we get the Cstring for free, but to get a Nim string from a C string means to allocate a new entity and copy content.
C language itself has not many operations for working with strings, we would have to use libraries, glib from gtk for example. Splitting a C string would mean allocating two new strings, or at least allocating one new and inserting \x0 to the old to mark the end. Problem with C strings is generally, that when we allocate them we have to care that we deallocate it to avoid memory leaks.
So the recommendation is: Use Nim strings from the beginning. You can just pass them to C libs directly, the Nim compiler is smart and passes the actual buffer. C libs creates copies of the string in most cases. If not, and the C lib would use the actual C string internally, you would have to care that the Nim string is not freed too early by the Nim GC, maybe call gc_ref() on the string. There may be rare conditions where you may want to really use C strings in Nim, maybe to save memory? You may do that when you know that you do not want to do operations with that string, like appending, splitting, substitution and all that what Nim strings provide. There is a section about Nim strings and C strings in the Nim beginner book, see http://ssalewski.de/nimprogramming.html#_strings

Related

Why are string literals &str instead of String in Rust?

I'm just asking why Rust decided to use &str for string literals instead of String. Isn't it possible for Rust to just automatically convert a string literal to a String and put it on the heap instead of putting it into the stack?
To understand the reasoning, consider that Rust wants to be a systems programming language. In general, this means that it needs to be (among other things) (a) as efficient as possible and (b) give the programmer full control over allocations and deallocations of heap memory. One use case for Rust is for embedded programming where memory is very limited.
Therefore, Rust does not want to allocate heap memory where this is not strictly necessary. String literals are known at compile time and can be written into the ro.data section of an executable/library, so they don't consume stack or heap space.
Now, given that Rust does not want to allocate the values on the heap, it is basically forced to treat string literals as &str: Strings own their values and can be moved and dropped, but how do you drop a value that is in ro.data? You can't really do that, so &str is the perfect fit.
Furthermore, treating string literals as &str (or, more accurately &'static str) has all the advantages and none of the disadvantages. They can be used in multiple places, can be shared without worrying about using heap memory and never have to be deleted. Also, they can be converted to owned Strings at will, so having them available as String is always possible, but you only pay the cost when you need to.
To create a String, you have to:
reserve a place on the heap (allocate), and
copy the desired content from a read-only location to the freshly allocated area.
If a string literal like "foo" did both, every string would effectively be allocated twice: once inside the executable as the read-only string, and the other time on the heap. You simply couldn't just refer to the original read-only data stored in the executable.
&str literals give you access to the most efficient string data: the one present in the executable image on startup, put there by the compiler along with the instructions that make up the program. The data it points to is not stored on the stack, what is stack-allocated is just the pointer/size pair, as is the case with any Rust slice.
Making "foo" desugar into what is now spelled "foo".to_owned() would make it slower and less space-efficient, and would likely require another syntax to get a non-allocating &str. After all, you don't want x == "foo" to allocate a string just to throw it away immediately. Languages like Python alleviate this by making their strings immutable, which allows them to cache strings mentioned in the source code. In Rust mutating String is often the whole point of creating it, so that strategy wouldn't work.

meaning of `STRING_ELT` in Rcpp

I searched the source code of RCPP but could not find out the definition of STRING_ELT, could someone point to a reference where I could find all the definitions of the Macro like things in RCPP?
This is part of R's internals accessed via:
#include <R.h>
#include <Rinternals.h>
See 5.9.7 Handling character data of Writing R Extensions:
R character vectors are stored as STRSXPs, a vector type like VECSXP
where every element is of type CHARSXP. The CHARSXP elements of
STRSXPs are accessed using STRING_ELT and SET_STRING_ELT.
CHARSXPs are read-only objects and must never be modified. In
particular, the C-style string contained in a CHARSXP should be
treated as read-only and for this reason the CHAR function used to
access the character data of a CHARSXP returns (const char *) (this
also allows compilers to issue warnings about improper use). Since
CHARSXPs are immutable, the same CHARSXP can be shared by any STRSXP
needing an element representing the same string. R maintains a global
cache of CHARSXPs so that there is only ever one CHARSXP representing
a given string in memory.
You can obtain a CHARSXP by calling mkChar and providing a
nul-terminated C-style string. This function will return a
pre-existing CHARSXP if one with a matching string already exists,
otherwise it will create a new one and add it to the cache before
returning it to you. The variant mkCharLen can be used to create a
CHARSXP from part of a buffer and will ensure null-termination.
Note that R character strings are restricted to 2^31 - 1 bytes, and
hence so should the input to mkChar be (C allows longer strings on
64-bit platforms).

What are the rules for cpython's string interning?

In python 3.5, is it possible to predict when we will get an interned string or when we will get a copy? After reading a few Stack Overflow answers on this issue I've found this one the most helpful but still not comprehensive. Than I looked at Python docs, but the interning is not guaranteed by default
Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.
So, my question is about inner intern() conditions, i.e. decision-making (whether to intern string literal or not): why the same piece of code works on one system and not on another one and what rules did author of the answer on mentioned topic mean when saying
the rules for when this happens are quite convoluted
You think there are rules?
The only rule for interning is that the return value of intern is interned. Everything else is up to the whims of whoever decided some piece of code should or shouldn't do interning. For example, "left" gets interned by PyCodeNew:
/* Intern selected string constants */
for (i = PyTuple_GET_SIZE(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!all_name_chars(v))
continue;
PyUnicode_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}
The "rule" here is that a string object in the co_consts of a Python code object gets interned if it consists purely of ASCII characters that are legal in a Python identifier. "left" gets interned, but "as,df" wouldn't be, and "1234" would be interned even though an identifier can't start with a digit. While identifiers can contain non-ASCII characters, such characters are still rejected by this check. Actual identifiers don't ever pass through this code; they get unconditionally interned a few lines up, ASCII or not. This code is subject to change, and there's plenty of other code that does interning or interning-like things.
Asking us for the "rules" for string interning is like asking a meteorologist what the rules are for whether it rains on your wedding. We can tell you quite a lot about how it works, but it won't be much use to you, and you'll always get surprises.
From what I understood from the post you linked:
When you use if a == b, you are checking if the value of a is the value of b, whereas when you use if a is b, you are checking if a and b are the same object (or share the same spot in the memory).
Now python interns the constant strings (defined by "blabla").
So:
>>> a = "abcdef"
>>> a is "abcdef"
True
But when you do:
>>> a = "".join([chr(i) for i in range(ord('a'), ord('g'))])
>>> a
'abcdef'
>>> a is "abcdef"
False
In the C programming language, using a string with "" will make it a const char *. I think this is what is happening here.

Why is the keyword `string` used to verify a variable type

For example, suppose we have a variable named i and set to 10. To check if it is an integer, in tcl one types : string is integer $i.
Why is there the keyword string ? Does it mean the same as in python and C++ ? How to check if a tcl string (in the meaning of a sequence of characters) is a string ? string is string $myString does not work because string is not a class in tcl.
Tcl doesn't have types. Or rather it does, but they're all serializable to strings and that happens magically behind the scenes; it looks like it doesn't have types, and you're not supposed to talk about them. Tcl does have classes, but they're not used for types of atomic values; something like 1.3 is not an instance of an object, it's just a value (often of floating point type, but it could also be a string or a singleton list or version identifier, or even a command name or variable name if you really want). Tcl's classes define objects that are commands, and those are (deliberately!) heavyweight entities.
The string is family of tests check whether a value meets the requirements for being interpreted as a particular kind of value. There's quite a few kinds of value, some of which make no sense as types at all (e.g., an all-uppercase string). There's nothing for string is string because everything you can ask that about would automatically pass; all values are already strings, or may be transparently converted to them.
There's exactly one way to probe what the type of a value currently is, and that is the command ::tcl::unsupported::representation (8.6 only). That reports the current type of a value as part of its output, and you're not supposed to rely on it (there's quite a few types under the hood, many of which are pretty obscure unless you know a lot about Tcl's implementation).
% set s 1.3
1.3
% ::tcl::unsupported::representation $s
value is a pure string with a refcount of 4, object pointer at 0x100836ca0, string representation "1.3"
% expr {$s + 3}
4.3
% ::tcl::unsupported::representation $s
value is a double with a refcount of 4, object pointer at 0x100836ca0, internal representation 0x3ff4cccccccccccd:0x0, string representation "1.3"
As you can see, types are pretty flexible. You're supposed to ignore them. We mean it. Make your code demand the types it needs, and throw an error if it can't get them. That's what Tcl's C API does for you.

Dart: how many string objects are allocated when using the * operator in an assignment?

In the following code, I would expect 3 total string allocations to be made:
String str = "abc";
String str2 = str*2; //"abcabc"
1 when creating str
another when creating a copy of str to concatenate with itself
a third to hold the concatenation of str with itself (str2)
Are there fewer or more allocations made in this example? I know that strings are immutable in Dart but I'm unsure how these operations work under the hood because of this property.
I have no knowledge about the inner workings of the Dart VM but I would say:
"abc" creates one String object.
String str = "abc"; makes str reference the one created String object ("abc").
str*2; creates a second String object "abcabc" which str2 refers to after the second statement.
All in all two String objects.
With optimising compilers it's difficult to know for sure. If you want to know more you can look at the generated native code with irhydra.
In general a good approach is write code to be as readable as possible, and then use tools to find the bottle necks in your code, and optimise those.
For example observatory can show you which objects are using up the most memory, and which methods are running the most.

Resources