meaning of `STRING_ELT` in Rcpp - rcpp

I searched the source code of RCPP but could not find out the definition of STRING_ELT, could someone point to a reference where I could find all the definitions of the Macro like things in RCPP?

This is part of R's internals accessed via:
#include <R.h>
#include <Rinternals.h>
See 5.9.7 Handling character data of Writing R Extensions:
R character vectors are stored as STRSXPs, a vector type like VECSXP
where every element is of type CHARSXP. The CHARSXP elements of
STRSXPs are accessed using STRING_ELT and SET_STRING_ELT.
CHARSXPs are read-only objects and must never be modified. In
particular, the C-style string contained in a CHARSXP should be
treated as read-only and for this reason the CHAR function used to
access the character data of a CHARSXP returns (const char *) (this
also allows compilers to issue warnings about improper use). Since
CHARSXPs are immutable, the same CHARSXP can be shared by any STRSXP
needing an element representing the same string. R maintains a global
cache of CHARSXPs so that there is only ever one CHARSXP representing
a given string in memory.
You can obtain a CHARSXP by calling mkChar and providing a
nul-terminated C-style string. This function will return a
pre-existing CHARSXP if one with a matching string already exists,
otherwise it will create a new one and add it to the cache before
returning it to you. The variant mkCharLen can be used to create a
CHARSXP from part of a buffer and will ensure null-termination.
Note that R character strings are restricted to 2^31 - 1 bytes, and
hence so should the input to mkChar be (C allows longer strings on
64-bit platforms).

Related

Is an explicit NUL-byte necessary at the end of a bytearray for cython to be able to convert it to a null-terminated C-string

When converting a bytearray-object (or a bytes-object for that matter) to a C-string, the cython-documentation recommends to use the following:
cdef char * cstr = py_bytearray
there is no overhead, as cstr is pointing to the buffer of the bytearray-object.
However, C-strings are null-terminated and thus in order to be able to pass cstr to a C-function it must also be null-terminated. The cython-documentation doesn't provide any information, whether the resulting C-strings are null-terminated.
It is possible to add a NUL-byte explicitly to the byarray-object, e.g. by using b'text\x00' instead of just `b'text'. Yet this is cumbersome, easy to forget, and there is at least experimental evidence, that the explicit NUL-byte is not needed:
%%cython
from libc.stdio cimport printf
def printit(py_bytearray):
cdef char *ptr = py_bytearray
printf("%s\n", ptr)
And now
printit(bytearray(b'text'))
prints the desired "text" to stdout (which, in the case an IPython-notebook, is obviously not the output shown in the browser).
But is this a lucky coincidence or is there a guarantee, that the buffer of a bytearray-object (or a bytes-object) is null-terminated?
I think it's safe (at least in Python 3), however I'd be a bit wary.
Cython uses the C-API function PyByteArray_AsString. The Python3 documentation for it says "The returned array always has an extra null byte appended." The Python2 version does not have that note so it's difficult to be sure if it's safe.
Practically speaking, I think Python deals with this by always over-allocating bytearrays by one and NULL terminating them (see source code for one example of where this is done).
The only reason to be a bit cautious is that it's perfectly acceptable for bytearrays (and Python strings for that matter) to contain a 0 byte within the string, so it isn't a good indicator of where the end is. Therefore, you should really be using their len anyway. (This is a weak argument though, especially since you're probably the one initializing them, so you know if this should be true)
(My initial version of this answer had something about _PyByteArray_empty_string. #ead pointed out in the comments that I was mistaken about this and hence it's edited out...)

Other target types provided by protobuf and Serialized Array/String/Ostream

I read the tutorial of protobuf C++ programming guide, and it seems to provide SerializeWithCachedSizeToArray inside its .h function, and I can also call SerializeToString() and SerializeToOstream().
I wish to know:
(1) Does pb provide other default serialize/de-serialize functions for cpp code?
(2) How to use the generated function of
void SerializeWithCachedSizes(
::google::protobuf::io::CodedOutputStream* output
I searched google but didn't get when and where should I use CodedOutputStream.
Any explanations? Thanks.
1) Three main operations must be done to serialize: a) calculate total size, b) encode, and c) dump. For example, SerializeWithCachedSizeToArray implies that a) use cached size, and c) dump to char Array.
Depending on how/where to do those operations, there are lots of variants of Serialize function and you can mix/match library-provided utility or what-you-wrote utility to create other types. The most common function would be 'SerializeToString/Ostream', as you can see. There are string, char array, ostream, zlibstream, to name a few.
2) CodedOutputStream is a utility class to encode tagged stream. Tag - the number you put in proto after '='. You instantiate it with dump target, such as stream, char array, etc...

Conversion of list to string - TCL

I encountered the following problem in TCL. In my application, I read very large text files (some hundreds of MB) into TCl list. The list is then returned by the function to the main context, and then checked for emptiness. Here is the code snapshot:
set merged_trace_list [merge_trace_files $exclude_trace_file $trace_filenames ]
if {$merged_trace_list == ""} {
...
And I get crash at the "if" line. The crash seems to be related to memory overflow. I thought that the comparison to "" forces TCL to convert list to the string, and since the string is too long, this causes crash. I then replaced above "if" line by another one:
if {[lempty $merged_trace_list]} {
and crash indeed disappeared. In the light of the above, I have several questions:
What is the maximum allowed string length in TCL?
What is difference between string and list in TCL in terms of memory allocation? Why I can have very long list, but not corresponding string?
When the list first returned by the function into the main scope (the first line) , is it not converted to the string first? And if yes, why I don't have crash in that line?
Thanks,
I hope the descriptions and the questions are clear.
Konstantin
The current maximum size of individual memory object (e.g., string) is 2GB. This is a known bug (of long standing) on 64-bit platforms, but fixing it requires a significant ABI and API breaking change, so it won't appear until Tcl 9.0.
The difference between strings and lists is that strings are stored in a single block of memory, whereas lists are stored in an array of pointers to elements. You can probably get 256k elements in a list no problem, but after that you might run into problems as the array reaches the 2GB limit.
Tcl's value objects may be simultaneously both lists and strings; the dictum about Tcl that “everything is a string” is not actually true, it's just that everything may be serialized to a string. The returning of a list does not force it to be converted to string — that's actually a fairly slow operation — but comparing the value for equality with a string does force the generation of the string. The lempty command must be instead getting the length of the string (you can use llength to do the same thing) and comparing that to zero.
Can you adjust your program to not need to hold all that data in memory at once? It's living a little dangerously given the bug mentioned above.
This is not really an answer, but it's slightly too much for a comment.
If you want to check if a list is empty, the best option is llength. If the list length is 0, your list has no content. The low-level lookup for this is very cheap.
If you still want to determine if a list is empty by comparing it to the empty string you will have to face the cost of resolving the string representation of the list. In this case, $myLongList eq {} is preferable to $myLongList == {}, since the latter comparison also forces the interpreter to check if the operands are numeric (at least it used to be like that, it might have changed).

Why is the keyword `string` used to verify a variable type

For example, suppose we have a variable named i and set to 10. To check if it is an integer, in tcl one types : string is integer $i.
Why is there the keyword string ? Does it mean the same as in python and C++ ? How to check if a tcl string (in the meaning of a sequence of characters) is a string ? string is string $myString does not work because string is not a class in tcl.
Tcl doesn't have types. Or rather it does, but they're all serializable to strings and that happens magically behind the scenes; it looks like it doesn't have types, and you're not supposed to talk about them. Tcl does have classes, but they're not used for types of atomic values; something like 1.3 is not an instance of an object, it's just a value (often of floating point type, but it could also be a string or a singleton list or version identifier, or even a command name or variable name if you really want). Tcl's classes define objects that are commands, and those are (deliberately!) heavyweight entities.
The string is family of tests check whether a value meets the requirements for being interpreted as a particular kind of value. There's quite a few kinds of value, some of which make no sense as types at all (e.g., an all-uppercase string). There's nothing for string is string because everything you can ask that about would automatically pass; all values are already strings, or may be transparently converted to them.
There's exactly one way to probe what the type of a value currently is, and that is the command ::tcl::unsupported::representation (8.6 only). That reports the current type of a value as part of its output, and you're not supposed to rely on it (there's quite a few types under the hood, many of which are pretty obscure unless you know a lot about Tcl's implementation).
% set s 1.3
1.3
% ::tcl::unsupported::representation $s
value is a pure string with a refcount of 4, object pointer at 0x100836ca0, string representation "1.3"
% expr {$s + 3}
4.3
% ::tcl::unsupported::representation $s
value is a double with a refcount of 4, object pointer at 0x100836ca0, internal representation 0x3ff4cccccccccccd:0x0, string representation "1.3"
As you can see, types are pretty flexible. You're supposed to ignore them. We mean it. Make your code demand the types it needs, and throw an error if it can't get them. That's what Tcl's C API does for you.

What's going on in the 'offsetof' macro?

Visual C++ 2008 C runtime offers an operator 'offsetof', which is actually macro defined as this:
#define offsetof(s,m) (size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
This allows you to calculate the offset of the member variable m within the class s.
What I don't understand in this declaration is:
Why are we casting m to anything at all and then dereferencing it? Wouldn't this have worked just as well:
&(((s*)0)->m)
?
What's the reason for choosing char reference (char&) as the cast target?
Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
An offset is in bytes. So to get a number expressed in bytes, you have to cast the addresses to char, because that is the same size as a byte (on this platform).
The use of volatile is perhaps a cautious step to ensure that no compiler optimisations (either that exist now or may be added in the future) will change the precise meaning of the cast.
Update:
If we look at the macro definition:
(size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
With the cast-to-char removed it would be:
(size_t)&((((s *)0)->m))
In other words, get the address of member m in an object at address zero, which does look okay at first glance. So there must be some way that this would potentially cause a problem.
One thing that springs to mind is that the operator & may be overloaded on whatever type m happens to be. If so, this macro would be executing arbitrary code on an "artificial" object that is somewhere quite close to address zero. This would probably cause an access violation.
This kind of abuse may be outside the applicability of offsetof, which is supposed to only be used with POD types. Perhaps the idea is that it is better to return a junk value instead of crashing.
(Update 2: As Steve pointed out in the comments, there would be no similar problem with operator ->)
offsetof is something to be very careful with in C++. It's a relic from C. These days we are supposed to use member pointers. That said, I believe that member pointers to data members are overdesigned and broken - I actually prefer offsetof.
Even so, offsetof is full of nasty surprises.
First, for your specific questions, I suspect the real issue is that they've adapted relative to the traditional C macro (which I thought was mandated in the C++ standard). They probably use reinterpret_cast for "it's C++!" reasons (so why the (size_t) cast?), and a char& rather than a char* to try to simplify the expression a little.
Casting to char looks redundant in this form, but probably isn't. (size_t) is not equivalent to reinterpret_cast, and if you try to cast pointers to other types into integers, you run into problems. I don't think the compiler even allows it, but to be honest, I'm suffering memory failure ATM.
The fact that char is a single byte type has some relevance in the traditional form, but that may only be why the cast is correct again. To be honest, I seem to remember casting to void*, then char*.
Incidentally, having gone to the trouble of using C++-specific stuff, they really should be using std::ptrdiff_t for the final cast.
Anyway, coming back to the nasty surprises...
VC++ and GCC probably won't use that macro. IIRC, they have a compiler intrinsic, depending on options.
The reason is to do what offsetof is intended to do, rather than what the macro does, which is reliable in C but not in C++. To understand this, consider what would happen if your struct uses multiple or virtual inheritance. In the macro, when you dereference a null pointer, you end up trying to access a virtual table pointer that isn't there at address zero, meaning that your app probably crashes.
For this reason, some compilers have an intrinsic that just uses the specified structs layout instead of trying to deduce a run-time type. But the C++ standard doesn't mandate or even suggest this - it's only there for C compatibility reasons. And you still have to be careful if you're working with class heirarchies, because as soon as you use multiple or virtual inheritance, you cannot assume that the layout of the derived class matches the layout of the base class - you have to ensure that the offset is valid for the exact run-time type, not just a particular base.
If you're working on a data structure library, maybe using single inheritance for nodes, but apps cannot see or use your nodes directly, offsetof works well. But strictly speaking, even then, there's a gotcha. If your data structure is in a template, the nodes may have fields with types from template parameters (the contained data type). If that isn't POD, technically your structs aren't POD either. And all the standard demands for offsetof is that it works for POD. In practice, it will work - your type hasn't gained a virtual table or anything just because it has a non-POD member - but you have no guarantees.
If you know the exact run-time type when you dereference using a field offset, you should be OK even with multiple and virtual inheritance, but ONLY if the compiler provides an intrinsic implementation of offsetof to derive that offset in the first place. My advice - don't do it.
Why use inheritance in a data structure library? Well, how about...
class node_base { ... };
class leaf_node : public node_base { ... };
class branch_node : public node_base { ... };
The fields in the node_base are automatically shared (with identical layout) in both the leaf and branch, avoiding a common error in C with accidentally different node layouts.
BTW - offsetof is avoidable with this kind of stuff. Even if you are using offsetof for some jobs, node_base can still have virtual methods and therefore a virtual table, so long as it isn't needed to dereference member variables. Therefore, node_base can have pure virtual getters, setters and other methods. Normally, that's exactly what you should do. Using offsetof (or member pointers) is a complication, and should only be used as an optimisation if you know you need it. If your data structure is in a disk file, for instance, you definitely don't need it - a few virtual call overheads will be insignificant compared with the disk access overheads, so any optimisation efforts should go into minimising disk accesses.
Hmmm - went off on a bit of a tangent there. Whoops.
char is guarenteed to be the smallest number of bits the architectural can "bite" (aka byte).
All pointers are actually numbers, so cast adress 0 to that type because it's the beginning.
Take the address of member starting from 0 (resulting into 0 + location_of_m).
Cast that back to size_t.
1) I also do not know why it is done in this way.
2) The char type is special in two ways.
No other type has weaker alignment restrictions than the char type. This is important for reinterpret cast between pointers and between expression and reference.
It is also the only type (together with its unsigned variant) for which the specification defines behavior in case the char is used to access stored value of variables of different type. I do not know if this applies to this specific situation.
3) I think that the volatile modifier is used to ensure that no compiler optimization will result in attempt to read the memory.
2 . What's the reason for choosing char reference (char&) as the cast target?
if type s has operator& overloaded then we can't get address using &s
so we reinterpret_cast the type s to primitive type char because primitive type char
doesn't have operator& overloaded
now we can get address from that
if in C then reinterpret_cast is not required
3 . Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
here volatile is not relevant to compiler optimizing.
if type s have const or volatile or both qualifier(s) then
reinterpret_cast can't cast to char& because reinterpret_cast can't remove cv-qualifiers
so result is using <const volatile char&> for casting work from any combination

Resources