String (constants) Literals pointer - string

I'm relatively new to C++ and I searched for an answer to my question, however I got more confused. As I understand, string literals must be pointed by "const" pointers, since are considered to be readable only. I also understand the pointer itself is not constant (and could be changed), but actually it is pointing to a string constant.I also understand that the string itself cannot be modified. So in this example:
const char* cstr="string";
*cstr = 'a';
I get an error: "assignment of read-only location."
Now, if I define my C-string as following, and define a pointer to it, I'll be able to change the string:
char str[7]="string";
char* cstr = str;
*cstr = 'a';
cout << cstr <<endl;
the string will be modified (output --> a), means the first element of the string is changes. My two questions are:
1- why in the second example I am able to modify the C-string but in the first case I cannot make any changes to the string? 2- In both cases I am using pointers, but in the first case I should Use constant char pointer?

When you use the syntax
const char* cstr="string";
C++ defines:
An array of 7 character in the read-only section of memory, with the contents string\0 in it.
pointer on the stack (or in the writable global section of memory), with the address of that array.
However, when you use the syntax:
char str[7]="string";
C++ defines:
An array of 7 character on the stack (or in the writable global section of memory), with the contents "string\0" in it.
In the first case, the actual values are in read-only memory, so you can't change them. In the second case, they are in writable memory (stack or global).
C++ tries to enforce this semantic, so if the definition is read-only memory, you should use a const pointer.
Note that not all architectures have read-only memory, but because most of them do, and C++ might want to use the read-only memory feature (for better correctness), then C++ programmers should assume (for the purpose of pointer types) that constants are going to be placed in read-only memory.

Related

How to input a string value of unknown length from console in Fortran? [duplicate]

I would like to use deferred-length character strings in a "simple" manner to read user input. The reason that I want to do this is that I do not want to have to declare the size of a character string before knowing how large the user input will be. I know that there are "complicated" ways to do this. For example, the iso_varying_string module can be used: https://www.fortran.com/iso_varying_string.f95. Also, there is a solution here: Fortran Character Input at Undefined Length. However, I was hoping for something as simple, or almost as simple, as the following:
program main
character(len = :), allocatable :: my_string
read(*, '(a)') my_string
write(*,'(a)') my_string
print *, allocated(my_string), len(my_string)
end program
When I run this program, the output is:
./a.out
here is the user input
F 32765
Notice that there is no output from write(*,'(a)') my_string. Why?
Also, my_string has not been allocated. Why?
Why isn't this a simple feature of Fortran? Do other languages have this simple feature? Am I lacking some basic understanding about this issue in general?
vincentjs's answer isn't quite right.
Modern (2003+) Fortran does allow automatic allocation and re-allocation of strings on assignment, so a sequence of statements such as this
character(len=:), allocatable :: string
...
string = 'Hello'
write(*,*)
string = 'my friend'
write(*,*)
string = 'Hello '//string
write(*,*)
is correct and will work as expected and write out 3 strings of different lengths. At least one compiler in widespread use, the Intel Fortran compiler, does not engage 2003 semantics by default so may raise an error on trying to compile this. Refer to the documentation for the setting to use Fortran 2003.
However, this feature is not available when reading a string so you have to resort to the tried and tested (aka old-fashioned if you prefer) approach of declaring a buffer of sufficient size for any input and of then assigning the allocatable variable. Like this:
character(len=long) :: buffer
character(len=:), allocatable :: string
...
read(*,*) buffer
string = trim(buffer)
No, I don't know why the language standard forbids automatic allocation on read, just that it does.
Deferred length character is a Fortran 2003 feature. Note that many of the complicated methods linked to are written against earlier language versions.
With Fortran 2003 support, reading a complete record into a character variable is relatively straight forward. A simple example with very minimal error handling below. Such a procedure only needs to be written once, and can be customized to suit a user's particular requirements.
PROGRAM main
USE, INTRINSIC :: ISO_FORTRAN_ENV, ONLY: INPUT_UNIT
IMPLICIT NONE
CHARACTER(:), ALLOCATABLE :: my_string
CALL read_line(input_unit, my_string)
WRITE (*, "(A)") my_string
PRINT *, ALLOCATED(my_string), LEN(my_string)
CONTAINS
SUBROUTINE read_line(unit, line)
! The unit, connected for formatted input, to read the record from.
INTEGER, INTENT(IN) :: unit
! The contents of the record.
CHARACTER(:), INTENT(OUT), ALLOCATABLE :: line
INTEGER :: stat ! IO statement IOSTAT result.
CHARACTER(256) :: buffer ! Buffer to read a piece of the record.
INTEGER :: size ! Number of characters read from the file.
!***
line = ''
DO
READ (unit, "(A)", ADVANCE='NO', IOSTAT=stat, SIZE=size) buffer
IF (stat > 0) STOP 'Error reading file.'
line = line // buffer(:size)
! An end of record condition or end of file condition stops the loop.
IF (stat < 0) RETURN
END DO
END SUBROUTINE read_line
END PROGRAM main
Deferred length arrays are just that: deferred length. You still need to allocate the size of the array using the allocate statement before you can assign values to it. Once you allocate it, you can't change the size of the array unless you deallocate and then reallocate with a new size. That's why you're getting a debug error.
Fortran does not provide a way to dynamically resize character arrays like the std::string class does in C++, for example. In C++, you could initialize std::string var = "temp", then redefine it to var = "temporary" without any extra work, and this would be valid. This is only possible because the resizing is done behind the scenes by the functions in the std::string class (it doubles the size if the buffer limit is exceeded, which is functionally equivalent to reallocateing with a 2x bigger array).
Practically speaking, the easiest way I've found when dealing with strings in Fortran is to allocate a reasonably large character array that will fit most expected inputs. If the size of the input exceeds the buffer, then simply increase the size of your array by reallocateing with a larger size. Removing trailing white space can be done using trim.
You know that there are "complicated" ways of doing what you want. Rather than address those, I'll answer your first two "why?"s.
Unlike intrinsic assignment a read statement does not have the target variable first allocated to the correct size and type parameters for the thing coming in (if it isn't already like that). Indeed, it is a requirement that the items in an input list be allocated. Fortran 2008, 9.6.3, clearly states:
If an input item or an output item is allocatable, it shall be allocated.
This is the case whether the allocatable variable is a character with deferred length, a variable with other deferred length-type parameters, or an array.
There is another way to declare a character with deferred length: giving it the pointer attribute. This doesn't help you, though, as we also see
If an input item is a pointer, it shall be associated with a definable target ...
Why you have no output from your write statement is related to why you see that the character variable isn't allocated: you haven't followed the requirements of Fortran and so you can't expect the behaviour that isn't specified.
I'll speculate as to why this restriction is here. I see two obvious ways to relax the restriction
allow automatic allocation generally;
allow allocation of a deferred length character.
The second case would be easy:
If an input item or an output item is allocatable, it shall be allocated unless it is a scalar character variable with deferred length.
This, though, is clumsy and such special cases seem against the ethos of the standard as a whole. We'd also need a carefully thought out rule about alloction for this special case.
If we go for the general case for allocation, we'd presumably require that the unallocated effective item is the final effective item in the list:
integer, allocatable :: a(:), b(:)
character(7) :: ifile = '1 2 3 4'
read(ifile,*) a, b
and then we have to worry about
type aaargh(len)
integer, len :: len
integer, dimension(len) :: a, b
end type
type(aaargh), allocatable :: a(:)
character(9) :: ifile = '1 2 3 4 5'
read(ifile,*) a
It gets quite messy very quickly. Which seems like a lot of problems to resolve where there are ways, of varying difficulty, of solving the read problem.
Finally, I'll also note that allocation is possible during a data transfer statement. Although a variable must be allocated (as the rules are now) when appearing in input list components of an allocated variable of derived type needn't be if that effective item is processed by defined input.

Does std::ptr::write transfer the "uninitialized-ness" of the bytes it writes?

I'm working on a library that help transact types that fit in a pointer-size int over FFI boundaries. Suppose I have a struct like this:
use std::mem::{size_of, align_of};
struct PaddingDemo {
data: u8,
force_pad: [usize; 0]
}
assert_eq!(size_of::<PaddingDemo>(), size_of::<usize>());
assert_eq!(align_of::<PaddingDemo>(), align_of::<usize>());
This struct has 1 data byte and 7 padding bytes. I want to pack an instance of this struct into a usize and then unpack it on the other side of an FFI boundary. Because this library is generic, I'm using MaybeUninit and ptr::write:
use std::ptr;
use std::mem::MaybeUninit;
let data = PaddingDemo { data: 12, force_pad: [] };
// In order to ensure all the bytes are initialized,
// zero-initialize the buffer
let mut packed: MaybeUninit<usize> = MaybeUninit::zeroed();
let ptr = packed.as_mut_ptr() as *mut PaddingDemo;
let packed_int = unsafe {
std::ptr::write(ptr, data);
packed.assume_init()
};
// Attempt to trigger UB in Miri by reading the
// possibly uninitialized bytes
let copied = unsafe { ptr::read(&packed_int) };
Does that assume_init call triggered undefined behavior? In other words, when the ptr::write copies the struct into the buffer, does it copy the uninitialized-ness of the padding bytes, overwriting the initialized state as zero bytes?
Currently, when this or similar code is run in Miri, it doesn't detect any Undefined Behavior. However, per the discussion about this issue on github, ptr::write is supposedly allowed to copy those padding bytes, and furthermore to copy their uninitialized-ness. Is that true? The docs for ptr::write don't talk about this at all, nor does the nomicon section on uninitialized memory.
Does that assume_init call triggered undefined behavior?
Yes. "Uninitialized" is just another value that a byte in the Rust Abstract Machine can have, next to the usual 0x00 - 0xFF. Let us write this special byte as 0xUU. (See this blog post for a bit more background on this subject.) 0xUU is preserved by copies just like any other possible value a byte can have is preserved by copies.
But the details are a bit more complicated.
There are two ways to copy data around in memory in Rust.
Unfortunately, the details for this are also not explicitly specified by the Rust language team, so what follows is my personal interpretation. I think what I am saying is uncontroversial unless marked otherwise, but of course that could be a wrong impression.
Untyped / byte-wise copy
In general, when a range of bytes is being copied, the source range just overwrites the target range -- so if the source range was "0x00 0xUU 0xUU 0xUU", then after the copy the target range will have that exact list of bytes.
This is what memcpy/memmove in C behave like (in my interpretation of the standard, which is not very clear here unfortunately). In Rust, ptr::copy{,_nonoverlapping} probably performs a byte-wise copy, but it's not actually precisely specified right now and some people might want to say it is typed as well. This was discussed a bit in this issue.
Typed copy
The alternative is a "typed copy", which is what happens on every normal assignment (=) and when passing values to/from a function. A typed copy interprets the source memory at some type T, and then "re-serializes" that value of type T into the target memory.
The key difference to a byte-wise copy is that information which is not relevant at the type T is lost. This is basically a complicated way of saying that a typed copy "forgets" padding, and effectively resets it to uninitialized. Compared to an untyped copy, a typed copy loses more information. Untyped copies preserve the underlying representation, typed copies just preserve the represented value.
So even when you transmute 0usize to PaddingDemo, a typed copy of that value can reset this to "0x00 0xUU 0xUU 0xUU" (or any other possible bytes for the padding) -- assuming data sits at offset 0, which is not guaranteed (add #[repr(C)] if you want that guarantee).
In your case, ptr::write takes an argument of type PaddingDemo, and the argument is passed via a typed copy. So already at that point, the padding bytes may change arbitrarily, in particular they may become 0xUU.
Uninitialized usize
Whether your code has UB then depends on yet another factor, namely whether having an uninitialized byte in a usize is UB. The question is, does a (partially) uninitialized range of memory represent some integer? Currently, it does not and thus there is UB. However, whether that should be the case is heavily debated and it seems likely that we will eventually permit it.
Many other details are still unclear, though -- for example, transmuting "0x00 0xUU 0xUU 0xUU" to an integer may well result in a fully uninitialized integer, i.e., integers may not be able to preserve "partial initialization". To preserve partially initialized bytes in integers we would have to basically say that an integer has no abstract "value", it is just a sequence of (possibly uninitialized) bytes. This does not reflect how integers get used in operations like /. (Some of this also depends on LLVM decisions around poison and freeze; LLVM might decide that when doing a load at integer type, the result is fully poison if any input byte is poison.) So even if the code is not UB because we permit uninitialized integers, it may not behave as expected because the data you want to transfer is being lost.
If you want to transfer raw bytes around, I suggest to use a type suited for that, such as MaybeUninit. If you use an integer type, the goal should be to transfer integer values -- i.e., numbers.

Function for unaligned memory access on ARM

I am working on a project where data is read from memory. Some of this data are integers, and there was a problem accessing them at unaligned addresses. My idea would be to use memcpy for that, i.e.
uint32_t readU32(const void* ptr)
{
uint32_t n;
memcpy(&n, ptr, sizeof(n));
return n;
}
The solution from the project source I found is similar to this code:
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
So my questions:
Isn't the second version undefined behaviour? The C standard says in 6.2.3.2 Pointers, at 7:
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned 57) for the
pointed-to type, the behavior is undefined.
As the calling code has, at some point, used a char* to handle the memory, there must be some conversion from char* to uint32_t*. Isn't the result of that undefined behaviour, then, if the uint32_t* is not corrently aligned? And if it is, there is no point for the function as you could write *(uint32_t*) to fetch the memory. Additionally, I think I read somewhere that the compiler may expect an int* to be aligned correctly and any unaligned int* would mean undefined behaviour as well, so the generated code for this function might make some shortcuts because it may expect the function argument to be aligned properly.
The original code has volatile on the argument and all variables because the memory contents could change (it's a data buffer (no registers) inside a driver). Maybe that's why it does not use memcpy since it won't work on volatile data. But, in which world would that make sense? If the underlying data can change at any time, all bets are off. The data could even change between those byte copy operations. So you would have to have some kind of mutex to synchronize access to this data. But if you have such a synchronization, why would you need volatile?
Is there a canonical/accepted/better solution to this memory access problem? After some searching I come to the conclusion that you need a mutex and do not need volatile and can use memcpy.
P.S.:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 1581.05
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
This code
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
passes the pointer as a uint32_t *. If it's not actually a uint32_t, that's UB. The argument should probably be a const void *.
The use of a const char * in the conversion itself is not undefined behavior. Per 6.3.2.3 Pointers, paragraph 7 of the C Standard (emphasis mine):
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type, the result points to the
lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining
bytes of the object.
The use of volatile with respect to the correct way to access memory/registers directly on your particular hardware would have no canonical/accepted/best solution. Any solution for that would be specific to your system and beyond the scope of standard C.
Implementations are allowed to define behaviors in cases where the Standard does not, and some implementations may specify that all pointer types have the same representation and may be freely cast among each other regardless of alignment, provided that pointers which are actually used to access things are suitably aligned.
Unfortunately, because some obtuse compilers compel the use of "memcpy" as an
escape valve for aliasing issues even when pointers are known to be aligned,
the only way compilers can efficiently process code which needs to make
type-agnostic accesses to aligned storage is to assume that any pointer of a type requiring alignment will always be aligned suitably for such type. As a result, your instinct that approach using uint32_t* is dangerous is spot on. It may be desirable to have compile-time checking to ensure that a function is either passed a void* or a uint32_t*, and not something like a uint16_t* or a double*, but there's no way to declare a function that way without allowing a compiler to "optimize" the function by consolidating the byte accesses into a 32-bit load that will fail if the pointer isn't aligned.

Fastest Way to Copy Buffer or C-String into a std::string

Let's say I have char buffer[64] and uint32_t length, and buffer might or might not be null terminated. If it is null terminated, the rest of the buffer will be filled with nulls. the length variable holds the length of buffer.
I would like to copy it into a std::string without extra nulls at the end of the string object.
Originally, I tried:
std::string s(buffer, length);
which copies the extra nulls when buffer is filled with nulls at the end.
I can think of:
char buffer2[128];
strncpy(buffer2, buffer, 128);
const std::sring s(buffer2);
But it is kind of wasteful because it copies twice.
I wonder whether there is a faster way. I know I need to benchmark to tell exactly which way is faster...but I would like to look at some other solutions and then benchmark...
Thanks in advance.
If you can, I'd simply add a '\0' at the end of your buffer and
then use the c-string version of the string constructor.
If you can't, you need to determine if there's a '\0' in your
buffer and while you're at it, you might as well count the number of
characters you encounter before the '\0'. You can then use that
count with the (buffer,length) form of the string constructor:
#include <string.h>
//...
std::string s(buffer, strnlen(buffer, length));
If you can't do 1. and don't want to iterate over buffer twice (once to determine the length, once in the string constructor), you could do:
char last_char = buffer[length-1];
buffer[length-1] = '\0';
std::string s(buffer); //the c-string form since we're sure there's a '\0' in the buffer now
if(last_char!='\0' && s.length()==(length-1)) {
//With good buffer sizes, this might not need to cause reallocation of the strings internal buffer
s.push_back(last_char);
}
I leave the benchmarking to you. It is possible that the c-string version of the constructor uses something like strlen internally anyway to avoid reallocations so there might not be much to gain from using the c-string version of the string constructor.
You can use all the canonical way to do this.
Faster way is surely implement by yourself smartpointer (or use anything already done as std::shared_ptr
).
Each smartpointer (sp) point to first char of array and contain.
Each time you do array.copy you don't do a true copy, but you simply add a reference do that array.
So, a "copy" take O(1) instead of O(N)

Writing a custom malloc which stores informations in the pointer

I have recently been reading about a family of automatic memory management techniques that rely on storing information in the pointer returned by the allocator, i.e. few bits of header e.g. to differentiate between pointers or to store thread-related information (note that I'm not talking about limited-field reference counting here, only immutable information).
I'd like to toy with these techniques. Now, to implement them, I need to be able to return pointers with a specific shape from my allocator. I suppose I could play with the least weight bits but this would require padding that looks extremely memory consuming, so I believe that I should play with the heaviest bits. However, I have no good idea on how to do this. Is there a way for me to, call malloc or malloc_create_zone or some related function and request a pointer that always starts with the given bits?
Thanks everyone!
The amount of information you can actually store in a pointer is pretty limited (typically one or two bits per pointer). And every attempt to dereference the pointer has to first mask out the magic information. The technique is often called tagging, BTW.
#define TAG_MASK 0x3
#define CONS_TAG 0x1
#define STRING_TAG 0x2
#define NUMBER_TAG 0x3
typedef uintptr_t value_t;
typedef struct cons {
value_t car;
value_t cdr;
} cons_t;
value_t
create_cons(value_t t1, value_t t2)
{
cons_t* pair = malloc(sizeof(cons_t));
value_t addr = (value_t)pair;
pair->car = t1;
pair->cdr = t2;
return addr | CONS_TAG;
}
value_t
car_of_cons(value_t v)
{
if ((v % TAG_MASK) != CONS_TAG) error("wrong type of argument");
return ((cons_t*) (v & ~TAG_MASK))->car;
}
One advantage of this technique is, that you can directly infer the type of the object from the pointer itself. You don't need to dereference it (say, in order to read a special type field or similar). Many language implementations using this scheme also have a special tag combination for "immediate" numbers and other small values, which can be represented direcly using the "pointer".
The disadvatage is, that the amount of information, which can be stored, is pretty limited. Also, as the example code shows, you have to be aware of the tagging in every access to the object, and need to "untag" the pointer before actually using it.
The use of the least significant bits for tagging stemms from the observation, that on most platforms, all pointer to malloced memory is actually aligned on a non-byte boundary (usually 8 bytes), so the least significant bits are always zero.

Resources