This question already has answers here:
Fortran binary output bigger than expected [duplicate]
(1 answer)
Unexpected "padding" in a Fortran unformatted file
(4 answers)
Closed 3 years ago.
If I write data to a file as follows:
program main
implicit none
integer :: err
real (kind=4), dimension(3) :: buffer
buffer(1) = 1.2
buffer(2) = 3.7
buffer(3) = 0.1
open(unit=36, file='test.dat', iostat=err, form='unformatted', action='write', status='new')
write(36) buffer
close(36)
end program
I would expect the file to be 12 bytes since the size of the real data type is 4 and I am inserting 3 real values in the file (4x3=12). However, if I type the following in my shell:
$ ls -lh test.dat
it says the file is 20 bytes.
Fortran unformatted files are not "binary" files, they still have a record structure. Thus typically there will be extra data over and above that which you have written to store information about the record - typically this is the length of the record, or maybe an end of record marker. Thus your file is bigger than the raw data.
Also don't use explicit constants for kind numbers - I can show you compilers where real(4) will fail to compile. Instead use Selected_real_kind or similar, or use the constant in the intrinsic module iso_fortran_env, or possibly those in iso_c_binding.
Related
I am reading and writing huge files, using unformatted form and stream access.
During the running, I open the same file multiple times, and I read only portions of the file that I need in that moment. I have huge files in order to avoid writing too many smaller files on the hard-disk. I don't read these huge files all in once because they are too large and I would have memory problems.
In order to read only portions of the files, I do this. Let's say that I have written the array A(1:10) on a file "data.dat", and let's say that I need to read it twice in an array B(1:5). This is what I do
real, dimension(5) :: B
integer:: fu, myposition
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=1) B
inquire(unit=fu,POS=myposition)
close(fu)
[....]
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=myposition) B
inquire(unit=fu,POS=myposition)
close(fu)
[...]
My questions are:
Is this approach correct?
When the files are too big, the inquire(fu,POS=myposition) goes wrong,
because the integer is too big (indeed, I get negative values).
Should I simply declare the integer myposition as a larger integer?
Or there is a better way to do what I am trying to do.
In other words, having such a huge integers is related to the fact that I am using a very clumsy approach?
P.S.
To be more quantitative, this is the order of magnitude: I have thousands of files of around 10 giga each.
I would like to use deferred-length character strings in a "simple" manner to read user input. The reason that I want to do this is that I do not want to have to declare the size of a character string before knowing how large the user input will be. I know that there are "complicated" ways to do this. For example, the iso_varying_string module can be used: https://www.fortran.com/iso_varying_string.f95. Also, there is a solution here: Fortran Character Input at Undefined Length. However, I was hoping for something as simple, or almost as simple, as the following:
program main
character(len = :), allocatable :: my_string
read(*, '(a)') my_string
write(*,'(a)') my_string
print *, allocated(my_string), len(my_string)
end program
When I run this program, the output is:
./a.out
here is the user input
F 32765
Notice that there is no output from write(*,'(a)') my_string. Why?
Also, my_string has not been allocated. Why?
Why isn't this a simple feature of Fortran? Do other languages have this simple feature? Am I lacking some basic understanding about this issue in general?
vincentjs's answer isn't quite right.
Modern (2003+) Fortran does allow automatic allocation and re-allocation of strings on assignment, so a sequence of statements such as this
character(len=:), allocatable :: string
...
string = 'Hello'
write(*,*)
string = 'my friend'
write(*,*)
string = 'Hello '//string
write(*,*)
is correct and will work as expected and write out 3 strings of different lengths. At least one compiler in widespread use, the Intel Fortran compiler, does not engage 2003 semantics by default so may raise an error on trying to compile this. Refer to the documentation for the setting to use Fortran 2003.
However, this feature is not available when reading a string so you have to resort to the tried and tested (aka old-fashioned if you prefer) approach of declaring a buffer of sufficient size for any input and of then assigning the allocatable variable. Like this:
character(len=long) :: buffer
character(len=:), allocatable :: string
...
read(*,*) buffer
string = trim(buffer)
No, I don't know why the language standard forbids automatic allocation on read, just that it does.
Deferred length character is a Fortran 2003 feature. Note that many of the complicated methods linked to are written against earlier language versions.
With Fortran 2003 support, reading a complete record into a character variable is relatively straight forward. A simple example with very minimal error handling below. Such a procedure only needs to be written once, and can be customized to suit a user's particular requirements.
PROGRAM main
USE, INTRINSIC :: ISO_FORTRAN_ENV, ONLY: INPUT_UNIT
IMPLICIT NONE
CHARACTER(:), ALLOCATABLE :: my_string
CALL read_line(input_unit, my_string)
WRITE (*, "(A)") my_string
PRINT *, ALLOCATED(my_string), LEN(my_string)
CONTAINS
SUBROUTINE read_line(unit, line)
! The unit, connected for formatted input, to read the record from.
INTEGER, INTENT(IN) :: unit
! The contents of the record.
CHARACTER(:), INTENT(OUT), ALLOCATABLE :: line
INTEGER :: stat ! IO statement IOSTAT result.
CHARACTER(256) :: buffer ! Buffer to read a piece of the record.
INTEGER :: size ! Number of characters read from the file.
!***
line = ''
DO
READ (unit, "(A)", ADVANCE='NO', IOSTAT=stat, SIZE=size) buffer
IF (stat > 0) STOP 'Error reading file.'
line = line // buffer(:size)
! An end of record condition or end of file condition stops the loop.
IF (stat < 0) RETURN
END DO
END SUBROUTINE read_line
END PROGRAM main
Deferred length arrays are just that: deferred length. You still need to allocate the size of the array using the allocate statement before you can assign values to it. Once you allocate it, you can't change the size of the array unless you deallocate and then reallocate with a new size. That's why you're getting a debug error.
Fortran does not provide a way to dynamically resize character arrays like the std::string class does in C++, for example. In C++, you could initialize std::string var = "temp", then redefine it to var = "temporary" without any extra work, and this would be valid. This is only possible because the resizing is done behind the scenes by the functions in the std::string class (it doubles the size if the buffer limit is exceeded, which is functionally equivalent to reallocateing with a 2x bigger array).
Practically speaking, the easiest way I've found when dealing with strings in Fortran is to allocate a reasonably large character array that will fit most expected inputs. If the size of the input exceeds the buffer, then simply increase the size of your array by reallocateing with a larger size. Removing trailing white space can be done using trim.
You know that there are "complicated" ways of doing what you want. Rather than address those, I'll answer your first two "why?"s.
Unlike intrinsic assignment a read statement does not have the target variable first allocated to the correct size and type parameters for the thing coming in (if it isn't already like that). Indeed, it is a requirement that the items in an input list be allocated. Fortran 2008, 9.6.3, clearly states:
If an input item or an output item is allocatable, it shall be allocated.
This is the case whether the allocatable variable is a character with deferred length, a variable with other deferred length-type parameters, or an array.
There is another way to declare a character with deferred length: giving it the pointer attribute. This doesn't help you, though, as we also see
If an input item is a pointer, it shall be associated with a definable target ...
Why you have no output from your write statement is related to why you see that the character variable isn't allocated: you haven't followed the requirements of Fortran and so you can't expect the behaviour that isn't specified.
I'll speculate as to why this restriction is here. I see two obvious ways to relax the restriction
allow automatic allocation generally;
allow allocation of a deferred length character.
The second case would be easy:
If an input item or an output item is allocatable, it shall be allocated unless it is a scalar character variable with deferred length.
This, though, is clumsy and such special cases seem against the ethos of the standard as a whole. We'd also need a carefully thought out rule about alloction for this special case.
If we go for the general case for allocation, we'd presumably require that the unallocated effective item is the final effective item in the list:
integer, allocatable :: a(:), b(:)
character(7) :: ifile = '1 2 3 4'
read(ifile,*) a, b
and then we have to worry about
type aaargh(len)
integer, len :: len
integer, dimension(len) :: a, b
end type
type(aaargh), allocatable :: a(:)
character(9) :: ifile = '1 2 3 4 5'
read(ifile,*) a
It gets quite messy very quickly. Which seems like a lot of problems to resolve where there are ways, of varying difficulty, of solving the read problem.
Finally, I'll also note that allocation is possible during a data transfer statement. Although a variable must be allocated (as the rules are now) when appearing in input list components of an allocated variable of derived type needn't be if that effective item is processed by defined input.
This question already has answers here:
Python Number Limit [duplicate]
(4 answers)
Closed 2 years ago.
Let's say I have this python string
>>> s = 'dog /superdog/ | cat /thundercat/'
Is there a way to like replace the character / (first one) with [ & second / with ].
I was thinking like an about like this.
Output:
'dog [superdog] | cat [thundercat]'
I tried doing like this but did not quite get that well.
>>> s = 'dog /superdog/ | cat /thundercat/'
>>> s.replace('/','[')
'dog [superdog[ | cat [thundercat['
I was thinking to know the best and pythonic way as possible. Thank you!
Python can handle arbitrarily large numbers because python has built-in arbitrary-precision integers. The limit is related to the amount of RAM memory Python can access. These built-in Long Integers arithmetic is implemented as an Integer object which is initially set to 32 bits for speed, and then start allocating memory on demand.
Integers are commonly stored using a word of memory, which is 4 bytes or 32 bits, so integers from 0 up to 4,294,967,295 (2e32 -1) can be stored.
But if your system has 1GB available to a python process, it will have 8589934592 bits to represent numbers, and you can use numbers like (2e8589934592 -1).
Computers can only handle numbers up to a certain size, but this is to be taken with some caveats.
2147483648 through 2147483647 are the limits of 32 bit numbers.
Most of todays computers can handle numbers of 64
bits, i.e. numbers from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807, or from −(2^63) to 2^63 − 1
It is possible to create a software that can handle arbitrary large numbers, as long as RAM or storage suffice. Those solutions are rather slow, but e.g. SSL encryption is based on numbers thousands of digits long.
As a side note, you are doubling your initial million in every iteration, not adding a million.
Recently I was working on a problem that required me to read many many lines of numbers (around 500,000).
Early on, I found that using input() was way too slow. Using stdin.readline() was much better. However, it still was not fast enough. I found that using the following code:
import io, os
input = io.BytesIO(os.read(0,os.fstat(0).st_size)).readline
and using input() in this manner improved the runtime. However, I don't actually understand how this code works. Reading the documentation for os.read, 0 in os.read(0, os.fstat(0).st_size) describes the file we are reading from. What file is 0 describing? Also, fstat describes the status of the file we are reading from but apparently that input is to denote the max number of bytes we are reading?
The code works but I want to understand what it is doing and why it is faster. Any help is appreciated.
0 is the file descriptor for standard input. os.fstat(0).st_size will tell Python how many bytes are currently waiting in the standard input buffer. Then os.read(0, ...) will read that many bytes in bulk, again from standard input, producing a bytestring.
(As an additional note, 1 is the file descriptor of standard output, and 2 is standard error.)
Here's a demo:
echo "five" | python3 -c "import os; print(os.stat(0).st_size)"
# => 5
Python found four single-byte characters and a newline in the standard input buffer, and reported five bytes waiting to be read.
Bytestrings are not very convenient to work with if you want text — for one thing, they don't really understand the concept of "lines" — so BytesIO fakes an input stream with the passed bytestring, allowing you to readline from it. I am not 100% sure why this is faster, but my guesses are:
Normal read is likely done character-wise, so that one can detect a line break and stop without reading too much; bulk read is more efficient (and finding newlines post-facto in memory is pretty fast)
There is no encoding processing done this way
os.read has a signature I am calling fd, size. Setting size to the bytes left in fd causes everything else to come rushing at you like a tusnami. There is also "standard file descriptors" for 0=stdin, 1=stdout, 2=stderr.
Code deconstruction:
import io, os # Utilities
input = \ # Replace the input built-in
io.BytesIO( \ # Create a fake file
os.read( \ # Read data from a file descriptor
0, \ # stdin
os.fstat(0) \ # Information about stdin
.st_size \ # Bytes left in the file
)
) \
.readline # When called, gets a line of the file
Background
I am reading buffers using the Node.js buffer native API. This API has two functions called readUIntBE and readUIntLE for Big Endian and Little Endian respectively.
https://nodejs.org/api/buffer.html#buffer_buf_readuintbe_offset_bytelength_noassert
Problem
By reading the docs, I stumbled upon the following lines:
byteLength Number of bytes to read. Must satisfy: 0 < byteLength <= 6.
If I understand correctly, this means that I can only read 6 bytes at a time using this function, which makes it useless for my use case, as I need to read a timestamp comprised of 8 bytes.
Questions
Is this a documentation typo?
If not, what is the reason for such an arbitrary limitation?
How do I read 8 bytes in a row ( or how do I read sequences greater than 6 bytes? )
Answer
After asking in the official Node.js repo, I got the following response from one of the members:
No it is not a typo
The byteLength corresponds to e.g. 8bit, 16bit, 24bit, 32bit, 40bit and 48bit. More is not possible since JS numbers are only safe up to Number.MAX_SAFE_INTEGER.
If you want to read 8 bytes, you can read multiple entries by adding the offset.
Source: https://github.com/nodejs/node/issues/20249#issuecomment-383899009