Inquire POS returns very large integers in Fortran - io

I am reading and writing huge files, using unformatted form and stream access.
During the running, I open the same file multiple times, and I read only portions of the file that I need in that moment. I have huge files in order to avoid writing too many smaller files on the hard-disk. I don't read these huge files all in once because they are too large and I would have memory problems.
In order to read only portions of the files, I do this. Let's say that I have written the array A(1:10) on a file "data.dat", and let's say that I need to read it twice in an array B(1:5). This is what I do
real, dimension(5) :: B
integer:: fu, myposition
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=1) B
inquire(unit=fu,POS=myposition)
close(fu)
[....]
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=myposition) B
inquire(unit=fu,POS=myposition)
close(fu)
[...]
My questions are:
Is this approach correct?
When the files are too big, the inquire(fu,POS=myposition) goes wrong,
because the integer is too big (indeed, I get negative values).
Should I simply declare the integer myposition as a larger integer?
Or there is a better way to do what I am trying to do.
In other words, having such a huge integers is related to the fact that I am using a very clumsy approach?
P.S.
To be more quantitative, this is the order of magnitude: I have thousands of files of around 10 giga each.

Related

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Cluster nodes need to read different sections of an input file - how do I organize it?

I am trying to read an input file in a cluster environment. Different nodes will read different parts of it. However the parts are not clearly separated, but interleaved in a "grid".
For example, a file with 16 elements (assume integers):
0 1 2 3
4 5 6 7
8 9 A B
C D E F
If I use four nodes, the first node will read the top left 2x2 square (0,1,4,5), the second node will read the top right 2x2 square and so on.
How should I handle this? I can use MPI or OpenMP. I have two ideas but I don't know which would work better:
Each node will open the file and have its own handle to it. Each node would read the file independently, using only the part of the file it needs and skipping over the rest of it. In this case, what would be the difference between using fopen or MPI_File_open? Which one would be better?
Use one node read the whole file and send each part of the input to the node that needs it.
Regarding your question,
I will not suggest the second option you mentioned. that is using one node to read and then distributing the parts. Reasons being this is slow .. especially if the file is large. Here you have twice the overhead, first to keep other processes waiting and second to send the data which is read. So clearly a no go for me.
Regarding your first option, there is no big difference between using fopen and MPI_Fole_open. But Here I will still suggest MPI_File_open to avail certain facilities like non blocking I/O operations and Shared file pointers (makes life easy)

How to avoid programs in status D

I wrote a program that are reading/writing data (open one infile and one outfile, read part of infile, then process, then write to outfile, and that cycle repeats), with I/O value about 200M/s in total. However, most of the running time, they are in status D, which means waiting for I/O (As shown in the figure)1. I used dd check write speed in my system, that is about 1.8G/s.
Are my programs inefficient?
Or my harddisk have problems?
How can I deal with it?
If using ifort, you must explicitly use buffered I/O. Flag with -assume buffered_io when compiling or set buffered='yes' in the openstatement.
If you are using gfortran this is the default, so then there must be some other problem.
Edit
I can add that depending on how you read and write the data, most time can be spent parsing it, i.e. decoding ascii characters 123 etc and changing the basis from 10 to 2 until it is machine readable data; then doing the opposite when writing. This is the case if you construct your code like this:
real :: vector1(10)
do
read(5,*) vector1 !line has 10 values
write(6,*) vector1
enddo
If you instead do the following, it will be much faster:
character(1000) :: line1 ! use enough characters so the whole line fits
do
read(5,'(A)') line1
write(6,'(A)') line1
enddo
Now you are just pumping ascii through the program without even knowing if its digits or maybe "ääåö(=)&/&%/(¤%/&Rhgksbks---31". With these modifications I think you should reach the max of your disk speed.
Notice also that there is a write cache in most drives, which is faster than the disk read/write speeds, meaning that you might first be throttled by the read speed, and after filling up the write cache, be throttled by the write speed, which is usually lower than the read speed.

How to convert fixed size dimension to unlimited in a netcdf file

I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?
Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc
The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.
You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.

Joining two files with regular expression in Unix (ideally with perl)

I have following two files disconnect.txt and answered.txt:
disconnect.txt
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032
answered.txt
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).
Thank you
Sounds like you have hundreds of millions of lines?
Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.
If the files are large the quadratic algorithm will take a lifetime.
Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:
def key s
s.split('from:')[1].split('to:').map(&:strip).join('.')
end
h = {}
open 'disconnect.txt', 'r' do |f|
while s = f.gets
h[key(s)] = true
end
end
open 'answered.txt', 'r' do |f|
while a = f.gets
puts a if h[key(a)]
end
end
Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big1 number, then you will probably not be able to fit all the keys in memory and you will need a database.
1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.
First, sort the files on the from/to timestamps if they are not already sorted that way. (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)
Then take the sorted files and compare the first lines of each.
If the timestamps are the same, you have a match. Hooray! Advance a line in one or both files (depending on your rules for duplicate timestamps in each) and compare again.
If not, grab the next line in whichever file has the earlier timestamp and compare again.
This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.
If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:
You can split the files into arbitrarily-sized chunks (ideally small enough for each chunk to fit into memory), sort each chunk independently, and then generalize the above algorithm from two files to as many as are necessary.
Even if you don't do that and you deal with the disk thrashing involved with sorting files larger than the available memory, sorting and then doing a single pass over each file will still be a lot faster than any solution involving a cartesian join.
Or you could just use a database as mentioned in previous answers. The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.

Resources