How to convert fixed size dimension to unlimited in a netcdf file - linux

I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?

Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc

The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.

You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.

Related

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

How does the bittorrent assemble the missing pieces?

I use BitTorrent and sometimes encounter files that do not have seed(missing pieces).
At that time, we sometimes force the file transfer to end and try to open the incompleted files (for example, an image file).
If we are lucky, may be able to see the downloaded image even if some parts are lost.
I would like to artificially reproduce this situation, and here's how I tried:
1) spliting a bmp image file of about 1 megabyte into 16 kilobytes by the Linux split command,
2) and then make just one of the divided files 0 kilobytes.
3) after that, rejoin all the files with the cat command.
However, in this case, unlike the torrent's "lost pieces" situation, the file becomes completely corrupt and can not be read.
Theoretically it does not seem like anything special, but what's wrong? And how can I achieve what I want?
I would appreciate your help.
Use dd:
dd if=/dev/zero of=image.jpg bs=1 conv=notrunc seek=X count=Y
being X the offset in the file you want to erase and Y the number of bytes.
About the corruption, it depends on the type of file, the piece you are losing and the program you are using to read it.
For instance, JPG files use a variable bit-length encoding, meaning that just losing one bit may corrupt all the file from that point on. But just for that, there can be resyncronization points where the bitstream is reset, so from that point on, the file will look ok. But those resync points are optional when writing the file, and not every reader honor them in case of corruption...
And anyway, losing part of the headers will make the file totally unreadable.

Why does Python find different file sizes to Windows?

I'm creating a basic GUI as a college project. It scans a user selected hard drive from their PC and gives them information about it such as the number of files on it etc...
There's a part of my scanning function that, for each file on the drive, takes the size of said file in bytes, and adds it to a running total. At the end of this, after comparing the number to the Windows total, I always find that my Python script finds less data than Windows says is on the drive.
Below is the code...
import os
overall_space_used = 0
def Scan (drive):
global overall_space_used
for path, subdirs, files in os.walk (r"" + drive + "\\"):
for file in files:
overall_space_used = overall_space_used + os.path.getsize(os.path.join(path,file))
print (overall_space_used)
When this is executed on one my HDDs, Python says there is 23,328,445,304 bytes of data in total (21.7 GB). However, when I go into the drive in Windows, it says that there is 23,536,922,624 bytes of data (21.9 GB). Why is there this difference?
I calculated it by hand, and using the same formula that Windows used to convert from bytes to gibibytes (gibibytes = bytes / 1024**3), I still arrived .2 GB short. Why is Python finding less data?
With os.path.getsize(...) you get the actual size of the file.
But NTFS, FAT32,... filesystems use cluster to store data in them, so they aren't filled up fully.
You can see this difference, when you go to the properties of a file, there is a difference between 'size' and 'size on the disk'. Now when you check the file size of the disk, it gives you the size of the used up clusters and not the size of the files added up.
Here some more detailed information:
Why is There a Big Difference Between ‘Size’ and ‘Size on Disk’?

How to extract interval/range of rows from compressed file?

How do I return interval of rows from 100mil rows *.gz file?
Let's say I need 5 mil rows starting from 15mil up to 20mil?
is this the best performing option?
zcat myfile.gz|head -20000000|tail -500
real 0m43.106s
user 0m43.154s
sys 0m9.259s
That's a perfectly reasonable option; since you don't know how long a line will be, you basically have to decompress and iterate the lines to figure out where the line separators are. All three tools are fairly heavily optimized, so I/O and decompression time will likely dominate regardless.
In theory, rolling your own solution that combines all three tools in a single executable might save a little (by reducing the costs of IPC a bit), but the savings would likely be negligible.

iozone what is record size/record lengh?

We have build windows file server and i wan to run iozone to test Disk I/O performance so what kind of test i should run and how do i know how much i/o i will get at X size of file? also what is record size or record lengh I came across many time this word while i was googling..
I am running following test right now but i don't know how do i read stats and what that result meas.
iozone -R -r 1M -s 100m
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fre
102400 1024 1438781 1833689 1647187 1731045 1770870 1881794 1933970 4323897 1973719 1954304 1743602 10781
Well, IOzone benchmarks a file system by breaking up a file of a given size into records.
These records are written (or read) in a different way, according to the given test, until the file size is reached.
For example, your command (iozone -R -r 1M -s 100m) asks IOzone to execute all its tests (e.g. read, re-read, write, re-write, etc) over a file of 100MB. Read/Write operations are split into records of 1MB. It means that 100 operations are done over records of 1MB to achieve the tests.
Have a look at the results. The first number is the size of the file. The second is the record length. Then, numbers correspond to the throughput recorded for the different tests. Some tests are done several time (e.g. read, write, etc). First time is done sequentially, second time is done by random location accessing.
The following document explains in details IOzone. Have a look at the description of the tests to understand their meanings.

Resources