In Julia v1.1, assume that I have a very large text file (30GB) and I want parallelism (multi-threads) to read eachline, how can I do ?
This code is an attempt to do this after checking Julia's documentation on multi-threading, but it's not working at all
open("pathtofile", "r") do file
# Count number of lines in file
seekend(file)
fileSize = position(file)
seekstart(file)
# skip nseekchars first characters of file
seek(file, nseekchars)
# progress bar, because it's a HUGE file
p = Progress(fileSize, 1, "Reading file...", 40)
Threads.#threads for ln in eachline(file)
# do something on ln
u, v = map(x->parse(UInt32, x), split(ln))
.... # other interesting things
update!(p, position(file))
end
end
Note 1 : you need using ProgressMeter (I want my code to show a progress bar while parallelism the file reading)
Note 2 : nseekchars is an Int and the number of characters I want to skip in the beginning in my file
Note 3 : the code is working but doesn't do parellelism without Threads.#threads macro next to the for loop
For the maximum I/O performance:
Parallelize the hardware - that is use disk arrays rather than a single drive. Try searching for raid performance for many excellent explanations (or ask a separate question)
Use the Julia memory mapping mechanism
s = open("my_file.txt","r")
using Mmap
a = Mmap.mmap(s)
Once having the memory mapping, do the processing in parallel. Beware of false sharing for threads (depends on your actual scenario).
Related
I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)
I wrote a program that are reading/writing data (open one infile and one outfile, read part of infile, then process, then write to outfile, and that cycle repeats), with I/O value about 200M/s in total. However, most of the running time, they are in status D, which means waiting for I/O (As shown in the figure)1. I used dd check write speed in my system, that is about 1.8G/s.
Are my programs inefficient?
Or my harddisk have problems?
How can I deal with it?
If using ifort, you must explicitly use buffered I/O. Flag with -assume buffered_io when compiling or set buffered='yes' in the openstatement.
If you are using gfortran this is the default, so then there must be some other problem.
Edit
I can add that depending on how you read and write the data, most time can be spent parsing it, i.e. decoding ascii characters 123 etc and changing the basis from 10 to 2 until it is machine readable data; then doing the opposite when writing. This is the case if you construct your code like this:
real :: vector1(10)
do
read(5,*) vector1 !line has 10 values
write(6,*) vector1
enddo
If you instead do the following, it will be much faster:
character(1000) :: line1 ! use enough characters so the whole line fits
do
read(5,'(A)') line1
write(6,'(A)') line1
enddo
Now you are just pumping ascii through the program without even knowing if its digits or maybe "ääåö(=)&/&%/(¤%/&Rhgksbks---31". With these modifications I think you should reach the max of your disk speed.
Notice also that there is a write cache in most drives, which is faster than the disk read/write speeds, meaning that you might first be throttled by the read speed, and after filling up the write cache, be throttled by the write speed, which is usually lower than the read speed.
I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?
Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc
The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.
You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.
I basically want the equivalent of
... | sort -arg1 -arg2 -... | head -n $k
but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.
I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.
Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...
Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.
I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:
256 files, of about 1.6 Gigs total size, all sitting on an ssd, lines
separated by \n, lines of format [^\t]*\t[0-9]+
Ubuntu 10.4, 6 cores, 8 gigs of ram, /tmp on ssd as well.
$ time sort -t^v<tab> -k2,2n foo* | tail -10000
real 7m26.444s
user 7m19.790s
sys 0m17.530s
$ time python test.py 10000 foo*
real 1m29.935s
user 1m28.640s
sys 0m1.220s
using diff to analyze, the two methods differ on tie-breaking, but otherwise the sort order is the same.
test.py:
#!/usr/bin/env python
# test.py
from sys import argv
import heapq
from itertools import chain
# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]
def line_iterator_to_tuple_iterator(line_i):
for line in line_i:
s,c = line.split("\t")
c = int(c)
yield (c, s)
# use heap to process inputs
rez = heapq.nlargest(N,
line_iterator_to_tuple_iterator(chain(*streams)),
key=lambda x: x[0])
for r in rez:
print "%s\t%s" % (r[1], r[0])
for s in streams:
s.close()
UNIX/Linux provides generalists toolset. For large datasets it does loads of I/O. It will do everything you can want, but slowly. If we had an idea of the input data it would help immensely.
IMO, You have some choices, none you will really like.
do a multipart "radix" pre-sort - for example have awk write all of the lines whose keys start with 'A' to one file 'B' to another, etc. Or if you only 'P' 'D' & 'Q', have awk just suck out what you want. Then do a full sort on a small subset. This creates 26 files named A, B ...Z
awk '{print $0 > substr($0,1,1)} bigfile; sort [options here] P D Q > result
Spend $$: (Example) Buy CoSort from iri.com any other sort software. These sorts use all kinds of optimizations, but they are not free like bash. You could also buy an SSD which speeds up sorting on disk by several orders of magnitude. 5000iops now to 75000iops. Use the TMPDIR variable to put your tmp files on the SSD, read and write only to the SSD. But use your existing UNIX toolset.
Use some software like R or strata, or preferably a database; all of these are meant for large datasets.
Do what you are doing now, but watch youtube while the UNIX sort runs.
IMO, you are using the wrong tools for large datasets when you want quick results.
Here's a crude partial solution:
#!/usr/bin/perl
use strict;
use warnings;
my #lines = ();
while (<>) {
push #lines, $_;
#lines = sort #lines;
if (scalar #lines > 10) {
pop #lines;
}
}
print #lines;
It reads the input data only once, continuously maintaining a sorted array of the top 10 lines.
Sorting the whole array every time is inefficient, of course, but I'll guess that for a gigabyte input it will still be substantially faster than sort huge-file | head.
Adding an option to vary the number of lines printed would be easy enough. Adding options to control how the sorting is done would be a bit more difficult, though I wouldn't be surprised if there's something in CPAN that would help with that.
More abstractly, one approach to getting just the first N sorted elements from a large array is to use a partial Quicksort, where you don't bother sorting the right partition unless you need to. That requires holding the entire array in memory, which is probably impractical in your case.
You could split the input into medium-sized chunks, apply some clever algorithm to get the top N lines of each chunk, concatenate the chunks together, then apply the same algorithm to the result. Depending on the sizes of the chunks, sort ... | head might be sufficiently clever. It shouldn't be difficult to throw together a shell script using split -l ... to do this.
(Insert more hand-waving as needed.)
Disclaimer: I just tried this on a much smaller file than what you're working with (about 1.7 million lines), and my method was slower than sort ... | head.
I'm trying to search rather big files for a certain string and return its offset. I'm new to lua and my current approach would look like this:
linenumber = 0
for line in io.lines(filepath) do
result=string.find(line,"ABC",1)
linenumber = linenumber+1
if result ~= nil then
offset=linenumber*4096+result
io.close
end
end
I realize that this way is rather primitive and certainly slow. How could I do this more efficiently?
Thanks in advance.
If the file is not too big, and you can spare the memory, it's faster to just slurp in the whole file and just use string.find. If not you can search the file by block.
Your approach isn't all that bad. I'd suggest loading the file in overlapping blocks though. The overlap avoids having the pattern split just between the blocks and going unnoticed like:
".... ...A BC.. ...."
My implementation goes like this:
size=4096 -- note, size should be bigger than the length of pat to work.
pat="ABC"
overlap=#pat
fh=io.open(filepath,'rb') -- On windows, do NOT forget the b
block=fh:read(size+overlap)
n=0
while block do
block_offset=block:find(pat)
if block_offset then
print(block_offset)
offset=block_offset+size*n
break
end
fh:seek('cur',-overlap)
cur=fh:seek'cur'
block=fh:read(size+overlap)
n=n+1
end
if offset then
print('found pattern at', offset, 'after reading',n,'blocks')
else
print('did not find pattern')
end
If your file really has lines, you can also use the trick explained here. This section in the Programming in Lua book explains some performance considerations reading files.
Unless your lines have all the same lenght (4096), I don't see how your code can work.
Instead of using io.lines, read blocks with io.read(4096). The rest of your code can be used as is, except that you need to handle the case that your string is not fully inside a block. If the files is composed of lines, then a trick mentioned in Programming in Lua is to do io.read(4096,"*l"), to read blocks that end at line boundaries. Then you don't have to worry about strings not fully inside a block but you need to adjust the offset calculation to include the length of the block, not just 4096.