Joining two files with regular expression in Unix (ideally with perl) - linux

I have following two files disconnect.txt and answered.txt:
disconnect.txt
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032
answered.txt
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).
Thank you

Sounds like you have hundreds of millions of lines?
Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.

If the files are large the quadratic algorithm will take a lifetime.
Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:
def key s
s.split('from:')[1].split('to:').map(&:strip).join('.')
end
h = {}
open 'disconnect.txt', 'r' do |f|
while s = f.gets
h[key(s)] = true
end
end
open 'answered.txt', 'r' do |f|
while a = f.gets
puts a if h[key(a)]
end
end
Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big1 number, then you will probably not be able to fit all the keys in memory and you will need a database.
1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.

First, sort the files on the from/to timestamps if they are not already sorted that way. (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)
Then take the sorted files and compare the first lines of each.
If the timestamps are the same, you have a match. Hooray! Advance a line in one or both files (depending on your rules for duplicate timestamps in each) and compare again.
If not, grab the next line in whichever file has the earlier timestamp and compare again.
This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.
If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:
You can split the files into arbitrarily-sized chunks (ideally small enough for each chunk to fit into memory), sort each chunk independently, and then generalize the above algorithm from two files to as many as are necessary.
Even if you don't do that and you deal with the disk thrashing involved with sorting files larger than the available memory, sorting and then doing a single pass over each file will still be a lot faster than any solution involving a cartesian join.
Or you could just use a database as mentioned in previous answers. The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.

Related

Inquire POS returns very large integers in Fortran

I am reading and writing huge files, using unformatted form and stream access.
During the running, I open the same file multiple times, and I read only portions of the file that I need in that moment. I have huge files in order to avoid writing too many smaller files on the hard-disk. I don't read these huge files all in once because they are too large and I would have memory problems.
In order to read only portions of the files, I do this. Let's say that I have written the array A(1:10) on a file "data.dat", and let's say that I need to read it twice in an array B(1:5). This is what I do
real, dimension(5) :: B
integer:: fu, myposition
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=1) B
inquire(unit=fu,POS=myposition)
close(fu)
[....]
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=myposition) B
inquire(unit=fu,POS=myposition)
close(fu)
[...]
My questions are:
Is this approach correct?
When the files are too big, the inquire(fu,POS=myposition) goes wrong,
because the integer is too big (indeed, I get negative values).
Should I simply declare the integer myposition as a larger integer?
Or there is a better way to do what I am trying to do.
In other words, having such a huge integers is related to the fact that I am using a very clumsy approach?
P.S.
To be more quantitative, this is the order of magnitude: I have thousands of files of around 10 giga each.

How do I seek for holes and data in a sparse file in golang [duplicate]

I want to copy files from one place to another and the problem is I deal with a lot of sparse files.
Is there any (easy) way of copying sparse files without becoming huge at the destination?
My basic code:
out, err := os.Create(bricks[0] + "/" + fileName)
in, err := os.Open(event.Name)
io.Copy(out, in)
Some background theory
Note that io.Copy() pipes raw bytes – which is sort of understandable once you consider that it pipes data from an io.Reader to an io.Writer which provide Read([]byte) and Write([]byte), correspondingly.
As such, io.Copy() is able to deal with absolutely any source providing
bytes and absolutely any sink consuming them.
On the other hand, the location of the holes in a file is a "side-channel" information which "classic" syscalls such as read(2) hide from their users.
io.Copy() is not able to convey such side-channel information in any way.
IOW, initially, file sparseness was an idea to just have efficient storage of the data behind the user's back.
So, no, there's no way io.Copy() could deal with sparse files in itself.
What to do about it
You'd need to go one level deeper and implement all this using the syscall package and some manual tinkering.
To work with holes, you should use the SEEK_HOLE and SEEK_DATA special values for the lseek(2) syscall which are, while formally non-standard, are supported by all major platforms.
Unfortunately, the support for those "whence" positions is not present
neither in the stock syscall package (as of Go 1.8.1)
nor in the golang.org/x/sys tree.
But fear not, there are two easy steps:
First, the stock syscall.Seek() is actually mapped to lseek(2)
on the relevant platforms.
Next, you'd need to figure out the correct values for SEEK_HOLE and
SEEK_DATA for the platforms you need to support.
Note that they are free to be different between different platforms!
Say, on my Linux system I can do simple
$ grep -E 'SEEK_(HOLE|DATA)' </usr/include/unistd.h
# define SEEK_DATA 3 /* Seek to next data. */
# define SEEK_HOLE 4 /* Seek to next hole. */
…to figure out the values for these symbols.
Now, say, you create a Linux-specific file in your package
containing something like
// +build linux
const (
SEEK_DATA = 3
SEEK_HOLE = 4
)
and then use these values with the syscall.Seek().
The file descriptor to pass to syscall.Seek() and friends
can be obtained from an opened file using the Fd() method
of os.File values.
The pattern to use when reading is to detect regions containing data, and read the data from them – see this for one example.
Note that this deals with reading sparse files; but if you'd want to actually transfer them as sparse – that is, with keeping this property of them, – the situation is more complicated: it appears to be even less portable, so some research and experimentation is due.
On Linux, it appears you could try to use fallocate(2) with
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE to try to punch a hole at the
end of the file you're writing to; if that legitimately fails
(with syscall.EOPNOTSUPP), you just shovel as many zeroed blocks to the destination file as covered by the hole you're reading – in the hope
the OS will do the right thing and will convert them to a hole by itself.
Note that some filesystems do not support holes at all – as a concept.
One example is the filesystems in the FAT family.
What I'm leading you to is that inability of creating a sparse file might
actually be a property of the target filesystem in your case.
You might find Go issue #13548 "archive/tar: add support for writing tar containing sparse files" to be of interest.
One more note: you might also consider checking whether the destination directory to copy a source file resides in the same filesystem as the source file, and if this holds true, use the syscall.Rename() (on POSIX systems)
or os.Rename() to just move the file across different directories w/o
actually copying its data.
You don't need to resort to syscalls.
package main
import "os"
func main() {
f, _ := os.Create("/tmp/sparse.dat")
f.Write([]byte("start"))
f.Seek(1024*1024*10, 0)
f.Write([]byte("end"))
}
Then you'll see:
$ ls -l /tmp/sparse.dat
-rw-rw-r-- 1 soren soren 10485763 Jun 25 14:29 /tmp/sparse.dat
$ du /tmp/sparse.dat
8 /tmp/sparse.dat
It's true you can't use io.Copy as is. Instead you need to implement an alternative to io.Copy which reads a chunk from the src, checks if it's all '\0'. If it is, just dst.Seek(len(chunk), os.SEEK_CUR) to skip past that part in dst. That particular implementation is left as an exercise to the reader :)

How to extract interval/range of rows from compressed file?

How do I return interval of rows from 100mil rows *.gz file?
Let's say I need 5 mil rows starting from 15mil up to 20mil?
is this the best performing option?
zcat myfile.gz|head -20000000|tail -500
real 0m43.106s
user 0m43.154s
sys 0m9.259s
That's a perfectly reasonable option; since you don't know how long a line will be, you basically have to decompress and iterate the lines to figure out where the line separators are. All three tools are fairly heavily optimized, so I/O and decompression time will likely dominate regardless.
In theory, rolling your own solution that combines all three tools in a single executable might save a little (by reducing the costs of IPC a bit), but the savings would likely be negligible.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

Splitting long input into multiple text files

I have some code which will generate an infinite number of lines in output. So, I can't store those values in a single output file.
Instead, I split the output file into more files. I am splitting the file according to the index numbers. Now my doubt is I don't know how many numbers my file will be having. So is it possible to split the file into different output without giving index? For example:
first 100,000 lines in m.txt
from 100,001 to next 200,000 in n.txt
If you don't need to be able to find a particular line based on the file name, you can split the output based on the file size. Write lines to m1.txt until the next line will make it >1MB; then move to the next file - m2.txt.
split(1) appears to be exactly the tool for your job.
Generate files with a running index. Start with opening e.g. m_000001.txt. Write a fixed nuber of lines to that file. Close file. Open next file, e.g. m_000002.txt, and continue.
Making sure that you don't overflow the disk is an housekeepting task to be done separately. Here one can think of backups, compression, file rotation and so on.
You may want to use logrotate for this purpose. It has a lot of options: check out the man page.
Here's the introduction of the man page:
"logrotate is designed to ease administration of systems that generate
large numbers of log files. It allows automatic rotation, compression,
removal, and mailing of log files. Each log file may be handled daily,
weekly, monthly, or when it grows too large."
4 ways to split while writing:
A) Fixed no of characters (Size)
B) Fixed no of lines
C) Fixed Interval of time before writing
D) Fixed Counter of a function before calling a write
Based on those splitings, You can name the output file.

Resources