Splitting long input into multiple text files - linux

I have some code which will generate an infinite number of lines in output. So, I can't store those values in a single output file.
Instead, I split the output file into more files. I am splitting the file according to the index numbers. Now my doubt is I don't know how many numbers my file will be having. So is it possible to split the file into different output without giving index? For example:
first 100,000 lines in m.txt
from 100,001 to next 200,000 in n.txt

If you don't need to be able to find a particular line based on the file name, you can split the output based on the file size. Write lines to m1.txt until the next line will make it >1MB; then move to the next file - m2.txt.

split(1) appears to be exactly the tool for your job.

Generate files with a running index. Start with opening e.g. m_000001.txt. Write a fixed nuber of lines to that file. Close file. Open next file, e.g. m_000002.txt, and continue.
Making sure that you don't overflow the disk is an housekeepting task to be done separately. Here one can think of backups, compression, file rotation and so on.

You may want to use logrotate for this purpose. It has a lot of options: check out the man page.
Here's the introduction of the man page:
"logrotate is designed to ease administration of systems that generate
large numbers of log files. It allows automatic rotation, compression,
removal, and mailing of log files. Each log file may be handled daily,
weekly, monthly, or when it grows too large."

4 ways to split while writing:
A) Fixed no of characters (Size)
B) Fixed no of lines
C) Fixed Interval of time before writing
D) Fixed Counter of a function before calling a write
Based on those splitings, You can name the output file.

Related

Read a huge log file line by line and random access in node?

I'm working on a log file reader which will parse files and display fields from lines in a nicely formatted table in a node/electron app.
If the files were small, I could just read them line-by-line, parse them, store fields extracted from each line in a data structure and allow clients to scroll back and forth throughout the whole the file.
Since these files can be several gigabytes long, I need to do something more complex.
My current thinking is to either:
Read the whole file via the readline package.
Keep track of line ending offsets
Once the file is read, parse the bottom (hence most recent) 50 or so lines so I can extract relevant data and display visually
If client wants to scroll beyond my 50 lines, use the offsets to go to previous lines (via fs.read(..)).
Another method:
Use fs.read() to go straight to the end
Work backwards until newline characters are found
If client wants to scroll around the file, figure out the line offsets on demand
This doesn't even take into account building tail -f style functionality.
I'll have to account for at least ascii and utf8 encodings, as well as windows vs linux style line endings.
This is a LOT of low level work.
Is there a library which already provides functionality?
If I do this myself, any major caveats I haven't mentioned here? I haven't done low level, random access, programming for 20 years.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

Zipping a folder into equal size parts

I've been using 7Zip for a few years now and always liked that I could zip a folder into several parts of a specific size. For example, the website BOX only allows uploads under 100MB so anything I wanted to put into BOX, I just split the zip file into 95MB files. However, recently I've needed to do something similar except instead of breaking into a certain size, I need to split them up into a specific number of files but all equaling the same size. Right now, 7zip breaks them into the max size you allow and the last file is any remaining data ranging from 1KB up to the limit specified.
For example, say I have a 826MB file, I want it to zip up 5 files that are all the same size. Is there any program out there that will do this?
Thanks in advanced!
I don't know of any program that does this, but if this is something that you're doing regularly, you could write a script that:
Finds out the size of the file
Calculates the maximum piece size to use if you want to split it into n pieces.
Constructs a corresponding 7zip command

Joining two files with regular expression in Unix (ideally with perl)

I have following two files disconnect.txt and answered.txt:
disconnect.txt
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032
answered.txt
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).
Thank you
Sounds like you have hundreds of millions of lines?
Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.
If the files are large the quadratic algorithm will take a lifetime.
Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:
def key s
s.split('from:')[1].split('to:').map(&:strip).join('.')
end
h = {}
open 'disconnect.txt', 'r' do |f|
while s = f.gets
h[key(s)] = true
end
end
open 'answered.txt', 'r' do |f|
while a = f.gets
puts a if h[key(a)]
end
end
Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big1 number, then you will probably not be able to fit all the keys in memory and you will need a database.
1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.
First, sort the files on the from/to timestamps if they are not already sorted that way. (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)
Then take the sorted files and compare the first lines of each.
If the timestamps are the same, you have a match. Hooray! Advance a line in one or both files (depending on your rules for duplicate timestamps in each) and compare again.
If not, grab the next line in whichever file has the earlier timestamp and compare again.
This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.
If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:
You can split the files into arbitrarily-sized chunks (ideally small enough for each chunk to fit into memory), sort each chunk independently, and then generalize the above algorithm from two files to as many as are necessary.
Even if you don't do that and you deal with the disk thrashing involved with sorting files larger than the available memory, sorting and then doing a single pass over each file will still be a lot faster than any solution involving a cartesian join.
Or you could just use a database as mentioned in previous answers. The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.

How can I compare two zip format(.tar,.gz,.Z) files in Unix

I have two gz files. I want to compare those files without extracting. for example:
first file is number.txt.gz - inside that file:
1111,589,3698,
2222,598,4589,
3333,478,2695,
4444,258,3694,
second file - xxx.txt.gz:
1111,589,3698,
2222,598,4589,
I want to compare any column between those files. If column1 in first file is equal to the 1st column of second file means I want output like this:
1111,589,3698,
2222,598,4589,
You can't do this.
You can compare all content from archive by comparing archives but not part of data in compressed files.
You can compare selected files in archive too without unpacking because archive has metadata with CRC32 control sum and you must compare this sum to know this without unpacking.
If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.
My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.
Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.
EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.
As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.
You cannot compare the files while they remain compressed using different techniques.
You must first decompress the files, and then find the difference between the results.
Decompression can be done with gunzip, tar, and uncompress (or zcat).
Finding the difference can be done with the diff command.
I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:
comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)
or if the shell doesn't support that, perhaps:
zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0
exact answer i want is this only
nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)
. instead of awk, nawk works perfectly and it's gzip file so use gzcat

Resources