Log Splitting based on Recurring Pattern and Number of Lines - linux

I have a system that is generating very large text logs (in excess of 1GB each). The utility into which I am feeding them requires that each file be less than 500MB. I cannot simply use the split command because this runs the risk of splitting a log entry in half, which would cause errors in the utility to which they are being fed.
I have done some research into split, csplit, and awk. So far I have had the most luck with the following:
awk '/REG_EX/{if(NR%X >= (X-Y) || NR%2000 <= Y)x="split"++i;}{print > x;}' logFile.txt
In the above example, X represents the number of lines I want each split file to contain. In practice, this ends up being about 10 million. Y represents a "plus or minus." So if I want "10 million plus or minus 50", Y allows for that.
The actual regular expression I use is not important, because that part works. The goal is that the file be split every X lines, but only if it is an occurrence of REG_EX. This is where the if() clause comes in. I attempted to have some "wiggle room" of plus or minus Y lines, because there is no guarantee that REG_EX will exist at exactly NR%X. My problem is that if I set Y too small, then I end up with files with two or three times the number of lines I am aiming for. If I set Y too large, then I end up with some files containing anywhere between 1 and X lines(it is possible for REG_EX to occurr several times in immediate succession).
Short of writing my own program that traverses the file line by line with a line counter, how can I go about elegantly solving this problem? I have a script that a co-worker created, but it takes easily over an hour to complete. My awk command completes in less than 60 seconds on a 1.5GB file with a X value of 10 million, but is not a 100% solution.
== EDIT ==
Solution found. Thank you to everyone who took the time to read my question, understand it, and provide a suggested solution. Most of them were very helpful, but the one I marked as the solution provided the greatest assistance. My problem was with my modular math being the cutoff point. I needed a way to keep track of lines and reset the counter each time I split a file. Being new to awk, I wasn't sure how to utilize the BEGIN{ ... } feature. Allow me to summarize the problem set and then list the command that solved the problem.
PROBLEM:
-- System produces text logs > 1.5GB
-- System into which logs are fed requires logs <= 500MB.
-- Every log entry begins with a standardized line
-- using the split command risks a new file beginning WITHOUT the standard line
REQUIREMENTS:
-- split files at Xth line, BUT
-- IFF Xth line is in the standard log entry format
NOTE:
-- log entries vary in length, with some being entirely empty
SOLUTION:
awk 'BEGIN {min_line=10000000; curr_line=1; new_file="split1"; suff=1;} \
/REG_EX/ \
{if(curr_line >= min_line){new_file="split"++suff; curr_line=1;}} \
{++curr_line; print > new_file;}' logFile.txt
The command can be typed on one line; I broke it up here for readability. Ten million lines works out to between 450MB and 500MB. I realized that given that how frequently the standard log entry line occurrs, I didn't need to set an upper line limit so long as I picked a lower limit with room to spare. Each time the REG_EX is matched, it checks to see if the current number of lines is greater than my limit, and if it is, starts a new file and resets my counter.
Thanks again to everyone. I hope that anyone else who runs into this or a similar problem finds this useful.

If you want to create split files based on exact n-count of pattern occurrences, you could do this:
awk '/^MYREGEX/ {++i; if(i%3==1){++j}} {print > "splitfilename"j}' logfile.log
Where:
^MYREGEX is your desired pattern.
3 is the count of pattern
occurrences you want in each file.
splitfilename is the prefix of
the filenames to be created.
logfile.log is your input log file.
i is a counter which is incremented for each occurrence of the pattern.
j is a counter which is incremented for each n-th occurrence of the pattern.
Example:
$ cat test.log
MY
123
ksdjfkdjk
MY
234
23
MY
345
MY
MY
456
MY
MY
xyz
xyz
MY
something
$ awk '/^MY/ {++i; if(i%3==1){++j}} {print > "file"j}' test.log
$ ls
file1 file2 file3 test.log
$ head file*
==> file1 <==
MY
123
ksdjfkdjk
MY
234
23
MY
345
==> file2 <==
MY
MY
456
MY
==> file3 <==
MY
xyz
xyz
MY
something

If splitting based on the regex is not important, one option would be to create new files line-by-line keeping track of the number of characters you are adding to an output file. If the number of characters are greater than a certain threshold, you can start outputting to the next file. An example command-line script is:
cat logfile.txt | awk 'BEGIN{sum=0; suff=1; new_file="tmp1"} {len=length($0); if ((sum + len) > 500000000) { ++suff; new_file = "tmp"suff; sum = 0} sum += len; print $0 > new_file}'
In this script, sum keeps track of the number of characters we have parsed from the given log file. If sum is within 500 MB, it keeps outputting to tmp1. Once sum is about to exceed that limit, it will start outputting to tmp2, and so on.
This script will not create files that are greater than the size limit. It will also not break a log entry.
Please note that this script doesn't make use of any pattern matching that you used in your script.

You could potentially split the log file by 10 million lines.
Then if the 2nd split file does not start with desired line, go find the last desired line in 1st split file, delete that line and subsequent lines from there, then prepend those lines to 2nd file.
Repeat for each subsequent split file.
This would produce files with very similar count of your regex matches.
In order to improve performance and not have to actually write out intermediary split files and edit them, you could use a tool such as pt-fifo-split for "virtually" splitting your original log file.

Replace fout and slimit values to your needs
#!/bin/bash
# big log filename
f="test.txt"
fout="$(mktemp -p . f_XXXXX)"
fsize=0
slimit=2500
while read line; do
if [ "$fsize" -le "$slimit" ]; then
# append to log file and get line size at the same time ;-)
lsize=$(echo "$line" | tee -a $fout | wc -c)
# add to file size
fsize=$(( $fsize + $lsize ))
else
echo "size of last file $fout: $fsize"
# create a new log file
fout="$(mktemp -p . f_XXXXX)"
# reset size counter
fsize=0
fi
done < <(grep 'YOUR_REGEXP' "$f")
size of last file ./f_GrWgD: 2537
size of last file ./f_E0n7E: 2547
size of last file ./f_do2AM: 2586
size of last file ./f_lwwhI: 2548
size of last file ./f_4D09V: 2575
size of last file ./f_ZuNBE: 2546

Related

How to count number of lines with only 1 character?

Im trying to just print counted number of lines which have only 1 character.
I have a file with 200k lines some of the lines have only one character (any type of character)
Since I have no experience I have googled a lot and scraped documentation and come up with this mixed solution from different sources:
awk -F^\w$ '{print NF-1}' myfile.log
I was expecting that will filter lines with single char, and it seems work
^\w$
However Im not getting number of the lines containing a single character. Instead something like this:
If a non-awk solution is OK:
grep -c '^.$'
You could try the following:
awk '/^.$/{++c}END{print c}' file
The variable c is incremented for every line containing only 1 character (any character).
When the parsing of the file is finished, the variable is printed.
In awk, rules like your {print NF-1} are executed for each line. To print only one thing for the whole file you have to use END { print ... }. There you can print a counter which you increment each time you see a line with one character.
However, I'd use grep instead because it is easier to write and faster to execute:
grep -xc . yourFile

print specified amount of lines in file until hitting the end

A brief description of the problem I face, for which I could not find a fitting solution:
I have a file of 8000 log rules and I need to print 100 lines each time, and netcat this to a destination. After hitting the end of the file, I want it to loop back to the beginning of the file and start the process again until I manually stop the looping through the file. This is to test our environment for events per second over a period of time.
I've seen some examples in 'sed' to print a certain amount of lines from a file, but not how it continues on to the next 100 lines and the next 100 lines after that (and so on).
Anyone has a fitting solution? It must be way easier than I'm thinking.
To print every 100 lines of a file you could use sed:
#!/bin/bash
# line increment
inc=100
# total number of lines in the file
total=$(wc -l < input_file)
for ((i=1;i<total;i=i+inc)); do
sed -n "$i,$((i+inc-1))p" input_file
done
As mentioned in a comment you could redirect the output to netcat, e.g. { for ((i=1;i<total;i=i+inc)); do ... done } > netcat [-options]. Then wrap all that in a continual loop to repeat.
An alternatively is to use split.
inc=100
split --filter "netcat [options]" -l$inc input_file

How can I remove lines that contain more than N words

Is there a good one-liner in bash to remove lines containing more than N words from a file?
example input:
I want this, not that, but thank you it is very nice of you to offer.
The very long sentence finding form ordering system always and redundantly requires an initial, albeit annoying and sometimes nonsensical use of commas, completion of the form A-1 followed, after this has been processed by the finance department and is legal, by a positive approval that allows for the form B-1 to be completed after the affirmative response to the form A-1 is received.
example output:
I want this, not that, but thank you it is very nice of you to offer.
In Python I would code something like this:
if len(line.split()) < 40:
print line
To only show lines containing less than 40 words, you can use awk:
awk 'NF < 40' file
Using the default field separator, each word is treated as a field. Lines with less than 40 fields are printed.
Note this answer assumes the first approach of the question: how to print those lines being shorter than a given number of characters
Use awk with length():
awk 'length($0)<40' file
You can even give the length as a parameter:
awk -v maxsize=40 'length($0) < maxsize' file
A test with 10 characters:
$ cat a
hello
how are you
i am fine but
i would like
to do other
things
$ awk 'length($0)<10' a
hello
things
If you feel like using sed for this, you can say:
sed -rn '/^.{,39}$/p' file
This checks if the line contains less than 40 characters. If so, it prints it.

Print previous line if condition is met

I would like to grep a word and then find the second column in the line and check if it is bigger than a value. Is yes, I want to print the previous line.
Ex:
Input file
AAAAAAAAAAAAA
BB 2
CCCCCCCCCCCCC
BB 0.1
Output
AAAAAAAAAAAAA
Now, I want to search for BB and if the second column (2 or 0.1) in that line is bigger than 1, I want to print the previous line.
Can somebody help me with grep and awk? Thanks. Any other suggestions are also welcome. Thanks.
This can be a way:
$ awk '$1=="BB" && $2>1 {print f} {f=$1}' file
AAAAAAAAAAAAA
Explanation
$1=="BB" && $2>1 {print f} if the 1st field is exactly BB and 2nd field is bigger than 1, then print f, a stored value.
{f=$1} store the current line in f, so that it is accessible when reading the next line.
Another option: reverse the file and print the next line if the condition matches:
tac file | awk '$1 == "BB" && $2 > 1 {getline; print}' | tac
Concerning generality
I think it needs to be mentioned that the most general solution to this class of problem involves two passes:
the first pass to add a decimal row number ($REC) to the front of each line, effectively grouping lines into records by $REC
the second pass to trigger on the first instance of each new value of $REC as a record boundary (resetting $CURREC), thereafter rolling along in the native AWK idiom concerning the records to follow matching $CURREC.
In the intermediate file, some sequence of decimal digits followed by a separator (for human reasons, typically an added tab or space) is parsed (aka conceptually snipped off) as out-of-band with respect to the baseline file.
Command line paste monster
Even confined to the command line, it's an easy matter to ensure that the intermediate file never hits disk. You just need to use an advanced shell such as ZSH (my own favourite) which supports process substitution:
paste <( <input.txt awk "BEGIN { R=0; N=0; } /Header pattern/ { N=1; } { R=R+N; N=0; print R; }" ) input.txt | awk -f yourscript.awk
Let's render that one-liner more suitable for exposition:
P="/Header pattern/"
X="BEGIN { R=0; N=0; } $P { N=1; } { R=R+N; N=0; print R; }"
paste <( <input.txt awk $X ) input.txt | awk -f yourscript.awk
This starts three processes: the trivial inline AWK script, paste, and the AWK script you really wanted to run in the first place.
Behind the scenes, the <() command line construct creates a named pipe and passes the pipe name to paste as the name of its first input file. For paste's second input file, we give it the name of our original input file (this file is thus read sequentially, in parallel, by two different processes, which will consume between them at most one read from disk, if the input file is cold).
The magic named pipe in the middle is an in-memory FIFO that ancient Unix probably managed at about 16 kB of average size (intermittently pausing the paste process if the yourscript.awk process is sluggish in draining this FIFO back down).
Perhaps modern Unix throws a bigger buffer in there because it can, but it's certainly not a scarce resource you should be concerned about, until you write your first truly advanced command line with process redirection involving these by the hundreds or thousands :-)
Additional performance considerations
On modern CPUs, all three of these processes could easily find themselves running on separate cores.
The first two of these processes border on the truly trivial: an AWK script with a single pattern match and some minor bookkeeping, paste called with two arguments. yourscript.awk will be hard pressed to run faster than these.
What, your development machine has no lightly loaded cores to render this master shell-master solution pattern almost free in the execution domain?
Ring, ring.
Hello?
Hey, it's for you. 2018 just called, and wants its problem back.
2020 is officially the reprieve of MTV: That's the way we like it, magic pipes for nothing and cores for free. Not to name out loud any particular TLA chip vendor who is rocking the space these days.
As a final performance consideration, if you don't want the overhead of parsing actual record numbers:
X="BEGIN { N=0; } $P { N=1; } { print N; N=0; }"
Now your in-FIFO intermediate file is annotated with just an additional two characters prepended to each line ('0' or '1' and the default separator character added by paste), with '1' demarking first line in record.
Named FIFOs
Under the hood, these are no different than the magic FIFOs instantiated by Unix when you write any normal pipe command:
cat file | proc1 | proc2 | proc2
Three unnamed pipes (and a whole process devoted to cat you didn't even need).
It's almost unfortunate that the truly exceptional convenience of the default stdin/stdout streams as premanaged by the shell obscures the reality that paste $magictemppipe1 $magictemppipe2 bears no additional performance considerations worth thinking about, in 99% of all cases.
"Use the <() Y-joint, Luke."
Your instinctive reflex toward natural semantic decomposition in the problem domain will herewith benefit immensely.
If anyone had had the wits to name the shell construct <() as the YODA operator in the first place, I suspect it would have been pressed into universal service at least a solid decade ago.
Combining sed & awk you get this:
sed 'N;s/\n/ /' < file |awk '$3>1{print $1}'
sed 'N;s/\n/ / : Combine 1st and 2nd line and replace next line char with space
awk '$3>1{print $1}': print $1(1st column) if $3(3rd column's value is > 1)

How to edit 300 GB text file (genomics data)?

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program 'Popoolution' allows us to comment out the "bad" records with an asterisk. Our problem is that we cannot find a text editor that will load the data so that we can comment out the bad records. Any suggestions? We have both Windows and Linux boxes.
UPDATE: More information
The program Popoolution (https://code.google.com/p/popoolation/) crashes when it reaches a "bad" record giving us the line number that we can then comment out. Specifically, we get a message from Perl that says "F#€%& Scaffolding". The manual suggests we can just use an asterisk to comment out the bad line. Sadly, we will have to repeat this process many times...
One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.
Based on your update:
One more thought... Is there an approach that would allow us to add
the asterisk to the line without opening the entire text file at once.
This could be very useful given that we will have to repeat the
process an unknown number of times.
Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:
sed 'LINE_NUMBER s/^/*/' file
See an example:
$ cat file
aa
bb
cc
dd
ee
$ sed '3 s/^/*/' file
aa
bb
*cc
dd
ee
If you add -i, the file will be updated:
$ sed -i '3 s/^/*/' file
$ cat file
aa
bb
*cc
dd
ee
Even though I always think it's better to do a redirection to another file
sed '3 s/^/*/' file > new_file
so that you keep intact your original file and save the updated one in new_file.
If you are required to have a person mark these records manually with a text editor, for whatever reason, you should probably use split to split the file up into manageable pieces.
split -a4 -d -l100000 hugefile.txt part.
This will split the file up into pieces with 100000 lines each. The names of the files will be part.0000, part.0001, etc. Then, after all the files have been edited, you can combine them back together with cat:
cat part.* > new_hugefile.txt
The simplest solution is to use a stream-oriented editor such as sed. All you need is to be able to write one or more regular expression(s) that will identify all (and only) the bad records. Since you haven't provided any details on how to identify the bad records, this is the only possible answer.
A basic pattern in R is to read the data in chunks, edit, and write out
fin = file("fin.txt", "r")
fout = file("fout.txt", "w")
while (length(txt <- readLines(fin, n=1000000))) {
## txt is now 1000000 lines, add an asterix to problem lines
## bad = <create logical vector indicating bad lines here>
## txt[bad] = paste0("*", txt[bad])
writeLines(txt, fout)
}
close(fin); close(fout)
While not ideal, this works on Windows (implied by the mention of Notepad++) and in a language that you are presumably familiar (R). Using sed (definitely the appropriate tool in the long run) would require installation of additional software and coming up to speed with sed.

Resources