print specified amount of lines in file until hitting the end - linux

A brief description of the problem I face, for which I could not find a fitting solution:
I have a file of 8000 log rules and I need to print 100 lines each time, and netcat this to a destination. After hitting the end of the file, I want it to loop back to the beginning of the file and start the process again until I manually stop the looping through the file. This is to test our environment for events per second over a period of time.
I've seen some examples in 'sed' to print a certain amount of lines from a file, but not how it continues on to the next 100 lines and the next 100 lines after that (and so on).
Anyone has a fitting solution? It must be way easier than I'm thinking.

To print every 100 lines of a file you could use sed:
#!/bin/bash
# line increment
inc=100
# total number of lines in the file
total=$(wc -l < input_file)
for ((i=1;i<total;i=i+inc)); do
sed -n "$i,$((i+inc-1))p" input_file
done
As mentioned in a comment you could redirect the output to netcat, e.g. { for ((i=1;i<total;i=i+inc)); do ... done } > netcat [-options]. Then wrap all that in a continual loop to repeat.
An alternatively is to use split.
inc=100
split --filter "netcat [options]" -l$inc input_file

Related

How to count number of lines with only 1 character?

Im trying to just print counted number of lines which have only 1 character.
I have a file with 200k lines some of the lines have only one character (any type of character)
Since I have no experience I have googled a lot and scraped documentation and come up with this mixed solution from different sources:
awk -F^\w$ '{print NF-1}' myfile.log
I was expecting that will filter lines with single char, and it seems work
^\w$
However Im not getting number of the lines containing a single character. Instead something like this:
If a non-awk solution is OK:
grep -c '^.$'
You could try the following:
awk '/^.$/{++c}END{print c}' file
The variable c is incremented for every line containing only 1 character (any character).
When the parsing of the file is finished, the variable is printed.
In awk, rules like your {print NF-1} are executed for each line. To print only one thing for the whole file you have to use END { print ... }. There you can print a counter which you increment each time you see a line with one character.
However, I'd use grep instead because it is easier to write and faster to execute:
grep -xc . yourFile

Set line maximum of log file

Currently I write a simple logger to log messages from my bash script. The logger works fine and I simply write the date plus the message in the log file. Since the log file will increase, I would like to set the limit of the logger to for example 1000 lines. After reaching 1000 lines, it doesn't delete or totally clear the log file. It should truncate the first line and replace it with the new log line. So the file keeps 1000 lines and doesn't increase further. The latest line should always be at the top of the file. Is there any built in method? Or how could I solve this?
Why would you want to replace the first line with the new message thereby causing a jump in the order of messages in your log file instead of just deleting the first line and appending the new message, e.g. simplistically:
log() {
tail -999 logfile > tmp &&
{ cat tmp && printf '%s\n' "$*"; } > logfile
}
log "new message"
You don't even need a tmp file if your log file is always small lines, just save the output of the tail in a variable and printf that too.
Note that unlike a sed -i solution, the above will not change the inode, hardlinks, permissions or anything else for logfile - it's the same file as you started with just with updated content, it's not getting replaced with a new file.
Your chosen example may not be the best. As the comments have already pointed out, logrotate is the best tool to keep log file sizes at bay; furthermore, a line is not the best unit to measure size. Those commenters are both right.
However, I take your question at face value and answer it.
You can achieve what you want by shell builtins, but it is much faster and simpler to use an external tool like sed. (awk is another option, but it lacks the -i switch which simplifies your life in this case.)
So, suppose your file exists already and is named script.log then
maxlines=1000
log_msg='Whatever the log message is'
sed -i -e1i"\\$log_msg" -e$((maxlines))',$d' script.log
does what you want.
-i means modify the given file in place.
-e1i"\\$log_msg" means insert $log_msg before the first (1) line.
-e$((maxlines))',$d' means delete each line from line number $((maxlines)) to the last one ($).

Log Splitting based on Recurring Pattern and Number of Lines

I have a system that is generating very large text logs (in excess of 1GB each). The utility into which I am feeding them requires that each file be less than 500MB. I cannot simply use the split command because this runs the risk of splitting a log entry in half, which would cause errors in the utility to which they are being fed.
I have done some research into split, csplit, and awk. So far I have had the most luck with the following:
awk '/REG_EX/{if(NR%X >= (X-Y) || NR%2000 <= Y)x="split"++i;}{print > x;}' logFile.txt
In the above example, X represents the number of lines I want each split file to contain. In practice, this ends up being about 10 million. Y represents a "plus or minus." So if I want "10 million plus or minus 50", Y allows for that.
The actual regular expression I use is not important, because that part works. The goal is that the file be split every X lines, but only if it is an occurrence of REG_EX. This is where the if() clause comes in. I attempted to have some "wiggle room" of plus or minus Y lines, because there is no guarantee that REG_EX will exist at exactly NR%X. My problem is that if I set Y too small, then I end up with files with two or three times the number of lines I am aiming for. If I set Y too large, then I end up with some files containing anywhere between 1 and X lines(it is possible for REG_EX to occurr several times in immediate succession).
Short of writing my own program that traverses the file line by line with a line counter, how can I go about elegantly solving this problem? I have a script that a co-worker created, but it takes easily over an hour to complete. My awk command completes in less than 60 seconds on a 1.5GB file with a X value of 10 million, but is not a 100% solution.
== EDIT ==
Solution found. Thank you to everyone who took the time to read my question, understand it, and provide a suggested solution. Most of them were very helpful, but the one I marked as the solution provided the greatest assistance. My problem was with my modular math being the cutoff point. I needed a way to keep track of lines and reset the counter each time I split a file. Being new to awk, I wasn't sure how to utilize the BEGIN{ ... } feature. Allow me to summarize the problem set and then list the command that solved the problem.
PROBLEM:
-- System produces text logs > 1.5GB
-- System into which logs are fed requires logs <= 500MB.
-- Every log entry begins with a standardized line
-- using the split command risks a new file beginning WITHOUT the standard line
REQUIREMENTS:
-- split files at Xth line, BUT
-- IFF Xth line is in the standard log entry format
NOTE:
-- log entries vary in length, with some being entirely empty
SOLUTION:
awk 'BEGIN {min_line=10000000; curr_line=1; new_file="split1"; suff=1;} \
/REG_EX/ \
{if(curr_line >= min_line){new_file="split"++suff; curr_line=1;}} \
{++curr_line; print > new_file;}' logFile.txt
The command can be typed on one line; I broke it up here for readability. Ten million lines works out to between 450MB and 500MB. I realized that given that how frequently the standard log entry line occurrs, I didn't need to set an upper line limit so long as I picked a lower limit with room to spare. Each time the REG_EX is matched, it checks to see if the current number of lines is greater than my limit, and if it is, starts a new file and resets my counter.
Thanks again to everyone. I hope that anyone else who runs into this or a similar problem finds this useful.
If you want to create split files based on exact n-count of pattern occurrences, you could do this:
awk '/^MYREGEX/ {++i; if(i%3==1){++j}} {print > "splitfilename"j}' logfile.log
Where:
^MYREGEX is your desired pattern.
3 is the count of pattern
occurrences you want in each file.
splitfilename is the prefix of
the filenames to be created.
logfile.log is your input log file.
i is a counter which is incremented for each occurrence of the pattern.
j is a counter which is incremented for each n-th occurrence of the pattern.
Example:
$ cat test.log
MY
123
ksdjfkdjk
MY
234
23
MY
345
MY
MY
456
MY
MY
xyz
xyz
MY
something
$ awk '/^MY/ {++i; if(i%3==1){++j}} {print > "file"j}' test.log
$ ls
file1 file2 file3 test.log
$ head file*
==> file1 <==
MY
123
ksdjfkdjk
MY
234
23
MY
345
==> file2 <==
MY
MY
456
MY
==> file3 <==
MY
xyz
xyz
MY
something
If splitting based on the regex is not important, one option would be to create new files line-by-line keeping track of the number of characters you are adding to an output file. If the number of characters are greater than a certain threshold, you can start outputting to the next file. An example command-line script is:
cat logfile.txt | awk 'BEGIN{sum=0; suff=1; new_file="tmp1"} {len=length($0); if ((sum + len) > 500000000) { ++suff; new_file = "tmp"suff; sum = 0} sum += len; print $0 > new_file}'
In this script, sum keeps track of the number of characters we have parsed from the given log file. If sum is within 500 MB, it keeps outputting to tmp1. Once sum is about to exceed that limit, it will start outputting to tmp2, and so on.
This script will not create files that are greater than the size limit. It will also not break a log entry.
Please note that this script doesn't make use of any pattern matching that you used in your script.
You could potentially split the log file by 10 million lines.
Then if the 2nd split file does not start with desired line, go find the last desired line in 1st split file, delete that line and subsequent lines from there, then prepend those lines to 2nd file.
Repeat for each subsequent split file.
This would produce files with very similar count of your regex matches.
In order to improve performance and not have to actually write out intermediary split files and edit them, you could use a tool such as pt-fifo-split for "virtually" splitting your original log file.
Replace fout and slimit values to your needs
#!/bin/bash
# big log filename
f="test.txt"
fout="$(mktemp -p . f_XXXXX)"
fsize=0
slimit=2500
while read line; do
if [ "$fsize" -le "$slimit" ]; then
# append to log file and get line size at the same time ;-)
lsize=$(echo "$line" | tee -a $fout | wc -c)
# add to file size
fsize=$(( $fsize + $lsize ))
else
echo "size of last file $fout: $fsize"
# create a new log file
fout="$(mktemp -p . f_XXXXX)"
# reset size counter
fsize=0
fi
done < <(grep 'YOUR_REGEXP' "$f")
size of last file ./f_GrWgD: 2537
size of last file ./f_E0n7E: 2547
size of last file ./f_do2AM: 2586
size of last file ./f_lwwhI: 2548
size of last file ./f_4D09V: 2575
size of last file ./f_ZuNBE: 2546

adding an adapter sequence to the end of a fastq file

I have a large fastq file and I want to add the sequence "TTAAGG" to the end of each sequence in my file (the 2nd line then every 4th line after), while still maintaining the fastq file format. For example:
this is the first line I start with:
#HWI-D00449:41:C2H8BACXX:5:1101:1219:2053 1:N:0:
GCAATATCCTTCAACTA
+
FFFHFHGFHAGGIIIII
and I want it to print out:
#HWI-D00449:41:C2H8BACXX:5:1101:1219:2053 1:N:0:
GCAATATCCTTCAACTATTAAGG
+
FFFHFHGFHAGGIIIII
I imagine sed or awk would be good for this, but I haven't been able to find a solution that allows me to keep the fastq format.
I tried:
awk 'NR%4==2 { print $0 "TTAAGG"}' < file_in.fastq > fileout_fastq
which added the TTAAGG to the second line and then every fourth line, but it also deleted the other three lines.
Does anyone have an suggestions of command lines I can use or if you know of a package currently available that can do this, please let me know!
Try this with GNU sed:
sed '2~4s/$/TTAAGG/' file

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Resources