How to crop(cut) text files based on starting and ending line-numbers in cygwin? - linux

I have few log files around 100MBs each.
Personally I find it cumbersome to deal with such big files. I know that log lines that are interesting to me are only between 200 to 400 lines or so.
What would be a good way to extract relavant log lines from these files ie I just want to pipe the range of line numbers to another file.
For example, the inputs are:
filename: MyHugeLogFile.log
Starting line number: 38438
Ending line number: 39276
Is there a command that I can run in cygwin to cat out only that range in that file? I know that if I can somehow display that range in stdout then I can also pipe to an output file.
Note: Adding Linux tag for more visibility, but I need a solution that might work in cygwin. (Usually linux commands do work in cygwin).

Sounds like a job for sed:
sed -n '8,12p' yourfile
...will send lines 8 through 12 of yourfile to standard out.
If you want to prepend the line number, you may wish to use cat -n first:
cat -n yourfile | sed -n '8,12p'

You can use wc -l to figure out the total # of lines.
You can then combine head and tail to get at the range you want. Let's assume the log is 40,000 lines, you want the last 1562 lines, then of those you want the first 838. So:
tail -1562 MyHugeLogFile.log | head -838 | ....
Or there's probably an easier way using sed or awk.

I saw this thread when I was trying to split a file in files with 100 000 lines. A better solution than sed for that is:
split -l 100000 database.sql database-
It will give files like:
database-aaa
database-aab
database-aac
...

And if you simply want to cut part of a file - say from line 26 to 142 - and input it to a newfile :
cat file-to-cut.txt | sed -n '26,142p' >> new-file.txt

How about this:
$ seq 1 100000 | tail -n +10000 | head -n 10
10000
10001
10002
10003
10004
10005
10006
10007
10008
10009
It uses tail to output from the 10,000th line and onwards and then head to only keep 10 lines.
The same (almost) result with sed:
$ seq 1 100000 | sed -n '10000,10010p'
10000
10001
10002
10003
10004
10005
10006
10007
10008
10009
10010
This one has the advantage of allowing you to input the line range directly.

If you are interested only in the last X lines, you can use the "tail" command like this.
$ tail -n XXXXX yourlogfile.log >> mycroppedfile.txt
This will save the last XXXXX lines of your log file to a new file called "mycroppedfile.txt"

This is an old thread but I was surprised nobody mentioned grep. The -A option allows specifying a number of lines to print after a search match and the -B option includes lines before a match. The following command would output 10 lines before and 10 lines after occurrences of "my search string" in the file "mylogfile.log":
grep -A 10 -B 10 "my search string" mylogfile.log
If there are multiple matches within a large file the output can rapidly get unwieldy. Two helpful options are -n which tells grep to include line numbers and --color which highlights the matched text in the output.
If there is more than file to be searched grep allows multiple files to be listed separated by spaces. Wildcards can also be used. Putting it all together:
grep -A 10 -B 10 -n --color "my search string" *.log someOtherFile.txt

Related

Retrieve last 100 lines logs

I need to retrieve last 100 lines of logs from the log file.
I tried the sed command
sed -n -e '100,$p' logfilename
Please let me know how can I change this command to specifically retrieve the last 100 lines.
You can use tail command as follows:
tail -100 <log file> > newLogfile
Now last 100 lines will be present in newLogfile
EDIT:
More recent versions of tail as mentioned by twalberg use command:
tail -n 100 <log file> > newLogfile
"tail" is command to display the last part of a file, using proper available switches helps us to get more specific output. the most used switch for me is -n and -f
SYNOPSIS
tail [-F | -f | -r] [-q] [-b number | -c number | -n number] [file ...]
Here
-n number :
The location is number lines.
-f : The -f option causes tail to not stop when end of file is
reached, but rather to wait for additional data to be appended to the
input. The -f option is ignored if the
standard input is a pipe, but not if it is a FIFO.
Retrieve last 100 lines logs
To get last static 100 lines
tail -n 100 <file path>
To get real time last 100 lines
tail -f -n 100 <file path>
You can simply use the following command:-
tail -NUMBER_OF_LINES FILE_NAME
e.g tail -100 test.log
will fetch the last 100 lines from test.log
In case, if you want the output of the above in a separate file then you can pipes as follows:-
tail -NUMBER_OF_LINES FILE_NAME > OUTPUT_FILE_NAME
e.g tail -100 test.log > output.log
will fetch the last 100 lines from test.log and store them into a new file output.log)
Look, the sed script that prints the 100 last lines you can find in the documentation for sed (https://www.gnu.org/software/sed/manual/sed.html#tail):
$ cat sed.cmd
1! {; H; g; }
1,100 !s/[^\n]*\n//
$p
$ sed -nf sed.cmd logfilename
For me it is way more difficult than your script so
tail -n 100 logfilename
is much much simpler. And it is quite efficient, it will not read all file if it is not necessary. See my answer with strace report for tail ./huge-file: https://unix.stackexchange.com/questions/102905/does-tail-read-the-whole-file/102910#102910
I know this is very old, but, for whoever it may helps.
less +F my_log_file.log
that's just basic, with less you can do lot more powerful things. once you start seeing logs you can do search, go to line number, search for pattern, much more plus it is faster for large files.
its like vim for logs[totally my opinion]
original less's documentation : https://linux.die.net/man/1/less
less cheatsheet : https://gist.github.com/glnds/8862214
len=`cat filename | wc -l`
len=$(( $len + 1 ))
l=$(( $len - 99 ))
sed -n "${l},${len}p" filename
first line takes the length (Total lines) of file
then +1 in the total lines
after that we have to fatch 100 records so, -99 from total length
then just put the variables in the sed command to fetch the last 100 lines from file
I hope this will help you.

How to grep within a grep

I have a bunch of massive text files, about 100MB each.
I want to grep to find entries that have 'INDIANA JONES' in it:
$ grep -ir 'INDIANA JONES' ./
Then, I would like to find the entries where there is the word PORTUGAL within 5,000 characters of the INDIANA JONES term. How would I do this?
# in pseudocode
grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char
Use grep's -o flag to output the 5000 characters surround the match, then search those characters for the second string. For example:
grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL"
If you need the original match, add the -n flag to the second grep and pipe into:
cut -f1 -d: > line_numbers.txt
then you could use awk to print those lines:
awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt
To avoid the temporary file, this could be written like:
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt
For multiple files, use find and a bash loop:
for i in $(find . -type f); do
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i"
done
One way to deal with this is with gawk. You could set the record separator to either INDIANA JONES or PORTUGAL and then perform a length check on the record (after stripping newlines, assuming newlines do not count towards the limit of 5000). You may have to resort to find to run this recursively within a directory
awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0;
gsub("\n", "", a)};
((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1};
{prevRT=RT};
END{if (found) print FILENAME}' file.txt
Consider installing ack-grep.
sudo apt-get install ack-grep
ack-grep is a more powerful version of grep.
There's no trivial solution to your question (that i can think of) outside of a full batch script, but you can use the -A and -B flags on ack-grep to specify a number of trailing or leading lines to output, resp.
This may not be a number of chars, but is a step further in that direction.
While this may not be a solution, it might give you some idea as to how to do this. Lookup filters like ack, awk, sed, etc. and see if you can find one with a flag for this kind of behaviour.
The ack-grep manual:
http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html
EDIT:
I think the sad news is, what you might think you're looking for is something like:
grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename
The problem is, even on a small file, querying this is going to be impossible time-wise.
I got this one to work with a different number. it's a size problem.
For such a large set of files, you'll need to do this in more than one step.
A Solution:
The only solution I know of is the leading and trailing output from ack-grep.
Step 1: how long are your lines?
If you knew how many lines out you had to go
(and you could estimate/calculate this a few ways) then you'd be able to grep the output of the first grep. Depending on what's in your file, you should be able to get a decent upper bound as to how many lines is 5000 chars (if a line has 100 chars average, 50+ lines should cover you, but if it has 10 chars, you'll need 500+).
You've got to determine the maximum number of lines that could be 5000 chars. You could guess or pick a high range if you like, but that'll be up to you. It's your data.
With that, call: (if you needed 100 lines for 5000 chars)
ack-grep -ira "PORTUGAL" -A 100 -B 100 filename
and
ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename
replace the 100s with what you need.
Step 2: parse the output
you'll need to take the matches that ack-grep returns and parse them, looking for any matches again, within these sub-ranges.
Look for INDIANA JONES in the first PORTUGAL ack-grep match output, and look for PORTUGAL in the second set of matches.
This should take a bit more work, likely involving a bash script (I might see if I can get one working this week), but it solves your massive-data problem, by breaking it down into more manageable chunks.
grep 'INDIANA JONES' . -iR -l | while read filename; do head -c 5000 "$filename" | grep -n PORTUGAL -H --label="$filename" ; done
This works as follows:
grep 'INDIANA JONES' . -iR -l. Search for all files in or below the current directory. Case insensitive (-i). And only print the names of the files that match (-l), don't print any content.
| while read filename; do ...|...|...; done for each line of input, store it in variable $filename and execute the pipeline.
Now, for each file that matched 'INDIANA JONES', we do
head -c 5000 "$filename" - extract the first 5000 characters
grep ... - search for PORTUGAL. Print the filename (-H), but where we tell us the 'filename' we want to use with --label="$filename". Print line numbers too, -n.

Grep only a part of a text file

How can I apply the following command to only a part of a text file? For example from the beginning to the line 5000.
grep "^ A : 11 B : 10" filename | wc -l
I cannot use head and then apply the above command since the text file is huge.
You could try using the sed command, which I believe does better for large files, from this question and pipe to grep.
sed -n 1,5000p file | grep ...
You can try combination of -n (prefixing each line of output with line number) and -m (limiting number of matching lines). Something like this:
grep -n -m 5000 pattern file.txt | grep -B 5000 "^5000:" | wc -l
First grep search for pattern, add line numbers and limit output to first 5000 matching lines (worst case scenario: all lines from range match pattern). Second grep match line number 5000, and print all lines before this line.
I don't know if it is more efficient solution

How to make grep to stop searching in each file after N lines?

It's best to describe the use by a hypothetical example:
Searching for some useful header info in a big collection of email storage (each email in a separate file). e.g. doing stats of top mail client apps used.
Normally if you do grep you can specify -m to stop at first match but let's say an email does not contact X-Mailer or whatever it is we are looking for in a header? It will scan through the whole email. Since most headers are <50 lines performance could be increased by telling grep to search only 50 lines on any file. I could not find a way to do that.
I don't know if it would be faster but you could do this with awk:
awk '/match me/{print;exit}FNR>50{exit}' *.mail
will print the first line matching match me if it appears in the first 50 lines. (If you wanted to print the filename as well, grep style, change print; to print FILENAME ":" $0;)
awk doesn't have any equivalent to grep's -r flag, but if you need to recursively scan directories, you can use find with -exec:
find /base/dir -iname '*.mail' \
-exec awk '/match me/{print FILENAME ":" $0;exit}FNR>50{exit}' {} +
You could solve this problem by piping head -n50 through grep but that would undoubtedly be slower since you'd have to start two new processes (one head and one grep) for each file. You could do it with just one head and one grep but then you'd lose the ability to stop matching a file as soon as you find the magic line, and it would be awkward to label the lines with the filename.
you can do something like this
head -50 <mailfile>| grep <your keyword>
Try this command:
for i in *
do
head -n 50 $i | grep -H --label=$i pattern
done
output:
1.txt: aaaaaaaa pattern aaaaaaaa
2.txt: bbbb pattern bbbbb
ls *.txt | xargs head -<N lines>| grep 'your_string'

Find line number in a text file - without opening the file

In a very large file I need to find the position (line number) of a string, then extract the 2 lines above and below that string.
To do this right now - I launch vi, find the string, note it's line number, exit vi, then use sed to extract the lines surrounding that string.
Is there a way to streamline this process... ideally without having to run vi at all.
Maybe using grep like this:
grep -n -2 your_searched_for_string your_large_text_file
Will give you almost what you expect
-n : tells grep to print the line number
-2 : print 2 additional lines (and the wanted string, of course)
You can do
grep -C 2 yourSearch yourFile
To send it in a file, do
grep -C 2 yourSearch yourFile > result.txt
Use grep -n string file to find the line number without opening the file.
you can use cat -n to display the line numbers and then use awk to get the line number after a grep in order to extract line number:
cat -n FILE | grep WORD | awk '{print $1;}'
although grep already does what you mention if you give -C 2 (above/below 2 lines):
grep -C 2 WORD FILE
You can do it with grep -A and -B options, like this:
grep -B 2 -A 2 "searchstring" | sed 3d
grep will find the line and show two lines of context before and after, later remove the third one with sed.
If you want to automate this, simple you can do a Shell Script. You may try the following:
#!/bin/bash
VAL="your_search_keyword"
NUM1=`grep -n "$VAL" file.txt | cut -f1 -d ':'`
echo $NUM1 #show the line number of the matched keyword
MYNUMUP=$["NUM1"-1] #get above keyword
MYNUMDOWN=$["NUM1"+1] #get below keyword
sed -n "$MYNUMUP"p file.txt #display above keyword
sed -n "$MYNUMDOWN"p file.txt #display below keyword
The plus point of the script is you can change the keyword in VAL variable as you like and execute to get the needed output.

Resources