Linux Terminal: Finding number of lines longer than x

Linux Terminal: Finding number of lines longer than x - linux

I come to you with a problem that has me stumped. I'm attempting to find the number of lines in a file (in this case, the html of a certain site) longer than x (which, in this case, is 80).
For example: google.com has (by checking with wc -l) has 7 lines, two of which are longer than 80 (checking with awk '{print NF}'). I'm trying to find a way to check how many lines are longer than 80, and then outputting that number.
My command so far looks like this:
wget -qO - google.com | awk '{print NF}' | sort -g
I was thinking of just counting which lines have values larger than 80, but I can't figure out the syntax for that. Perhaps 'awk'? Maybe I'm going about this the clumsiest way possible and have hit a wall for a reason.
Thanks for the help!
Edit: The unit of measurement are characters. The command should be able to find the number of lines with more than 80 characters in them.

If you want the number of lines that are longer than 80 characters (your question is missing the units), grep is a good candidate:
grep -c '.\{80\}'
So:
wget -qO - google.com | grep -c '.\{80\}'
outputs 6.

Blue Moon's answer (in its original version) will print the number of fields, not the length of the line. Since the default field separator in awk is ' ' (space) you will get a word count, not the length of the line.
Try this:
wget -q0 - google.com | awk '{ if (length($0) > 80) count++; } END{print count}'

Using awk:
wget -qO - google.com | awk 'NF>80{count++} END{print count}'
This gives 2 as output as there are two lines with more than 80 fields.
If you mean number of characters (I presumed fields based on what you have in the question) then:
wget -qO - google.com | awk 'length($0)>80{c++} END{print c}'
which gives 6.

Related

How to clean output, prints the desired information with less CPU usage

I have 20GB log file, where it contains lots of fields, the field or column numbers 2 contains numbers. I use the below commands to print only column 2
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' | nawk -F "=" '{print $2}'
the result of this command is:
"93711994166", Key
since i want only the number then i append the below command to my original command to clean the output:
| awk -F, '{print $1}' | sed 's/"//g'
the result is:
93711994166
my final purpose is to print only numbers having length other than 11 digits, therefore, I append the following to my final command:
-vE '^.{11}$'
so my final command is:
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' | nawk -F "=" '{print $2}' | awk -F, '{print $1}' | sed 's/"//g' | grep -vE '^.{11}$' >/tmp/$file
this command takes long time to execute also causes high CPU usage. I want to achieve the following:
print all numbers with length not equal to 11 digits.
print all numbers that do not start with 93 (regardless of their length)
clean, effective and not cpu or memory costly command
I have another requirement which is to print also the numbers that not started with 93.
Note:
the log file contains lots of different lines but i use awk '/Read:ROP/' to work on the below output and extract numbers
Read:ROP (CustomerId="93700001865", Key=1, ActiveEndDate=2025-01-19 20:12:22, FirstCallDate=2018-01-08 12:30:30, IsFirstCallPassed=true, IsLocked=false, LTH={Data=["1|
MOC|07.07.2020 09:18:58|48000.0|119||OnPeakAccountID|480|19250||", "1|RECHARGE|04.07.2020 10:18:32|-4500.0|0|0", "1|RECHARGE|04.07.2020 10:18:59|-4500.0|0|0"], Index=0
}, LanguageID=2, LastKnownPeriod="Active", LastRechargeAmount=4500, LastRechargeDate=2020-07-04 10:18:59, VoucherRchFraudCounter=0, c_BlockPAYG=true, s_PackageKeyCount
er=13, s_OfferId="xyz", OnPeakAccountID_FU={Balance=18850});

20GB log file [...] zcat
Using zcat on 20GB log files is quite expensive. Check top when running your command line above.
It might be worth keeping the data from the first filtering step:
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' > filter_data.out
and work with the filtered data. I assume here that this awk step can remove the majority of the data.
Bonus points: This step can be parallelized by running the zcat [...] |awk [...] pipe file-by-file, and you only need to do this once for each file.
The other steps don't look particularly expensive unless there are a lot of data lines left even after filtering.

sed '/.*Read:ROP.*([^=]="\([^"]*\)".*/!d; s//\1/'
/.../ - match regex
.*Read:ROP.* - match Read:ROP followed by anything with anything in front, ie. awk '/Read:ROP/'
([^=]*=" - match a (, followed by anything except =, then a =, then a ", ie. nawk -F "=" '{print $2}'
\([^"]*\) - match everythjing inside qoutes. I guess [0-9] would be fine also
".* - delete rest of line
! - if the line doesn't match the regex
d - remove the line
s - substitute
// - reuse the regex in /.../
\1 - substitute for first backreference, ie. for \([^"]*\)

Grep a "text" and print all the line before and after the text. each log session is separated by 2 blank lines

1
1
1
1
Text
11
1
1
1
The text above can be anywhere therefore grep -C wont help.
I have tried that using AWK but i want to do it using grep
zgrep -C25 "Text" engine.*.log.gz
doesn't work as text may appear anywhere
I know the awk option but it doesn't work on .gz file i have to convert it its a lengthy process
awk -v RS= '/Text/' engine.*.log.gz

Just do it the trivial, obvious, robust, portable way:
zcat engine.*.log.gz | awk -v RS= '/Text/{print; exit}'
That should also be pretty efficient since the SIGPIPE that zcat gets when awk exits on finding the first Text should terminate the zcat.
Or if Text can appear multiple times in the input and you want all associated records output:
zcat engine.*.log.gz | awk -v RS= -v ORS='\n\n' '/Text/'

Thank you so much for the help, but I figured out the solution
zgrep -C34 "text" engine.*.log.gz |awk '/20yy-mm-dd hh:mm:ss/,/SENT MESSAGES (asynchronous)/'
Search the "text" along side 34 lines before and after it to get average out all the lines required
All of my logs session always begins with date and time so is used AWK for that
All of my logs session always ends with "SENT MESSAGES (asynchronous):" So AWK did that trick for me.

Check record length for fixed width files

In a Unix environment, I occasionally have some fixed width files for which I'd like to check the record lengths. For each file I'd like to catch if any records are not an appropriate line number for further investigation; appropriate size is known a priori.
If I want to check if all record lengths are the same, I simply run
zcat <gzipped file> | awk '{print length}' | sort -u
If there is more than one record length in the above command, then I run
zcat <gzipped file> | awk '{print length}' | nl -n rz -s "," > recordLenghts.csv
which stores a record length for row in the original file.
What: Is this an efficient method, or is there a better way of checking record length for a file?
Why: The reason I ask is that some of these files can be a few GB in size while gzipped. So this process can take a while.

With pure awk:
zcat <gzipped file> | awk '{printf "%0.6d,%s\n", NR, length}' > recordLenghts.csv
This way you will save one extra subprocess.

Linux bash command that returns where two strings differ

So, I've been googling around, and also searching in more detail on stack overflow, but I just can't seem to find an easy way of doing exactly this:
I want to know in what way two strings (without whitespace) differ, and simply print what that exact difference is.
E.g.:
Input 1 > "Chocolatecakeflavour"
Input 2 > "Chocolateflavour"
Output: "cake"
I've tried doing this with diff and dwdiff, cmp, and other known bash commands that popped into mind, but I just couldn't get this exact result.
Any ideas?

You can use diff with fold and awk like ths:
s="Chocolatecakeflavour"
r="Chocolateflavour"
diff <(fold -w1 <<< "$s") <(fold -w1 <<< "$r") | awk '/[<>]/{printf $2}'
cake
fold -w1 is to split input string character by character (one in each line)
diff is to get difference in both lists (1 char in each line)
awk '/[<>]/{printf $2}' is to suppress < OR > from diff'e output and print everything in same line
EDIT: As per OP's comments below if strings are in different lines of a file then use:
f=file
diff <(fold -w1 <(sed '2q;d' $f)) <(fold -w1 <(sed '3q;d' $f)) | awk '/[<>]/{printf $2}'
cake

How to grep within a grep

I have a bunch of massive text files, about 100MB each.
I want to grep to find entries that have 'INDIANA JONES' in it:
$ grep -ir 'INDIANA JONES' ./
Then, I would like to find the entries where there is the word PORTUGAL within 5,000 characters of the INDIANA JONES term. How would I do this?
# in pseudocode
grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char

Use grep's -o flag to output the 5000 characters surround the match, then search those characters for the second string. For example:
grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL"
If you need the original match, add the -n flag to the second grep and pipe into:
cut -f1 -d: > line_numbers.txt
then you could use awk to print those lines:
awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt
To avoid the temporary file, this could be written like:
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt
For multiple files, use find and a bash loop:
for i in $(find . -type f); do
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i"
done

One way to deal with this is with gawk. You could set the record separator to either INDIANA JONES or PORTUGAL and then perform a length check on the record (after stripping newlines, assuming newlines do not count towards the limit of 5000). You may have to resort to find to run this recursively within a directory
awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0;
gsub("\n", "", a)};
((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1};
{prevRT=RT};
END{if (found) print FILENAME}' file.txt

Consider installing ack-grep.
sudo apt-get install ack-grep
ack-grep is a more powerful version of grep.
There's no trivial solution to your question (that i can think of) outside of a full batch script, but you can use the -A and -B flags on ack-grep to specify a number of trailing or leading lines to output, resp.
This may not be a number of chars, but is a step further in that direction.
While this may not be a solution, it might give you some idea as to how to do this. Lookup filters like ack, awk, sed, etc. and see if you can find one with a flag for this kind of behaviour.
The ack-grep manual:
http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html
EDIT:
I think the sad news is, what you might think you're looking for is something like:
grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename
The problem is, even on a small file, querying this is going to be impossible time-wise.
I got this one to work with a different number. it's a size problem.
For such a large set of files, you'll need to do this in more than one step.
A Solution:
The only solution I know of is the leading and trailing output from ack-grep.
Step 1: how long are your lines?
If you knew how many lines out you had to go
(and you could estimate/calculate this a few ways) then you'd be able to grep the output of the first grep. Depending on what's in your file, you should be able to get a decent upper bound as to how many lines is 5000 chars (if a line has 100 chars average, 50+ lines should cover you, but if it has 10 chars, you'll need 500+).
You've got to determine the maximum number of lines that could be 5000 chars. You could guess or pick a high range if you like, but that'll be up to you. It's your data.
With that, call: (if you needed 100 lines for 5000 chars)
ack-grep -ira "PORTUGAL" -A 100 -B 100 filename
and
ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename
replace the 100s with what you need.
Step 2: parse the output
you'll need to take the matches that ack-grep returns and parse them, looking for any matches again, within these sub-ranges.
Look for INDIANA JONES in the first PORTUGAL ack-grep match output, and look for PORTUGAL in the second set of matches.
This should take a bit more work, likely involving a bash script (I might see if I can get one working this week), but it solves your massive-data problem, by breaking it down into more manageable chunks.

grep 'INDIANA JONES' . -iR -l | while read filename; do head -c 5000 "$filename" | grep -n PORTUGAL -H --label="$filename" ; done
This works as follows:
grep 'INDIANA JONES' . -iR -l. Search for all files in or below the current directory. Case insensitive (-i). And only print the names of the files that match (-l), don't print any content.
| while read filename; do ...|...|...; done for each line of input, store it in variable $filename and execute the pipeline.
Now, for each file that matched 'INDIANA JONES', we do
head -c 5000 "$filename" - extract the first 5000 characters
grep ... - search for PORTUGAL. Print the filename (-H), but where we tell us the 'filename' we want to use with --label="$filename". Print line numbers too, -n.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string