Length of shortest line? - linux

In Linux command using wc -L it's possible to get the length of longest line of a text file.
How do I find the length of the shortest line of a text file?

Try this:
awk '{print length}' <your_file> | sort -n | head -n1
This command gets lengths of all files, sorts them (correctly, as numbers) and, fianlly, prints the smallest number to console.

Pure awk solution:
awk '(NR==1||length<shortest){shortest=length} END {print shortest}' file

I turned the awk command into a function (for bash):
function shortest() { awk '(NR==1||length<shortest){shortest=length} END {print shortest}' $1 ;} ## report the length of the shortest line in a file
Added this to my .bashrc (and then "source .bashrc" )
and then ran it: shortest "yourFileNameHere"
[~]$ shortest .history
2
It can be assigned to a variable (Note the backtics are required):
[~]$ var1=`shortest .history`
[~]$ echo $var1
2
For csh:
alias shortest "awk '(NR==1||length<shortest){shortest=length} END {print shortest}' \!:1 "

Both awk solutions from above do not handle '\r' the way wc -L does.
For a single line input file they should not produce values greater than maximal line length reported by wc -L.
This is a new sed based solution (I was not able to shorten while keeping correct):
echo $((`sed 'y/\r/\n/' file|sed 's/./#/g'|sort|head -1|wc --bytes`-1))
Here are some samples, showing '\r' claim and demonstrating sed solution:
$ echo -ne "\rABC\r\n" > file
$ wc -L file
3 file
$ awk '{print length}' file|sort -n|head -n1
5
$ awk '(NR==1||length<shortest){shortest=length} END {print shortest}' file
5
$ echo $((`sed 'y/\r/\n/' file|sed 's/./#/g'|sort|head -1|wc --bytes`-1))
0
$
$ echo -ne "\r\r\n" > file
$ wc -L file
0 file
$ echo $((`sed 'y/\r/\n/' file|sed 's/./#/g'|sort|head -1|wc --bytes`-1))
0
$
$ echo -ne "ABC\nD\nEF\n" > file
$ echo $((`sed 'y/\r/\n/' file|sed 's/./#/g'|sort|head -1|wc --bytes`-1))
1
$

Related

Print second last line from variable in bash

VAR="1\n2\n3"
I'm trying to print out the second last line. One liner in bash!
I've gotten so far: printf -- "$VAR" | head -2
It however prints out too much.
I can do this with a file no problem: tail -2 ~/file | head -1
You almost done this task by yourself. Try
VAR="1\n2\n3"; printf -- "$VAR"|tail -2|head -1
Here is one pure bash way of doing this:
readarray -t arr < <(printf -- "$VAR") && echo "${arr[-2]}"
2
You may also use this awk as a single command:
VAR="1\n2\n3"
awk -F '\\\\n' '{print $(NF-1)}' <<< "$VAR"
2
maybe more efficient using a temporary variable and using expansions
var=$'1\n2\n3' ; tmpvar=${var%$'\n'*} ; echo "${tmpvar##*$'\n'}"
Use echo -e for backslash interpretation and to translate \n to newlines and print the interested line number using NR.
$ echo -e "${VAR}" | awk 'NR==2'
2
With multiple lines and do, tail and head can be used to print any particular line number.
$ echo -e "$VAR" | tail -2 | head -1
2
or do a fancy sed, where you keep the previous line in the buffer-space (x) to print and keep deleting until the last line,
$ echo -e "$VAR" | sed 'x;$!d'
2

Need to reduce the execution time

We are trying to execute below script for finding out the occurrence of a particular word in a log file
Need suggestions to optimize the script.
Test.log size - Approx to 500 to 600 MB
$wc -l Test.log
16609852 Test.log
po_numbers - 11 to 12k po's to search
$more po_numbers
xxx1335
AB1085
SSS6205
UY3347
OP9111
....and so on
Current Execution Time - 2.45 hrs
while IFS= read -r po
do
check=$(grep -c "PO_NUMBER=$po" Test.log)
echo $po "-->" $check >>list3
if [ "$check" = "0" ]
then
echo $po >>po_to_server
#else break
fi
done < po_numbers
You are reading your big file too many times when you execute
grep -c "PO_NUMBER=$po" Test.log
You can try to split your big file into smaller ones or write your patterns to a file and make grep use it
echo -e "PO_NUMBER=$po\n" >> patterns.txt
then
grep -f patterns.txt Test.log
$ grep -Fwf <(sed 's/.*/PO_NUMBER=&/' po_numbers) Test.log
create the lookup file from po_numbers (process substitution) check for literal word matches from the log file. This assumes the searched PO_NUMBER=xxx is a separate word, if not remove -w, also assumes there is no regex but just literal matches, if not remove -F, however both will slow down searches.
Using Grep :
sed -e 's|^|PO_NUMBER=|' po_numbers | grep -o -F -f - Test.log | sed -e 's|^PO_NUMBER=||' | sort | uniq -c > list3
grep -o -F -f po_numbers list3 | grep -v -o -F -f - po_numbers > po_to_server
Using awk :
This awk program might work faster
awk '(NR==FNR){ po[$0]=0; next }
{ for(key in po) {
str=$0
po[key]+=gsub("PO_NUMBER="key,"",str)
}
}
END {
for(key in po) {
if (po[key]==0) {print key >> "po_to_server" }
else {print key"-->"po[key] >> "list3" }
}
}' po_numbers Test.log
This does the following :
The first line loads the po keys from the file po_numbers
The second awk parser, will pars the file for occurences of PO_NUMBER=key per line. (gsub is a function which performs a substitutation and returns the substitution count)
In the end we print out the requested output to the requested files.
The assumption here is that is might be possible that multiple patterns could occure multiple times on a single line of Test.log
Comment: the original order of po_numbers will not be satisfied.
"finding out the occurrence"
Not sure if you mean to count the number of occurrences for each searched word or to output the lines in the log that contain at least one of the searched words. This is how you could solve it in the latter case:
(cat po_numbers; echo GO; cat Test.log) | \
perl -nle'$r?/$r/&&print:/GO/?($r=qr/#{[join"|",#s]}/):push#s,$_'

Linux: Extract string from a line including delimiter character using sed command [duplicate]

For example
echo "abc-1234a :" | grep <do-something>
to print only abc-1234a
I think these are closer to what you're getting at, but without knowing what you're really trying to achieve, it's hard to say.
echo "abc-1234a :" | egrep -o '^[^:]+'
... though this will also match lines that have no colon. If you only want lines with colons, and you must use only grep, this might work:
echo "abc-1234a :" | grep : | egrep -o '^[^:]+'
Of course, this only makes sense if your echo "abc-1234a :" is an example that would be replace with possibly multiple lines of input.
The smallest tool you could use is probably cut:
echo "abc-1234a :" | cut -d: -f1
And sed is always available...
echo "abc-1234a :" | sed 's/ *:.*//'
For this last one, if you only want to print lines that include a colon, change it to:
echo "abc-1234a :" | sed -ne 's/ *:.*//p'
Heck, you could even do this in pure bash:
while read line; do
field="${line%%:*}"
# do stuff with $field
done <<<"abc-1234a :"
For information on the %% bit, you can man bash and search for "Parameter Expansion".
UPDATE:
You said:
It's the characters in the first line of input before the colon. The
input could have multiple line though.
The solutions with grep probably aren't your best choice, then, since they'll also print data from subsequent lines that might include colons. Of course, there are many ways to solve this requirement as well. We'll start with sample input:
$ function sample { printf "abc-1234a:foo\nbar baz:\nNarf\n"; }
$ sample
abc-1234a:foo
bar baz:
Narf
You could use multiple pipes, for example:
$ sample | head -1 | grep -Eo '^[^:]*'
abc-1234a
$ sample | head -1 | cut -d: -f1
abc-1234a
Or you could use sed to process only the first line:
$ sample | sed -ne '1s/:.*//p'
abc-1234a
Or tell sed to exit after printing the first line (which is faster than reading the whole file):
$ sample | sed 's/:.*//;q'
abc-1234a
Or do the same thing but only show output if a colon was found (for safety):
$ sample | sed -ne 's/:.*//p;q'
abc-1234a
Or have awk do the same thing (as the last 3 examples, respectively):
$ sample | awk '{sub(/:.*/,"")} NR==1'
abc-1234a
$ sample | awk 'NR>1{nextfile} {sub(/:.*/,"")} 1'
abc-1234a
$ sample | awk 'NR>1{nextfile} sub(/:.*/,"")'
abc-1234a
Or in bash, with no pipes at all:
$ read line < <(sample)
$ printf '%s\n' "${line%%:*}"
abc-1234a
It is possible to do what you want with only sed.
Here is an example:
#!/bin/sh
filename=$1
pattern=yourpattern
# flag -n disables print everyline (default behavior)
sed -n "
1,/$pattern/ {
/$pattern/n # skip line containing pattern
p # print lines ranging from line 1 untill pattern
}
" $filename
exit 0
This works at least for GNU's sed. It should work for other sed too, except
regarding the comments (some implementations of sed don't support comments).
Source: https://www.grymoire.com/Unix/Sed.html

How to keep blank lines in the end of a file when I user cat command in shell script

the file a.txt has two blank lines at the end
[yaxin#oishi tmp]$ cat -n a.txt
1 jhasdfj
2
3 sdfjalskdf
4
5
and my script is:
[yaxin#oishi tmp]$ cat t.sh
#!/bin/sh
a=`cat a.txt`
a_length=`echo "$a" | awk 'END {print NR}'`
echo "$a"
echo $a_length
[yaxin#oishi tmp]$ sh t.sh
jhasdfj
sdfjalskdf
3
open debug
[yaxin#oishi tmp]$ sh -x t.sh
++ cat a.txt
+ a='jhasdfj
sdfjalskdf'
++ echo 'jhasdfj
sdfjalskdf'
++ awk 'END {print NR}'
+ a_length=3
+ echo 'jhasdfj
sdfjalskdf'
jhasdfj
sdfjalskdf
+ echo 3
3
the cat command steal the blank lines at the end of the file.How to solve this problem.
The cat command does not steal anything. It is the command substitution that does. man bash says:
Bash performs the expansion by executing command and replacing the command substitution with the standard output of the command, with any trailing newlines deleted. Embedded newlines are not deleted
If you want to store an output of a command to a variable, you might add && echo . after the command, store the output and remove the final ..
Also, to count the number of lines in a file, the cannonical way is to run wc -l:
wc -l < a.txt
You don't need cat command here, directly use awk like this:
awk 'END {print NR}' a.txt
Your problem is in storing the cat's output in a shell variable. Even this will give right output (though case of UUOC):
cat a.txt | awk 'END {print NR}'
Update: When you try to do this:
a=`cat a.txt`
OR else:
a=$(cat a.txt)
Pitfall is that the process substitution i.e. command inside reverse quote like you have or in $() strips trailing newlines.
You can do this trick to get trailing newlines stored in a shell variable:
a=`cat a.txt; echo ';'`
a="${a%;}"
Test the variable value:
echo "$a"
printf "%q" "$a"
Then output will show newlines as well:
jhasdfj
sdfjalskdf
$'jhasdfj\n\nsdfjalskdf\n\n\n'

print a line which has a digit repeated n times in the third field

I have a file with contents:
20120619112139,3,22222288100597,01,503352786544597,,W,ROAMER,,,,0,mme2
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120611171517,3,22222288100620,,503352786544620,11917676228846,B,ROAMER,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120611171003,3,22222288100618,02,503352786544618,,W,ROAMER,8,2505,,0,
20120611171046,3,00000000000000,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
20120611171101,3,22222288100618,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
20120611171101,3,22222222222222,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
I need to check if the third field of any line has one digit repeated all through 14 times, like:00000000000000 and print such lines to another file
I tried this code:
awk '$3 ~ /[0-9]{14}/' myfile > output.txt
But this prints lines having "22222288100618" such values as well.
Also i tried:
for i in `cat myfile`
do
if [ `echo $i | cut -d"," -f 3 | egrep "^[0-9]{14}$"` ];
then echo $i >> output.txt;
fi
done
This doesn't help as well.This also prints all the lines.
But I only need these lines in the output file.
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120611171046,3,00000000000000,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
20120611171101,3,22222222222222,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
Thanks in advance for any immediate help
Don't know if this can be done with awk but this should work:
perl -aF, -nle '$F[2]=~/(\d)\1{13}/&& print'
You can use an expression like 0{14}|1{14}.... Try this:
$ for i in 0 1 2 3 4 5 6 7 8 9; do re=$re${re:+|}$i{14}; done
$ awk -F, --posix \$3~/$re/ myfile
(gawk requires --posix to recognize the interval expression {14}. This may not be necessary with all awk.)
Using grep:
grep -E "[0-9]+,[0-9]+,([0-9])\1{13}" myfile
sed -n '/^[^,]+,[^,]+,([0-9])\1{13}/p' input_file

Resources