I would like to wrote a little shell script that permit to check if all lines on a file has the same number of ;
I have a file containing the following format :
$ cat filename.txt
34567890;098765456789;098765567;9876;9876;EXTG;687J;
4567800987987;09876789;9667876YH;9876;098765;098765;09876;
SLKL987H;09876LKJ;POIUYT;PÖIUYT;88765K;POIUYTY;LKJHGFDF;
TYUIO;09876LKJ;POIUYT;LKJHG;88765K;POIUYTY;OIUYT;
...
...
...
SDFGHJK;RTYUIO9876;4567890LKJHGFD;POIUYTRF56789;POIUY;POIUYT;9876;
I use the following command for determine of the number of ; of each line :
awk -F';' 'NF{print (NF-1)}' filename.txt
I have the following output :
7
7
7
7
...
...
...
7
Because number of ; on each line of this file is 7.
Now, I want to wrote a script that permit me to verify if all the lines in the file have 7 commas. If it's OK, it tells me that the file is correct. Otherwise, if there is a single line containing more than 7 commas, it tells me that the file is not correct.
Rather than printing output, return a value. eg
awk -F',' 'NR==1{count = NF} NF!=count{status=1}END{exit status}' filename.txt
If there are no lines or if all lines contain the same number of commas, this will return 0. Otherwise, it returns 1 to indicate failure.
Count the number of unique lines and verify that the count is 1.
if (($(awk -F';' 'NF{print (NF-1)}' filename.txt | uniq | wc -l) == 1)); then
echo good
else
echo bad
fi
Just pipe the result through sort -u | wc -l. If all lines have the same number of fields, this will produce one line of output.
Alternatively, just look for a line in awk that doesn't have the same number of fields as the first line.
awk -F';' 'NR==1 {linecount=NF}
linecount != NF { print "Bad line " $0; exit 1}
' filename.txt && echo "Good file"
You can also adapt the old trick used to output only the first of duplicate lines.
awk -F';' '{a[NF]=1}; length(a) > 1 {exit 1}' filename.txt
Each line updates the count of lines with that number of fields. Exit with status 1 as soon as a has more than one entry. Basically, a acts as a set of all field counts seen so far.
Based on all the information you have given me, I ended up doing the following. And it works for me.
nbCol=`awk -F';' '(NR==1){print NF;}' $1`
val=7
awk -F';' 'NR==1{count = NF} NF != count { exit 1}' $1
result=`echo $?`
if [ $result -eq 0 ] && [ $nbCol -eq $val ];then
echo "Good Format"
else
echo "Bad Format"
fi
I have a .txt file with 20 lines, and would love to get the last two letters of each line. it equals AA in every line then print Good. if not, print Bad.
line11111111111111111 AA
line22222222222222222 AA
line33333333333333333 AA
.....................
line20202020202020202 AA
This is GOOD.
===========================
line11111111111111111 AB
line22222222222222222 AC
line33333333333333333 WD
.....................
line20202020202020202 ZZ
This is BAD.
Did this but needs improvement : sed 's/^.*\(.\{2\}\)/\1/'
based on your file layout
$ awk '$NF!="AA"{f=1; exit} END{print (f?"BAD":"GOOD")}' file
note that you don't have to check the rest after the first non "AA" token.
You may use a single command awk:
awk 'substr($0, length()-1) != "AA"{exit 1}' file && echo "GOOD" || echo "BAD"
substr($0, length()-1) will extract last 2 characters of every line. awk command will exit with 1 if we don't fine AA in any line.
Use a grep invert-match to identify lines not ending with "AA":
if egrep -q -v AA$ input.txt; then echo "bad"; else echo "good";fi
This script shoud work with awk. The name of the txt file for me is .test you can change it with your file name.
if [ "$(awk '{ print $2 }' .test | uniq)" = "AA" ]; then
echo "This is GOOD";
else echo "This is BAD";
fi
How it works:
First, awk is used to get the second column by awk '{ print $2 }' and using uniq command we are taking unique entries from each line. If all the lines are AA uniq makes it only 1 line. At last we check whether this last product is only "AA" (1 line string with 2 As) or not.
Solution with grep and wc:
if [ "`grep -v 'AA$' your-file-here | wc -l`" == "0" ] ; then echo 'GOOD' ; else echo 'BAD' ; fi
The grep command checks for all lines not ending with AA and wc counts how many lines.
I have a file with contents:
20120619112139,3,22222288100597,01,503352786544597,,W,ROAMER,,,,0,mme2
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120611171517,3,22222288100620,,503352786544620,11917676228846,B,ROAMER,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120611171003,3,22222288100618,02,503352786544618,,W,ROAMER,8,2505,,0,
20120611171046,3,00000000000000,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
20120611171101,3,22222288100618,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
20120611171101,3,22222222222222,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
I need to check if the third field of any line has one digit repeated all through 14 times, like:00000000000000 and print such lines to another file
I tried this code:
awk '$3 ~ /[0-9]{14}/' myfile > output.txt
But this prints lines having "22222288100618" such values as well.
Also i tried:
for i in `cat myfile`
do
if [ `echo $i | cut -d"," -f 3 | egrep "^[0-9]{14}$"` ];
then echo $i >> output.txt;
fi
done
This doesn't help as well.This also prints all the lines.
But I only need these lines in the output file.
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120703112557,3,00000000000000,,503352786544021,,B,,8,2505,,U,
20120611171046,3,00000000000000,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
20120611171101,3,22222222222222,02,503352786544618,11917676228846,W,ROAMER,8,2505,,0,
Thanks in advance for any immediate help
Don't know if this can be done with awk but this should work:
perl -aF, -nle '$F[2]=~/(\d)\1{13}/&& print'
You can use an expression like 0{14}|1{14}.... Try this:
$ for i in 0 1 2 3 4 5 6 7 8 9; do re=$re${re:+|}$i{14}; done
$ awk -F, --posix \$3~/$re/ myfile
(gawk requires --posix to recognize the interval expression {14}. This may not be necessary with all awk.)
Using grep:
grep -E "[0-9]+,[0-9]+,([0-9])\1{13}" myfile
sed -n '/^[^,]+,[^,]+,([0-9])\1{13}/p' input_file
I'm looking at files that all have a different version number that starts at column 18 of line 7.
What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?
sed way:
variable=$(sed -n '7s/^.\{17\}//p' file)
EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to
variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)
You have a number of different ways you can go about this, depending on the utilities you want to use. One of your options is to make use of Bash's substring expansion in any of the following ways:
sed
line=1
string=$(sed -n "${line}p" /etc/passwd)
echo "${string:17}"
awk
line=1
string=$(awk "NR==${line} {print}; {next}" /etc/passwd)
echo "${string:17}"
coreutils
line=1
string=`{ head -n $line | tail -n1; } < /etc/passwd`
echo "${string:17}"
Use
var=$(head -n 17 filename | tail -n 1 | cut -f 18-)
or
var=$(awk 'NR == 17' {delim = ""; for (i = 18; i <= NF; i++) {printf "%s%s", delim, $i; delim = OFS}; printf "\n"}')
If you mean "characters" instead of "fields":
var=$(head -n 17 filename | tail -n 1 | cut -c 18-)
or
var=$(awk 'NR == 17' {print substr($0, 18)}')
If by 'columns' you mean 'fields':
a=$( awk 'NR==7{ print $18 }' file )
If you really want the 18th byte through the end of line 7, do:
a=$( sed -n 7p | cut -b 18- )
I am able to find the number of times a word occurs in a text file, like in Linux we can use:
cat filename|grep -c tom
My question is, how do I find the count of multiple words like "tom" and "joe" in a text file.
Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps
Ok, so first split the file into words, then sort and uniq:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq:
sort filename | uniq -c
Use awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep to get the desired fields
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)
The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>
You could do regexp,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
Here is one:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
A shell script solution:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
I completely forgot about grep -f:
cat filename | grep -fc names
AWK solution:
Assuming the names are in a file called names:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. e.g.
$ echo tomorrow | grep -c tom
1
You need grep -w
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
We use gawk because the POSIX awk doesn't allow regex record separator.
For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print}.")
To find all hits in all lines
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits.