Display whole word from line - linux

I wanted to display only unique words as output. How to define grep expression ?
strings file.txt |grep (filter to display only whole words) | unique

Sounds like you need to translate "whitespace" into newlines:
strings file.txt | tr '[:blank:]' '\n' | sort -u

This should work:
s="sample word sample word samples"
echo "$s" |grep -oE "\b\w+\b"|sort -u
Output:
sample
samples
word

cat file.txt | sed -e 's/\s\+/\n/g' | sort -u

Related

Want to display output in column wise in Linux Shell script

I have printed my output in the below format.
last -w -F | awk '{print $1","$3","$5$6$7$8","$11$12$13$14","$15}' | tac
Now for the same output I want to display column wise. Can some one help me out here?
Add this to the end: | tr ',' '\t', like this:
last -w -F | awk '{print $1","$3","$5$6$7$8","$11$12$13$14","$15}' | tac | tr ',' '\t'
This will pipe your comma-delimited output to the tr utility and tell it to translate commas to tabs.

force grep to show only unmatched part of a word in bash

After several grep's I am able to have a list of some "words" like this
everything starts as
cat \path\verilargestructured.txt | grep option1 -B50 | grep option2 -A30 | grep option3 -A20 | grep "=host"
which result in a list with this structure
part1.part2.part3.part4=host
part1.part2.part3.part4=host
...
part1.part2.part3.part4=host
I want to use sed or any other option in bash to trim that out to
part1.part2.part3.part4
or
part2.part3.part4
assuming partN is only alphanumeric (no special characters)
Thanks
With awk, you can specify multiple delimiters with the -F parameter and output field separator with OFS option.
For example awk -F '[.=]' '{print $2,$3,$4}' OFS=. will print only second, third and fourth row of your output separated with dot.
cat \path\verilargestructured.txt | grep option1 -B50 | grep option2 -A30 | grep option3 -A20 | grep "=host" | awk -F '[.=]' '{print $2,$3,$4}' OFS=.
If I understand you correctly, then pipe it through
sed 's/=.*//'
...that will cut off the first = in each line and everything that comes after it. So, all in all,
cat \path\verilargestructured.txt | grep option1 -B50 | grep option2 -A30 | grep option3 -A20 | grep "=host" | sed 's/=.*//'
Alternatively, you could use cut:
cut -d = -f 1
Addendum: Going the cut route, to isolate all but part1, you could pipe it through yet another cut call
cut -d . -f 2-
As in
echo 'part1.part2.part3.part4=host' | cut -d = -f 1 | cut -d . -f 2-
Here -f 2- means "from the second field to the last." if you only wanted parts 2 and 3, you could use -f 2-3, and so forth. See man cut for details.

Need to remove the count from the output when using "uniq -c" command

I am trying to read a file and sort it by number of occurrences of a particular field. Suppose i want to find out the most repeated date from a log file then i use uniq -c option and sort it in descending order. something like this
uniq -c | sort -nr
This will produce some output like this -
809 23/Dec/2008:19:20
the first field which is actually the count is the problem for me .... i want to get ony the date from the above output but m not able to get this. I tried to use cut command and did this
uniq -c | sort -nr | cut -d' ' -f2
but this just prints blank space ... please can someone help me on getting the date only and chop off the count. I want only
23/Dec/2008:19:20
Thanks
The count from uniq is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:
uniq -c | sort -nr | cut -c 9-
to get columns (character positions) 9 upwards. Or you can use sed:
uniq -c | sort -nr | sed 's/^.\{8\}//'
or:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
This second option is robust in the face of a repeat count of 10,000,000 or more; if you think that might be a problem, it is probably better than the cut alternative. And there are undoubtedly other options available too.
Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq from coreutils 8.3. The BSD uniq -c produced 3 leading spaces before a single digit count. The POSIX spec says the output from uniq -c shall be formatted as if with:
printf("%d %s", repeat_count, line);
which would not have any leading blanks. Given this possible variance in output formats, the sed script with the [0-9] regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
Instead of cut -d' ' -f2, try
awk '{$1="";print}'
Maybe you need to remove one more blank in the beginning:
awk '{$1="";print}' | sed 's/^.//'
or completly with sed, preserving original whitspace:
sed -r 's/^[^0-9]*[0-9]+//'
Following awk may help you here.
awk '{a[$0]++} END{for(i in a){print a[i],i | "sort -k2"}}' Input_file
Solution 2nd: In case you want order of output to be same as input but not as sort.
awk '!a[$0]++{b[++count]=$0} {c[$0]++} END{for(i=1;i<=count;i++){print c[b[i]],b[i]}}' Input_file
an alternative solution is this:
uniq -c | sort -nr | awk '{print $1, $2}'
also you may easily print a single field.
use(since you use -f2 in the cut in your question)
cat file |sort |uniq -c | awk '{ print $2; }'
If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'
For the original task it is a bit of an overkill, but after reformatting, cut can be used to remove the field, as OP intended:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-
Add tr -s to the pipe chain to "squeeze" multiple spaces into one space delimiter:
uniq -c | tr -s ' ' | cut -d ' ' -f3
tr is very useful in some obscure places. Unfortunately it doesn't get rid of the first leading space, hence the -f3
You could make use of sed to strip both the leading spaces and the numbers printed by uniq -c
sort file | uniq -c | sed 's/^ *[0-9]* //'
I would illustrate this with an example. Consider a file
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
The command
sort file | uniq -c | sed 's/^ *[0-9]* //'
would return
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
first solution
just using sort when input repetition has not been taken into consideration. sort has unique option -u
sort -u file
sort -u < file
Ex.:
$ cat > file
a
b
c
a
a
g
d
d
$ sort -u file
a
b
c
d
g
second solution
if sorting based on repetition is important
sort txt | uniq -c | sort -k1 -nr | sed 's/^ \+[0-9]\+ //g'
sort txt | uniq -c | sort -k1 -nr | perl -lpe 's/^ +[\d]+ +//g'
which has this output:
a
d
g
c
b

finding unique values in a data file

I can do this in python but I was wondering if I could do this in Linux
I have a file like this
name1 text text 123432re text
name2 text text 12344qp text
name3 text text 134234ts text
I want to find all the different types of values in the 3rd column by a particular username lets say name 1.
grep name1 filename gives me all the lines, but there must be some way to just list all the different type of values? (I don't want to display duplicate values for the same username)
grep name1 filename | cut -d ' ' -f 4 | sort -u
This will find all lines that have name1, then get just the fourth column of data and show only unique values.
I tried using cat
File contains :(here file is foo.sh you can input any file name here)
$cat foo.sh
tar
world
class
zip
zip
zip
python
jin
jin
doo
doo
uniq will get each word only once
$ cat foo.sh | sort | uniq
class
doo
jin
python
tar
world
zip
uniq -u will get the word appeared only one time in file
$ cat foo.sh | sort | uniq -u
class
python
tar
world
uniq -d will get the only the duplicate words and print them only once
$ cat foo.sh | sort | uniq -d
doo
jin
zip
You can let sort look only on 4-th key, and then ask only for records with unique keys:
grep name1 | sort -k4 -u
As an all-in-one awk solution:
awk '$1 == "name1" && ! seen[$1" "$4]++ {print $4}' filename
IMHO Michał Šrajer got the best answer but a filename needed after grep name1
And i've got this fancy solution using indexed array
user=name1
IFSOLD=$IFS; IFS=$'\n'; test=( $(grep $user test) ); IFS=$IFSOLD
declare -A index
for item in "${test[#]}"; {
sub=( $item )
name=${sub[3]}
index[$name]=$item
}
for item in "${index[#]}"; { echo $item; }
In my opinion, you need to select the field from which you need the unique values. I was trying to retrieve unique source IPs from IPTables log.
cat /var/log/iptables.log | grep "May 5" | awk '{print $11}' | sort -u
Here is the output of the above command:
SRC=192.168.10.225
SRC=192.168.10.29
SRC=192.168.20.125
SRC=192.168.20.147
SRC=192.168.20.155
SRC=192.168.20.183
SRC=192.168.20.194
So, the best idea is to select the field first and then filter out the unique data.
The following command worked for me.
sudo cat AirtelFeb.txt | awk '{print $3}' | sort -u
Here it prints the 3rd column with unique values.
I think you meant fourth column.
You can try using 'cat Filename.txt | awk '{print $4}' | sort | uniq'

Bash script to find the frequency of every letter in a file

I am trying to find out the frequency of appearance of every letter in the english alphabet in an input file. How can I do this in a bash script?
My solution using grep, sort and uniq.
grep -o . file | sort | uniq -c
Ignore case:
grep -o . file | sort -f | uniq -ic
Just one awk command
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file
if you want case insensitive, add tolower()
awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file
and if you want only characters,
awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file
and if you want only digits, change /[a-zA-Z]/ to /[0-9]/
if you do not want to show unicode, do export LC_ALL=C
A solution with sed, sort and uniq:
sed 's/\(.\)/\1\n/g' file | sort | uniq -c
This counts all characters, not only letters. You can filter out with:
sed 's/\(.\)/\1\n/g' file | grep '[A-Za-z]' | sort | uniq -c
If you want to consider uppercase and lowercase as same, just add a translation:
sed 's/\(.\)/\1\n/g' file | tr '[:upper:]' '[:lower:]' | grep '[a-z]' | sort | uniq -c
Here is a suggestion:
while read -n 1 c
do
echo "$c"
done < "$INPUT_FILE" | grep '[[:alpha:]]' | sort | uniq -c | sort -nr
Similar to mouviciel's answer above, but more generic for Bourne and Korn shells used on BSD systems, when you don't have GNU sed, which supports \n in a replacement, you can backslash escape a newline:
sed -e's/./&\
/g' file | sort | uniq -c | sort -nr
or to avoid the visual split on the screen, insert a literal newline by type CTRL+V CTRL+J
sed -e's/./&\^J/g' file | sort | uniq -c | sort -nr

Resources