linux command for removing duplicate values

linux command for removing duplicate values - linux

I have 2 files
file1 which contains
1,2,3,4,5
file2 contains
4,5,6,7,8
output should be in such a way that new file assume it as file3 should contain
1,2,3,4,5,6,7,8,
and not only that if contents in file1 and file2 changes like
file1 new contents
10,11,12,13,14
file2 new contents
13,14,15,16,17,18
after merging file3 should should contain below values
1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18
I have tried several commands like sort, uniq, cat etc but it haven't worked

Commands like sort and uniq work on lines.
All you have to do is convert commas to newlines, do sort -u or uniq and then convert the newlines back to commas, e.g.
$ cat a
1,2,3,4,5
$ cat b
4,5,6,7,8
$ cat a b | tr ',' '\n' | sort -u | tr '\n' ','
1,2,3,4,5,6,7,8,
You may find Set Operations in the Unix Shell helpful.

If you want merge result to file3:
cat file1 file2 | sed s/,/\\n/g | sort -u | tr "\\n" "," >> file3

Related

Want to display output in column wise in Linux Shell script

I have printed my output in the below format.
last -w -F | awk '{print $1","$3","$5$6$7$8","$11$12$13$14","$15}' | tac
Now for the same output I want to display column wise. Can some one help me out here?

Add this to the end: | tr ',' '\t', like this:
last -w -F | awk '{print $1","$3","$5$6$7$8","$11$12$13$14","$15}' | tac | tr ',' '\t'
This will pipe your comma-delimited output to the tr utility and tell it to translate commas to tabs.

How to fInd frequencies of pair of strings within a Unix terminal

how can I compute the following from within the Unix terminal and then store the results in a file?
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115, 2
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss, 1
49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss, 1
EDIT:
OK, I think, I got it:
cat lol | cut -f 1,2 -d ',' | sort | uniq -c > lol2
My only problem now it is that the fist column of the output file should - in fact - be at the end, and also that the output file should be csv compatible. Any ideas?

Would it be a problem to simply count unique lines instead? If not, the uniq command is your friend - see its manpage, but be sure to sort the list first so that all repetitions happen after another:
sort myfile.txt | uniq -c
For your example data, returns:
2 4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
1 4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
1 49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
To redirect into a file, append > outfile.txt:
sort myfile.txt | uniq -c > outfile.txt
If you need an output similar to the one in your question, you can use awk to reorder columns and sed to change delimiters:
sort count.txt | uniq -c | awk '{ print $2 " " $1 }' | sed 's/ /,/'

Awk: Words frequency from one text file, how to ouput into myFile.txt?

Given a .txt files with space separated words such as:
But where is Esope the holly Bastard
But where is
And the Awk function :
cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2"#"$1}'
I get the following output in my console :
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
How to get into printed into myFile.txt ?
I actually have 300.000 lines and near 2 millions words. Better to output the result into a file.
EDIT: Used answer (by #Sudo_O):
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt

Your pipeline isn't very efficient you should do the whole thing in awk instead:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file > myfile
If you want the output in sorted order:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort > myfile
The actual output given by your pipeline is:
$ tr ' ' '\n' < file | sort | uniq -c | awk '{print $2"#"$1}'
Bastard#1
But#2
Esope#1
holly#1
is#2
the#1
where#2
Note: using cat is useless here we can just redirect the input with <. The awk script doesn't make sense either, it's just reversing the order of the words and words frequency and separating them with an #. If we drop the awk script the output is closer to the desired output (notice the preceding spacing however and it's unsorted):
$ tr ' ' '\n' < file | sort | uniq -c
1 Bastard
2 But
1 Esope
1 holly
2 is
1 the
2 where
We could sort again a remove the leading spaces with sed:
$ tr ' ' '\n' < file | sort | uniq -c | sort | sed 's/^\s*//'
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
But like I mention at the start let awk handle it:
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where

Just redirect output to a file.
cat /pathway/to/your/file.txt % tr ' ' '\n' | sort | uniq -c | \
awk '{print $2"#"$1}' > myFile.txt

Just use shell redirection :
echo "test" > overwrite-file.txt
echo "test" >> append-to-file.txt
Tips
A useful command is tee which allow to redirect to a file and still see the output :
echo "test" | tee overwrite-file.txt
echo "test" | tee -a append-file.txt
Sorting and locale
I see you are working with asian script, you need to be need to be careful with the locale use by your system, as the resulting sort might not be what you expect :
* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
And have a look at the output of :
locale

Need to remove the count from the output when using "uniq -c" command

I am trying to read a file and sort it by number of occurrences of a particular field. Suppose i want to find out the most repeated date from a log file then i use uniq -c option and sort it in descending order. something like this
uniq -c | sort -nr
This will produce some output like this -
809 23/Dec/2008:19:20
the first field which is actually the count is the problem for me .... i want to get ony the date from the above output but m not able to get this. I tried to use cut command and did this
uniq -c | sort -nr | cut -d' ' -f2
but this just prints blank space ... please can someone help me on getting the date only and chop off the count. I want only
23/Dec/2008:19:20
Thanks

The count from uniq is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:
uniq -c | sort -nr | cut -c 9-
to get columns (character positions) 9 upwards. Or you can use sed:
uniq -c | sort -nr | sed 's/^.\{8\}//'
or:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
This second option is robust in the face of a repeat count of 10,000,000 or more; if you think that might be a problem, it is probably better than the cut alternative. And there are undoubtedly other options available too.
Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq from coreutils 8.3. The BSD uniq -c produced 3 leading spaces before a single digit count. The POSIX spec says the output from uniq -c shall be formatted as if with:
printf("%d %s", repeat_count, line);
which would not have any leading blanks. Given this possible variance in output formats, the sed script with the [0-9] regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'

Instead of cut -d' ' -f2, try
awk '{$1="";print}'
Maybe you need to remove one more blank in the beginning:
awk '{$1="";print}' | sed 's/^.//'
or completly with sed, preserving original whitspace:
sed -r 's/^[^0-9]*[0-9]+//'

Following awk may help you here.
awk '{a[$0]++} END{for(i in a){print a[i],i | "sort -k2"}}' Input_file
Solution 2nd: In case you want order of output to be same as input but not as sort.
awk '!a[$0]++{b[++count]=$0} {c[$0]++} END{for(i=1;i<=count;i++){print c[b[i]],b[i]}}' Input_file

an alternative solution is this:
uniq -c | sort -nr | awk '{print $1, $2}'
also you may easily print a single field.

use(since you use -f2 in the cut in your question)
cat file |sort |uniq -c | awk '{ print $2; }'

If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'
For the original task it is a bit of an overkill, but after reformatting, cut can be used to remove the field, as OP intended:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-

Add tr -s to the pipe chain to "squeeze" multiple spaces into one space delimiter:
uniq -c | tr -s ' ' | cut -d ' ' -f3
tr is very useful in some obscure places. Unfortunately it doesn't get rid of the first leading space, hence the -f3

You could make use of sed to strip both the leading spaces and the numbers printed by uniq -c
sort file | uniq -c | sed 's/^ *[0-9]* //'
I would illustrate this with an example. Consider a file
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
The command
sort file | uniq -c | sed 's/^ *[0-9]* //'
would return
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~

first solution
just using sort when input repetition has not been taken into consideration. sort has unique option -u
sort -u file
sort -u < file
Ex.:
$ cat > file
a
b
c
a
a
g
d
d
$ sort -u file
a
b
c
d
g
second solution
if sorting based on repetition is important
sort txt | uniq -c | sort -k1 -nr | sed 's/^ \+[0-9]\+ //g'
sort txt | uniq -c | sort -k1 -nr | perl -lpe 's/^ +[\d]+ +//g'
which has this output:
a
d
g
c
b

shellscript get column data

I have a file that's the result of the comm command, it has 2 columns, I wish to separate these 2 columns into two different files, how do I do that?
the file looks like:
a
b
g
f
c
d

Depending on the column separator, you can do something like:
cut -f1 orig_file >file1
cut -f2 orig_file >file2
Here the column separator is supposed to be a TAB. If it is another character, you can use the -d char option to cut.
If you want to remove empty lines, as per your request, you can add to each line a sed command:
cut -f1 orig_file | sed -e /^$/d >file1
cut -f2 orig_file | sed -e /^$/d >file2

You can cut the relevant parts based on character indexes:
# assuming constant 5 chars for col1, 5 chars for col2
cat file | cut -c0-5 | sed '/^\s*$/ {d}' > col1
cat file | cut -c6-10 | sed '/^\s*$/ {d}' > col2
The sed pipe removes empty lines (those with only whitespace). They can also be removed with grep -v '^[[:space:]]*$'.

Using cut will result in a command for each and every column.
You can do it using awk in a single command.
awk '{for (i=1;i<=NF;i++) print $i>i".txt"}' your_file
By default Tab is the field separator.
Incase the field separator is other than tab,then add a flag after awk like below
awk -F"<field separator>" '{....

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

linux command for removing duplicate values - linux

If you want merge result to file3: cat file1 file2 | sed s/,/\\n/g | sort -u | tr "\\n" "," >> file3

Related

Want to display output in column wise in Linux Shell script

How to fInd frequencies of pair of strings within a Unix terminal

Awk: Words frequency from one text file, how to ouput into myFile.txt?

Need to remove the count from the output when using "uniq -c" command

shellscript get column data

Categories

Resources