Is there a way to 'uniq' by column?

Is there a way to 'uniq' by column? - linux

I have a .csv file like this:
stack2#domain.example,2009-11-27 01:05:47.893000000,domain.example,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.793000000,domain2.example,255.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.646465785,domain2.example,256.255.255.0
...
I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing overflow#domain2.example in the above example). How do I use uniq on only field 1 (separated by commas)? According to man, uniq doesn't have options for columns.
I tried something with sort | uniq but it doesn't work.

sort -u -t, -k1,1 file
-u for unique
-t, so comma is the delimiter
-k1,1 for the key field 1
Test result:
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

awk -F"," '!_[$1]++' file
-F sets the field separator.
$1 is the first field.
_[val] looks up val in the hash _(a regular variable).
++ increment, and return old value.
! returns logical not.
there is an implicit print at the end.

To consider multiple column.
Sort and give unique list based on column 1 and column 3:
sort -u -t : -k 1,1 -k 3,3 test.txt
-t : colon is separator
-k 1,1 -k 3,3 based on column 1 and column 3

If you want to use uniq:
<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2
gives:
1 01:05:47.893000000 2009-11-27 tack2#domain.example
2 00:58:29.793000000 2009-11-27 overflow#domain2.example
1

If you want to retain the last one of the duplicates you could use
tac a.csv | sort -u -t, -r -k1,1 |tac
Which was my requirement
here
tac will reverse the file line by line

Here is a very nifty way.
First format the content such that the column to be compared for uniqueness is a fixed width. One way of doing this is to use awk printf with a field/column width specifier ("%15s").
Now the -f and -w options of uniq can be used to skip preceding fields/columns and to specify the comparison width (column(s) width).
Here are three examples.
In the first example...
1) Temporarily make the column of interest a fixed width greater than or equal to the field's max width.
2) Use -f uniq option to skip the prior columns, and use the -w uniq option to limit the width to the tmp_fixed_width.
3) Remove trailing spaces from the column to "restore" it's width (assuming there were no trailing spaces beforehand).
printf "%s" "$str" \
| awk '{ tmp_fixed_width=15; uniq_col=8; w=tmp_fixed_width-length($uniq_col); for (i=0;i<w;i++) { $uniq_col=$uniq_col" "}; printf "%s\n", $0 }' \
| uniq -f 7 -w 15 \
| awk '{ uniq_col=8; gsub(/ */, "", $uniq_col); printf "%s\n", $0 }'
In the second example...
Create a new uniq column 1. Then remove it after the uniq filter has been applied.
printf "%s" "$str" \
| awk '{ uniq_col_1=4; printf "%15s %s\n", uniq_col_1, $0 }' \
| uniq -f 0 -w 15 \
| awk '{ $1=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'
The third example is the same as the second, but for multiple columns.
printf "%s" "$str" \
| awk '{ uniq_col_1=4; uniq_col_2=8; printf "%5s %15s %s\n", uniq_col_1, uniq_col_2, $0 }' \
| uniq -f 0 -w 5 \
| uniq -f 1 -w 15 \
| awk '{ $1=$2=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'

well, simpler than isolating the column with awk, if you need to remove everything with a certain value for a given file, why not just do grep -v:
e.g. to delete everything with the value "col2" in the second place
line: col1,col2,col3,col4
grep -v ',col2,' file > file_minus_offending_lines
If this isn't good enough, because some lines may get improperly stripped by possibly having the matching value show up in a different column, you can do something like this:
awk to isolate the offending column:
e.g.
awk -F, '{print $2 "|" $line}'
the -F sets the field delimited to ",", $2 means column 2, followed by some custom delimiter and then the entire line. You can then filter by removing lines that begin with the offending value:
awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE
and then strip out the stuff before the delimiter:
awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE | sed 's/.*|//g'
(note -the sed command is sloppy because it doesn't include escaping values. Also the sed pattern should really be something like "[^|]+" (i.e. anything not the delimiter). But hopefully this is clear enough.

By sorting the file with sort first, you can then apply uniq.
It seems to sort the file just fine:
$ cat test.csv
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv | uniq
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
You could also do some AWK magic:
$ awk -F, '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' test.csv
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0

Related

Using bash, I want to print a number followed by sizes of 2 paths on one line. i.e. output of 3 commands on one line

Using bash, I want to print a number followed by sizes of 2 paths on one line. i.e. output of 3 commands on one line.
All the 3 items should be separated by ":"
echo -n "10001:"; du -sch /abc/def/* | grep 'total' | awk '{ print $1 }'; du -sch /ghi/jkl/* | grep 'total' | awk '{ print $1 }'
I am getting the output as -
10001:61M
:101M
But I want the output as -
10001:61M:101M

This should work for you. The two key elements added being the
tr - d '\n'
which effectively strips new line characters from the end of the output. As well as adding in the echo ":" to get the extra colon for formatting in there.
Hope this helps! Here's a link to the docs for tr command.
https://ss64.com/bash/tr.html
echo -n "10001:"; du -sch /abc/def/* | grep 'total' | awk '{ print $1 }' | tr -d '\n'; echo ":" | tr -d '\n'; du -sch /ghi/jkl/* | grep 'total' | awk '{ print $1 }'

Save your values to variables, and then use printf:
printf '%s:%s:%s\n' "$first" "$second" "$third"

Find all unique columns from a huge(having millions of records and columns)OFS file(without fixed header row) unix

Input
119764469|14100733//1,k1=v1,k2=v2,STREET:1:1=NY
119764469|14100733//1,k1=v1,k2=v2,k3=v3
119764469|14100733//1,k1=v1,k4=v4,abc.xyz:1:1=nmb,abc,po.foo:1:1=yu
k1 could be any name with alphanumeric with . & : special chars like abc.nm.1:1
Expected output(all unique columns), sorting not required/necessary , it should be super fast
k1,k2,STREET:1:1,k3,k4,abc.xyz:1:1
My current approach/solution is
awk -F',' '{for (i=0; i<=NR; i++) {for(j=1; j<=NF; j++){split($j,a,"="); print a[1];}}}' file.txt | awk '!x[$1]++' | grep -v '|' | sed -e :a -e '$!N; s/\n/ | /; ta'
It works fine but it is too slow for huge size of file(which could be in MBs or in GBs in size)
NOTE: This is required in data migration, should use basic unix shell commands as production may not allow to have 3rd party utilities.

not sure about the speed but give it a try
$ cut -d, -f2- file | # select the key/value pairs
tr ',' '\n' | # split each k=v to its own line
cut -d= -f1 | # select only keys
sort -u | # filter uniques
paste -sd, # serialize back to single csv line
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
I expect it to be faster than grep since no regex is involved.

Use grep -o to grep only the parts you need:
grep -o -e '[^=,]\+=[^,]\+' file.txt |awk -F'=' '{print $1}' |sort |uniq |tr '\n' ',' |sed 's/,$/\n/'
>>> abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
(sort is needed here because otherwise uniq doesn't work)

If you don't really need the output all on one line:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u
abc.xyz:1:1
k1
k2
k3
k4
STREET:1:1
If you do:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u |
awk -v ORS= '{print sep $0; sep=","} END{print RS}'
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
You could do it all in one awk script but I'm not sure it'll be as efficient as the above or might run into memory issues if/when the array grows to millions of values:
$ cat tst.awk
BEGIN { FS="[,=]"; ORS="" }
{
for (i=2; i<=NF; i+=2) {
vals[$i]
}
}
END {
for (val in vals) {
print sep val
sep = ","
}
print RS
}
$ awk -f tst.awk file
k1,abc.xyz:1:1,k2,k3,k4,STREET:1:1

how to sort lines where sorting starts when a delimeter comes (linux)

I want to sort a file which has a particular delimiter in every line. I want to sort the lines such that sorting starts from that delimeter and sorts only according to numbers.
file is like this:
adf234sdf:nzzs13245ekeke
zdkfjs:ndsd34352jejs
mkd45fei:znnd11122iens
output should be:
mkd45fei:znnd11122iens
adf234sdf:nzzs13245ekeke
zdkfjs:ndsd34352jejs

Use the -t option to set the delimiter:
$ sort -t: -nk2,2 file
mkdfei:11122iens
adf234sdf:13245ekeke
zdkfjs:34352jejs

This can be an approach, based on this idea:
$ sed -r 's/([^:]*):([a-z]*)([0-9]*)(.*)/\1:\2-\3\4/g' a | sort -t- -k2,2 | tr -d '-'
mkdfei:aa11122iens
adf234sdf:tt13245ekeke
zdkfjs:aa34352jejs
By pieces:
$ sed -r 's/([^:]*):([a-z]*)([0-9]*)(.*)/\1:\2-\3\4/g' a
adf234sdf:tt-13245ekeke
zdkfjs:aa-34352jejs
mkdfei:aa-11122iens
$ sed -r 's/([^:]*):([a-z]*)([0-9]*)(.*)/\1:\2-\3\4/g' a | sort -t- -k2,2
mkdfei:aa-11122iens
adf234sdf:tt-13245ekeke
zdkfjs:aa-34352jejs
$ sed -r 's/([^:]*):([a-z]*)([0-9]*)(.*)/\1:\2-\3\4/g' a | sort -t- -k2,2 | tr -d '-'
mkdfei:aa11122iens
adf234sdf:tt13245ekeke
zdkfjs:aa34352jejs
So what we do is to add a - character before the first number. Then we sort based on that character and finally delete - back (tr -d '-').

In gawk there is an asort function, and you could use:
gawk -f sort.awk data.txt
where data.txt is your input file, and sort.awk is
{
line[NR]=$0;
match($0,/:[^0-9]*([0-9]*)/,a)
nn[NR]=a[1]" "NR
}
END {
N=asort (nn);
for (i=1; i<=N; i++) {
split(nn[i],c," ")
ind=c[2]
print line[ind];
}
}

Print columns from specific line of file?

I'm looking at files that all have a different version number that starts at column 18 of line 7.
What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?

sed way:
variable=$(sed -n '7s/^.\{17\}//p' file)
EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to
variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)

You have a number of different ways you can go about this, depending on the utilities you want to use. One of your options is to make use of Bash's substring expansion in any of the following ways:
sed
line=1
string=$(sed -n "${line}p" /etc/passwd)
echo "${string:17}"
awk
line=1
string=$(awk "NR==${line} {print}; {next}" /etc/passwd)
echo "${string:17}"
coreutils
line=1
string=`{ head -n $line | tail -n1; } < /etc/passwd`
echo "${string:17}"

Use
var=$(head -n 17 filename | tail -n 1 | cut -f 18-)
or
var=$(awk 'NR == 17' {delim = ""; for (i = 18; i <= NF; i++) {printf "%s%s", delim, $i; delim = OFS}; printf "\n"}')
If you mean "characters" instead of "fields":
var=$(head -n 17 filename | tail -n 1 | cut -c 18-)
or
var=$(awk 'NR == 17' {print substr($0, 18)}')

If by 'columns' you mean 'fields':
a=$( awk 'NR==7{ print $18 }' file )
If you really want the 18th byte through the end of line 7, do:
a=$( sed -n 7p | cut -b 18- )

Sort file beginning at a certain line

I'd like to be able to sort a file but only at a certain line and below. From the manual sort isn't able to parse content so I'll need a second utility to do this. read? or awk possibly? Here's the file I'd like to be able to sort:
tar --exclude-from=$EXCLUDE_FILE --exclude=$BACKDEST/$PC-* \
-cvpzf $BACKDEST/$BACKUPNAME.tar.gz \
/etc/X11/xorg.conf \
/etc/X11/xorg.conf.1 \
/etc/fonts/conf.avail.1 \
/etc/fonts/conf.avail/60-liberation.conf \
So for this case, I'd like to begin sorting on line three. I'm thinking I'm going to have to do a function to be able to do this something like
cat backup.sh | while read LINE; do echo $LINE | sort; done
Pretty new to this and the script looks like it's missing something. Also, not sure how to begin at a certain line number.
Any ideas?

Something like this?
(head -n 2 backup.sh; tail -n +3 backup.sh | sort) > backup-sorted.sh
You may have to fixup the last line of the input... it probably doesn't have the trailing \ for the line continuation, so you might have a broken 'backup-sorted.sh' if you just do the above.
You might want to consider using tar's --files-from (or -T) option, and having the sorted list of files in a data file instead of the script itself.

clumsy way:
len=$(cat FILE | wc -l)
sortable_len=$((len-3))
head -3 FILE > OUT
tail -$sortable_len FILE | sort >> OUT
I'm sure someone will post an elegant 1-liner shortly.

Sort the lines excluding the (2 lines) header, just for view.
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' | sort
Sort the lines excluding the (2 lines) headers and send the output to another file.
Method #1:
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' 2> file_sorted.txt | sort >> file_sorted.txt
Method #2:
cat file.txt | awk '{if (NR < 3) print $0 > "file_sorted.txt"; else print $0}' | sort >> file_sorted.txt

You could try this:
(read line; echo "$line"; sort) < file.txt
It takes one line and echoes it, then sorts the rest. You can also:
file.txt | (read line; echo "$line"; sort)
For two lines, just repeat the read and echo:
(read line; echo "$line"; read line; echo "$line"; sort) < file.txt

Using awk:
awk '{ if ( NR > 2 ) { print $0 } }' file.txt | sort
NR is a built-in awk variable and contains the current record/line number. It starts at 1.

Extending Vigneswaran R's answer using awk:
using tty to get your current terminals' stdin file, print the first three lines directly to you terminal (no it won't run the input) within awk and pipe the rest to sort.
tty
>/dev/pts/3
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/pts/3"; else print $0}' | sort

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is there a way to 'uniq' by column? - linux

sort -u -t, -k1,1 file -u for unique -t, so comma is the delimiter -k1,1 for the key field 1 Test result: overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0 stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

awk -F"," '!_[$1]++' file -F sets the field separator. $1 is the first field. _[val] looks up val in the hash _(a regular variable). ++ increment, and return old value. ! returns logical not. there is an implicit print at the end.

To consider multiple column. Sort and give unique list based on column 1 and column 3: sort -u -t : -k 1,1 -k 3,3 test.txt -t : colon is separator -k 1,1 -k 3,3 based on column 1 and column 3

If you want to use uniq: <mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2 gives: 1 01:05:47.893000000 2009-11-27 tack2#domain.example 2 00:58:29.793000000 2009-11-27 overflow#domain2.example 1

If you want to retain the last one of the duplicates you could use tac a.csv | sort -u -t, -r -k1,1 |tac Which was my requirement here tac will reverse the file line by line

Related

Using bash, I want to print a number followed by sizes of 2 paths on one line. i.e. output of 3 commands on one line

Find all unique columns from a huge(having millions of records and columns)OFS file(without fixed header row) unix

how to sort lines where sorting starts when a delimeter comes (linux)

Print columns from specific line of file?

Sort file beginning at a certain line

Categories

Resources