Related
Input
119764469|14100733//1,k1=v1,k2=v2,STREET:1:1=NY
119764469|14100733//1,k1=v1,k2=v2,k3=v3
119764469|14100733//1,k1=v1,k4=v4,abc.xyz:1:1=nmb,abc,po.foo:1:1=yu
k1 could be any name with alphanumeric with . & : special chars like abc.nm.1:1
Expected output(all unique columns), sorting not required/necessary , it should be super fast
k1,k2,STREET:1:1,k3,k4,abc.xyz:1:1
My current approach/solution is
awk -F',' '{for (i=0; i<=NR; i++) {for(j=1; j<=NF; j++){split($j,a,"="); print a[1];}}}' file.txt | awk '!x[$1]++' | grep -v '|' | sed -e :a -e '$!N; s/\n/ | /; ta'
It works fine but it is too slow for huge size of file(which could be in MBs or in GBs in size)
NOTE: This is required in data migration, should use basic unix shell commands as production may not allow to have 3rd party utilities.
not sure about the speed but give it a try
$ cut -d, -f2- file | # select the key/value pairs
tr ',' '\n' | # split each k=v to its own line
cut -d= -f1 | # select only keys
sort -u | # filter uniques
paste -sd, # serialize back to single csv line
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
I expect it to be faster than grep since no regex is involved.
Use grep -o to grep only the parts you need:
grep -o -e '[^=,]\+=[^,]\+' file.txt |awk -F'=' '{print $1}' |sort |uniq |tr '\n' ',' |sed 's/,$/\n/'
>>> abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
(sort is needed here because otherwise uniq doesn't work)
If you don't really need the output all on one line:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u
abc.xyz:1:1
k1
k2
k3
k4
STREET:1:1
If you do:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u |
awk -v ORS= '{print sep $0; sep=","} END{print RS}'
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
You could do it all in one awk script but I'm not sure it'll be as efficient as the above or might run into memory issues if/when the array grows to millions of values:
$ cat tst.awk
BEGIN { FS="[,=]"; ORS="" }
{
for (i=2; i<=NF; i+=2) {
vals[$i]
}
}
END {
for (val in vals) {
print sep val
sep = ","
}
print RS
}
$ awk -f tst.awk file
k1,abc.xyz:1:1,k2,k3,k4,STREET:1:1
I have this file
file.txt
unknown#mail.com||unknown#mail.com||
unknown#mail2.com||unknown#mail2.com||
unknown#mail3.com||unknown#mail3.com||
unknown#mail4.com||unknown#mail4.com||
unknownpass
unknownpass2
unknownpass3
unknownpass4
How can I use the sed command to obtain this:
unknown#mail.com|unknownpass|unknown#mail.com|unknownpass|
unknown#mail2.com|unknownpass2|unknown#mail2.com|unknownpass2|
unknown#mail3.com|unknownpass3|unknown#mail3.com|unknownpass3|
unknown#mail4.com|unknownpass4|unknown#mail4.com|unknownpass4|
This might work for you (GNU sed):
sed ':a;N;/\n[^|\n]*$/!ba;s/||\([^|]*\)||\(\n.*\)*\n\(.*\)$/|\3|\1|\3|\2/;P;D' file
Slurp the first part of the file into pattern space and one of the replacements, substitute, print and delete the first line and then repeat.
Well, this does use sed anyway:
{ sed -n 5,\$p file.txt; sed 4q file.txt; } | awk 'NR<5{a[NR]=$0; next}
{$2=a[NR-4]; $4=a[NR-4]} 1' FS=\| OFS=\|
awk to the rescue!
awk 'BEGIN {FS=OFS="|"}
NR==FNR {if(NF==1) a[++c]=$1; next}
NF>4 {$2=a[FNR]; $4=$2; print}' file{,}
a two pass algorithm, caches the entries in the first round and inserts them into the empty fields, assumes the number of items match.
Here is another approach with one pass, powered by tac wrapped awk
tac file |
awk 'BEGIN {FS=OFS="|"}
NF==1 {a[++c]=$1}
NF>4 {$2=a[c--]; $4=$2; print}' |
tac
I would combine the related lines with paste and reshuffle the elements with awk (I assume the related lines are exactly half a file away):
n=$(wc -l < file.txt)
paste -d'|' <(head -n $((n/2)) file.txt) <(tail -n $((n/2)) file.txt) |
awk '{ print $1, $6, $3, $6, "" }' FS='|' OFS='|'
Output:
unknown#mail.com|unknownpass|unknown#mail.com|unknownpass|
unknown#mail2.com|unknownpass2|unknown#mail2.com|unknownpass2|
unknown#mail3.com|unknownpass3|unknown#mail3.com|unknownpass3|
unknown#mail4.com|unknownpass4|unknown#mail4.com|unknownpass4|
I need to find the difference between two files in Unix,
File 1:
1,column1
2,column2
3,column3
File 2:
1,column1
2,column3
3,column5
I need to find the position of common column in file 2 from file 1
If there is no matching column in file1 some default index value and column name should return.
Output:
1,column1
3,column3
-1,column5
Can anyone help me to get in Unix script ?
Thanks,
William R
awk:
awk -F, 'NR==FNR{a[$2]=1; next;} ($2 in a)' file2 file1
grep+process substitution:
grep -f <(cut -d, -f2 file2) file1
EDIT for updated question:
awk:
awk -F, 'NR==FNR{a[$2]=$1;next} {if ($2 in a) print a[$2]","$2; else print "-1," $2}' file1 file2
# if match found in file1, print the index, else print -1
# (Also note that the input file order is reversed in this command, compared to earlier awk.)
grep:
cp file1 tmpfile #get original file
grep -f <(cut -d, -f2 file1) -v f2 | sed 's/.*,/-1,/' >> tmpfile #append missing entries
grep -f <(cut -d, -f2 file2) tmpfile # grep in this tmpfile
This is my data and file name : example.txt
id name lastname point
1234;emanuel;emenike;2855
1357;christian;baroni;398789
1390;alex;souza;23143
8766;moussa;sow;5443
I want to see who has this id(1234, 1390) columnname and point like that
emanuel 2855
alex 23143
How can i do this in linux command line with awk and egrep
You can try this:
awk -F\; '$1=="1234" || $1=="1390" {print $2,$4}' file
Using grep and cut:
grep '^\(1234\|1390\);' input | cut -d\; --output-delimiter=' ' -f2,4
Some variation awk
awk -F\; '$1~/^(1234|1390)$/ {print $2,$4}' file
emanuel 2855
alex 23143
Through awk,
awk -F';' '$1~/^1234$/ || $1~/^1390$/ {print $2,$4}' file
Example:
$ cat ccc
id name lastname point
1234;emanuel;emenike;2855
1357;christian;baroni;398789
1390;alex;souza;23143
8766;moussa;sow;5443
$ awk -F';' '$1~/^1234$/ || $1~/^1390$/ {print $2,$4}' ccc
emanuel 2855
alex 23143
Use the GNU version of awk (= gawk) in a two step approach to make your solution very flexible:
Step 1:
Parse your data file (e.g., example.txt) to generate a gawk lookup-function (here called "function_library.awk"):
$ /PATH/TO/generate_awk_function.sh /PATH/TO/example.txt
"generate_awk_function.sh" is just an gawk script for printing:
#! /bin/bash -
gawk 'BEGIN {
FS=";"
OFS="\t"
print "#### gawk function library \"function_library.awk\""
print "function lookup_value(key, value_for_key) {"
}
{
if (NR > 1 ) print "\tvalue_for_key["$1"] = \"" $2 OFS $4 "\""
}
END {
print " print value_for_key[key]"
print "}"
}' $1 > function_library.awk
You have generated this lookup function:
$ cat function_library.awk
#### gawk function library "function_library.awk"
function lookup_value(key, value_for_key) {
value_for_key[1234] = "emanuel 2855"
value_for_key[1357] = "christian 398789"
value_for_key[1390] = "alex 23143"
value_for_key[8766] = "moussa 5443"
print value_for_key[key]
}
Adapt "generate_awk_function.sh" for your needs:
a) FS=";" is setting the field separator in your input file (here a semicolon)
b) OFS="\t" is setting the output field separator (here a TAB)
You only have to generate this gawk "lookup-function" anew when your "example.txt" has changed.
Step 2:
Read your IDs to look up your results:
$ cat id.txt
1234
1390
$ gawk -i function_library.awk '{lookup_value($1)}' id.txt
emanuel 2855
alex 23143
You can also use this approach in a pipe like this:
$ cat id.txt | gawk -i function_library.awk '{lookup_value($1)}'
or like this:
$ echo 1234 | gawk -i function_library.awk '{lookup_value($1)}'
You can adapt this approach if your lookup string (1234) or file (id.txt) is containing some additional unwanted data ("noise") by using simple awk means:
a) Here, too, you can define a field separator, e.g., by setting it to a colon (:)
$ gawk -F":" -i function_library.awk '{lookup_value($5)}' id.txt
b) You can use the nth field of your lookup string, e.g., setting it from the 1st field to the 5th field just by changing the lookup_value from $1 to $5:
$ gawk -i function_library.awk '{lookup_value($5)}' id.txt
Please be aware that the '-i' command-line option is only supported by the GNU version of awk (= gawk).
HTH
bernie
I have a .csv file like this:
stack2#domain.example,2009-11-27 01:05:47.893000000,domain.example,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.793000000,domain2.example,255.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.646465785,domain2.example,256.255.255.0
...
I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing overflow#domain2.example in the above example). How do I use uniq on only field 1 (separated by commas)? According to man, uniq doesn't have options for columns.
I tried something with sort | uniq but it doesn't work.
sort -u -t, -k1,1 file
-u for unique
-t, so comma is the delimiter
-k1,1 for the key field 1
Test result:
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
awk -F"," '!_[$1]++' file
-F sets the field separator.
$1 is the first field.
_[val] looks up val in the hash _(a regular variable).
++ increment, and return old value.
! returns logical not.
there is an implicit print at the end.
To consider multiple column.
Sort and give unique list based on column 1 and column 3:
sort -u -t : -k 1,1 -k 3,3 test.txt
-t : colon is separator
-k 1,1 -k 3,3 based on column 1 and column 3
If you want to use uniq:
<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2
gives:
1 01:05:47.893000000 2009-11-27 tack2#domain.example
2 00:58:29.793000000 2009-11-27 overflow#domain2.example
1
If you want to retain the last one of the duplicates you could use
tac a.csv | sort -u -t, -r -k1,1 |tac
Which was my requirement
here
tac will reverse the file line by line
Here is a very nifty way.
First format the content such that the column to be compared for uniqueness is a fixed width. One way of doing this is to use awk printf with a field/column width specifier ("%15s").
Now the -f and -w options of uniq can be used to skip preceding fields/columns and to specify the comparison width (column(s) width).
Here are three examples.
In the first example...
1) Temporarily make the column of interest a fixed width greater than or equal to the field's max width.
2) Use -f uniq option to skip the prior columns, and use the -w uniq option to limit the width to the tmp_fixed_width.
3) Remove trailing spaces from the column to "restore" it's width (assuming there were no trailing spaces beforehand).
printf "%s" "$str" \
| awk '{ tmp_fixed_width=15; uniq_col=8; w=tmp_fixed_width-length($uniq_col); for (i=0;i<w;i++) { $uniq_col=$uniq_col" "}; printf "%s\n", $0 }' \
| uniq -f 7 -w 15 \
| awk '{ uniq_col=8; gsub(/ */, "", $uniq_col); printf "%s\n", $0 }'
In the second example...
Create a new uniq column 1. Then remove it after the uniq filter has been applied.
printf "%s" "$str" \
| awk '{ uniq_col_1=4; printf "%15s %s\n", uniq_col_1, $0 }' \
| uniq -f 0 -w 15 \
| awk '{ $1=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'
The third example is the same as the second, but for multiple columns.
printf "%s" "$str" \
| awk '{ uniq_col_1=4; uniq_col_2=8; printf "%5s %15s %s\n", uniq_col_1, uniq_col_2, $0 }' \
| uniq -f 0 -w 5 \
| uniq -f 1 -w 15 \
| awk '{ $1=$2=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'
well, simpler than isolating the column with awk, if you need to remove everything with a certain value for a given file, why not just do grep -v:
e.g. to delete everything with the value "col2" in the second place
line: col1,col2,col3,col4
grep -v ',col2,' file > file_minus_offending_lines
If this isn't good enough, because some lines may get improperly stripped by possibly having the matching value show up in a different column, you can do something like this:
awk to isolate the offending column:
e.g.
awk -F, '{print $2 "|" $line}'
the -F sets the field delimited to ",", $2 means column 2, followed by some custom delimiter and then the entire line. You can then filter by removing lines that begin with the offending value:
awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE
and then strip out the stuff before the delimiter:
awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE | sed 's/.*|//g'
(note -the sed command is sloppy because it doesn't include escaping values. Also the sed pattern should really be something like "[^|]+" (i.e. anything not the delimiter). But hopefully this is clear enough.
By sorting the file with sort first, you can then apply uniq.
It seems to sort the file just fine:
$ cat test.csv
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv | uniq
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
You could also do some AWK magic:
$ awk -F, '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' test.csv
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0