Even after `sort`, `uniq` is still repeating some values - linux

Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz
(It is a tape archive that contains a file called Wiki-Vote.txt)
The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt
# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt
# Wikipedia voting on promotion to administratorship (till January 2008).
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId ToNodeId
30 1412
30 3352
30 5254
30 5543
30 7478
3 28
I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l
Explanation:
/^#/ matches all the lines that start with #. And !/^#/ matches that doesn't.
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt prints the first and second column of all those matched lines in new lines.
| sort pipes the output to sort them.
| uniq should display all those unique values, but it doesn't.
| wc -l counts the previous lines and it is wrong.
The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail returns,
992
993
993
994
994
995
996
998
999
999
Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.

The file has dos line endings - each line is ending with \r CR character.
You can inspect your tail output for example with hexdump -C, lines starting with # added by me:
$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | sort | uniq | tail | hexdump -C
00000000 39 39 32 0a 39 39 33 0a 39 39 33 0d 0a 39 39 34 |992.993.993..994|
# ^^ HERE
00000010 0a 39 39 34 0d 0a 39 39 35 0d 0a 39 39 36 0a 39 |.994..995..996.9|
# ^^ ^^
00000020 39 38 0a 39 39 39 0a 39 39 39 0d 0a |98.999.999..|
# ^^
0000002c
Because uniq sees unique lines, one with CR and one not, they are not removed. Remove the CR character before pipeing. Note that sort | uniq is better to sort -u.
$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | tr -d '\r' | sort -u | wc -l
7115

Related

Remove \r\ character from String pattern matched in AWK

I'm quite new to AWK so apologies for the basic question. I've found many references for removing windows end-line characters from files but none that match a regular expression and subsequently remove the windows end line characters.
I have a file named infile.txt that contains a line like so:
...
DATAFILE data5v.dat
...
Within a shell script I want to capture the filename argument data5v.dat from this infile.txt and remove any carriage return character, \r, IF present. The carriage return may not always be present. So I have to match a word and then remove the \r subsequently.
I have tried the following but it is not working how I expect:
FILENAME=$(awk '/DATAFILE/ { print gsub("\r", "", $2) }' $INFILE)
Can I store the string returned from matching my regex /DATAFILE/ in a variable within my AWK statement to subsequently apply gsub?
File names can contain spaces, including \rs, blanks and tabs, so to do this robustly you can't remove all \rs with gsub() and you can't rely on there being any field, e.g. $2, that contains the whole file name.
If your input fields are tab-separated you need:
awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' file
or this otherwise:
awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' file
The above assumes your file names don't start with spaces and don't contain newlines.
To test any solution for robustness try:
printf 'DATAFILE\tfoo \r bar\r\n' | awk '...' | cat -TEv
and make sure that the output looks like it does below:
$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$
$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$
Note the blank, ^M (CR), and ^I (tab) in the middle of the file name as they should be but no ^M at the end of the line.
If your version of cat doesn't support -T or -E then do whatever you normally do to look for non-printing chars, e.g. od -c or vi the output.
With GNU awk, would you please try the following:
FILENAME=$(awk -v RS='\r?\n' '/DATAFILE/ {print $2}' "$INFILE")
echo "$FILENAME"
It assigns the record separator RS to a sequence of zero or one \r followed by \n.
As a side note, it is not recommended to use uppercases for user's variable names because it may conflict with system reserved variable names.
Awk simply applies each line of script to each input line. You can easily remove the carriage return and then apply some other logic to the input line. For example,
FILENAME=$(awk '/\r/ { sub(/\r/, "") }
/DATAFILE/ { print $2 }' "$INFILE")
Notice also When to wrap quotes around a shell variable.
who says you need gnu-awk :
gecho -ne "test\r\nabc\n\rdef\n" \
\
| mawk NF=NF FS='\r' OFS='' | odview
0000000 1953719668 1667391754 1717920778 10
t e s t \n a b c \n d e f \n
164 145 163 164 012 141 142 143 012 144 145 146 012
t e s t nl a b c nl d e f nl
116 101 115 116 10 97 98 99 10 100 101 102 10
74 65 73 74 0a 61 62 63 0a 64 65 66 0a
0000015
gawk -P posix mode is also fine with it :
gecho -ne "test\r\nabc\n\rdef\n" \
\
| gawk -Pe NF=NF FS='\r' OFS='' | odview
0000000 1953719668 1667391754 1717920778 10
t e s t \n a b c \n d e f \n
164 145 163 164 012 141 142 143 012 144 145 146 012
t e s t nl a b c nl d e f nl
116 101 115 116 10 97 98 99 10 100 101 102 10
74 65 73 74 0a 61 62 63 0a 64 65 66 0a
0000015

Display Second Result of Duplicate Lines in Text File

I am trying to display the second result of duplicate lines in a text file.
Data:
60 60 61 64 63 78 78
Duplicate Lines:
60 60 78 78
Attempted Code:
echo "60 60 61 64 63 78 78" | sed 's/ /\n/g' | uniq -D | tail -1
Current Result:
78
Expected Result:
60 78
You may try this gnu awk solution:
s='60 60 61 64 63 78 78'
awk -v RS='[[:space:]]+' '++fq[$0] == 2' <<< "$s"
60
78
To avoid getting line breaks after each line:
awk -v RS='[[:space:]]+' '++fq[$0] == 2 {printf "%s", $0 RT}' <<< "$s"
60 78
Considering that you could have multiple lines in your Input_file then you could try following.
awk '
{
delete value
num=split($0,arr," ")
for(i=1;i<=num;i++){
value[arr[i]]++
}
for(i=1;i<=num;i++){
if(value[arr[i]]>1){
print arr[i]
delete value[arr[i]]
}
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
delete value ##Deleting value array here.
num=split($0,arr," ") ##Splitting current line to array arr here.
for(i=1;i<=num;i++){ ##Traversing through all fields here.
value[arr[i]]++ ##Creating value array with value of arr array.
}
for(i=1;i<=num;i++){ ##Traversing through all fields here.
if(value[arr[i]]>1){ ##Checking condition if value array value is coming more than 1 times then do following.
print arr[i] ##printing array value here(which is value which comes more than 1 time).
delete value[arr[i]] ##Deleting value array value to avoid duplicate printing here.
}
}
}
' Input_file ##Mentioning Input_file name here.
So if you do not want to then do not use tail -1. And instead of printing all, print duplicates once per duplicate.
echo "60 60 61 64 63 78 78" | sed 's/ /\n/g' | sort | uniq -d
Note - if the input is not sorted (or duplicate lines are not adjacent), you need to sort it first.
$ printf '%s\n' 60 60 61 64 63 78 78 | uniq -d
60
78
The above is assuming a hard-coded list of numbers works for you since, per the example in your question, you thought echo "60 60 61 64 63 78 78" | sed 's/ /\n/g' | uniq -D | tail -1 would work for you. If you need to store them in a variable first then make it an array variable:
$ vals=(60 60 60 61 64 63 78 78)
$ printf '%s\n' "${vals[#]}" | uniq -d
60
78

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Cut a file between two lines numbers using awk

Say I have a file with 100 lines (not including header). I want to cut that file down, only keeping the content between line 51 and 70 (inclusive), as well as the header so that the resulting file is 20+1 lines.
So far, I have this code:
awk 'NR==1 {h=$0; next} (NR-1)>50 && (NR-1)<71 {filename = "file20.csv"; print h >> filename} {print >> filename}' file100.csv
But it's giving me an error:
fatal: expression for `>>' redirection has null string value
Can somebody help me understand where my syntax is wrong?
You can directly use:
awk 'NR==1 || (NR>=51 && NR<=70)'
Note that this evaluates the condition of NR. In case it is true, it performs awk's default action: {print $0}. Hence, you do not have to explicit it.
Then you can redirect to another file:
awk 'NR==1 || (NR>=51 && NR<=70)' file > new_file
Test
$ seq 100 | awk 'NR==1 || (NR>=51 && NR<=70)'
1
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
It returns 21 lines:
$ seq 100 | awk 'NR==1 || (NR>=51 && NR<=70)' | wc -l
21

Linux-Change line between two awk scripts

I have two awk scripts to run in Linux. The output of each one is in one line.
How can I separate the two output into two lines?
For example:
awk '{printf $1}' f.txt >> a.txt
awk '{printf $3}' f.txt >> a.txt
The output of the first script is:
35 56 40 28 57
And the second output is:
29 48 73 26
If I run them one after another, the output will become:
35 56 40 28 57 29 48 73 26
Is there any way to get the result to:
35 56 40 28 57
29 48 73 26
Thank you!~
Although I don't understand how you manage to get the spaces between fields the way you do it, you can add an END statement to the first script:
awk '{printf $1} END{print "\n"}'
You can also do this with a single awk command:
awk -v ORS=" " 'BEGIN{ARGV[ARGC++] = ARGV[1]; i = 1 }
NR!=FNR && FNR==1 { printf "\n"; i=3 }
{ print $i }
END { printf "\n" }' f.txt

Resources