Awk pattern always matches last record? - linux

I'm in the process of switching from zsh to bash, and I need to produce a bash script that can remove duplicate entries in $PATH without reordering the entries (thus no sort -d magic). zsh has some nice array handling shortcuts that made it easy to do this efficiently, but I'm not aware of such shortcuts in bash. I came across this answer which has gotten me 90% of the way there, but there is a small problem that I would like to understand better. It appears that when I run that awk command, the last record processed incorrectly matches the pattern.
$ awk 'BEGIN{RS=ORS=":"}!a[$0]++' <<<"aa:bb:cc:aa:bb:cc"
aa:bb:cc:cc
$ awk 'BEGIN{RS=ORS=":"}!a[$0]++' <<<"aa:bb:cc:aa:bb"
aa:bb:cc:bb
$ awk 'BEGIN{RS=ORS=":"}!a[$0]++' <<<"aa:bb:cc:aa:bb:cc:" # note trailing colon
aa:bb:cc:
I don't understand awk well enough to know why it behaves this way, but I have managed to work around the issue by using an intermediate array like so.
array=($(awk 'BEGIN{RS=":";ORS=" "}!a[$0]++' <<<"aa:bb:cc:aa:bb:cc:"))
# Use a subshell to avoid modifying $IFS in current context
echo $(export IFS=":"; echo "${array[*]}")
aa:bb:cc
This seems like a sub-optimal solution however, so my question is: did I do something wrong in the awk command that is causing false positive matches on the final record processed?

The last record in your original string is cc\n which is different from cc. When unsure what's happening in any program in any language, adding some print statements is step 1 to debugging/investigating:
$ awk 'BEGIN{RS=ORS=":"} {print "<"$0">"}' <<<"aa:bb:cc:aa:bb:cc"
<aa>:<bb>:<cc>:<aa>:<bb>:<cc
>:$
If you want the RS to be : or \n then just state that (with GNU awk at least):
$ awk 'BEGIN{RS="[:\n]"; ORS=":"} !a[$0]++' <<<"aa:bb:cc:aa:bb:cc"
aa:bb:cc:$
The $ in all of the above is my prompt.

Another possible workaround instead of your bash array solution
$ echo "aa:bb:cc:aa:bb:cc" | tr ':' '\n' | awk '!a[$0]++' | paste -sd:
aa:bb:cc

Related

Can we do this in perl?

I want the below awk one liner to be translated to perl. is it possible??
awk '{ for(i=1;i<=NF;i++){if(i==NF){printf("%s\n",$NF);}else {printf("%s\t",$i)}}}' file.txt | awk 'NR > 1'
The first awk command removes the leading empty column and the next one removes the first line.
Below is the head of file.txt
#FILEOUTPUT
1 137442 2324
2326 139767 4169
6491 143936 94
The output i get from those commands is below
1 137442 2324
2326 139767 4169
6491 143936 94
Thanks,
Karthic
#Alex got the usage of $. correctly - which is not a very common perl idiom (though a useful one as we see), but they didn't handle the extra spaces correctly.
Awk is all about understanding what the fields are and then manipulating the fields, and as part of that it does a lot of whitespace canonalization.
Perl, OTOH, usually doesn't involve itself in field separation and a lot of users like to do that themeselves - but it does support this Awk behavior using the -a flag.
So a simple implementation of the above Awk line noise might look like this:
perl -anle 'print join("\t",#F) if $. > 1' file.txt
Explanation:
-a: trigger field separation using the default field separator (which works well in this case) or whatever -F says (like Awk).
-n : iterate over the input lines (same as what the outermost {} do in Awk). A common alternative is -p which would mean to iterate over the input lines and then print out whatever the line buffer has after running the code.
-l : When printing, add a new line at the end of the text (makes things like this slightly easier to work with)
-e : here's a script.
Then we just take the separated field array (#F) and join it. Often devs like to just address certain fields with $F[<index>], but here we don't need to loop - we can just take the list as is and pipe it to join().

How to remove everything before a specific string

I need to get clean output from wireshark but recently i noticed that i've been getting outputs where my cut command doesn't work anymore
Data from wireshark
C.'¢¢#)ù¨.fEo³#(#¹³³÷÷P*,§ýý {P¹'GET /2015/dec/alltracks/playlist.m3u8
I used to use cut -f2 -d" " but now i notice some entries come with multiple spaces so my command fails.
How would I get rid of everything before GET, including the word GET? The goal is to get only /2015/dec/alltracks/playlist.m3u8
Use GNU grep in PCRE-mode enabled by -P flag and -o flag to print only the matching word.
grep -Po ".*GET \K(.*)" input-file
/2015/dec/alltracks/playlist.m3u8
Using a perl regEx
perl -nle 'print "$1" if /.*GET (.*)/' input-file
/2015/dec/alltracks/playlist.m3u8
You can do this by using the following regular expression
.*GET
Where .* recognises any characters (the * means zero or more) and GET recognises what you want including the leading space, so the output would be what you desire
sed "s/.*GET //g"
We would replace what we find with nothing (//).
s is for substitute while g is for global (which, depending on the case may or may not be necessary, although I'd recommend you to use it if you are willing to modify more than a line.

Line numbering in Grep

I have command in Grep:
cat nastava.html | grep '<td>[A-Z a-z]*</td><td>[0-9/]*</td>' | sed 's/[ \t]*<td>\([A-Z a-z]*\)<\/td><td>\([0-9]\{1,3\}\)\/[0-9]\{2\}\([0-9]\{2\}\)<\/td>.*/\1 mi\3\2 /'
|sort|grep -n ".*" | sed -r 's/(.*):(.*)/\1. \2/' >studenti.txt
I don't understand second line, sort is ok, grep -n means to num that sorted list, but why do we use here ".*"? It won't work without it, and i don't understand why.
The grep is used purely for the side effect of the line numbering with the -n option here, so the main thing is really to use a regular expression which matches all the input lines. As such, .* is not very elegant -- ^ would work without scanning every line, and $ trivially matches every line as well. Since you know the input lines are not empty, thus contain at least one character, the simple regular expression . would work perfectly, too.
However, as the end goal is to perform line numbering, a better solution is to use a dedicated tool for this purpose.
... | sort | nl -ba -s '. '
The -ba option specifies to number all lines (the default is to only add a line number to non-empty lines; we know there are no empty lines, so it's not strictly necessary here, but it's good to know) and the -s option specifies the separator string to put after the number.
A possible minor complication is that the line number format is whitespace-padded, so in the end, this solution may not work for you if you specifically want unpadded numbers. (But a sed postprocessor to fix that up is a lot simpler than the postprocessor for grep you have now -- just sed 's/^ *//' will remove leading whitespace).
... As an aside, the ugly cat | grep | sed pipeline can be abbreviated to just
sed -n 's%[ \t]*<td>\([A-Z a-z]*\)</td><td>\([0-9]\{1,3\}\)/[0-9]\{2\}\([0-9]\{2\}\)</td>.*%\1 mi\3\2 %p' nastava.html
The cat was never necessary in the first place, and the sed script can easily be refactored to only print when a substitution was performed (your grep regular expression was not exactly equivalent to the one you have in the sed script but I assume that was the intent). Also, using a different separator avoids having to backslash the slashes.
... And of course, if nastava.html is your own web page, the whole process is umop apisdn. You should have the students results in a machine-readable form, and generate a web page from that, rather than the other way around.
grep needs a regular expression to match. You can't run grep with no expression at all. If you want to number all the lines, just specify an expression that matches anything. I'd probably use ^ instead of .*.

How to do something like grep -B to select only one line?

Everything is in the title. Basicaly let's say I have this pattern
some text lalala
another line
much funny wow grep
I grep funny and I want my output to be "lalala"
Thank you
One possible answer is to use either ed or ex to do this (it is trivial in them):
ed - yourfile <<< 'g/funny/.-2p'
(Or replace ed with ex. You might have red, the restricted editor, too; it can't modify files.) This looks for the pattern /funny/ globally, and whenever it is found, prints the line 2 before the matching line (that's the .-2p part). Or, if you want the most recent line containing 'lalala' before the line matching 'funny':
ed - yourfile <<< 'g/funny/?lalala?p'
The only problem is if you're trying to process standard input rather than a file; then you have to save the standard input to a file and process that file, which spoils the concurrency.
You can't do negative offsets in sed (though GNU sed allows you to do positive offsets, so you could use sed -n '/lalala/,+2p' file to get the 'lalala' to 'funny' lines (which isn't quite what you want) based on finding 'lalala', but you cannot find the 'lalala' lines based on finding 'funny'). Standard sed does not allow offsets at all.
If you need to print just the IP address found on a line 8 lines before the pattern-matching line, you need a slightly more involved ed script, but it is still doable:
ed - yourfile <<< 'g/funny/.-8s/.* //p'
This uses the same basic mechanism to find the right line, then runs a substitute command to remove everything up to the last space on the line and print the modified version. Since there isn't a w command, it doesn't actually modify the file.
Since grep -B only prints each full number of lines before the match, you'll have to pipe the output into something like grep or Awk.
grep -B 2 "funny" file|awk 'NR==1{print $NF; exit}'
You could also just use Awk.
awk -v s="funny" '/[[:space:]]lalala$/{n=NR+2; o=$NF}NR==n && $0~s{print o}' file
For the specific example of an IP address 8 lines before the match as mentioned in your comment:
awk -v s="funny" '
/[[:space:]][0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ {
n=NR+8
ip=$NF
}
NR==n && $0~s {
print ip
}' file
These Awk solutions first find the output field you might want, then print the output only if the word you want exists in the nth following line.
Here's an attempt at a slightly generalized Awk solution. It maintains a circular queue of the last q lines and prints the line at the head of the queue when it sees a match.
#!/bin/sh
: ${q=8}
e=$1
shift
awk -v q="$q" -v e="$e" '{ m[(NR%q)+1] = $0 }
$0 ~ e { print m[((NR+1)%q)+1] }' "${#--}"
Adapting to a different default (I set it to 8) or proper option handling (currently, you'd run it like q=3 ./qgrep regex file) as well as remembering (and hence printing) the entire line should be easy enough.
(I also didn't bother to make it work correctly if you see a match in the first q-1 lines. It will just print an empty line then.)

Replacing a line in a csv file?

I have a set of 10 CSV files, which normally have a an entry of this kind
a,b,c,d
d,e,f,g
Now due to some error entries in this file have become of this kind
a,b,c,d
d,e,f,g
,,,
h,i,j,k
Now I want to remove the line with only commas in all the files. These files are on a Linux filesystem.
Any command that you recommend that can replaces the erroneous lines in all the files.
It depends on what you mean by replace. If you mean 'remove', then a trivial variant on #wnoise's solution is:
grep -v '^,,,$' old-file.csv > new-file.csv
Note that this deletes just those lines with exactly three commas. If you want to delete mal-formed lines with any number of commas (including zero) - and no other characters on the line, then:
grep -v '^,*$' ...
There are endless other variations on the regex that would deal with other scenarios. Dealing with full CSV data with commas inside quotes starts to need something other than a regex machine. It can be done, within broad limits, especially in more complex regex systems such as PCRE or Perl. But it requires more work.
Check out Mastering Regular Expressions.
sed 's/,,,/replacement/' < old-file.csv > new-file.csv
optionally followed by
mv new-file.csv old-file.csv
Replace or remove, your post is not clear... For replacement see wnoise's answer. For removing, you could use
awk '$0 !~ /,,,/ {print}' <old-file.csv > new-file.csv
What about trying to keep only lines which are matching the desired format instead of handling one exception ?
If the provided input is what you really want to match:
grep -E '[a-z],[a-z],[a-z],[a-z]' < oldfile.csv > newfile.csv
If the input is different, provide it, the regular expression should not be too hard to write.
Do you want to replace them with something, or delete them entirely? Either way, it can be done with sed. To delete:
sed -i -e '/^,\+$/ D' yourfile1.csv yourfile2.csv ...
To replace: well, see wnoise's answer, or if you don't want to create new files with the output,
sed -i -e '/^,\+$/ s//replacement/' yourfile1.csv yourfile2.csv ...
or
sed -i -e '/^,\+$/ c\
replacement' yourfile1.csv yourfile2.csv ...
(that should be entered exactly as is, including the line break). Of course, you can also do this with awk or perl or, if you're only deleting lines, even grep:
egrep -v '^,+$' < oldfile.csv > newfile.csv
I tested these to make sure they work, but I'd advise you to do the same before using them (just in case). You can omit the -i option from sed, in which case it'll print out the results (rather than writing them back to the file), or omit the output redirection >newfile.csv from grep.
EDIT: It was pointed out in a comment that some features of these sed commands only work on GNU sed. As far as I can tell, these are the -i option (which can be replaced with shell redirection, sed ... <infile >outfile ) and the \+ modifier (which can be replaced with \{1,\} ).
Most simply:
$ grep -v ,,,, oldfile > newfile
$ mv newfile oldfile
yes, awk or grep are very good option if you are working in linux platform. However you can use perl regex for other platform. using join & split options.

Resources