Can we do this in perl? - linux

I want the below awk one liner to be translated to perl. is it possible??
awk '{ for(i=1;i<=NF;i++){if(i==NF){printf("%s\n",$NF);}else {printf("%s\t",$i)}}}' file.txt | awk 'NR > 1'
The first awk command removes the leading empty column and the next one removes the first line.
Below is the head of file.txt
#FILEOUTPUT
1 137442 2324
2326 139767 4169
6491 143936 94
The output i get from those commands is below
1 137442 2324
2326 139767 4169
6491 143936 94
Thanks,
Karthic

#Alex got the usage of $. correctly - which is not a very common perl idiom (though a useful one as we see), but they didn't handle the extra spaces correctly.
Awk is all about understanding what the fields are and then manipulating the fields, and as part of that it does a lot of whitespace canonalization.
Perl, OTOH, usually doesn't involve itself in field separation and a lot of users like to do that themeselves - but it does support this Awk behavior using the -a flag.
So a simple implementation of the above Awk line noise might look like this:
perl -anle 'print join("\t",#F) if $. > 1' file.txt
Explanation:
-a: trigger field separation using the default field separator (which works well in this case) or whatever -F says (like Awk).
-n : iterate over the input lines (same as what the outermost {} do in Awk). A common alternative is -p which would mean to iterate over the input lines and then print out whatever the line buffer has after running the code.
-l : When printing, add a new line at the end of the text (makes things like this slightly easier to work with)
-e : here's a script.
Then we just take the separated field array (#F) and join it. Often devs like to just address certain fields with $F[<index>], but here we don't need to loop - we can just take the list as is and pipe it to join().

Related

How to truncate rest of the text in a file after finding a specific text pattern, in unix?

I have a HTML PAGE which I have extracted in unix using wget command, in that after the word "Check list" I need to remove all of the text and with the remaining I am trying to grep some data. I am unable to think on a way which can be helpful for removing the text after a keyword. if I do
s/Check list.*//g
It just removes the line , I want everything below that to be gone. How do I perform this?
The other solutions you have so far require non-POSIX-mandatory tools (GNU sed, GNU awk, or perl) so YMMV with their availability and will read the whole file into memory at once.
These will work in any awk in any shell on every Unix box and only read 1 line at a time into memory:
awk -F 'Check list' '{print $1} NF>1{exit}' file
or:
awk 'sub(/Check list.*/,""){f=1} {print} f{exit}' file
With GNU awk for multi-char RS you could do:
awk -v RS='Check list' '{print; exit}' file
but that would still read all of the text before Check list into memory at once.
Depending on which sed version you have, maybe
sed -z 's/Check list.*//'
The /g flag is useless as you only want to replace everything once.
If your sed does not have the -z option (which says to use the ASCII null character as line terminator instead of newline; this hinges on your file not containing any actual nulls, but that should trivially be true for any text file), try Perl:
perl -0777 -pe 's/Check list.*//s'
Unlike sed -z, this explicitly says to slurp the entire file into memory (the argument to -0 is the octal character code of a terminator character, but 777 is not a valid terminator character at all, so it always reads the entire file as a single "line") so this works even if there are spurious nulls in your file. The final s flag says to include newline in what . matches (otherwise s/.*// would still only substitute on the matching physical line).
I assume you are aware that removing everything will violate the integrity of the HTML file; it needs there to be a closing tag for every start tag near the beginning of the document (so if it starts with <html><body> you should keep </body></html> just before the end of the file, for example).
With awk you could make use of RS variable and then set field separator to regex with word boundaries and then print the very first field as per need.
awk -v RS="^$" -v FS='\\<check_list\\>' '{print $1}' Input_file
You might use q to instruct GNU sed to quit, thus ending processing, consider following simple example, let file.txt content be
123
456
789
and say you want to jettison everything beyond 5, then you could do
sed '/5/{s/5.*//;q}' file.txt
which gives output
123
4
Explanation: for line having 5, substitute 5 and everything beyond it with empty string (i.e. delete it), then q. Observe that lowercase q is used to provide printing of altered line before quiting.
(tested in GNU sed 4.7)

Awk pattern always matches last record?

I'm in the process of switching from zsh to bash, and I need to produce a bash script that can remove duplicate entries in $PATH without reordering the entries (thus no sort -d magic). zsh has some nice array handling shortcuts that made it easy to do this efficiently, but I'm not aware of such shortcuts in bash. I came across this answer which has gotten me 90% of the way there, but there is a small problem that I would like to understand better. It appears that when I run that awk command, the last record processed incorrectly matches the pattern.
$ awk 'BEGIN{RS=ORS=":"}!a[$0]++' <<<"aa:bb:cc:aa:bb:cc"
aa:bb:cc:cc
$ awk 'BEGIN{RS=ORS=":"}!a[$0]++' <<<"aa:bb:cc:aa:bb"
aa:bb:cc:bb
$ awk 'BEGIN{RS=ORS=":"}!a[$0]++' <<<"aa:bb:cc:aa:bb:cc:" # note trailing colon
aa:bb:cc:
I don't understand awk well enough to know why it behaves this way, but I have managed to work around the issue by using an intermediate array like so.
array=($(awk 'BEGIN{RS=":";ORS=" "}!a[$0]++' <<<"aa:bb:cc:aa:bb:cc:"))
# Use a subshell to avoid modifying $IFS in current context
echo $(export IFS=":"; echo "${array[*]}")
aa:bb:cc
This seems like a sub-optimal solution however, so my question is: did I do something wrong in the awk command that is causing false positive matches on the final record processed?
The last record in your original string is cc\n which is different from cc. When unsure what's happening in any program in any language, adding some print statements is step 1 to debugging/investigating:
$ awk 'BEGIN{RS=ORS=":"} {print "<"$0">"}' <<<"aa:bb:cc:aa:bb:cc"
<aa>:<bb>:<cc>:<aa>:<bb>:<cc
>:$
If you want the RS to be : or \n then just state that (with GNU awk at least):
$ awk 'BEGIN{RS="[:\n]"; ORS=":"} !a[$0]++' <<<"aa:bb:cc:aa:bb:cc"
aa:bb:cc:$
The $ in all of the above is my prompt.
Another possible workaround instead of your bash array solution
$ echo "aa:bb:cc:aa:bb:cc" | tr ':' '\n' | awk '!a[$0]++' | paste -sd:
aa:bb:cc

Cut number from string

I want to cut several numbers from a .txt file to add them later up. Here is an abstract from the .txt file:
anonuser pts/25 127.0.0.1 Mon Nov 16 17:24 - crash (10+23:07)
I want to get the "10" before the "+" and I only want the number, nothing else. This number should be written to another .txt file. I used this code, but it only works if the number has one digit:
awk ' /^'anonuser' / {split($NF,k,"[(+0:)][0-9][0-9]");print k[1]} ' log2.txt > log3.txt
With GNU grep:
grep -Po '\(\K[^+]*' file > new_file
Output to new_file:
10
See: PCRE Regex Spotlight: \K
What if you use the match() function in awk?
$ awk '/^anonuser/ && match($NF,/^\(([0-9]*)/,a) {print a[1]}' file
10
How does this work?
/^anonuser/ && match() {print a[1]} if the line starts with anonuser and the pattern is found, print it.
match($NF,/^\(([0-9]*)/,a) in the last field ((10+23:07)), look for the string ( + digits and capture these in the array a[].
Note also that this approach allows you to store the values you capture, so that you can then sum them as you indicate in the question.
The following uses the same approach as the OP, and has a couple of advantages, e.g. it does not require anything special, and it is quite robust (with respect to assumptions about the input) and maintainable:
awk '/^anonuser/ {split($NF,k,/+/); gsub(/[^0-9]/,"",k[1]); print k[1]}'
for anything more complex use awk but for simple task sed is easy enough
sed -r '/^anonuser/{s/.*\(([0-9]+)\+.*/\1/}'
find the number between a ( and + sign.
I am not sure about the format in the file.
Can you use simple cut commands?
cut -d"(" -f2 log2.txt| cut -d"+" -f1 > log3.txt

How to do something like grep -B to select only one line?

Everything is in the title. Basicaly let's say I have this pattern
some text lalala
another line
much funny wow grep
I grep funny and I want my output to be "lalala"
Thank you
One possible answer is to use either ed or ex to do this (it is trivial in them):
ed - yourfile <<< 'g/funny/.-2p'
(Or replace ed with ex. You might have red, the restricted editor, too; it can't modify files.) This looks for the pattern /funny/ globally, and whenever it is found, prints the line 2 before the matching line (that's the .-2p part). Or, if you want the most recent line containing 'lalala' before the line matching 'funny':
ed - yourfile <<< 'g/funny/?lalala?p'
The only problem is if you're trying to process standard input rather than a file; then you have to save the standard input to a file and process that file, which spoils the concurrency.
You can't do negative offsets in sed (though GNU sed allows you to do positive offsets, so you could use sed -n '/lalala/,+2p' file to get the 'lalala' to 'funny' lines (which isn't quite what you want) based on finding 'lalala', but you cannot find the 'lalala' lines based on finding 'funny'). Standard sed does not allow offsets at all.
If you need to print just the IP address found on a line 8 lines before the pattern-matching line, you need a slightly more involved ed script, but it is still doable:
ed - yourfile <<< 'g/funny/.-8s/.* //p'
This uses the same basic mechanism to find the right line, then runs a substitute command to remove everything up to the last space on the line and print the modified version. Since there isn't a w command, it doesn't actually modify the file.
Since grep -B only prints each full number of lines before the match, you'll have to pipe the output into something like grep or Awk.
grep -B 2 "funny" file|awk 'NR==1{print $NF; exit}'
You could also just use Awk.
awk -v s="funny" '/[[:space:]]lalala$/{n=NR+2; o=$NF}NR==n && $0~s{print o}' file
For the specific example of an IP address 8 lines before the match as mentioned in your comment:
awk -v s="funny" '
/[[:space:]][0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ {
n=NR+8
ip=$NF
}
NR==n && $0~s {
print ip
}' file
These Awk solutions first find the output field you might want, then print the output only if the word you want exists in the nth following line.
Here's an attempt at a slightly generalized Awk solution. It maintains a circular queue of the last q lines and prints the line at the head of the queue when it sees a match.
#!/bin/sh
: ${q=8}
e=$1
shift
awk -v q="$q" -v e="$e" '{ m[(NR%q)+1] = $0 }
$0 ~ e { print m[((NR+1)%q)+1] }' "${#--}"
Adapting to a different default (I set it to 8) or proper option handling (currently, you'd run it like q=3 ./qgrep regex file) as well as remembering (and hence printing) the entire line should be easy enough.
(I also didn't bother to make it work correctly if you see a match in the first q-1 lines. It will just print an empty line then.)

How to display the first word of each line in my file using the linux commands?

I have a file containing many lines, and I want to display only the first word of each line with the Linux commands.
How can I do that?
You can use awk:
awk '{print $1}' your_file
This will "print" the first column ($1) in your_file.
Try doing this using grep :
grep -Eo '^[^ ]+' file
try doing this with coreutils cut :
cut -d' ' -f1 file
I see there are already answers. But you can also do this with sed:
sed 's/ .*//' fileName
The above solutions seem to fit your specific case. For a more general application of your question, consider that words are generally defined as being separated by whitespace, but not necessarily space characters specifically. Columns in your file may be tab-separated, for example, or even separated by a mixture of tabs and spaces.
The previous examples are all useful for finding space-separated words, while only the awk example also finds words separated by other whitespace characters (and in fact this turns out to be rather difficult to do uniformly across various sed/grep versions). You may also want to explicitly skip empty lines, by amending the awk statement thus:
awk '{if ($1 !="") print $1}' your_file
If you are also concerned about the possibility of empty fields, i.e., lines that begin with whitespace, then a more robust solution would be in order. I'm not adept enough with awk to produce a one-liner for such cases, but a short python script that does the trick might look like:
>>> import re
>>> for line in open('your_file'):
... words = re.split(r'\s', line)
... if words and words[0]:
... print words[0]
...or on Windows (if you have GnuWin32 grep) :
grep -Eo "^[^ ]+" file

Resources