Print word before matching a pattern

Print word before matching a pattern - linux

I am trying to print only first word after a matching pattern but not getting success, let me explain to you what is my requirement
Input file
$>cat abc.txt
source: hrs1bdapoc2:21002
1571426725 secs (436507.42 hrs) behind the primary
Desired Output:-
echo $delay_time
1571426725
What I have tried using awk command till now:-
$>delay_time=`awk -F'secs' '{print $1}' abc.txt`
$>echo $delay_time
source: hrs1bdapoc2:21002 1571426725
Can you let me know what I am doing wrong

Could you please try following. Though I suspect about your Input_file because code you showed should work out.
awk 'match($0,/[0-9]+ secs/){print substr($0,RSTART+5,RLENGTH-5)}' Input_file
Also check your Input_file if it has control M characters by doing cat -v Input_file if yes then you could remove them by doing tr -d '\r' Input_file > temp && mv temp Input_file
Also to create a variable do something like var=$(above command)

Print the first element on the line that has secs
delay_time=$(awk '/secs/{ print $1 }' abc.txt)
Using backticks ` is discouraged. Use $( ... ) for command substitution instead.

Related

linux command to fetch path of S3

I have S3 bucket path like
s3://my-dev-s3/apyong/output/public/file3.gz.tar
I want to fetch all characters after first "/" (not //) and before last "/" .
So here output should be
apyong/output/public
I tried awk -F'/' '{print $NF}' .But its not producing correct results.

Here is sed solution:
s='s3://my-dev-s3/apyong/output/public/file3.gz.tar'
sed -E 's~.*//[^/]*/|/[^/]*$~~g' <<< "$s"
apyong/output/public
Pure bash solution:
s='s3://my-dev-s3/apyong/output/public/file3.gz.tar'
r="${s#*//*/}"
echo "${r%/*}"
apyong/output/public

1st solution: With GNU grep you could use following solution. Using -oP options with GNU grep will make sure it prints only matched values and enables PCRE regex respectively. Then in main grep code using regex ^.*?\/\/.*?\/\K(.*)(?=/)(explained following) to fetch the desired outcome.
var="s3://my-dev-s3/apyong/output/public/file3.gz.tar"
grep -oP '^.*?\/\/.*?\/\K(.*)(?=/)' <<<"$var"
apyong/output/public
Explanation: Adding detailed explanation for used regex.
^.*?\/\/ ##From starting of value performing a lazy match till // here.
.*?\/ ##again performing lazy match till single / here.
\K ##\K will make sure previous values are forgotten in order to print only needed values.
(.*)(?=/) ##This is greedy match to get everything else as per requirement.
2nd solution: Using GNU awk please try following code. Using capturing group capability of GNU awk here to store values captured by regex in array arr to be used later on in program.
awk 'match($0,/^[^:]*:\/\/[^/]*\/(.*)\//,arr){print arr[1]}' <<<"$var"
3rd solution: awk's match + match solution here to get the required output.
awk 'match($0,/^[^:]*:\/\/[^/]*\//) && match(val=substr($0,RSTART+RLENGTH),/^.*\//){
print substr(val,RSTART,RLENGTH-1)
}
' <<<"$var"

Using awk:
awk -F "/" '{ for(i=4;i<=NF;i++) { if (i==4) { printf "%s",$i } else { printf "/%s",$i } }}' <<< "s3://my-dev-s3/apyong/output/public/file3.gz.tar"
Set the field delimiter to "/" and the look through the 4th to the last field. For the fourth field print the field and for all other field print "/" then the field.

Use coreutils cut:
<<<"s3://my-dev-s3/apyong/output/public/file3.gz.tar" \
cut -d/ -f4-6
Output:
apyong/output/public

Select subdomains using print command

cat a.txt
a.b.c.d.e.google.com
x.y.z.google.com
rev a.txt | awk -F. '{print $2,$3}' | rev
This is showing:
e google
x google
But I want this output
a.b.c.d.e.google
b.c.d.e.google
c.d.e.google
e.google
x.y.z.google
y.z.google
z.google

With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
BEGIN{
FS=OFS="."
}
{
nf=NF
for(i=1;i<(nf-1);i++){
print
$1=""
sub(/^[[:space:]]*\./,"")
}
}
' Input_file

Here is one more awk solution:
awk -F. '{while (!/^[^.]+\.[^.]+$/) {print; sub(/^[^.]+\./, "")}}' file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com

Using sed
$ sed -En 'p;:a;s/[^.]+\.(.*([^.]+\.){2}[[:alpha:]]+$)/\1/p;ta' input_file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com

Using bash:
IFS=.
while read -ra a; do
for ((i=${#a[#]}; i>2; i--)); do
echo "${a[*]: -i}"
done
done < a.txt
Gives:
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
(I assume the lack of d.e.google.com in your expected output is typo?)

For a shorter and arguably simpler solution, you could use Perl.
To auto-split the line on the dot character into the #F array, and then print the range you want:
perl -F'\.' -le 'print join(".", #F[0..$#F-1])' a.txt
-F'\.' will auto-split each input line into the #F array. It will split on the given regular expression, so the dot needs to be escaped to be taken literally.
$#F is the number of elements in the array. So #F[0..$#F-1] is the range of elements from the first one ($F[0]) to the penultimate one. If you wanted to leave out both "google" and "com", you would use #F[0..$#F-2] etc.

Replaceing multiple command calls

Able to trim and transpose the below data with sed, but it takes considerable time. Hope it would be better with AWK. Welcome any suggestions on this
Input Sample Data:
[INX_8_60L ] :9:Y
[INX_8_60L ] :9:N
[INX_8_60L ] :9:Y
[INX_8_60Z ] :9:Y
[INX_8_60Z ] :9:Y
Required Output:
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z

Just use awk, e.g.
awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
Which will be orders of magnitude faster. It just picks out the (e.g. "INX_8_60L") substring using substring and match. n is simply used as a false/true (0/1) flag to prevent outputting a "!" before the first string.
Example Use/Output
With your data in file you would get:
$ awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
INX_8_60L!INX_8_60L!INX_8_60L!INX_8_60Z!INX_8_60Z
Which appears to be what you are after. (Note: I'm not sure what your separator character is, so just change above as needed) If not, let me know and I'm happy to help further.
Edit Per-Changes
Including the '?' isn't difficult, and I just copied the character, so you would now have:
awk -v n=0 '{s=substr($0,2,match($0,/[ \t]+/)-2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1}
END {print ""}' file
Example Output
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
And to simplify, just operating on the first field as in #JamesBrown's answer, that would reduce to:
awk -v n=0 '{s=substr($1,2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1} END {print ""}' file
Let me know if that needs more changes.

Don't start so many sed commands, separate the sed operations with semicolon instead.

Try to process the data in a single job and avoid regex. Below reading with substr() static sized first block and insterting ? while outputing.
$ awk '{
b=b (b==""?"":";") substr($1,2,3) "?" substr($1,5)
}
END {
print b
}' file
Output:
INX?_8_60L;INX?_8_60L;INX?_8_60L;INX?_8_60Z;INX?_8_60Z
If the fields are not that static in size:
$ awk '
BEGIN {
FS="[[_ ]" # split field with regex
}
{
printf "%s%s?_%s_%s",(i++?";":""), $2,$3,$4 # output semicolons and fields
}
END {
print ""
}' file
Performance of solutions for 20 M records:
Former:
real 0m8.017s
user 0m7.856s
sys 0m0.160s
Latter:
real 0m24.731s
user 0m24.620s
sys 0m0.112s

sed can be very fast when used gingerly, so for simplicity and speed you might wish to consider:
sed -e 's/ .*//' -e 's/\[INX/INX?/' | tr '\n' '|' | sed -e '$s/|$//'
The second call to sed is there to satisfy the requirement that there is no trailing |.

Another solution using GNU awk:
awk -F'[[ ]+' '
{printf "%s%s",(o?"¦":""),gensub(/INX/,"INX?",1,$2);o=1}
END{print ""}
' file
The field separator is set (with -F option) such that it matches the wanted parameter.
The main statement is to print the modified parameter with the ? character.
The variable o allows to keep track of the delimeter ¦.

filter out unrecognised fields using awk

I have a CVS file where I expect some values such as Y or N. Folks are adding comments or arbitrary entries such as NA? that I want to remove:
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
I can use gsub to remove things that I am anticipating such as:
$ cat test.csv | awk '{gsub("NA\\?", ""); gsub("NA \\?",""); gsub("TBD", ""); print}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Yet that will break if someone adds a new comment. I am looking for a regex to generalise the match as "not Y".
I tried some negative look arounds but couldn't get it to work on the awk that I have which is GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.1, GNU MP 6.1.2). Thanks in advance!

awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if ($i !~ /^(y|Y|n|N)$/) $i="";print}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Accepting only Y/N (case-insensitive).

awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
This seems to do the trick. Loops through the 3rd through the last field, and if the field isn't Y, it's replaced with nothing. Since we're modifying fields we need to set OFS as well.
$ cat file.txt
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
$ awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
If you wanted to accept "N" too, /^[YN]$/ would work.

cat test.CSV | awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if($i != "Y") $i=""; print}'
Output:
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Update: So there's no need to use regex if you simply want to determine it's "Y" or not.
However, if you want to use regex, as zzevannn's answer and tink's answer already gave great ideas of regex condition, so I'll give a batch replace by regex instead:
To be exact, and to increase the challenge, I created some boundary conditions:
$ cat test.CSV
Create,20055776,Y,,Y,Y,,Y,,YNA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,YN.Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,NANN,,,,,Y,,,NA ?Y,,,Y,,,,,,TYBD,,,,,,,,,
And the batch replace is:
$ awk 'BEGIN{FS=OFS=","}{fst=$1;sub($1 FS,"");print fst,gensub("(,)[^,]*[^Y,]+[^,]*","\\1","g",$0);}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
"(,)[^,]*[^Y,]+[^,]*" is to match anything between two commas that other than single Y.
Note I saved $1 and deleted $1 and the comma after it first, and later print it back.

sed solution
# POSIX
sed -e ':a' -e 's/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;t a' test.csv
# GNU
sed ':a;s/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;ta' test.csv
awk on same concept (avoid some problem of sed that miss the OR regex)
awk -F ',' '{ Idx=$2;gsub(/,[[:blank:]]*[^YN,][^,]*/, "");sub( /,/, "," Idx);print}'

search for a string and after getting result cut that word and store result in variable

I Have a file name abc.lst i ahve stored that in a variable it contain 3 words string among them i want to grep second word and in that i want to cut the word from expdp to .dmp and store that into variable
example:-
REFLIST_OP=/tmp/abc.lst
cat $REFLIST_OP
34 /data/abc/GOon/expdp_TEST_P119_*_18112017.dmp 12-JAN-18 04.27.00 AM
Desired Output:-
expdp_TEST_P119_*_18112017.dmp
I Have tried below command :-
FULL_DMP_NAME=`cat $REFLIST_OP|grep /orabackup|awk '{print $2}'`
echo $FULL_DMP_NAME
/data/abc/GOon/expdp_TEST_P119_*_18112017.dmp

REFLIST_OP=/tmp/abc.lst
awk '{n=split($2,arr,/\//); print arr[n]}' "$REFLIST_OP"
Test Results:
$ REFLIST_OP=/tmp/abc.lst
$ cat "$REFLIST_OP"
34 /data/abc/GOon/expdp_TEST_P119_*_18112017.dmp 12-JAN-18 04.27.00 AM
$ awk '{n=split($2,arr,/\//); print arr[n]}' "$REFLIST_OP"
expdp_TEST_P119_*_18112017.dmp
To save in variable
myvar=$( awk '{n=split($2,arr,/\//); print arr[n]}' "$REFLIST_OP" )

Following awk may help you on same.
awk -F'/| ' '{print $6}' Input_file
OR
awk -F'/| ' '{print $6}' "$REFLIST_OP"
Explanation: Simply making space and / as a field separator(as per your shown Input_file) and then printing 6th field of the line which is required by OP.
To see the field number and field's value you could use following command too:
awk -F'/| ' '{for(i=1;i<=NF;i++){print i,$i}}' "$REFLIST_OP"

Using sed with one of these regex
sed -e 's/.*\/\([^[:space:]]*\).*/\1/' abc.lst capture non space characters after /, printing only the captured part.
sed -re 's|.*/([^[:space:]]*).*|\1|' abc.lst Same as above, but using different separator, thus avoiding to escape the /. -r to use unescaped (
sed -e 's|.*/||' -e 's|[[:space:]].*||' abc.lst in two steps, remove up to last /, remove from space to end. (May be easiest to read/understand)
myvar=$(<abc.lst); myvar=${myvar##*/}; myvar=${myvar%% *}; echo $myvar
If you want to avoid external command (sed)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Print word before matching a pattern - linux

Print the first element on the line that has secs delay_time=$(awk '/secs/{ print $1 }' abc.txt) Using backticks ` is discouraged. Use $( ... ) for command substitution instead.

Related

linux command to fetch path of S3

Select subdomains using print command

Replaceing multiple command calls

filter out unrecognised fields using awk

search for a string and after getting result cut that word and store result in variable

Categories

Resources