linux command to fetch path of S3 - linux

I have S3 bucket path like
s3://my-dev-s3/apyong/output/public/file3.gz.tar
I want to fetch all characters after first "/" (not //) and before last "/" .
So here output should be
apyong/output/public
I tried awk -F'/' '{print $NF}' .But its not producing correct results.

Here is sed solution:
s='s3://my-dev-s3/apyong/output/public/file3.gz.tar'
sed -E 's~.*//[^/]*/|/[^/]*$~~g' <<< "$s"
apyong/output/public
Pure bash solution:
s='s3://my-dev-s3/apyong/output/public/file3.gz.tar'
r="${s#*//*/}"
echo "${r%/*}"
apyong/output/public

1st solution: With GNU grep you could use following solution. Using -oP options with GNU grep will make sure it prints only matched values and enables PCRE regex respectively. Then in main grep code using regex ^.*?\/\/.*?\/\K(.*)(?=/)(explained following) to fetch the desired outcome.
var="s3://my-dev-s3/apyong/output/public/file3.gz.tar"
grep -oP '^.*?\/\/.*?\/\K(.*)(?=/)' <<<"$var"
apyong/output/public
Explanation: Adding detailed explanation for used regex.
^.*?\/\/ ##From starting of value performing a lazy match till // here.
.*?\/ ##again performing lazy match till single / here.
\K ##\K will make sure previous values are forgotten in order to print only needed values.
(.*)(?=/) ##This is greedy match to get everything else as per requirement.
2nd solution: Using GNU awk please try following code. Using capturing group capability of GNU awk here to store values captured by regex in array arr to be used later on in program.
awk 'match($0,/^[^:]*:\/\/[^/]*\/(.*)\//,arr){print arr[1]}' <<<"$var"
3rd solution: awk's match + match solution here to get the required output.
awk 'match($0,/^[^:]*:\/\/[^/]*\//) && match(val=substr($0,RSTART+RLENGTH),/^.*\//){
print substr(val,RSTART,RLENGTH-1)
}
' <<<"$var"

Using awk:
awk -F "/" '{ for(i=4;i<=NF;i++) { if (i==4) { printf "%s",$i } else { printf "/%s",$i } }}' <<< "s3://my-dev-s3/apyong/output/public/file3.gz.tar"
Set the field delimiter to "/" and the look through the 4th to the last field. For the fourth field print the field and for all other field print "/" then the field.

Use coreutils cut:
<<<"s3://my-dev-s3/apyong/output/public/file3.gz.tar" \
cut -d/ -f4-6
Output:
apyong/output/public

Related

Print word before matching a pattern

I am trying to print only first word after a matching pattern but not getting success, let me explain to you what is my requirement
Input file
$>cat abc.txt
source: hrs1bdapoc2:21002
1571426725 secs (436507.42 hrs) behind the primary
Desired Output:-
echo $delay_time
1571426725
What I have tried using awk command till now:-
$>delay_time=`awk -F'secs' '{print $1}' abc.txt`
$>echo $delay_time
source: hrs1bdapoc2:21002 1571426725
Can you let me know what I am doing wrong
Could you please try following. Though I suspect about your Input_file because code you showed should work out.
awk 'match($0,/[0-9]+ secs/){print substr($0,RSTART+5,RLENGTH-5)}' Input_file
Also check your Input_file if it has control M characters by doing cat -v Input_file if yes then you could remove them by doing tr -d '\r' Input_file > temp && mv temp Input_file
Also to create a variable do something like var=$(above command)
Print the first element on the line that has secs
delay_time=$(awk '/secs/{ print $1 }' abc.txt)
Using backticks ` is discouraged. Use $( ... ) for command substitution instead.

shorten awk command pipes

fairly new to using linux on shell.
I want to reduce the amount of pipes I used to extract the following data.
V 190917135635Z 1005 unknown /C=DE/ST=City/L=City/O=something/OU=Somewhat/CN=someserver.com/emailAddress=test#toast.com
My goal is to put the following values into a separate file
190917135635 someserver.com
The command I use right now is fairly long, piped and looks like this
grep -v '^R' $file | awk '{print $2, $6}' | awk -F'[=|/]' '{print $1, $3}' | awk '{print $1, $3}' | awk -F 'Z ' '{print $1, $2}' > sdata.txt
(The file contains other lines starting with 'R' so I exclude those in my grep)
Is this a legit way of doing it?
Is there a way to get this in a shorter command?
Thanks a lot!
Another awk. Using match to find CN entry and substr to extract it for print to print, if it exists.
$ awk '!/^R/{
print $2,
(match($0,/CN=[^/]+/)?substr($0,RSTART+3,RLENGTH-3):"") # 3==length("CN=")
}' file
Output:
190917135635Z someserver.com
Looks some of your data fields are used as creating SSL certificates, thus many fields might contain SPACES, i.e. City, Organization Name etc. That's why you need many awk lines(???). Here is one way which might help you overcome these issues. So instead of transforming your existing code logic, the target is to find the domain name by searching the substring CN= and fetching its corresponding value.
awk '
!/^R/{
start = index($0, "CN=")+3
end = index(substr($0, start), "/")
domain = end ? substr($0, start, end-1) : substr($0, start)
print $2, domain
}
' file.txt
Where:
we use index() to find the start-position of the substring CN=, +3 will be the starting point of the domain name
then we search the next / to get the end-position of this domain. if it's at the end of the line, there will be no / and thus end will be '0'
then we get the domain name between the substring CN= and the next '/' by using substr($0, start, end-1) or the end of line by using substr($0, start).
A short version:
awk '!/^R/{s=index($0, "CN=")+3; e=index(substr($0, s), "/"); print $2, substr($0, s, e ? e-1 : 253)}' file.txt
where 253 is the longest possible domain name which might be enough to fit your needs.
Update:
Actually, it's much easier just use match(), but the point is the same:
awk '!/^R/{if(match($0, "/CN=([^/]*)")) print $2, substr($0, RSTART+4, RLENGTH-4)}' file.txt
If this:
$ awk -F'[[:space:]/=]+' '!/^R/{print $2+0, $16}' file
190917135635 someserver.com
isn't all you need then updated your question to clarify your requirements and provide more truly representative sample input/output.
Using GNU sed:
sed -E -n '/^R/d; s/^[A-Za-z]\s+([0-9]+)\s+[0-9]+\s+.*\/CN=(.*)\/.*/\1 \2/p' input_file > new_file
EDIT: Strictly considering that OP's Input_file is same as shown samples only. After seeing OP's samples one could try following.
awk -F"[ =/Z]" '!/^R/{print $8,$37}' Input_file
For FUN :) in case one want to try in OP's approach then we could try following.
awk '
!/^R/{
val=$2 OFS $5
split(val,array,"[ /Z]")
val1=array[1] OFS array[9] OFS array[10]
split(val1,array1,"[ =]")
print array1[1],array1[3]
}
' Input_file
You are using $6 in the second awk command, that means your 5th column has potentially spaces inside unlike the sample data you showed, also it is extracting CN= part (CNAME?).
So here's a more compatible and more exact sed way which does not require GNU sed:
sed -n -e '/^R/!{' -e 's|^[^[:space:]]*[[:space:]]*\([^[:space:]Z][^[:space:]Z]*\).*/CN=\([^/][^/]*\).*|\1 \2|p;}'
If you just want digits in the second column and it begins with digit, then you can change to use this:
sed -n -e '/^R/!{' -e 's|^[^[:space:]]*[[:space:]]*\([0-9][0-9]*\).*/CN=\([^/][^/]*\).*|\1 \2|p;}'

filter out unrecognised fields using awk

I have a CVS file where I expect some values such as Y or N. Folks are adding comments or arbitrary entries such as NA? that I want to remove:
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
I can use gsub to remove things that I am anticipating such as:
$ cat test.csv | awk '{gsub("NA\\?", ""); gsub("NA \\?",""); gsub("TBD", ""); print}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Yet that will break if someone adds a new comment. I am looking for a regex to generalise the match as "not Y".
I tried some negative look arounds but couldn't get it to work on the awk that I have which is GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.1, GNU MP 6.1.2). Thanks in advance!
awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if ($i !~ /^(y|Y|n|N)$/) $i="";print}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Accepting only Y/N (case-insensitive).
awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
This seems to do the trick. Loops through the 3rd through the last field, and if the field isn't Y, it's replaced with nothing. Since we're modifying fields we need to set OFS as well.
$ cat file.txt
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
$ awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
If you wanted to accept "N" too, /^[YN]$/ would work.
cat test.CSV | awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if($i != "Y") $i=""; print}'
Output:
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Update: So there's no need to use regex if you simply want to determine it's "Y" or not.
However, if you want to use regex, as zzevannn's answer and tink's answer already gave great ideas of regex condition, so I'll give a batch replace by regex instead:
To be exact, and to increase the challenge, I created some boundary conditions:
$ cat test.CSV
Create,20055776,Y,,Y,Y,,Y,,YNA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,YN.Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,NANN,,,,,Y,,,NA ?Y,,,Y,,,,,,TYBD,,,,,,,,,
And the batch replace is:
$ awk 'BEGIN{FS=OFS=","}{fst=$1;sub($1 FS,"");print fst,gensub("(,)[^,]*[^Y,]+[^,]*","\\1","g",$0);}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
"(,)[^,]*[^Y,]+[^,]*" is to match anything between two commas that other than single Y.
Note I saved $1 and deleted $1 and the comma after it first, and later print it back.
sed solution
# POSIX
sed -e ':a' -e 's/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;t a' test.csv
# GNU
sed ':a;s/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;ta' test.csv
awk on same concept (avoid some problem of sed that miss the OR regex)
awk -F ',' '{ Idx=$2;gsub(/,[[:blank:]]*[^YN,][^,]*/, "");sub( /,/, "," Idx);print}'

search for a string and after getting result cut that word and store result in variable

I Have a file name abc.lst i ahve stored that in a variable it contain 3 words string among them i want to grep second word and in that i want to cut the word from expdp to .dmp and store that into variable
example:-
REFLIST_OP=/tmp/abc.lst
cat $REFLIST_OP
34 /data/abc/GOon/expdp_TEST_P119_*_18112017.dmp 12-JAN-18 04.27.00 AM
Desired Output:-
expdp_TEST_P119_*_18112017.dmp
I Have tried below command :-
FULL_DMP_NAME=`cat $REFLIST_OP|grep /orabackup|awk '{print $2}'`
echo $FULL_DMP_NAME
/data/abc/GOon/expdp_TEST_P119_*_18112017.dmp
REFLIST_OP=/tmp/abc.lst
awk '{n=split($2,arr,/\//); print arr[n]}' "$REFLIST_OP"
Test Results:
$ REFLIST_OP=/tmp/abc.lst
$ cat "$REFLIST_OP"
34 /data/abc/GOon/expdp_TEST_P119_*_18112017.dmp 12-JAN-18 04.27.00 AM
$ awk '{n=split($2,arr,/\//); print arr[n]}' "$REFLIST_OP"
expdp_TEST_P119_*_18112017.dmp
To save in variable
myvar=$( awk '{n=split($2,arr,/\//); print arr[n]}' "$REFLIST_OP" )
Following awk may help you on same.
awk -F'/| ' '{print $6}' Input_file
OR
awk -F'/| ' '{print $6}' "$REFLIST_OP"
Explanation: Simply making space and / as a field separator(as per your shown Input_file) and then printing 6th field of the line which is required by OP.
To see the field number and field's value you could use following command too:
awk -F'/| ' '{for(i=1;i<=NF;i++){print i,$i}}' "$REFLIST_OP"
Using sed with one of these regex
sed -e 's/.*\/\([^[:space:]]*\).*/\1/' abc.lst capture non space characters after /, printing only the captured part.
sed -re 's|.*/([^[:space:]]*).*|\1|' abc.lst Same as above, but using different separator, thus avoiding to escape the /. -r to use unescaped (
sed -e 's|.*/||' -e 's|[[:space:]].*||' abc.lst in two steps, remove up to last /, remove from space to end. (May be easiest to read/understand)
myvar=$(<abc.lst); myvar=${myvar##*/}; myvar=${myvar%% *}; echo $myvar
If you want to avoid external command (sed)

cut a particular number from the url in linux

I've a file with the below header generated by certain process
Link: <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=2>; rel="next", <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=8>; rel="last"
I want to cut just the number 8 from page=8 in the above content. How to go about it? Appreciate any help.
Try this -
$ cat f
Link: <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=2>; rel="next", <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=8>; rel="last"
$ awk -F'[&=<>]' '{for(i=1;i<=NF;i++) if($i ~ /^page$/) {print $(i+1)}}' f
2
8
If it is getting appended then you will get the last value using below awk :
$ awk -F'[&=<>]' '{for(i=1;i<=NF;i++) if($i ~ /^page$/) {kk=$(i+1)}} END{print kk}' ff
8
Limitation : Currently you have page=2 and page=8 and above command
will print the last page value.
And if you always want to print the 2nd value "8" (Added extra lines to the existing url, considering that it will keep on increasing and you always need the 2nd value then use below) -
$ cat f
Link: <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=2>; rel="next", <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=8>; rel="last"
<https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=8>; rel="last"
$ awk -v k=1 -F'[&=<>]' '{for(i=1;i<=NF;i++) if(($i ~ /^page$/) && (k==2) ) {print $(i+1)} k++}' f
8
Following is an implementation using grep:
grep -Po "&page=[0-9]*" <file_name> | grep -Po "[0-9]*"
Example:
echo 'Link: <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=2>; rel="next", <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=8000>; rel="last"' | grep -Po "&page=[0-9]*" | grep -Po "[0-9]*"
This will produces the result as expected.
echo 'Link: <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=2>; rel="next", <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=12345>; rel="last"' | grep -Po "&page=[0-9]*" |grep -Po "[0-9]*"| awk '2 == NR % $ct'
In awk. reverse the text, remove first [0-9]+=egap, output and rev again:
$ rev foo | awk 'sub(/[0-9]+=egap/,"")||1' |rev
Output:
Link: <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&page=2>; rel="next", <https://rnd.corp.zoom/api/v3/repositories/99/issues?state=all&per_page=100&>; rel="last"
try:
awk '{gsub(/.*page=/,"page=");sub(/>.*/,"");print}' Input_file
Simply substitute the all line with .*page= to page= which is nothing but will go till last page string(as * is a greedy regex match), so then substitute >.*(means starting from > to till end of line) with NULL, then print the line which will be page=8 or last value of the page. Off course I am considering that your Input_file is same as example shown.
awk -F'[= >]' '{print $12}' file
8
awk -F= '{split($8,a,">");print a[1]}' file
8
awk -F= '$8=="8>; rel"{print substr($8,1,1)}' file
8
The fact that a greedy regex is needed here (only the last occurrence of &page= should be matched) enables a simple sed solution:
sed -E 's/^.*&page=([0-9]+).*$/\1/' file
^.*&page= matches everything up to the last occurrence of &page on the line.
([0-9]+) matches one or more digits, and - thanks to enclosure in (...) stores the match in the 1st (and only) capture group, which the replacement string then reference as \1.
.*$ matches any remaining character on the line.
By virtue of the regex having matched the entire line, \1 therefore results in just the captured number as the output.
The above works with both GNU and BSD/macOS sed and takes advantage of modern extended regular expressions (-E), but in case you need a POSIX-compliant solution (which must use basic regular expressions and is therefore more cumbersome):
sed 's/^.*&page=\([0-9]\{1,\}\).*$/\1/' file
With GNU grep (on Linux, as requested), a single-pass grep -Po solution is also possible; like the sed solution, it relies on greedily matching up to the last &page=:
grep -Po "^.*&page=\K[0-9]+" file
-P activates support for PRCEs (Perl-compatible Regular Expressions).
-o only outputs the matching part of the line.
\K drops everything matched so far, so that what [0-9]+ matches - one or more digits - is the only output.

Resources