When running a command I get an output that outputs a chart:
+---------+------------------------------------------------------+
| Key | Value |
+---------+------------------------------------------------------+
| Address | longstringofcharacters |
+---------+------------------------------------------------------+
| Name | word1-word2-word3 |
+---------+------------------------------------------------------+
I can grep Name to get the line that contains the word Name.
What do I grep to output just the string of word1-word2-word3 only?
I've tried grep '*-*-*' but that doesn't work.
With GNU grep, you can use a PCRE regex based solution like
grep -oP '^\h*\|\h*Name\h*\|\h*\K\S+' file
See the online demo and the regex demo.
-o - outputs matches only
P - enables the PCRE regex engine
^ - start of string
\h*\|\h* - a | char enclosed with optional horizontal whitespaces
Name - a word Name
\h*\|\h* - a | char enclosed with optional horizontal whitespaces
\K - match reset operator that discards text matched so far
\S+ - one or more non-whitespace chars.
With a GNU awk:
awk -F'[|[:space:]]+' '$2 == "Name"{print $3}' file
Set the field separator to a [|[:space:]]+ regex that matches one or more | chars or whitespaces, check if Group 2 equals Name and grab Field 3.
With any awk (if you need to extract a string like nonwhitespaces(-nonwhitespaces)+):
awk 'match($0, /[^ -]+(-[^ -]+)+/) { print substr($0, RSTART, RLENGTH) }' file
See this online demo.
A simple solution using awk is:
awk '/Name/{ print $4 }'
The /Name/ section is how awk "greps". The { print $4 } bit says print the fourth space delimited word.
Related
I am trying to split the following text string by dash, square brackets and colon delimiters but keep those in square brackets
Input:
10:100 - [10/09/21:12:23:22]
Desired output:
100, 10/09/21:12:23:22
My current code:
awk -F '[- ":]' '{print $1, $2, $3, $4, $5}'
1st solution: With GNU awk you could try following code.
awk '
match($0,/:([^[:space:]]+)[[:space:]]+-[[:space:]]+\[([^]]*)\]/,arr){
print arr[1],arr[2]
}
' Input_file
2nd solution: Using sed's s(substitution operation) along with its capturing group capability try following:
sed -E 's/^[^:]*:([^[:space:]]+)[[:space:]]+-[[:space:]]+\[([^]]*)\]/\1 \2/' Input_file
3rd solution: Using any awk you could use following code. Using its sub and gsub operations on 1st and last fields.
awk '{sub(/.*:/,"",$1);gsub(/^\[|\]$/,"",$NF);print $1,$NF}' Input_file
4th solution: With Perl's one-liner solution using a lazy match.*? one could try following using its substitution operation.
perl -pe 's/^.*?:([^[:space:]]+)[[:space:]]+-[[:space:]]+\[([^]]*)\]/\1 \2/' Input_file
If you have multiple of these patterns in the string and not regarding the order, you can make use of awk, match the patterns that you are interested in, and then remove the surrounding delimters.
In this case, you can match
\[[^][]+]|:[0-9]+
The pattern matches:
\[[^][]+] Match from [...]
| Or
:[0-9]+ Match : and 1+ digits
The part in gsub [:\[]|\]$ matches either : [ at the start of the string, or match ] at the end of the string, and will replace that with an empty string.
awk '
{
while(match($0,/\[[^][]+]|:[0-9]+/)){
v = substr($0,RSTART,RLENGTH)
gsub(/^[:\[]|\]$/, "", v)
print v
$0=substr($0,RSTART+RLENGTH)
}
}
' file
Output
100
10/09/21:12:23:22
assuming no empty lines within input data :
echo '10:100 - [10/09/21:12:23:22]' |
nawk 'sub("^[^:]*:",_, $!--NF)' FS='[ -]*[][]' OFS=', '
or
gawk 'NF -= sub("^[^:]*:",_)' FS='[ -]*[][]' OFS=', '
or
mawk 'NF -= sub("^[^:]*:",_)' FS='[][ -]+' OFS=', '
100, 10/09/21:12:23:22
What I want to achieve:
grep: extract lines with the contig number and length
awk: remove "length:" from column 2
sort: sort by length (in descending order)
Current code
grep "length:" test_reads.fa.contigs.vcake_output | awk -F:'{print $2}' |sort -g -r > contig.txt
Example content of test_reads.fa.contigs.vcake_output:
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
Expected output
>Contig_0 99995
>Contig_11 42
With your shown samples, please try following awk + sort solution here.
awk -F'[: ]' '/^>/{print $1,$3}' Input_file | sort -nrk2
Explanation: Simple explanation would be, running awk program to read Input_file first, where setting field separator as : OR space and checking condition if line starts from > then printing its 1st and 2nd fields then sending its output(as a standard input) to sort command where sorting it from 2nd field to get required output.
Here is a gnu-awk solution that does it all in a single command without invoking sort:
awk -F '[:[:blank:]]' '
$2 == "length" {arr[$1] = $3}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (i in arr)
print i, arr[i]
}' file
>Contig_0 99995
>Contig_11 42
Perhaps this, combining grep and awk:
awk -F '[ :]' '$2 == "length" {print $1, $3}' file | sort ...
Assumptions:
if more than one row has the same length then additionally sort the 1st column using 'version' sort
Adding some additional lines to the sample input:
$ cat test_reads.fa.contigs.vcake_output
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_17 length:93
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_837 ignore-this-length:1000000
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_8 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
One sed/sort idea:
$ sed -rn 's/(>[^ ]+) length:(.*)$/\1 \2/p' test_reads.fa.contigs.vcake_output | sort -k2,2nr -k1,1V
Where:
-En - enable extended regex support and suppress normal printing of input data
(>[^ ])+) - (1st capture group) - > followed by 1 or more non-space characters
length: - space followed by length:
(.*) - (2nd capture group) - 0 or more characters (following the colon)
$ - end of line
\1 \2/p - print 1st capture group + <space> + 2nd capture group
-k2,2nr - sort by 2nd (spaced-delimited) field in reverse numeric order
-k1,1V - sort by 1st (space-delimited) field in Version order
This generates:
>Contig_0 99995
>Contig_17 93
>Contig_8 42
>Contig_11 42
I have made a script for analyzing Windows logs message numbers. The output of the uniq -c numbers are difficult to predict, because there is varying white-space depending on the size of the numbers. At this point I remove white-space manually.
This is the command which sorts and counts the messages:
cat nt2.rawlog | awk 'BEGIN {FS=","} {print $3,$4,$6,$7}' | sort | uniq -c | sort -rg >> ~/tempNT2.report
This is my best attempt at an example output:
21340 4624,Windows-Security-Audit-Log,Success Audit,Logon
1209 4658,Windows-Security-Audit-Log,Success Audit,Privileged Logon
My desired output is:
[tab]21340[tab]--[tab]Security Audit Log 4624 (Logon Success Audit)
[tab]1209[tab]--[tab]Security Audit Log 4658 (Privileged Logon Success Audit)
Something like
awk -F , '{ i = split($1, n, / +/);
printf ("\t%d\t--\t%s %d (%s %s)\n", n[i-1], $2, n[i], substr($4, 2), $3) }'
The field separator , does the first level of splitting; then we split the first field on whitespace, and extract the numbers into n. The number of elements in n depends on whether the field had leading whitespace or not, so we count the last two fields from the end. The last field has a pesky leading space, so we extract a substring from the second character of that field.
I had a string like:-
sometext sometext BASEDIR=/someword/someword/someword/1342.32 sometext sometext.
Could someone tell me, how to filter this number 1342.32, from the above string in linux??
$ echo "sometext BASEDIR=/someword/1342.32 sometext." |
sed "s/[^0-9.]//g"
> 1342.32.
The sed command searches for anything not in the set "0123456789" or ".", and replaces it with nothing (deletes it). It does this in global mode, so it doesn't stop on the first match.
This is enough if you're just trying to read it. If you're trying to feed the number into another command and need a real number, you will need to clean it up:
$ ... | cut -f 1-2 -d "."
> 1342.32
cut splits the input on the delemiter, then selects fields 1 and 2 (numbered from one). So "1.2.3.4" would return "1.2".
If sometext is always delimited from the surrounding fields by a white space, try this
cat log.txt | awk '{for (i=1;i<=NF;i++) {if ($i ~
/BASEDIR/) {print i,$i}}}' | awk -F/ '{for (i=1;i<=NF;i++) {if ($i ~
/^[0-9][0-9]*$/) {print $i}}}'
The code snippet above assumes that your data is contained in a file called log.txt and organised in records(read this awk-wise)
This works also if digits appear in sometext before BASEDIR as well as if the input has additional lines:
sed -n 's,.*BASEDIR=\(/\w*\)*/\([0-9.]*\).*,\2,p'
-n do not output lines without BASEDIR…
\(/\w*\)* group of / and someword, repeated
\([0-9.]*\) group of repeated digit or decimal point
\2 replacement of everything matched (the entire line) with the 2nd group
p print the result
I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file