GREP or AWK complex pattern match (linux)

GREP or AWK complex pattern match (linux) - linux

So I have text output bellow from a 'mediinfo VIDEO.mkv':
General
Unique ID : 190778803810831492312123193779943 (0x8F265C1B107A4D595F723237C370C7074FB7)
Complete name : VIDEO.mkv
Format : Matroska
Format version : Version 4 / Version 2
Video
ID : 1
Format : HEVC
Format/Info : High Efficiency Video Coding
Format profile : Main#L3#Main
Codec ID : V_MPEGH/ISO/HEVC
I need to GREP or AWK out the Format: HEVC bellow Video. I wasn't sure how to proceed as I could regex 'Format' but then I get back multiple rows (Matroska and HEVC). I haven't found any handy hints.
Ideas?

If "Matroska" is fixed you can do it by mediinfo VIDEO.mkv | grep "Format " test.fi | grep -v "Matroska"
If output format is fixed then you do it by mediinfo VIDEO.mkv | grep "Format " test.fi | tail -n1
grep -v will ignore matching line, tail will print specified number o lines from the last.

mediinfo VIDEO.mkv | awk -v RS= '/^Video/{print $7}'
HEVC
You can use awk with RS set to blank and print the desired column number.

Obviously many ways to solve this, but sed seems like a natural fit here:
$ sed -n '/Video/,$ { s/Format *: //p }' file
HEVC

Related

linux extract portion of the string that can be second most common pattern

I have several strings(or filenames in a directory) and i need to group them by second most common pattern, then i will iterate over them by each group and process them. in the example below i need 2 from ACCEPT and 2 from BASIC_REGIS, bascially from string beginning to one character after hyphen (-) and it could be any character and not just digit. The first most common pattern are ACCEPT and BASIC_REGIS. I am looking for second most common pattern using grep -Po (Perl and only-matching). AWK solution is working
INPUT
ACCEPT-zABC-0123
ACCEPT-zBAC-0231
ACCEPT-1ABC-0120
ACCEPT-1CBA-0321
BASIC_REGIS-2ABC-9043
BASIC_REGIS-2CBA-8132
BASIC_REGIS-PCCA-6532
BASIC_REGIS-PBBC-3023
OUTPUT
ACCEPT-z
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-P
echo "ACCEPT-0ABC-0123"|grep -Po "\K^A.*-"
Result : ACCEPT-0ABC-
but I need : ACCEPT-0
However awk solution is working
echo "ACCEPT-1ABC-0120"|awk '$0 ~ /^A/{print substr($0,1,index($0,"-")+1)}'
ACCEPT-1

1st solution: With your shown samples please try following awk code.
awk '
match($0,/^(ACCEPT-[0-9]+|BASIC_REGIS-[0-9]+/) && !arr[substr($0,RSTART,RLENGTH)]++
' Input_file
2nd solution: With GNU grep please try following.
grep -oP '^.*?-[0-9]+' Input_file | sort -u

Like this:
$ grep -Eo '^[^-]+-.' file | sort -u
Output
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
The regular expression matches as follows:
Node
Explanation
^
the beginning of the string
[^-]+
any character except: - (1 or more times (matching the most amount possible))
-
-
.
any character except \n

not too sure what you meant by "2nd most common groupings", but to simply replicate that output :
{gn}awk '!NF || !__[$-_ = sprintf("%.*s", index($-_,$(!_+!_)),$-_)]++' FS='-'
mawk '!NF || !__[$!NF = sprintf("%.*s", index($_, $(!_+!_)),$_) ]++' FS='-'
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9

You don't need -P (PCRE) for that, just a plain, old BRE:
$ grep -o '^[^-]*-.' file | sort -u
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
Or using GNU awk alone:
$ awk 'match($0,/^[^-]*-./,a) && !seen[a[0]]++{print a[0]}' file
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
or any awk:
$ awk '!match($0,/^[^-]*-./){next} {$0=substr($0,1,RLENGTH)} !seen[$0]++' file
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9

POSIX-shells have primitive parameter expansion. Meaning using this:
${string#-*} # Remove first ‘-‘ and everything after
In combination with this:
${string#*-} # Remove first ‘-‘ and everything before
Can extract the n’th most common pattern.
For example:
input="ACCEPT-0ABC-0123"
common_pattern_base=${input#-*} # Result → ACCEPT
next_level=${input#*-} # Result → 0ABC-0123
common_pattern_mid=${next_level#-*} # Result → 0ABC
next_level_again=${next_level#*-} # Result → 0123
Now I did this very crudely, but it should serve as an example on how simple and powerful this tool can be. Especially in combination with a loop.
If you need a certain syntax, you can now simply work with individual pieces:
# Result of line below → 0
trim_pattern_mid=“$(echo ${common_pattern_mid} | cut -c1)”
# Result of line below → ACCEPT-0
format=“${common_pattern_base}-${trim_pattern_mid}”
While this answer is longer, it is more flexible and simple than using regular-expressions. Imagine wanting to get the 4th-pattern of a 256 long chain with regex, it’s a nightmare.
This answer is more suited for scripting. If it’s ad-hoc, grep or sed will do the job - at least for small patterns.

A bit more efficient as it's not calling substr:
awk -v{,O}FS='-' '{printf("%s-%c\n",$1,$2)}' file

parsing data from log using awk

I want to extract machineId userId origReqUri,filename,mime,size,checksum as comma-separated from this log pattern. Any awk command to do it?
test1.1/test.log.2020-07-14-20:2020-07-14 20:47:44,239 [http--1594759553405 sessionId:4567 nodeId:node-1 machineId:31656 userId:2540397 origReqUri:/test1/batch] INFO com.test.company - [RETURN INFO - RETURN] - TRACK_PREPROCESSED_DATA_POPULATION: Populated test_doc_version entry for doc version [1130783_1_0] with data from test_doc_metadata. File name: [09014b3080135f44.doc]. Mime type: [application/msword]. Content size: [100352]. MD5 checksum: [7ef30e834107990c95c7e53f7b6f6ee6]. [source:]
I tried
grep machineId:31656 test.1/test.log.2020-07-14-* |grep "Populated test_doc_version entry" | awk machineId |awk origReqUri

I didn't use AWK, but I would resolve your problem using mostly SED and GREP, like this:
sed s/': '/':'/g input | sed s/' '/\\n/g | grep 'machineId\|userId\|origReqUri\|name\|type\|size\|checksum' | sed 's/\[\|\]\|\.//g' | tr '\n' ',' | sed 's/name/filename/g' | sed 's/type/mime/g' | sed 's/.$//'
ps.: "input" is the name of the file where I wrote the input.
The result for the provided input is:
machineId:31656,userId:2540397,origReqUri:/test1/batch,filename:09014b3080135f44doc,mime:application/msword,size:100352,checksum:7ef30e834107990c95c7e53f7b6f6ee6
It is probably not the best solution and we can certainly make it smaller and more beautiful, but I hope it helps you.
There's another solution, simpler and way more readable. You could do like this:
tr -s ' :[]' ' ' < input | cut -d ' ' -f 12,14,16,39,43,47,51
In here, it's not comma-separated. I guess it's better not to use commas since they are in the list of special symbols.
The result for this one is:
31656 2540397 /test1/batch 09014b3080135f44.doc application/msword 100352 7ef30e834107990c95c7e53f7b6f6ee6

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.

Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'

This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.

It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

Find HEX value in file and grep the following value

I have a 2GB file in raw format. I want to search for all appearance of a specific HEX value "355A3C2F74696D653E" AND collect the following 28 characters.
Example: 355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135
In this case I want the output: "323031312D30342D32365431343A34373A30322D31343A34373A3135" or better: 2011-04-26T14:47:02-14:47:15
I have tried with
xxd -u InputFile | grep '355A3C2F74696D653E' | cut -c 1-28 > OutputFile.txt
and
xxd -u -ps -c 4000000 InputFile | grep '355A3C2F74696D653E' | cut -b 1-28 > OutputFile.txt
But I can't get it working.
Can anybody give me a hint?

As you are using xxd it seems to me that you want to search the file as if it were binary data. I'd recommend using a more powerful programming language for this; the Unix shell tools assume there are line endings and that the text is mostly 7-bit ASCII. Consider using Python:
#!/usr/bin/python
import mmap
fd = open("file_to_search", "rb")
needle = "\x35\x5A\x3C\x2F\x74\x69\x6D\x65\x3E"
haystack = mmap.mmap(fd.fileno(), length = 0, access = mmap.ACCESS_READ)
i = haystack.find(needle)
while i >= 0:
i += len(needle)
print (haystack[i : i + 28])
i = haystack.find(needle, i)

If your grep supports -P parameter then you could simply use the below command.
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{28}'
323031312D30342D32365431343A
For 56 chars,
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{56}'
323031312D30342D32365431343A34373A30322D31343A34373A3135

Why convert to hex first? See if this awk script works for you. It looks for the string you want to match on, then prints the next 28 characters. Special characters are escaped with a backslash in the pattern.
Adapted from this post: Grep characters before and after match?
I added some blank lines for readability.
VirtualBox:~$ cat data.dat
Thisis a test of somerandom characters before thestringI want5Z</time>2011-04-26T14:47:02-14:47:15plus somemoredata
VirtualBox:~$ cat test.sh
awk '/5Z\<\/time\>/ {
match($0, /5Z\<\/time\>/); print substr($0, RSTART + 9, 28);
}' data.dat
VirtualBox:~$ ./test.sh
2011-04-26T14:47:02-14:47:15
VirtualBox:~$
EDIT: I just realized something. The regular expression will need to be tweaked to be non-greedy, etc and between that and awk need to be tweaked to handle multiple occurrences as you need them. Perhaps some of the folks more up on awk can chime in with improvements as I am real rusty. An approach to consider anyway.

Filtering Linux command output

I need to get a row based on column value just like querying a database. I have a command output like this,
Name ID Mem VCPUs State
Time(s)
Domain-0 0 15485 16 r-----
1779042.1
prime95-01 512 1
-b---- 61.9
Here I need to list only those rows where state is "r". Something like this,
Domain-0 0 15485 16
r----- 1779042.1
I have tried using "grep" and "awk" but still I am not able to succeed.
Any help me is much appreciated
Regards,
Raaj

There is a variaty of tools available for filtering.
If you only want lines with "r-----" grep is more than enough:
command | grep "r-----"
Or
cat filename | grep "r-----"

grep can handle this for you:
yourcommand | grep -- 'r-----'
It's often useful to save the (full) output to a file to analyse later. For this I use tee.
yourcommand | tee somefile | grep 'r-----'
If you want to find the line containing "-b----" a little later on without re-running yourcommand, you can just use:
grep -- '-b----' somefile
No need for cat here!
I recommend putting -- after your call to grep since your patterns contain minus-signs and if the minus-sign is at the beginning of the pattern, this would look like an option argument to grep rather than a part of the pattern.

try:
awk '$5 ~ /^r.*/ { print }'
Like this:
cat file | awk '$5 ~ /^r.*/ { print }'

grep solution:
command | grep -E "^([^ ]+ ){4}r"
What this does (-E switches on extended regexp):
The first caret (^) matches the beginning of the line.
[^ ] matches exactly one occurence of a non-space character, the following modifier (+) allows it to also match more occurences.
Grouped together with the trailing space in ([^ ]+ ), it matches any sequence of non-space characters followed by a single space. The modifyer {4} requires this construct to be matched exactly four times.
The single "r" is then the literal character you are searching for.
In plain words this could be written like "If the line starts <^> with four strings that are followed by a space <([^ ]+ ){4}> and the next character is , then the line matches."
A very good introduction into regular expressions has been written by Jan Goyvaerts (http://www.regular-expressions.info/quickstart.html).

Filtering by awk cmd in linux:-
Firstly find the column for this cmd and store file2 :-
awk '/Domain-0 0 15485 /' file1 >file2
Output:-
Domain-0 0 15485 16
r----- 1779042.1
after that awk cmd in file2:-
awk '{print $1,$2,$3,$4,"\n",$5,$6}' file2
Final Output:-
Domain-0 0 15485 16
r----- 1779042.1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

GREP or AWK complex pattern match (linux) - linux

If "Matroska" is fixed you can do it by mediinfo VIDEO.mkv | grep "Format " test.fi | grep -v "Matroska" If output format is fixed then you do it by mediinfo VIDEO.mkv | grep "Format " test.fi | tail -n1 grep -v will ignore matching line, tail will print specified number o lines from the last.

mediinfo VIDEO.mkv | awk -v RS= '/^Video/{print $7}' HEVC You can use awk with RS set to blank and print the desired column number.

Obviously many ways to solve this, but sed seems like a natural fit here: $ sed -n '/Video/,$ { s/Format *: //p }' file HEVC

Related

linux extract portion of the string that can be second most common pattern

parsing data from log using awk

how to count occurrence of specific word in group of file by bash/shellscript

Find HEX value in file and grep the following value

Filtering Linux command output

Categories

Resources