parsing data from log using awk - linux

I want to extract machineId userId origReqUri,filename,mime,size,checksum as comma-separated from this log pattern. Any awk command to do it?
test1.1/test.log.2020-07-14-20:2020-07-14 20:47:44,239 [http--1594759553405 sessionId:4567 nodeId:node-1 machineId:31656 userId:2540397 origReqUri:/test1/batch] INFO com.test.company - [RETURN INFO - RETURN] - TRACK_PREPROCESSED_DATA_POPULATION: Populated test_doc_version entry for doc version [1130783_1_0] with data from test_doc_metadata. File name: [09014b3080135f44.doc]. Mime type: [application/msword]. Content size: [100352]. MD5 checksum: [7ef30e834107990c95c7e53f7b6f6ee6]. [source:]
I tried
grep machineId:31656 test.1/test.log.2020-07-14-* |grep "Populated test_doc_version entry" | awk machineId |awk origReqUri

I didn't use AWK, but I would resolve your problem using mostly SED and GREP, like this:
sed s/': '/':'/g input | sed s/' '/\\n/g | grep 'machineId\|userId\|origReqUri\|name\|type\|size\|checksum' | sed 's/\[\|\]\|\.//g' | tr '\n' ',' | sed 's/name/filename/g' | sed 's/type/mime/g' | sed 's/.$//'
ps.: "input" is the name of the file where I wrote the input.
The result for the provided input is:
machineId:31656,userId:2540397,origReqUri:/test1/batch,filename:09014b3080135f44doc,mime:application/msword,size:100352,checksum:7ef30e834107990c95c7e53f7b6f6ee6
It is probably not the best solution and we can certainly make it smaller and more beautiful, but I hope it helps you.
There's another solution, simpler and way more readable. You could do like this:
tr -s ' :[]' ' ' < input | cut -d ' ' -f 12,14,16,39,43,47,51
In here, it's not comma-separated. I guess it's better not to use commas since they are in the list of special symbols.
The result for this one is:
31656 2540397 /test1/batch 09014b3080135f44.doc application/msword 100352 7ef30e834107990c95c7e53f7b6f6ee6

Related

linux extract portion of the string that can be second most common pattern

I have several strings(or filenames in a directory) and i need to group them by second most common pattern, then i will iterate over them by each group and process them. in the example below i need 2 from ACCEPT and 2 from BASIC_REGIS, bascially from string beginning to one character after hyphen (-) and it could be any character and not just digit. The first most common pattern are ACCEPT and BASIC_REGIS. I am looking for second most common pattern using grep -Po (Perl and only-matching). AWK solution is working
INPUT
ACCEPT-zABC-0123
ACCEPT-zBAC-0231
ACCEPT-1ABC-0120
ACCEPT-1CBA-0321
BASIC_REGIS-2ABC-9043
BASIC_REGIS-2CBA-8132
BASIC_REGIS-PCCA-6532
BASIC_REGIS-PBBC-3023
OUTPUT
ACCEPT-z
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-P
echo "ACCEPT-0ABC-0123"|grep -Po "\K^A.*-"
Result : ACCEPT-0ABC-
but I need : ACCEPT-0
However awk solution is working
echo "ACCEPT-1ABC-0120"|awk '$0 ~ /^A/{print substr($0,1,index($0,"-")+1)}'
ACCEPT-1
1st solution: With your shown samples please try following awk code.
awk '
match($0,/^(ACCEPT-[0-9]+|BASIC_REGIS-[0-9]+/) && !arr[substr($0,RSTART,RLENGTH)]++
' Input_file
2nd solution: With GNU grep please try following.
grep -oP '^.*?-[0-9]+' Input_file | sort -u
Like this:
$ grep -Eo '^[^-]+-.' file | sort -u
Output
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
The regular expression matches as follows:
Node
Explanation
^
the beginning of the string
[^-]+
any character except: - (1 or more times (matching the most amount possible))
-
-
.
any character except \n
not too sure what you meant by "2nd most common groupings", but to simply replicate that output :
{gn}awk '!NF || !__[$-_ = sprintf("%.*s", index($-_,$(!_+!_)),$-_)]++' FS='-'
mawk '!NF || !__[$!NF = sprintf("%.*s", index($_, $(!_+!_)),$_) ]++' FS='-'
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
You don't need -P (PCRE) for that, just a plain, old BRE:
$ grep -o '^[^-]*-.' file | sort -u
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
Or using GNU awk alone:
$ awk 'match($0,/^[^-]*-./,a) && !seen[a[0]]++{print a[0]}' file
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
or any awk:
$ awk '!match($0,/^[^-]*-./){next} {$0=substr($0,1,RLENGTH)} !seen[$0]++' file
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
POSIX-shells have primitive parameter expansion. Meaning using this:
${string#-*} # Remove first ‘-‘ and everything after
In combination with this:
${string#*-} # Remove first ‘-‘ and everything before
Can extract the n’th most common pattern.
For example:
input="ACCEPT-0ABC-0123"
common_pattern_base=${input#-*} # Result → ACCEPT
next_level=${input#*-} # Result → 0ABC-0123
common_pattern_mid=${next_level#-*} # Result → 0ABC
next_level_again=${next_level#*-} # Result → 0123
Now I did this very crudely, but it should serve as an example on how simple and powerful this tool can be. Especially in combination with a loop.
If you need a certain syntax, you can now simply work with individual pieces:
# Result of line below → 0
trim_pattern_mid=“$(echo ${common_pattern_mid} | cut -c1)”
# Result of line below → ACCEPT-0
format=“${common_pattern_base}-${trim_pattern_mid}”
While this answer is longer, it is more flexible and simple than using regular-expressions. Imagine wanting to get the 4th-pattern of a 256 long chain with regex, it’s a nightmare.
This answer is more suited for scripting. If it’s ad-hoc, grep or sed will do the job - at least for small patterns.
A bit more efficient as it's not calling substr:
awk -v{,O}FS='-' '{printf("%s-%c\n",$1,$2)}' file

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.
Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'
This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.
It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

Squeezing spaces between columns in Unix shell

I want the spaces to be removed between two columns.
After running a sql query from shell, I'm getting the output as below:
23554402243 0584940772;2TZ0584940772001U;
23554402272 0423721840;7TT0423721840001B;
23554402303 0110770863;BBTU500248822001Q;
23554402305 02311301;BTB02311301001J;
23554402563 0550503408;PPTU004984208001O;
23554402605 0457553223;Q0T0457553223001I;
23554367602 0454542427;TB8U501674990001V;
23554378584 0383071261;HTHU500374797001Y;
23554404965 059792244;ST3059792244005C;
23554405503 0571632586;QTO0571632586001D;
But the desired output should be like below:
23554400043 0117601738;22TU003719388001V;
23554402883 0823973229;TTT0823973229001C;
23554402950 024071080;MNT024071080001D;
23554405827 0415260614;TL20415260614001R;
23554405828 08119270800;TL2U003010407001G;
23554406553 011306895;VBT011306895001E;
23554406557 054121509;TL2054121509001M;
23554406563 065069209;TL2065069209005M;
23554409085 0803434328;QTO0803434328001B;
23553396219 062004063;G6T062004063001C;
Remember, there should be only one tabspace between two columns in the desired output.
Assuming you need to remove space between all the columns:
If you need tab spaced result between first two columns. Put g to apply changes between all the columns.
sed -r 's/\s+/\t/' inputfile
if -r is not available:
sed 's/\s\+/\t/'
or If you need single space between every multi-space
tr -s ' '
Easy to do using this awk:
awk -v OFS='\t' '{$1=$1} 1' file
23554402243 0584940772;2TZ0584940772001U;
23554402272 0423721840;7TT0423721840001B;
23554402303 0110770863;BBTU500248822001Q;
23554402305 02311301;BTB02311301001J;
23554402563 0550503408;PPTU004984208001O;
23554402605 0457553223;Q0T0457553223001I;
23554367602 0454542427;TB8U501674990001V;
23554378584 0383071261;HTHU500374797001Y;
23554404965 059792244;ST3059792244005C;
23554405503 0571632586;QTO0571632586001D;
Alternatively this tr will also work:
tr -s ' ' < file | tr ' ' '\t'
or this sed:
sed -i.bak $'s/ \{1,\}/\t/g' file
what about the following perl one-liner?
perl -ne '/(.*?)\s+(.*)/; print "$1\t$2\n"' your_input_file

Find HEX value in file and grep the following value

I have a 2GB file in raw format. I want to search for all appearance of a specific HEX value "355A3C2F74696D653E" AND collect the following 28 characters.
Example: 355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135
In this case I want the output: "323031312D30342D32365431343A34373A30322D31343A34373A3135" or better: 2011-04-26T14:47:02-14:47:15
I have tried with
xxd -u InputFile | grep '355A3C2F74696D653E' | cut -c 1-28 > OutputFile.txt
and
xxd -u -ps -c 4000000 InputFile | grep '355A3C2F74696D653E' | cut -b 1-28 > OutputFile.txt
But I can't get it working.
Can anybody give me a hint?
As you are using xxd it seems to me that you want to search the file as if it were binary data. I'd recommend using a more powerful programming language for this; the Unix shell tools assume there are line endings and that the text is mostly 7-bit ASCII. Consider using Python:
#!/usr/bin/python
import mmap
fd = open("file_to_search", "rb")
needle = "\x35\x5A\x3C\x2F\x74\x69\x6D\x65\x3E"
haystack = mmap.mmap(fd.fileno(), length = 0, access = mmap.ACCESS_READ)
i = haystack.find(needle)
while i >= 0:
i += len(needle)
print (haystack[i : i + 28])
i = haystack.find(needle, i)
If your grep supports -P parameter then you could simply use the below command.
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{28}'
323031312D30342D32365431343A
For 56 chars,
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{56}'
323031312D30342D32365431343A34373A30322D31343A34373A3135
Why convert to hex first? See if this awk script works for you. It looks for the string you want to match on, then prints the next 28 characters. Special characters are escaped with a backslash in the pattern.
Adapted from this post: Grep characters before and after match?
I added some blank lines for readability.
VirtualBox:~$ cat data.dat
Thisis a test of somerandom characters before thestringI want5Z</time>2011-04-26T14:47:02-14:47:15plus somemoredata
VirtualBox:~$ cat test.sh
awk '/5Z\<\/time\>/ {
match($0, /5Z\<\/time\>/); print substr($0, RSTART + 9, 28);
}' data.dat
VirtualBox:~$ ./test.sh
2011-04-26T14:47:02-14:47:15
VirtualBox:~$
EDIT: I just realized something. The regular expression will need to be tweaked to be non-greedy, etc and between that and awk need to be tweaked to handle multiple occurrences as you need them. Perhaps some of the folks more up on awk can chime in with improvements as I am real rusty. An approach to consider anyway.

sed script to remove file name duplicates

I hope the below task will be very easy for sed lovers. I am not sed-guru, but I need to express the following task in sed, as sed is more popular on Linux systems.
The input text stream is something which is produced by "make depends" and looks like following:
pgm2asc.o: pgm2asc.c ../include/config.h amiga.h list.h pgm2asc.h pnm.h \
output.h gocr.h unicode.h ocr1.h ocr0.h otsu.h barcode.h progress.h
box.o: box.c gocr.h pnm.h ../include/config.h unicode.h list.h pgm2asc.h \
output.h
database.o: database.c gocr.h pnm.h ../include/config.h unicode.h list.h \
pgm2asc.h output.h
detect.o: detect.c pgm2asc.h pnm.h ../include/config.h output.h gocr.h \
unicode.h list.h
I need to catch only C++ header files (i.e. ending with .h), make the list unique and print as space-separated list prepending src/ as a path-prefix. This is achieved by the following perl script:
make libs-depends | perl -e 'while (<>) { while (/ ([\w\.\/]+?\.h)/g) { $a{$1} = 1; } } print join " ", map { "src/$_" } keys %a;'
The output is:
src/unicode.h src/pnm.h src/progress.h src/amiga.h src/ocr0.h src/ocr1.h src/otsu.h src/barcode.h src/gocr.h src/../include/config.h src/list.h src/pgm2asc.h src/output.h
Please, help to express this in sed.
Not sed but hope this helps you:
make libs-depends | grep -io --perl-regexp "[\w\.\/]+\.h " | sort -u | sed -e 's:^:src/:'
If you really want to do this in pure sed:
make libs-depends | sed 's/ /\n/g' | sed '/\.h$/!d;s/^/src\//' | sed 'G;/^\(.*\)\n.*\1/!h;$!d;${x;s/\n/ /g}'
The first sed command breaks the output up into separate lines, the second filters out everything but *.h and prepends 'src/', the third gloms the lines together without repetition.
Sed probably isn't the best tool here as it's stream-oriented. You could possibly use it to convert the spaces to newlines though, pipe that through sort and uniq, then use sed again to convert the newlines back to spaces.
Typing this on my phone, though, so can't give exact commands :(

Resources