Find HEX value in file and grep the following value - linux

I have a 2GB file in raw format. I want to search for all appearance of a specific HEX value "355A3C2F74696D653E" AND collect the following 28 characters.
Example: 355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135
In this case I want the output: "323031312D30342D32365431343A34373A30322D31343A34373A3135" or better: 2011-04-26T14:47:02-14:47:15
I have tried with
xxd -u InputFile | grep '355A3C2F74696D653E' | cut -c 1-28 > OutputFile.txt
and
xxd -u -ps -c 4000000 InputFile | grep '355A3C2F74696D653E' | cut -b 1-28 > OutputFile.txt
But I can't get it working.
Can anybody give me a hint?

As you are using xxd it seems to me that you want to search the file as if it were binary data. I'd recommend using a more powerful programming language for this; the Unix shell tools assume there are line endings and that the text is mostly 7-bit ASCII. Consider using Python:
#!/usr/bin/python
import mmap
fd = open("file_to_search", "rb")
needle = "\x35\x5A\x3C\x2F\x74\x69\x6D\x65\x3E"
haystack = mmap.mmap(fd.fileno(), length = 0, access = mmap.ACCESS_READ)
i = haystack.find(needle)
while i >= 0:
i += len(needle)
print (haystack[i : i + 28])
i = haystack.find(needle, i)

If your grep supports -P parameter then you could simply use the below command.
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{28}'
323031312D30342D32365431343A
For 56 chars,
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{56}'
323031312D30342D32365431343A34373A30322D31343A34373A3135

Why convert to hex first? See if this awk script works for you. It looks for the string you want to match on, then prints the next 28 characters. Special characters are escaped with a backslash in the pattern.
Adapted from this post: Grep characters before and after match?
I added some blank lines for readability.
VirtualBox:~$ cat data.dat
Thisis a test of somerandom characters before thestringI want5Z</time>2011-04-26T14:47:02-14:47:15plus somemoredata
VirtualBox:~$ cat test.sh
awk '/5Z\<\/time\>/ {
match($0, /5Z\<\/time\>/); print substr($0, RSTART + 9, 28);
}' data.dat
VirtualBox:~$ ./test.sh
2011-04-26T14:47:02-14:47:15
VirtualBox:~$
EDIT: I just realized something. The regular expression will need to be tweaked to be non-greedy, etc and between that and awk need to be tweaked to handle multiple occurrences as you need them. Perhaps some of the folks more up on awk can chime in with improvements as I am real rusty. An approach to consider anyway.

Related

count number of Chinese character and add in the end of line

I have the file, which have a Chinese word in each line like this :
王大明
新型传染病
電子雷射
I want to add the number of Chinese character in each end of line :
王大明 3
新型传染病 5
電子雷射 4
How can I do this?
I know command, sed, wc. However, I cannot achieve this work. I tried many things, but clearly I need help here.
sed -i s/$/{length $0}/ myfile
sed -i s/$/{wc -m}/ myfile
awk '{$2=system(awk 'length') OFS $2} 1' myfile
What exactly will work will depend entirely on what exactly your input looks like. If you are dealing with Unicode glyphs, use a Unicode-aware tool such as e.g. Python.
bash$ cat uniline
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.rstrip('\n')
print(line, len(line))
bash$ chmod +x uniline
bash$ uniline <<\:
> 王大明
> 新型传染病
> 電子雷射
> :
王大明 3
新型传染病 5
電子雷射 4
(I had to trim some whitespace from the ends of the lines in the example you posted.)
For the record, my system encoding is UTF-8, meaning the first line's representation as bytes is
bash$ echo '王大明' | xxd
00000000: e78e 8be5 a4a7 e698 8e0a ..........
Perhaps see also Problematic questions about decoding errors for some relevant background.
If you are lucky, even Awk and wc might be locale-aware on your platform. Your sed attempts really have no chance of working (though if you have GNU sed you could try with the /e option; but really, probably don't). If you have GNU Awk and the en_US.UTF-8 locale defined, this works, too:
bash$ echo $'\xe7\x8e\x8b\xe5\xa4\xa7\xe6\x98\x8e' |
> LC_ALL=en-US.UTF-8 awk '{ print $0, length }'
王大明 3
if you're VERY certain the only multi-byte characters there are chinese, then do
gawk/mawk/mawk2 '{ print $0, \
\
gsub(/\342|\343|\344|\345|\346|\347|\350|\351|\357|\360/, "&") }'
This list of leading-bytes shall correctly account for either 3- or 4-byte code-points related to chinese chars, of either simplified and traditional, plus all special compatibility variants.
Run that in either byte-mode or unicode-mode and it'll give you the same result. Your locale settings DOES NOT matter here (as long as your input is already UTF8 compliant text)
If you're definitely in byte-mode or LC_ALL=C, then
awk '{ print $0, gsub(/[\342-\351\357\360]/,"&") }'
One of the less-mentioned-but-excellent use case for gsub() is to use it for purposes of counting occurrences without having to do split() or substr().
if you're REALLY pedantic about exactness, the hideous regex i use myself is
function isChinese(str6) { return (str6 ~
/\344|\345|\346|\347|\350|\351|
(\343|\360|\357)(\244|\245|\246|\247|
\250|\251|\252|\253)|(\357\271|
\343(\204|\207))(\200|\201|\202|\203|\204|
\205|\206|\207|\210|
\211|\212|\213|\214|\215|\216|\217)|(\343\206|
\357\270)(\260|\261|\262|\263|\264|\265|\266|\267|
\270|\271|\272|\273|\274|\275|\276|\277)|
(\343|\360)(\240|\241|\242|\243|\254|\255|\256|\257|\260|
\261)|\342(\272|\273|\274|\275|\276|\277(\200|
\210|\211|\212|\213|\214|\215|\216|\217))|
(\342\277|\343(\204|\206|\207))(\220|\221|\222|
\223|\224|\225|\226|\227|\230|\231|\232|\233|
\234|\235|\236|\237)|\343(\200|\210|\211|\212|
\213|\214|\215|\216|\217|\220|\221|\222|\223|
\224|\225|\226|\227|\230|\231|\232|\233|\234|
\235|\236|\237|\262|\263|\264|\265|\266|(\204|
\206|\207)(\240|\241|\242|\243|\244|
\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257))/) };

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.
Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'
This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.
It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

Filtering Linux command output

I need to get a row based on column value just like querying a database. I have a command output like this,
Name ID Mem VCPUs State
Time(s)
Domain-0 0 15485 16 r-----
1779042.1
prime95-01 512 1
-b---- 61.9
Here I need to list only those rows where state is "r". Something like this,
Domain-0 0 15485 16
r----- 1779042.1
I have tried using "grep" and "awk" but still I am not able to succeed.
Any help me is much appreciated
Regards,
Raaj
There is a variaty of tools available for filtering.
If you only want lines with "r-----" grep is more than enough:
command | grep "r-----"
Or
cat filename | grep "r-----"
grep can handle this for you:
yourcommand | grep -- 'r-----'
It's often useful to save the (full) output to a file to analyse later. For this I use tee.
yourcommand | tee somefile | grep 'r-----'
If you want to find the line containing "-b----" a little later on without re-running yourcommand, you can just use:
grep -- '-b----' somefile
No need for cat here!
I recommend putting -- after your call to grep since your patterns contain minus-signs and if the minus-sign is at the beginning of the pattern, this would look like an option argument to grep rather than a part of the pattern.
try:
awk '$5 ~ /^r.*/ { print }'
Like this:
cat file | awk '$5 ~ /^r.*/ { print }'
grep solution:
command | grep -E "^([^ ]+ ){4}r"
What this does (-E switches on extended regexp):
The first caret (^) matches the beginning of the line.
[^ ] matches exactly one occurence of a non-space character, the following modifier (+) allows it to also match more occurences.
Grouped together with the trailing space in ([^ ]+ ), it matches any sequence of non-space characters followed by a single space. The modifyer {4} requires this construct to be matched exactly four times.
The single "r" is then the literal character you are searching for.
In plain words this could be written like "If the line starts <^> with four strings that are followed by a space <([^ ]+ ){4}> and the next character is , then the line matches."
A very good introduction into regular expressions has been written by Jan Goyvaerts (http://www.regular-expressions.info/quickstart.html).
Filtering by awk cmd in linux:-
Firstly find the column for this cmd and store file2 :-
awk '/Domain-0 0 15485 /' file1 >file2
Output:-
Domain-0 0 15485 16
r----- 1779042.1
after that awk cmd in file2:-
awk '{print $1,$2,$3,$4,"\n",$5,$6}' file2
Final Output:-
Domain-0 0 15485 16
r----- 1779042.1

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

Resources