How to grep particular lines - linux

I am trying to fetch Some IDs from URL.
In my script I hit the URL using while loop and wget command and I save output in file.
Then in same loop I grep XYZ User ID: and 3 lines after this string and save it to another file.
When I open this output file I find following lines.
< p >XYZ User ID:< /p>
< /td >
< td>
< p>2989288174< /p>
So using grep or any thing else how can I print following output
XYZ User ID:2989288174

Supposing a constant tag pattern:
<p>XYZ User ID:</p>
</td>
<td>
<p>2989288174</p>
grep should be the best way:
grep -oP '(?<=p>)([^>]+?)(?=<\/p)' outputfile|while read user;do
read id
echo "$user $id"
done
Note that look-behind expressions cannot be of variable length. That means you cannot use quantifiers ?, *, + , etc or alternation of different-length items inside them.
For variable length tags awk could be well suited for oneliner tags:
awk '/User ID/{print ""}/p *>/{printf $3}' FS='(p *>|<)' outputfile

This should work (sed with extended regex):
sed -nr 's#<\s*p\s*>([^>]*)<\s*/\s*p\s*>#\1#p' file | tr -d '\n'
Output:
XYZ User ID:2989288174

Related

BASH scripting - unable to split string from grepped output and pass it one by one to a variable

I'm a beginner to bash scripting and been writing a script to check different log files and I'm bit stuck here.
clientlist=/path/to/logfile/which/consists/of/client/names
#i will grep only the client name from the file which has multiple log lines
clients=$(grep --color -i 'list of client assets:' $clientlist | cut -d":" -f1 )
echo "Clients : $clients"
#For example "Clients: Apple
# Samsung
# Nokia"
#number of clients may vary from time to time
assets=("$clients".log)
echo assets: "$assets"
The code above greps the client name from the log file and i'm trying to use the grepped client name (each) to construct a logfile with each client name.
The number of clients is indefinite and may vary from time to time.
The code I have returns the client name as a whole
assets: Apple
Samsung
Nokia.log
and I'm bit unsure of how to cut the string and pass it on one by one to return the assets which has .log for each client name. How can i do this ?
Apple.log
Samsung.log
Nokia.log
(Apologies if I have misunderstood the task)
Using awk
if your input file (I'll call it clients.txt) is:
Clients: Apple
Samsung
Nokia
The following awk step:
awk '{print $NF".log"}' clients.txt
outputs:
Apple.log
Samsung.log
Nokia.log
(You can pipe straight into awk and omit the file name if the pipe stream is as the file contents in the above example).
It is highly likely that a simple awk procedure can perform the entire task beginning with the 'clientlist' you process with grep (awk has all the functionality of grep built-in) but I'd need to know the structure of the origial file to extract the client names.
One awk idea:
assets=( $(awk -F: '/list of client assets:/ {print $2".log"}' "${clientlist}") )
# or
mapfile -t assets < <(awk -F: '/list of client assets:/ {print $2".log"}' "${clientlist}")
Where:
-F: - define input field delimiter as :
/list of client assets:/ - for lines that contain the string list of clients assets: print the 2nd :-delimited field and append the string .log on the end
One sed idea:
assets=( $(sed 's/.*://; s/$/.log/' "${clientlist}") )
# or
mapfile -t assets < <(sed 's/.*://; s/$/.log/' "${clientlist}")
Where:
s/.*:// - strip off everything up to the :
s/$/.log/ - replace end of line with .log
Both generate:
$ typeset -p assets
declare -a assets=([0]="Apple.log" [1]="Samsung.log" [2]="Nokia.log")
$ echo "${assets[#]}"
Apple.log Samsung.log Nokia.log
$ printf "%s\n" "${assets[#]}"
Apple.log
Samsung.log
Nokia.log
$ for i in "${!assets[#]}"; do echo "assets[$i] = ${assets[$indx]}"; done
assets[0] = Apple.log
assets[1] = Samsung.log
assets[2] = Nokia.log
NOTE: the alternative answers using mapfile address the issue referenced in Charles Duffy comment (see bash pitfall #50); readarray is a synonym for mapfile

Can't input date variable in bash

I have a directory /user/reports under which many files are there, one of them is :
report.active_user.30092018.77325.csv
I need output as number after date i.e. 77325 from above file name.
I created below command to find a value from file name:
ls /user/reports | awk -F. '/report.active_user.30092018/ {print $(NF-1)}'
Now, I want current date to be passed in above command as variable and get result:
ls /user/reports | awk -F. '/report.active_user.$(date +'%d%m%Y')/ {print $(NF-1)}'
But not getting required output.
Tried bash script:
#!/usr/bin/env bash
_date=`date +%d%m%Y`
active=$(ls /user/reports | awk -F. '/report.active_user.${_date}/ {print $(NF-1)}')
echo $active
But still output is blank.
Please help with proper syntax.
As #cyrus said you must use double quotes in your variable assignment because simple quote are use only for string and not for containing variables.
Bas use case
number=10
string='I m sentence with or wihtout var $number'
echo $string
Correct use case
number=10
string_with_number="I m sentence with var $number"
echo $string_with_number
You can use simple quote but not englobe all the string
number=10
string_with_number='I m sentence with var '$number
echo $string_with_number
Don't parse ls
You don't need awk for this: you can manage with the shell's capabilities
for file in report.active_user."$(date "+%d%m%Y")"*; do
tmp=${file%.*} # remove the extension
number=${tmp##*.} # remove the prefix up to and including the last dot
echo "$number"
done
See https://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.
Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'
This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.
It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

Find HEX value in file and grep the following value

I have a 2GB file in raw format. I want to search for all appearance of a specific HEX value "355A3C2F74696D653E" AND collect the following 28 characters.
Example: 355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135
In this case I want the output: "323031312D30342D32365431343A34373A30322D31343A34373A3135" or better: 2011-04-26T14:47:02-14:47:15
I have tried with
xxd -u InputFile | grep '355A3C2F74696D653E' | cut -c 1-28 > OutputFile.txt
and
xxd -u -ps -c 4000000 InputFile | grep '355A3C2F74696D653E' | cut -b 1-28 > OutputFile.txt
But I can't get it working.
Can anybody give me a hint?
As you are using xxd it seems to me that you want to search the file as if it were binary data. I'd recommend using a more powerful programming language for this; the Unix shell tools assume there are line endings and that the text is mostly 7-bit ASCII. Consider using Python:
#!/usr/bin/python
import mmap
fd = open("file_to_search", "rb")
needle = "\x35\x5A\x3C\x2F\x74\x69\x6D\x65\x3E"
haystack = mmap.mmap(fd.fileno(), length = 0, access = mmap.ACCESS_READ)
i = haystack.find(needle)
while i >= 0:
i += len(needle)
print (haystack[i : i + 28])
i = haystack.find(needle, i)
If your grep supports -P parameter then you could simply use the below command.
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{28}'
323031312D30342D32365431343A
For 56 chars,
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{56}'
323031312D30342D32365431343A34373A30322D31343A34373A3135
Why convert to hex first? See if this awk script works for you. It looks for the string you want to match on, then prints the next 28 characters. Special characters are escaped with a backslash in the pattern.
Adapted from this post: Grep characters before and after match?
I added some blank lines for readability.
VirtualBox:~$ cat data.dat
Thisis a test of somerandom characters before thestringI want5Z</time>2011-04-26T14:47:02-14:47:15plus somemoredata
VirtualBox:~$ cat test.sh
awk '/5Z\<\/time\>/ {
match($0, /5Z\<\/time\>/); print substr($0, RSTART + 9, 28);
}' data.dat
VirtualBox:~$ ./test.sh
2011-04-26T14:47:02-14:47:15
VirtualBox:~$
EDIT: I just realized something. The regular expression will need to be tweaked to be non-greedy, etc and between that and awk need to be tweaked to handle multiple occurrences as you need them. Perhaps some of the folks more up on awk can chime in with improvements as I am real rusty. An approach to consider anyway.

Searching for text

I'm trying to write a shell script that searches for text within a file and prints out the text and associated information to a separate file.
From this file containing list of gene IDs:
DDIT3 ENSG00000175197
DNMT1 ENSG00000129757
DYRK1B ENSG00000105204
I want to search for these gene IDs (ENSG*), their RPKM1 and RPKM2 values in a gtf file:
chr16 gencodeV7 gene 88772891 88781784 0.126744 + . gene_id "ENSG00000174177.7"; transcript_ids "ENST00000453996.1,ENST00000312060.4,ENST00000378384.3,"; RPKM1 "1.40735"; RPKM2 "1.61345"; iIDR "0.003";
chr11 gencodeV7 gene 55850277 55851215 0.000000 + . gene_id "ENSG00000225538.1"; transcript_ids "ENST00000425977.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
and print/ write it to a separate output file
Gene_ID RPKM1 RPKM2
ENSG00000108270 7.81399 8.149
ENSG00000101126 12.0082 8.55263
I've done it on the command line using for each ID using:
grep -w "ENSGno" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' > output.file
but when it comes to writing the shell script, I've tried various combinations of for, while, read, do and changing the variables but without success. Any ideas would be great!
You can do something like:
while read line
do
var=$(echo $line | awk '{print $2}')
grep -w "$var" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.file
done < geneIDs.file

Resources