Selective string operation

Selective string operation - string

I need help in string processing in CSH/TCSH script.
I know basic processing, but need help understand how we can handle advance string operation requirements.
I have a log file whose format is something like this:
[START_A]
log info of A
[END_A]
[START_B]
log info of B
[END_B]
[START_C]
log info of C
[END_C]
My requirement is to selectively extract content between start and end tag and store them in a file.
For example the content between START_A and END_A will be stored in A.log

this should work for you:
awk -F'_' '/\[START_.\]/{s=1;gsub(/]/,"",$2);f=$2".txt";next;}/\[END_.\]/{s=0} s{print $0 > f}' yourLog
test:
kent$ cat test
[START_A]
log info of A
[END_A]
[START_B]
log info of B
[END_B]
[START_C]
log info of C
[END_C]
kent$ awk -F'_' '/\[START_.\]/{s=1;gsub(/]/,"",$2);f=$2".txt";next;}/\[END_.\]/{s=0} s{print $0 > f}' test
kent$ head *.txt
==> A.txt <==
log info of A
==> B.txt <==
log info of B
==> C.txt <==
log info of C

Awk can do them 1 at a time:
cat log | awk '/[START_A]/,/[END_A]/'

This might work for you:
sed '/.*\[START_\([^]]*\)\].*/s||/\\[START_\1\\]/,/\\[END_\1\\]/w \1.log|p;d' file |
sed -nf - file

Related

BASH scripting - unable to split string from grepped output and pass it one by one to a variable

I'm a beginner to bash scripting and been writing a script to check different log files and I'm bit stuck here.
clientlist=/path/to/logfile/which/consists/of/client/names
#i will grep only the client name from the file which has multiple log lines
clients=$(grep --color -i 'list of client assets:' $clientlist | cut -d":" -f1 )
echo "Clients : $clients"
#For example "Clients: Apple
# Samsung
# Nokia"
#number of clients may vary from time to time
assets=("$clients".log)
echo assets: "$assets"
The code above greps the client name from the log file and i'm trying to use the grepped client name (each) to construct a logfile with each client name.
The number of clients is indefinite and may vary from time to time.
The code I have returns the client name as a whole
assets: Apple
Samsung
Nokia.log
and I'm bit unsure of how to cut the string and pass it on one by one to return the assets which has .log for each client name. How can i do this ?
Apple.log
Samsung.log
Nokia.log

(Apologies if I have misunderstood the task)
Using awk
if your input file (I'll call it clients.txt) is:
Clients: Apple
Samsung
Nokia
The following awk step:
awk '{print $NF".log"}' clients.txt
outputs:
Apple.log
Samsung.log
Nokia.log
(You can pipe straight into awk and omit the file name if the pipe stream is as the file contents in the above example).
It is highly likely that a simple awk procedure can perform the entire task beginning with the 'clientlist' you process with grep (awk has all the functionality of grep built-in) but I'd need to know the structure of the origial file to extract the client names.

One awk idea:
assets=( $(awk -F: '/list of client assets:/ {print $2".log"}' "${clientlist}") )
# or
mapfile -t assets < <(awk -F: '/list of client assets:/ {print $2".log"}' "${clientlist}")
Where:
-F: - define input field delimiter as :
/list of client assets:/ - for lines that contain the string list of clients assets: print the 2nd :-delimited field and append the string .log on the end
One sed idea:
assets=( $(sed 's/.*://; s/$/.log/' "${clientlist}") )
# or
mapfile -t assets < <(sed 's/.*://; s/$/.log/' "${clientlist}")
Where:
s/.*:// - strip off everything up to the :
s/$/.log/ - replace end of line with .log
Both generate:
$ typeset -p assets
declare -a assets=([0]="Apple.log" [1]="Samsung.log" [2]="Nokia.log")
$ echo "${assets[#]}"
Apple.log Samsung.log Nokia.log
$ printf "%s\n" "${assets[#]}"
Apple.log
Samsung.log
Nokia.log
$ for i in "${!assets[#]}"; do echo "assets[$i] = ${assets[$indx]}"; done
assets[0] = Apple.log
assets[1] = Samsung.log
assets[2] = Nokia.log
NOTE: the alternative answers using mapfile address the issue referenced in Charles Duffy comment (see bash pitfall #50); readarray is a synonym for mapfile

How do I grep a string on multiple files only if the string is present in all of all the files?

I have around 20 files. The first columns of each file contains ids (ID0001, ID0056, ID0165 etc). I have a list file that contains all possible ids. I want to find the ids from that file that are present in all the files. Is there a way to use grep for this? So far if I use the command:
grep "id_name" file*.txt,
it prints the id even if it is present in only 1 file.

There is a simple grep pipeline that you can do, but it is a bit cumbersome to write down:
cut -f1 file1 | grep -Ff - file2 | grep -Ff - file3 | grep -Ff - file3 ...
Another way is using awk:
awk '{a[$1]++}END{for(i in a) if (a[i]==ARGC-1) print i}' file1 file2 file3 ...
The latter assumes that the id's are unique per file.
If they are not unique, it is a bit more tricky:
awk '(FNR==1){delete b}!($1 in b){a[$1]++;b[$1]}END{for(i in a) if (a[i]==ARGC-1) print i }' file1 file2 file3 ...

Say you have a list of all the ids in a file ids_list.txt with each ID being on a single line like
id001
id101
id201
...
And all the files from which you want to search from are in the folder data . So in this scenario, this little script should be able to help you
#!/bin/bash
all_ids="";
for i in `cat ids_list.txt`; do
all_ids="$all_ids|$i"
done
all_ids=`echo $all_ids|sed -e 's/^|//'`
grep -Pir "^($all_ids)[\s,]+" data
It output would be like
data/f1:id001, ssd
data/f3:id201, some data
...

This may be what you're trying to do but without sample input/output it's an untested guess:
awk '
!seen[FILENAME,$1]++ {
cnt[$1]++
}
END {
for (id in cnt) {
if ( cnt[id] == (ARGC-1) ) {
print id
}
}
}
' list file*

Capturing the value of pattern in UNIX and writing to another file

I have a file
#InboxPulse.jmx
request.threads3=10
request.loop=10
duration=300
request.ramp=6
#LaunchPulse.jmx
request.threads1=20
request.loop1=5
duration1=300
request.ramp1=6
#BankRetail.jmx
request.threads2=30
request.loop2=7
duration2=300
request.ramp2=6
I would like to capture the values for
request.threads2
request.threads1
request.threads3
into another file like this:
10
20
30
I tried this
awk '/request.threads[0-9]{1,10}=/{print $NF}' build.properties >> sum.txt
It gives the output as:
request.threads3=10
request.threads1=20
request.threads2=30
How can I get the desired output?

Split on the = sign, match on field 1, print field 2:
awk -F'=' '$1 ~ /request.threads[0-9]+$/ {print $2}' build.properties >> sum.txt

1) Extracting values
$ grep -oP 'request.threads\d+=\K\d+' build.properties
10
20
30
Add > sum.txt to command to save output to a file
2) If sum of those values is needed
$ perl -lne '($v)=/request.threads\d+=\K(\d+)/; $s+=$v; END{print $s}' build.properties
60

Searching for text

I'm trying to write a shell script that searches for text within a file and prints out the text and associated information to a separate file.
From this file containing list of gene IDs:
DDIT3 ENSG00000175197
DNMT1 ENSG00000129757
DYRK1B ENSG00000105204
I want to search for these gene IDs (ENSG*), their RPKM1 and RPKM2 values in a gtf file:
chr16 gencodeV7 gene 88772891 88781784 0.126744 + . gene_id "ENSG00000174177.7"; transcript_ids "ENST00000453996.1,ENST00000312060.4,ENST00000378384.3,"; RPKM1 "1.40735"; RPKM2 "1.61345"; iIDR "0.003";
chr11 gencodeV7 gene 55850277 55851215 0.000000 + . gene_id "ENSG00000225538.1"; transcript_ids "ENST00000425977.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
and print/ write it to a separate output file
Gene_ID RPKM1 RPKM2
ENSG00000108270 7.81399 8.149
ENSG00000101126 12.0082 8.55263
I've done it on the command line using for each ID using:
grep -w "ENSGno" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' > output.file
but when it comes to writing the shell script, I've tried various combinations of for, while, read, do and changing the variables but without success. Any ideas would be great!

You can do something like:
while read line
do
var=$(echo $line | awk '{print $2}')
grep -w "$var" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.file
done < geneIDs.file

extracting data from two list using a shell script

I am trying to create a shell script that pulls a line from a file and checks another file for an instance of the same. If it finds an entry then it adds it to another file and loops through the first list until the it has gone through the whole file. The data in the first file looks like this -
email#address.com;
email2#address.com;
and so on
The other file in which I am looking for a match and placing the match in the blank file looks like this -
12334 email#address.com;
32213 email2#address.com;
I want it to retain the numbers as well as the matching data. I have an idea of how this should work but need to know how to implement it.
My Idea
#!/bin/bash
read -p "enter first file name:" file1
read -p "enter second file name:" file2
FILE_DATA=( $( /bin/cat $file1))
FILE_DATA1=( $( /bin/cat $file2))
for I in $((${#FILE_DATA[#]}))
do
echo $FILE_DATA[$i] | grep $FILE_DATA1[$i] >> output.txt
done
I want the output to look like this but only for addresses that match -
12334 email#address.com;
32213 email2#address.com;
Thank You

quite like manipulating text using SQL:
$ cat file1
b#address.com
a#address.com
c#address.com
d#address.com
$ cat file2
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
$ join -1 1 -2 2 <(sort file1) <(sort -k2 file2) | awk '{print $2,$1}'
11457 b#address.com
22519 d#address.com
make keys sorted(we use emails as keys here)
join on keys(file1.column1, file2.column2)
format output(use awk to reverse columns)

As you've learned about diff and comm, now it's time to learn about another tool in the unix toolbox, join.
Join does just what the name indicates, it joins together 2 files. The way you join is based on keys embedded in the file.
The number 1 restraint on using join is that the data must be sorted in both files on the same column.
file1
a abc
b bcd
c cde
file2
a rec1
b rec2
c rec3
join file1 file2
a abc rec1
b bcd rec2
c cde rec3
you can consult the join man page for how to reduce and reorder the columns of output. for example
1>join -o 1.1 2.2 file1 file2
a rec1
b rec2
c rec3
You can use your code for file name input to turn this into a generalizable script.
Your solution using a pipeline inside a for loop will work for small sets of data, but as the size of data grows, the cost of starting a new process for each word you are searching for will drag down the run time.
I hope this helps.

Read line by the file1.txt file and assign the line to var ADDR. grep file2.txt with the content of var ADDR and append the output to file_result.txt.
(while read ADDR; do grep "${ADDR}" file2.txt >> file_result.txt ) < file1.txt

This awk one-liner can help you do that -
awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
NR and FNR are awk's built-in variables that stores the line numbers. NR does not get reset to 0 when working with two files. FNR does. So while that condition is true we add everything to an array a. Once the first file is completed, we check for the second column of second file. If a match is present in the array we put the entire line in a file f3.txt. If not then we ignore it.
Using data from Kev's solution:
[jaypal:~/Temp] cat f1.txt
b#address.com
a#address.com
c#address.com
d#address.com
[jaypal:~/Temp] cat f2.txt
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
[jaypal:~/Temp] awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
[jaypal:~/Temp] cat f3.txt
11457 b#address.com
22519 d#address.com

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Selective string operation - string

Awk can do them 1 at a time: cat log | awk '/[START_A]/,/[END_A]/'

This might work for you: sed '/.\[START_\([^]]\)\].*/s||/\\[START_\1\\]/,/\\[END_\1\\]/w \1.log|p;d' file | sed -nf - file

Related

BASH scripting - unable to split string from grepped output and pass it one by one to a variable

How do I grep a string on multiple files only if the string is present in all of all the files?

Capturing the value of pattern in UNIX and writing to another file

Searching for text

extracting data from two list using a shell script

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Selective string operation - string

Awk can do them 1 at a time: cat log | awk '/[START_A]/,/[END_A]/'

This might work for you: sed '/.*\[START_\([^]]*\)\].*/s||/\\[START_\1\\]/,/\\[END_\1\\]/w \1.log|p;d' file | sed -nf - file

Related

BASH scripting - unable to split string from grepped output and pass it one by one to a variable

How do I grep a string on multiple files only if the string is present in all of all the files?

Capturing the value of pattern in UNIX and writing to another file

Searching for text

extracting data from two list using a shell script

Categories

Resources

This might work for you: sed '/.\[START_\([^]]\)\].*/s||/\\[START_\1\\]/,/\\[END_\1\\]/w \1.log|p;d' file | sed -nf - file