Extracting lines from file using grep in a for loop, exporting to new file with variable in file name - linux

I am trying to extract all lines from a file that contain a string using a for loop with a file that contains a list of possible strings. I also want to export the results of grep to a new file with the variable in the file name.
Here is what I have:
file="variables.txt"
listofvariables=$(cat ${file})
for variable in ${listofvariables}
do
samtools view sample.bam | \
grep "'${variable}'" \
> sample.${variable}.bam
done
What this code does is simply make a blank file for every variable. Why isn't grep extracting lines that contain that variable and putting it into those files?
For reference, here is what the variables.txt file looks like:
mmu-let-7g-5p
mmu-let-7g-3p
mmu-let-7i-5p
mmu-let-7i-3p
mmu-miR-1a-1-5p
mmu-miR-1a-3p
mmu-miR-15b-5p
mmu-miR-15b-3p
mmu-miR-23b-5p
mmu-miR-23b-3p
And here is what the samtools view output looks like:
7238520-1_CATAAT.mmu-miR-125b-5p 0 chr1 11301523 60 75M * 0 0CAGGTGTTTTCTCAGGCATTTGGATTTCTATAGAATCATAGTATTAAAATTTCAAAGTAATAACATTGCTTTTTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:75 YT:Z:UU NH:i:1
1422982-2_CCCCGC.mmu-miR-132-3p 0 chr1 11301726 60 97M * 0 0 AAGTCTGTTTTTATGTGAGTGTTCCTGTGAAACTGAGGTCTGATGACTCTTCCTTAAGCAATTACAACTTCATTAGCATACATAAGGTTCAATTAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:97 YT:Z:UU NH:i:1
5675450-1_CCCCGC.mmu-miR-132-3p 0 chr1 11301726 60 97M * 0 0 AAGTCTGTTTTTATGTGAGTGTTCGTGTGAAACTGAGGTCTGATGACTCTTCCTTAAGCAATTACAACTTC^C
For those who may be unfamiliar samtools view simply reads out the .bam file. You can think of it like cat.
Thanks in advance!

Since ...
What this code does is simply make a blank file for every variable.
... you know that your variables file is being read correctly, and your for loop is correctly iterating over the results. That the resulting files are empty indicates that grep is not finding any matches to your pattern.
Why not? Because the pattern in your grep command ...
grep "'${variable}'" \
... doesn't mean what you appear to think it means. You have taken some pains to get literal apostrophes (') into the pattern, but these have no special meaning in that context. Your pattern does not match any lines because in the data, there are no apostrophes around the appearances of the target strings.
This would be better:
grep -F -e "${variable}" \
The -F option tells grep to treat the pattern as a fixed string to match, so that nothing within is interpreted as a regex metacharacter. The -e ensures that the pattern is interpreted as such, even if, for example, it begins with a - character. The double quotes remain, as they are required to ensure that the shell does not perform word splitting on the expanded result, and of course the inner apostrophes are gone, since they were causing the main problem.

Related

update the first character of line if line contains two strings

I am looking for a script to search a line if it contains two strings and then find if the first charater of that line contains certain character and remove it.
Eg. if line contains two strings as "abc" and "xyz" it should look for first character of the line and if it contains # , it should remove it and vice-versa.
It tried to run below command in crontab and got the result
crontab -l | grep az2-er32-cxv-iz| grep aze
Output
#5,10 * * * * /opt/apps/scripts/dsm-rync -q -del -s az2-er32-cxv-iz /opt/apps/sdl/scripts/aze-dsm-rync.app.config
since , its difficult to update the crontab entry directly , i copied it to tmpfile.
crontab -l > tmpfile and tried to run sed 's/^#//' tmpfile but it is removing all # instead of the line matching with two strings
You may use gnu awk to do this easily:
awk -i inplace '/az2-er32-cxv-iz/ && /aze/{sub(/^#/, "")} 1' crontab
This will remove # from first position if line has az2-er32-cxv-iz and aze in it.
As mentioned by Shloim, I won't give you the code itself, but I'm giving you a piece of pseudo-code to start:
In order to know if a line contains at least two words, you might search for a space within that line. (grep " " <filename>)
In order to know if a line starts with a certain character, you might search for the character, followed by that one certain character. (grep "<beginning_of_line>#")
Replacing a character in a line can be done, using the sed command.

Grep for specific numbers within a text file and output per number text file

I have a text file chunk_names.txt that looks like this:
chr1_12334_64321
chr1_134435_77474
chr10_463252_74754
chr10_54265_423435
chr13_5464565_547644567
This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*.
I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt:
In Linux I have tried :
for i in {1..22}
do
grep chr$i chunk_names.txt > chr$i.txt
done
However, the chr1.txt output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).
How would I modify this script to separate out the chromosomes?
I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately
Things I have tried :
grep -o gives me just "chr$i" as an output
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem
Many thanks for your time.
Your 'for' loop will mean parsing your file N times (where N is the number of chromosomes/contigs in your list). Here's an agnostic approach using awk that will parse the file just once:
awk -F '_' '{ print > $1 ".txt" }' chunk_names.txt
If you include the _ following the number you can distinguish between chr1_ and e.g. chr10_. To include X and Y, simply include these in the loop
for i in {1..22} X Y
do
grep "chr${i}_" chunk_names.txt > chr$i.txt
done
To search at the beginning of the line only you can add a leading ^ to the pattern
grep "^chr${i}_" chunk_names.txt > chr$i.txt
Explanation about your attempts:
grep chr$i searches for the pattern anywhere in the line. The shell replaces $i with the value of the variable i, so you get chr1, chr2 etc.
If you enclose the pattern in double quotes as grep "chr$i" the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.
If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i (instead of chr1 etc.) which does not occur in your file.
Explanation about quotes:
The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.
Example:
If your file would contain a chr1* instead of the chr1_, the pattern chr${i}* would be replaced by the list of matching files.
When you already created your output files chr1.txt etc., try these commands
$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*
In the first case, the grepcommand
grep chr${i}* chunk_names.txt
would be expanded as
grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt
which would search for the pattern chr10.txt in files chr11.txt ... chr1.txt and chunk_names.txt.

Replace text in a file in bash with / in the search string

How can i replace the following text in a file in linux with a different line
Current :
0 22 * * * /scripts/application_folder_backup.sh >> /var/log/application_folder_backup.log
Replacement line : #line_removed
I tried using sed but my text in the file already has a / which is causing problems. I tried storing the string in a variable too. But it doesn't work
#!/bin/bash
var="0 22 * * * /scripts/application_folder_backup.sh >> /var/log/application_folder_backup.log"
sed -i -e 's/$var/#line_removed/g' /tmp/k1.txt
exit
Just / is not a problem here, even * or all the special regex meta characters will be a problem for sed since it uses only regex for search patterns.
Better to use this non-regex based awk command:
awk -v var="$var" 'index($0, var) { $0 = "#line_removed" } 1' file
#line_removed
index function in awk uses plain text search instead of a regex based search.
I suggest previous to feed the contents of $var to the sed script, to escape all the / chars to be \/, as in:
var_esc=$(echo "$var" | sed 's/\//\\\//g')
But the sed expression gets very complicated (you must use single quotes, if you don't want to double \ chars, as double quotes do interpret also the backslashes.
Another thing that could simplify the expression is to use the possibility of change the regexp delimiter (with using a different char to begin it) as in:
var_esc=$(echo "$var" | sed 's:/:\/:g')
The idea is to, before substituting the $var variable, to escape all the possible interferring chars you can have (you don't know a priory if there are going to be, as you are using a variable for that purpose) and then substitute them in the final command:
sed "s/$var_esc/#line_removed/g"
Finally, if you don't want to substitute the line, but just to erase it, you can do something like this:
sed "/$var_esc/d"
and that will erase all lines that match your $var_esc contents. (this time I believe you cannot change the regexp delimiter, as the / introduces the matching line operator --- d is the command to delete line in the output)
Another way to delete the line in your file is to call ex directly and pass the editor command as input:
ex file <<EOF <-- edit "file" with the commands that follow until EOF is found.
/$var_esc <-- search for first occurrence of $var_esc
d <-- delete line
w <-- write file
EOF <-- eof marker.

How do I use basic grep commands in Unix?

I need to display all the lines using the grep command that contain 2-6 'x's
Also need to know how to display all lines with 3 consecutive 'x's
I have tried grep x{2,6} example.txt but I keep getting an error saying that x6 is not found in the directory. My example file contains 7 lines increasing in the amount of 'x's by one in each line.
The Bash shell uses Brace Expansion to expand:
grep x{2,6} example.txt
into:
grep x2 x6 example.txt
Unless you have a file called x6 in your directory, you will get an error from grep telling you it can't open it.
Rule 1: enclose regular expressions to grep inside quotes — single quotes whenever possible.
Hence, use:
grep 'x{2,6}' example.txt
This deals with getting a regex to grep. Now we need to consider what it means. By default, this means look for the characters x, {, 2, ,, 6, } on a single line. Adding the -E option uses extended regular expressions, and the command looks for anything from 2 to 6 consecutive x's on a single line in the file:
grep -E 'x{2,6}' example.txt
However, it might be worth noting that this is pretty much the same as selecting 'xx' unless you have colouration on, or are selecting 'only' the matched text (the GNU grep extension -o option).
These are all for 2-6 adjacent x's, which is roughly what your proposed regex wanted.
You ask about three adjacent x's:
grep 'xxx' example.txt
The single quotes aren't 100% necessary, but they do no harm and remind you to use them for the regex in general.
Now we face the dilemma that you probably meant "between 2 and 6 x's on a single line, not necessarily adjacent, and not 0 or 1, nor 7 or more".
Rule 2: describe your required result precisely
Imprecise requirements lead to incorrect, or unintended, results. Meeting that requirement needs a more complex regex:
grep -E '^([^x]*x){2,6}[^x]*$' example.txt
That looks for 2-6 occurrences of zero or more non-x's followed by an x at the start of the line, followed by zero or more non-x's up to the end of line.
I need to display all the lines using GREP command that contain 2-6 'x's
grep -P '^(?:[^x]*x[^x]*){2,6}$' file
Also need to know how to display all lines with 3 consecutive 'x's
grep -P 'xxx' file

Convert string to hexadecimal on command line

I'm trying to convert "Hello" to 48 65 6c 6c 6f in hexadecimal as efficiently as possible using the command line.
I've tried looking at printf and google, but I can't get anywhere.
Any help greatly appreciated.
Many thanks in advance,
echo -n "Hello" | od -A n -t x1
Explanation:
The echo program will provide the string to the next command.
The -n flag tells echo to not generate a new line at the end of the "Hello".
The od program is the "octal dump" program. (We will be providing a flag to tell it to dump it in hexadecimal instead of octal.)
The -A n flag is short for --address-radix=n, with n being short for "none". Without this part, the command would output an ugly numerical address prefix on the left side. This is useful for large dumps, but for a short string it is unnecessary.
The -t x1 flag is short for --format=x1, with the x being short for "hexadecimal" and the 1 meaning 1 byte.
If you want to do this and remove the spaces you need:
echo -n "Hello" | od -A n -t x1 | sed 's/ *//g'
The first two commands in the pipeline are well explained by #TMS in his answer, as edited by #James. The last command differs from #TMS comment in that it is both correct and has been tested. The explanation is:
sed is a stream editor.
s is the substitute command.
/ opens a regular expression - any character may be used. / is
conventional, but inconvenient for processing, say, XML or path names.
/ or the alternate character you chose, closes the regular expression and
opens the substitution string.
In / */ the * matches any sequence of the previous character (in this
case, a space).
/ or the alternate character you chose, closes the substitution string.
In this case, the substitution string // is empty, i.e. the match is
deleted.
g is the option to do this substitution globally on each line instead
of just once for each line.
The quotes keep the command parser from getting confused - the whole
sequence is passed to sed as the first option, namely, a sed script.
#TMS brain child (sed 's/^ *//') only strips spaces from the beginning of each line (^ matches the beginning of the line - 'pattern space' in sed-speak).
If you additionally want to remove newlines, the easiest way is to append
| tr -d '\n'
to the command pipes. It functions as follows:
| feeds the previously processed stream to this command's standard input.
tr is the translate command.
-d specifies deleting the match characters.
Quotes list your match characters - in this case just newline (\n).
Translate only matches single characters, not sequences.
sed is uniquely retarded when dealing with newlines. This is because sed is one of the oldest unix commands - it was created before people really knew what they were doing. Pervasive legacy software keeps it from being fixed. I know this because I was born before unix was born.
The historical origin of the problem was the idea that a newline was a line separator, not part of the line. It was therefore stripped by line processing utilities and reinserted by output utilities. The trouble is, this makes assumptions about the structure of user data and imposes unnatural restrictions in many settings. sed's inability to easily remove newlines is one of the most common examples of that malformed ideology causing grief.
It is possible to remove newlines with sed - it is just that all solutions I know about make sed process the whole file at once, which chokes for very large files, defeating the purpose of a stream editor. Any solution that retains line processing, if it is possible, would be an unreadable rat's nest of multiple pipes.
If you insist on using sed try:
sed -z 's/\n//g'
-z tells sed to use nulls as line separators.
Internally, a string in C is terminated with a null. The -z option is also a result of legacy, provided as a convenience for C programmers who might like to use a temporary file filled with C-strings and uncluttered by newlines. They can then easily read and process one string at a time. Again, the early assumptions about use cases impose artificial restrictions on user data.
If you omit the g option, this command removes only the first newline. With the -z option sed interprets the entire file as one line (unless there are stray nulls embedded in the file), terminated by a null and so this also chokes on large files.
You might think
sed 's/^/\x00/' | sed -z 's/\n//' | sed 's/\x00//'
might work. The first command puts a null at the front of each line on a line by line basis, resulting in \n\x00 ending every line. The second command removes one newline from each line, now delimited by nulls - there will be only one newline by virtue of the first command. All that is left are the spurious nulls. So far so good. The broken idea here is that the pipe will feed the last command on a line by line basis, since that is how the stream was built. Actually, the last command, as written, will only remove one null since now the entire file has no newlines and is therefore one line.
Simple pipe implementation uses an intermediate temporary file and all input is processed and fed to the file. The next command may be running in another thread, concurrently reading that file, but it just sees the stream as a whole (albeit incomplete) and has no awareness of the chunk boundaries feeding the file. Even if the pipe is a memory buffer, the next command sees the stream as a whole. The defect is inextricably baked into sed.
To make this approach work, you need a g option on the last command, so again, it chokes on large files.
The bottom line is this: don't use sed to process newlines.
echo hello | hexdump -v -e '/1 "%02X "'
Playing around with this further,
A working solution is to remove the "*", it is unnecessary for both the original requirement to simply remove spaces as well if substituting an actual character is desired, as follows
echo -n "Hello" | od -A n -t x1 | sed 's/ /%/g'
%48%65%6c%6c%6f
So, I consider this as an improvement answering the original Q since the statement now does exactly what is required, not just apparently.
Combining the answers from TMS and i-always-rtfm-and-stfw, the following works under Windows using gnu-utils versions of the programs 'od', 'sed', and 'tr':
echo "Hello"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
or in a CMD file as:
#echo "%1"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
A limitation on my solution is it will remove all double quotes (").
"tr -d '\42'" removes quote marks that the Windows 'echo' will include.
"tr -d '\r'" removes the carriage return, which Windows includes as well as '\n'.
The pipe (|) character must follow immediately after the string or the Windows echo will add that space after the string.
There is no '-n' switch to the Windows echo command.

Resources