Grep for specific numbers within a text file and output per number text file - linux

I have a text file chunk_names.txt that looks like this:
chr1_12334_64321
chr1_134435_77474
chr10_463252_74754
chr10_54265_423435
chr13_5464565_547644567
This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*.
I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt:
In Linux I have tried :
for i in {1..22}
do
grep chr$i chunk_names.txt > chr$i.txt
done
However, the chr1.txt output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).
How would I modify this script to separate out the chromosomes?
I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately
Things I have tried :
grep -o gives me just "chr$i" as an output
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem
Many thanks for your time.

Your 'for' loop will mean parsing your file N times (where N is the number of chromosomes/contigs in your list). Here's an agnostic approach using awk that will parse the file just once:
awk -F '_' '{ print > $1 ".txt" }' chunk_names.txt

If you include the _ following the number you can distinguish between chr1_ and e.g. chr10_. To include X and Y, simply include these in the loop
for i in {1..22} X Y
do
grep "chr${i}_" chunk_names.txt > chr$i.txt
done
To search at the beginning of the line only you can add a leading ^ to the pattern
grep "^chr${i}_" chunk_names.txt > chr$i.txt
Explanation about your attempts:
grep chr$i searches for the pattern anywhere in the line. The shell replaces $i with the value of the variable i, so you get chr1, chr2 etc.
If you enclose the pattern in double quotes as grep "chr$i" the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.
If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i (instead of chr1 etc.) which does not occur in your file.
Explanation about quotes:
The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.
Example:
If your file would contain a chr1* instead of the chr1_, the pattern chr${i}* would be replaced by the list of matching files.
When you already created your output files chr1.txt etc., try these commands
$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*
In the first case, the grepcommand
grep chr${i}* chunk_names.txt
would be expanded as
grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt
which would search for the pattern chr10.txt in files chr11.txt ... chr1.txt and chunk_names.txt.

Related

How to add character at the end of specific line in UNIX/LINUX?

Here is my input file. I want to add a character ":" into the end of lines that have ">" at the beginning of the line. I tried seq -i 's|$|:|' input.txt but ":" was added to all the ending of each line. It is also hard to call out specific line numbers because, in each of my input files, the line contains">" present in different line numbers. I want to run a loop for multiple files so it is useless.
>Pas_pyrG_2
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Here is experted output file:
>Pas_pyrG_2:
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4:
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2:
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Do seq have more option to modify or the other commands can solve this problem?
sed -i '/^>/ s/$/:/' input.txt
Search the lines of input for lines that match ^> (regex for "starts with the > character). Those that do substitute : for end-of-line (you got this part right).
/ slashes are the standard separator character in sed. If you wish to use different characters, be sure to pass -e or s|$|:| probably won't work. Since / characters, unlike | characters, are not meaningful character within the shell, it's best to use them unless the pattern also contains slashes, in which case things get unwieldy.
Be careful with sed -i. Make a backup - make sure you know what's changing by using diff to compare the files.
On OSX -i requires an argument.
Using ed to edit the file:
printf "%s\n" 'g/^>/s/$/:/' w | ed -s input.txt
For every line starting with >, add a colon to the end, and then write the changed file back to disk.

Extracting lines from file using grep in a for loop, exporting to new file with variable in file name

I am trying to extract all lines from a file that contain a string using a for loop with a file that contains a list of possible strings. I also want to export the results of grep to a new file with the variable in the file name.
Here is what I have:
file="variables.txt"
listofvariables=$(cat ${file})
for variable in ${listofvariables}
do
samtools view sample.bam | \
grep "'${variable}'" \
> sample.${variable}.bam
done
What this code does is simply make a blank file for every variable. Why isn't grep extracting lines that contain that variable and putting it into those files?
For reference, here is what the variables.txt file looks like:
mmu-let-7g-5p
mmu-let-7g-3p
mmu-let-7i-5p
mmu-let-7i-3p
mmu-miR-1a-1-5p
mmu-miR-1a-3p
mmu-miR-15b-5p
mmu-miR-15b-3p
mmu-miR-23b-5p
mmu-miR-23b-3p
And here is what the samtools view output looks like:
7238520-1_CATAAT.mmu-miR-125b-5p 0 chr1 11301523 60 75M * 0 0CAGGTGTTTTCTCAGGCATTTGGATTTCTATAGAATCATAGTATTAAAATTTCAAAGTAATAACATTGCTTTTTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:75 YT:Z:UU NH:i:1
1422982-2_CCCCGC.mmu-miR-132-3p 0 chr1 11301726 60 97M * 0 0 AAGTCTGTTTTTATGTGAGTGTTCCTGTGAAACTGAGGTCTGATGACTCTTCCTTAAGCAATTACAACTTCATTAGCATACATAAGGTTCAATTAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:97 YT:Z:UU NH:i:1
5675450-1_CCCCGC.mmu-miR-132-3p 0 chr1 11301726 60 97M * 0 0 AAGTCTGTTTTTATGTGAGTGTTCGTGTGAAACTGAGGTCTGATGACTCTTCCTTAAGCAATTACAACTTC^C
For those who may be unfamiliar samtools view simply reads out the .bam file. You can think of it like cat.
Thanks in advance!
Since ...
What this code does is simply make a blank file for every variable.
... you know that your variables file is being read correctly, and your for loop is correctly iterating over the results. That the resulting files are empty indicates that grep is not finding any matches to your pattern.
Why not? Because the pattern in your grep command ...
grep "'${variable}'" \
... doesn't mean what you appear to think it means. You have taken some pains to get literal apostrophes (') into the pattern, but these have no special meaning in that context. Your pattern does not match any lines because in the data, there are no apostrophes around the appearances of the target strings.
This would be better:
grep -F -e "${variable}" \
The -F option tells grep to treat the pattern as a fixed string to match, so that nothing within is interpreted as a regex metacharacter. The -e ensures that the pattern is interpreted as such, even if, for example, it begins with a - character. The double quotes remain, as they are required to ensure that the shell does not perform word splitting on the expanded result, and of course the inner apostrophes are gone, since they were causing the main problem.

How to print the longest word in a file by using combination of grep and wc

iam trining to find the longest word in a text file.
i tried it and find out the no of characters in the longest word in a file
by using the command
wc -L
i need to print the longest word By using this number and grep command .
If you must use the two commands give, I'd suggest:
grep -E ".{$(wc -L < test.txt)}" test.txt
The command substitution is used to build the correct brace expression to match the line(s) with exactly the given number of characters. -E is needed to enable extended regular expression support; otherwise, the braces need to be escaped: grep ".\{...\}" test.txt.
Using an awk command that makes a single pass through the file may be faster.

linux script shell : grep a part of path in a list of path

In my script shell, i have 2 files. The first one is a file containing only names of files with part of the path :
list1:
aaa/bbb/file1.ext
ccc/ddd/file2.ext
eee/fff/file3.ext
The second one is a list of every files of the extension ".ext" with the absolute path before them:
list2:
/home/.../aaa/bbb/file1.ext
...
...
...
/home/...ccc/ddd/file2.ext
...
And I am trying to extract the lines of the second file list2, containing the lines of the first one with grep.
For now I tried :
while read line
do
grep "$line" "list1"
done < list2
But this command doesn't ouptut anything, however the command
grep "aaa/bbb/file1.ext" "list1"
have the output I am waiting for
/home/.../aaa/bbb/file1.ext
Anyone sees what I am missing on this script? Thanks
This is one of the cases where -f option from grep comes very handy:
grep -f f1 f2
For your given input returns:
/home/.../aaa/bbb/file1.ext
/home/...ccc/ddd/file2.ext
From man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by
POSIX.)

How do I use basic grep commands in Unix?

I need to display all the lines using the grep command that contain 2-6 'x's
Also need to know how to display all lines with 3 consecutive 'x's
I have tried grep x{2,6} example.txt but I keep getting an error saying that x6 is not found in the directory. My example file contains 7 lines increasing in the amount of 'x's by one in each line.
The Bash shell uses Brace Expansion to expand:
grep x{2,6} example.txt
into:
grep x2 x6 example.txt
Unless you have a file called x6 in your directory, you will get an error from grep telling you it can't open it.
Rule 1: enclose regular expressions to grep inside quotes — single quotes whenever possible.
Hence, use:
grep 'x{2,6}' example.txt
This deals with getting a regex to grep. Now we need to consider what it means. By default, this means look for the characters x, {, 2, ,, 6, } on a single line. Adding the -E option uses extended regular expressions, and the command looks for anything from 2 to 6 consecutive x's on a single line in the file:
grep -E 'x{2,6}' example.txt
However, it might be worth noting that this is pretty much the same as selecting 'xx' unless you have colouration on, or are selecting 'only' the matched text (the GNU grep extension -o option).
These are all for 2-6 adjacent x's, which is roughly what your proposed regex wanted.
You ask about three adjacent x's:
grep 'xxx' example.txt
The single quotes aren't 100% necessary, but they do no harm and remind you to use them for the regex in general.
Now we face the dilemma that you probably meant "between 2 and 6 x's on a single line, not necessarily adjacent, and not 0 or 1, nor 7 or more".
Rule 2: describe your required result precisely
Imprecise requirements lead to incorrect, or unintended, results. Meeting that requirement needs a more complex regex:
grep -E '^([^x]*x){2,6}[^x]*$' example.txt
That looks for 2-6 occurrences of zero or more non-x's followed by an x at the start of the line, followed by zero or more non-x's up to the end of line.
I need to display all the lines using GREP command that contain 2-6 'x's
grep -P '^(?:[^x]*x[^x]*){2,6}$' file
Also need to know how to display all lines with 3 consecutive 'x's
grep -P 'xxx' file

Resources