I have a text file chunk_names.txt that looks like this:
chr1_12334_64321
chr1_134435_77474
chr10_463252_74754
chr10_54265_423435
chr13_5464565_547644567
This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*.
I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt:
In Linux I have tried :
for i in {1..22}
do
grep chr$i chunk_names.txt > chr$i.txt
done
However, the chr1.txt output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).
How would I modify this script to separate out the chromosomes?
I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately
Things I have tried :
grep -o gives me just "chr$i" as an output
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem
Many thanks for your time.
Your 'for' loop will mean parsing your file N times (where N is the number of chromosomes/contigs in your list). Here's an agnostic approach using awk that will parse the file just once:
awk -F '_' '{ print > $1 ".txt" }' chunk_names.txt
If you include the _ following the number you can distinguish between chr1_ and e.g. chr10_. To include X and Y, simply include these in the loop
for i in {1..22} X Y
do
grep "chr${i}_" chunk_names.txt > chr$i.txt
done
To search at the beginning of the line only you can add a leading ^ to the pattern
grep "^chr${i}_" chunk_names.txt > chr$i.txt
Explanation about your attempts:
grep chr$i searches for the pattern anywhere in the line. The shell replaces $i with the value of the variable i, so you get chr1, chr2 etc.
If you enclose the pattern in double quotes as grep "chr$i" the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.
If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i (instead of chr1 etc.) which does not occur in your file.
Explanation about quotes:
The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.
Example:
If your file would contain a chr1* instead of the chr1_, the pattern chr${i}* would be replaced by the list of matching files.
When you already created your output files chr1.txt etc., try these commands
$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*
In the first case, the grepcommand
grep chr${i}* chunk_names.txt
would be expanded as
grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt
which would search for the pattern chr10.txt in files chr11.txt ... chr1.txt and chunk_names.txt.
I want to cut several numbers from a .txt file to add them later up. Here is an abstract from the .txt file:
anonuser pts/25 127.0.0.1 Mon Nov 16 17:24 - crash (10+23:07)
I want to get the "10" before the "+" and I only want the number, nothing else. This number should be written to another .txt file. I used this code, but it only works if the number has one digit:
awk ' /^'anonuser' / {split($NF,k,"[(+0:)][0-9][0-9]");print k[1]} ' log2.txt > log3.txt
With GNU grep:
grep -Po '\(\K[^+]*' file > new_file
Output to new_file:
10
See: PCRE Regex Spotlight: \K
What if you use the match() function in awk?
$ awk '/^anonuser/ && match($NF,/^\(([0-9]*)/,a) {print a[1]}' file
10
How does this work?
/^anonuser/ && match() {print a[1]} if the line starts with anonuser and the pattern is found, print it.
match($NF,/^\(([0-9]*)/,a) in the last field ((10+23:07)), look for the string ( + digits and capture these in the array a[].
Note also that this approach allows you to store the values you capture, so that you can then sum them as you indicate in the question.
The following uses the same approach as the OP, and has a couple of advantages, e.g. it does not require anything special, and it is quite robust (with respect to assumptions about the input) and maintainable:
awk '/^anonuser/ {split($NF,k,/+/); gsub(/[^0-9]/,"",k[1]); print k[1]}'
for anything more complex use awk but for simple task sed is easy enough
sed -r '/^anonuser/{s/.*\(([0-9]+)\+.*/\1/}'
find the number between a ( and + sign.
I am not sure about the format in the file.
Can you use simple cut commands?
cut -d"(" -f2 log2.txt| cut -d"+" -f1 > log3.txt
I have a shell script which is automatically ran each morning which appends that days results to a text file. The file should have todays date on the first column followed by results separated by commas. I use the command date +%x to get the day in the required format (dd/mm/yy). However on one computer date +%x returns mm/dd/yyyy ( any idea why this is the case?). I then sort the data in the file in date order.
Here is a snippet of such a text file
29/11/12,9654.80,194.32,2.01,7.19,-7.89,7.65,7.57,3.98,9625.27,160.10,1.66,4.90,-4.79,6.83,4.84,3.54
03/12/12,5184.22,104.63,2.02,6.88,-6.49,7.87,6.67,4.10,5169.52,93.81,1.81,5.29,-5.45,7.87,5.37,4.10
04/12/12,5183.65,103.18,1.99,6.49,-6.80,8.40,6.66,4.38,5166.04,95.44,1.85,6.04,-6.49,8.40,6.28,4.38
11/07/2012,5183.65,102.15,1.97,6.78,-6.36,8.92,6.56,4.67,5169.48,96.67,1.87,5.56,-6.10,8.92,5.85,4.67
07/11/2012,5179.39,115.57,2.23,7.64,-6.61,8.83,7.09,4.62,5150.17,103.52,2.01,7.01,-6.08,8.16,6.51,4.26
11/26/2012,5182.66,103.30,1.99,7.07,-5.76,7.38,6.37,3.83,5162.81,95.47,1.85,6.34,-5.40,6.65,5.84,3.44
11/30/2012,5180.82,95.19,1.84,6.51,-5.40,7.91,5.92,4.12,5163.98,91.82,1.78,5.58,-5.07,7.05,5.31,3.65
Is it possible to change the date format for the latter four lines to the correct date format using awk or sed? I only wish to change the date format for those in the form mm/dd/yyyy to dd/mm/yy.
It looks like you're using two different flavors (versions) of date. To check which versions you've got, I think GNU date accepts the --version flag whereas other versions, like BSD/OSX will not accept this flag.
Since you may be using completely different systems, it's probably safest to avoid date completely and use perl to print the current date:
perl -MPOSIX -e 'print POSIX::strftime("%d/%m/%y", localtime) . "\n"'
If you are sure you have GNU awk on both machines, you could use it like this:
awk 'BEGIN { print strftime("%d/%m/%y") }'
To fix the file you've got, here's my take using GNU awk:
awk '{ print gensub(/^(..\/)(..\/)..(..,)/, "\\2\\1\\3", "g"); next }1' file
Or using sed:
sed 's/^\(..\/\)\(..\/\)..\(..,\)/\2\1\3/' file
Results:
29/11/12,9654.80,194.32,2.01,7.19,-7.89,7.65,7.57,3.98,9625.27,160.10,1.66,4.90,-4.79,6.83,4.84,3.54
03/12/12,5184.22,104.63,2.02,6.88,-6.49,7.87,6.67,4.10,5169.52,93.81,1.81,5.29,-5.45,7.87,5.37,4.10
04/12/12,5183.65,103.18,1.99,6.49,-6.80,8.40,6.66,4.38,5166.04,95.44,1.85,6.04,-6.49,8.40,6.28,4.38
07/11/12,5183.65,102.15,1.97,6.78,-6.36,8.92,6.56,4.67,5169.48,96.67,1.87,5.56,-6.10,8.92,5.85,4.67
11/07/12,5179.39,115.57,2.23,7.64,-6.61,8.83,7.09,4.62,5150.17,103.52,2.01,7.01,-6.08,8.16,6.51,4.26
26/11/12,5182.66,103.30,1.99,7.07,-5.76,7.38,6.37,3.83,5162.81,95.47,1.85,6.34,-5.40,6.65,5.84,3.44
30/11/12,5180.82,95.19,1.84,6.51,-5.40,7.91,5.92,4.12,5163.98,91.82,1.78,5.58,-5.07,7.05,5.31,3.65
This should work: sed -re 's/^([0-9][0-9])\/([0-9][0-9])\/[0-9][0-9]([0-9][0-9])(.*)$/\2\/\1\/\3\4/'
It can be made smaller but I made it so it would be more obvious what it does (4 groups, just switching month/day and removing first two chars of the year).
Tip: If you don't want to cat the file you could to the changes in place with sed -i. But be careful if you put a faulty expression in you might end up breaking your source file.
NOTE: This assumes that IF the year is specified with 4 digits, the month/day is reversed.
This below command will do it.
Note:No matter how many number of lines are present in the file.this will just change the last 4 lines.
tail -r your_file| awk -F, 'NR<5{split($1,a,"/");$1=a[2]"/"a[1]"/"a[3];print}1'|tail -r
Well i could figure out some way without using pipes and using a single awk statement and this solution does need a tail command:
awk -F, 'BEGIN{cmd="wc -l your_file";while (cmd|getline tmp);split(tmp,x)}x[1]-NR<=4{split($1,a,"/");$1=a[2]"/"a[1]"/"a[3];print}1' your_file
Another solution:
awk -F/ 'NR<4;NR>3{a=$1;$1=$2;$2=a; print $1"/"$2"/" substr($3,3,2) substr($3,5)}' file
Using awk:
$ awk -F/ 'NR>3{x=$1;$1=$2;$2=x}1' OFS="/" file
By using the / as the delimiter, all you need to do is swap the 1st and 2nd fields which is done here using a temporary variable.