I have a folder with many paired end files (1.1.fq 1.2.fq 2.1.fq 2.2.fq ...) I want to use the "for" to do the aligment for each pair (*.1fq *2.fq) and generate 2 outputs *.stats.txt and *.sam.
I wrote the following command:
for x in *.fq ; do
~/Pedro_Dias/Mamão/Single_end/novocraft/novoalign -d cpapaya.novoIndex -f demultiplex-fq/$x *.1.fq demultiplex-fq/$x *.2.fq -x 3 -H -a -o SAM 2> demultiplex-sam/$x *.stats.txt > demultiplex-sam/$x *.sam;
done
The code return the error:
demultiplex-sam/demultiplex-fq/98.1.fq*.stats.txt:No file or directory
P.s. My files were in demultiplex-fq folder and the output must go to the demultiplex-sam folder. I'm working in a folder that contains the demultiplex-fq demultiplex-sam folders.
You should just loop over one file in the pairs. Then replace .1.fq with .2.fq to get the other file in that pair.
The wildcards need to include the directory name, and then you have to replace the directory when generating the output filenames.
for x in demultiplex-fq/*.1.fq
do
y=${x/.1.fq/.2.fq}
stats=${x/demultiplex-fq/demultiplex/sam}.stats.txt
sam=${x/demultiplex-fq/demultiplex/sam}.sam
~/Pedro_Dias/Mamão/Single_end/novocraft/novoalign -d cpapaya.novoIndex -f "$x" "$y" -x 3 -H -a -o SAM 2> "$stats" > "$sam"
You don't use wildcards in the command, you just use the $x and $y variables.
I used that code and works:
for x in $( ls -1 dem*/*.fq | rev | cut -d . -f 3- | rev | sort -u ) ; do ~/Pedro_Dias/Mamão/Single_end/novocraft/novoalign -d cpapaya.novoIndex -f $x.1.fq $x.2.fq -x 3 -H -a -o SAM 2> ./bams/$(echo $x | tr / _).stats.txtx > ./bams/$(echo $x | tr / _).sam; done
I've created a cURL bash script in which I want to save the response body into a file called output.log, but when I open the file output.log it looks like this:
Here is my bash script:
#!/bin/bash
SECRET_KEY='helloWorld'
FILE_NAME='sma.txt'
function save_log()
{
printf '%s\n' \
"Header Code : $1" \
"Executed at : $(date)" \
"Response Body : $2" \
'==========================================================\n' > output.log
}
while IFS= read -r line;
do
HTTP_RESPONSE=$(curl -I -L -s -w "HTTPSTATUS:%{http_code}\\n" -H "X-Gitlab-Event: Push Hook" -H 'X-Gitlab-Token: '$SECRET_KEY --insecure $line 2>&1)
HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://')
save_log $HTTP_STATUS $HTTP_RESPONSE
done < $FILE_NAME
Can anyone help me get my desired output in my output.log?
From the Curl documentation:
-I, --head Show document info only
Removing the -I or replace it with -i should solve your problem
I have four files in a directory let's say as following
Test_File_20170101_20170112_1.txt
Test_File_20170101_20170112_2.txt
Test_File_20170101_20170112_3.txt
Test_File_20170101_20170112_4.txt
and I want to merge them in order and want the final file as
Test_File_20170101_20170112.txt
You can do something like this:
ls *_[1-9].txt \
| sed 's/_[1-9]\.txt//' \
| sort -u \
| xargs -n 1 -I {} sh -c "cat {}_*.txt > {}.txt"
Explaining each step:
ls *_[1-9].txt: list all files ending with _1.txt, _2.txt etc
sed 's/_[1-9]\.txt//': remove the extension and number suffix
sort -u: unique file names (e.g. Test_File_20170101_20170112)
xargs ...: for each file name, catenate each numbered file into a new file
You could extend this to a larger sequence, e.g. _10.txt etc, but you would need to be aware that the order would not be correct, as it would be in alphabetical order at the expansion of the final *, e.g. _1, _10, _2... Here are some approaches for this: cat files in specific order based on number in filename
I complete the following. Your doubt is fine but I was stuck at 1 point where #cmbuckley helped and my code is working as I expected. Thanks for your concerns and help. You can still correct my code if anything is not right but it is working fine.
#!/bin/bash
PDIR=$1
FDIR=$2
# Change the current working directory
cd "$PDIR" || exit;
# Count number of files present in the current working directory
c=$(ls -p | grep -v / | wc -l)
# Count number of iteration it needed for Loop
n=$(expr "$c" / 4)
# Move the first 4 sorted files.
# 1.Header File 2.Column Name File 3.Detail records file 4.Trailer record file
i=1
while [ "$i" -le "$n" ];
# Move first 4 files from source folder to processing folder
do mv `ls -p | grep -v / | head -4` "$PDIR"merge/;
# Change directory to processing
cd "$PDIR"merge/ || exit;
# Look for files ending with "_[1-9], sort and then merge them to one file and remove numeric from output file
ls *_[1-9].txt | sort -u | sed 's/_[1-9]\.txt//' | xargs -n 1 -I {} sh -c "cat {}_*.txt > {}.txt";
# Remove processed files
rm -f *_[1-9].txt;
# Move output file to Target directory
mv *.txt "$FDIR";
cd "$PDIR" || exit;
i=$(($i + 1));
done
I'm a total Linux noob. I just want to append a field in the first column
Ex. 192.168.0.254 mwd.com
wget -O - "http://mirror1.malwaredomains.com/files/justdomains" | ??? > /var/hosts.md
I was thinking to use sed but there's no data to substitute.
you can still use sed, just match on the start of line:
wget -O - "http://mirror1.malwaredomains.com/files/justdomains" | sed 's/^/192.168.0.254 /' >/var/hosts.md
You can substitute with a beginning-of-line or end-of-line marker:
> echo line | sed -e 's/$/ foobar/'
line foobar
> echo line | sed -e 's/^/foobar /'
foobar line
Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.
This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.
Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines
Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.
A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt
I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total