Concatenate files together that have similar names

Concatenate files together that have similar names - string

I have a number of files that look something like this:
418_S32_L003_R1_001.fastq.gz
418_S32_L003_R2_001.fastq.gz
418_S1_L002_R1_001.fastq.gz
418_S1_L002_R2_001.fastq.gz
419_S32_L003_R1_001.fastq.gz
419_S32_L003_R2_001.fastq.gz
419_S1_L002_R1_001.fastq.gz
419_S1_L002_R2_001.fastq.gz
The first number is different for each set of four files.
Samples that start with the same number should be combined together if they have the same value for *R1* or *R2*.
So, these two samples should be concatenated:
418_S32_L003_R1_001.fastq.gz
418_S1_L002_R1_001.fastq.gz
And these two should be concatenated:
419_S32_L003_R2_001.fastq.gz
419_S1_L002_R2_001.fastq.gz
And this should be repeated for all files within the directory.
Is there a good way to do this in bash other than manually concatenating like this:
cat 418_S32_L003_R1_001.fastq.gz 418_S1_L002_R1_001.fastq.gz > 418_R1.fastq.gz

You can read each file and append it to the target file which you can get from the file name.
for file in *.fastq.gz;
do
IFS='_' read -a array <<< $file;
name="${array[0]}_${array[3]}.fastq.gz";
cat $file >> $name;
done

a = 0
for i in *
do
for j in *
do
if [ $(echo $j |cut -d _ -f 1) = $(echo $i |cut -d _ -f 1) ]
then
cat $i $j > $a
a = a + 1
fi
done
done

This might work for you (GNU parallel):
parallel --dry-run -N4 --plus cat {1} {4} \> {1%_.*}_R1.{1+..} ::: *R[12]*
This will print out the intended cat commands, check the results and if OK remove the --dry-run option.

Related

Bash: Two for loops at once?

I am not sure how to do this at all.
I have two text files, FILE1 and FILE2.
I would like to run a for loop for each file at the same time and display the
contents next to each other.
For example,
for $i in $(cat FILE1); do echo $i; done
for $j in $(cat FILE2); do echo $j; done
I would like to combine these two commands, so I can run both files at the same time and have an output like $i $j

Solution 1
Use the paste command
paste FILE1 FILE2
Details for paste command
Another resource
Solution 2
You can do this if they have the same number of lines.
#!/bin/bash
t=$(cat FILE1 | wc -l)
for i in `seq 1 $t`;
do
cat FILE1|head -n $i|tail -n 1
cat FILE2|head -n $i|tail -n 1
done
You can extend it to what you want for unequal number of lines.

You shouldn't be using for loops at all; see Bash FAQ 001. Instead, use two read commands in a single while loop.
while IFS= read -r line1 && IFS= read -r line2 <&3; do
printf '%s | %s\n' "$line1" "$line2"
done < FILE1 3< FILE2
Each read command reads from a separate file descriptor. In this version, the loop will exit when the shorter of the two files is exhausted.

There are two different questions being asked here. Other answers address the question of how to display the contents of the file in 2 columns. Running two loops simultaneously (which is the wrong way to address the first problem) can be done by running them each asynchronously: for i in ${seqi?}; do ${cmdi?}; done & for j in ${seqj?}; do ${cmdj?}; done & wait
Although you could also implement paste -d ' ' file1 file2 with something like:
while read line_from_file1; p=$?; read line_from_file2 <&3 || test "$p" = 0; do
echo "$line_from_file1" "$line_from_file2"
done < file1 3< file2

Another option, in bash v4+ is to read the two files into 2 arrays, then echo the array elements side-by-side:
# Load each file into its own array
readarray -t f1 < file1
readarray -t f2 < file2
# Print elements of both arrays side-by-side
for ((i=0;i<${#f1[#]};i++)) ; do echo ${f1[i]} ${f2[i]}; done
Or change the echo to printf if you want the columns to line up:
printf "%-20s %-20s\n" ${f1[i]} ${f2[i]}
I'm not suggesting you do this if your files are 100s of megabytes.

find files that have number in file name greater than

If I have 10 files called 01-a.txt, 02-a.txt,...10-a.txt - how can I find the files where the number is greater than 5? I would like a general solution, and I would be putting the contents of all files into one file using something like
cat *.txt > bigfile.txt
I can get files with numbers using
ls *[0-9]*
but can't seem go beyond that.
thanks.

You may use seq for that, but it works only if all files have same naming convention:
seq -f "%02g-a.txt" 6 10
06-a.txt
07-a.txt
08-a.txt
09-a.txt
10-a.txt
I.e.:
cat `seq -f "%02g-a.txt" 6 10` > bigfile.txt

Assuming the folder contains only these files.
This would list all files where the number is > 5
ls [0-9]* | awk -F '-' '{if ($1 > 5) print $0}'

It will cat all files named as "< numeric_value >-< rest >" and having this < numeric_value > greater than $LIM.
Even if they are written with one single digit (like 5), with two digits (like 05), or more...
And even if the < rest > are different among the files.
LIM=5
for file in $(ls);
do
number=$(echo $file | cut -f1 -d'-');
[ $number -gt $LIM ] && cat $file >> bigfile.txt;
done

ls *.txt | perl -n -e '$f = $_; $f =~ s/\D//g; $f > 5 and print'

appending files into multiple files depending on the appended files line count in unix

I have a scenario where i have lot of file in folder and i need to all files in such a format that appended files threshold is 5000, if its more than that it should create a new file...and for the new file also threshold should be 5000. have tried the below logic which works only for the first appended file.
ls -ltr *re.txt | awk '{print $9}' > files.txt
touch file_int.txt
touch file_count.txt
for file in `cat files.txt`
do
cat file_count.txt >> file_int.txt
count=`wc -l < file_int.txt`
count1=`wc -l < $file`
count3=`expr $count + $count1`
file_date_time=file_`date +%H_%M_%S_%N`.txt
if [ $count3 -gt 5000 ]
then
cat $file >> $file_date_time
cat $file_date_time > file_int.txt
else
cat $file >> file_count.txt
fi
done

You might like to consider using the split command:
split -l 5000 bigfile.txt `date +%H_%M_%S`
The filenames might not be exactly what you need, but split will break up your file into 5000 line chunks.
You could rewrite your script like this:
#!/bin/bash
mkdir to_delete
for file in *re.txt
do
split -l 5000 $file ${file}_`date +%H_%M_%S_`
mv $file to_delete/.
done
If it works ok you can then delete the contents of the to_delete directory.

extracting specified line numbers from file using shell script

I have a file with a list of address it looks like this (ADDRESS_FILE)
0xf012134
0xf932193
.
.
0fx12923a
I have another file with a list of numbers it looks like this (NUMBERS_FILE)
20
40
.
.
12
I want to cut the first 20 lines from ADDRESS_FILE and put that into a new file
then cut the next 40 lines from ADDRESS_FILE so on ...
I know that a series of sed commands like the one given below does the job
sed -n 1,20p ADDRESSS_FILE > temp_file_1
sed -n 20,60p ADDRESSS_FILE > temp_file_2
.
.
sed -n somenumber,endofilep. ADDRESS_FILE > temp_file_n
But I want to does this automatically using shell scripting which will change the numbers of lines to cut on each sed execution.
How to do this ???
Also on a general note, which are the text processing commands in linux which are very useful in such cases?

Assuming your line numbers are in a file called lines, sorted etc., try:
#!/bin/sh
j=0
count=1
while read -r i; do
sed -n $j,$i > filename.$count # etc... details of sed/redirection elided
j=$i
count=$(($count+1))
done < lines
Note. The above doesn't assume a consistent number of lines to split on for each iteration.
Since you've additionally asked for a general utility, try split. However this splits on a consistent number of lines, and is perhaps of limited use here.

Here's an alternative that reads directly from the NUMBERS_FILE:
n=0; i=1
while read; do
sed -n ${i},+$(( REPLY - 1 ))p ADDRESS_FILE > temp_file_$(( n++ ))
(( i += REPLY ))
done < NUMBERS_FILE

size=$(wc -l ADDRESSS_FILE)
i=1
n=1
while [ $n -lt $size ]
do
sed -n $n,$((n+19))p ADDRESSS_FILE > temp_file_$i
i=$((i+1))
n=$((n+20))
done
or just
split -l20 ADDRESSS_FILE temp_file_
(thanks Brian Agnew for the idea).

An ugly solution which works with a single sed invocation, can probably be made less horrible.
This generates a tiny sed script to split the file
#!/bin/bash
sum=0
count=0
sed -n -f <(while read -r n ; do
echo $((sum+1),$((sum += n)) "w temp_file_$((count++))" ;
done < NUMBERS_FILE) ADDRESS_FILE

How to split a file and keep the first line in each of the pieces?

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?

This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines

Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.

A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt

I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Concatenate files together that have similar names - string

You can read each file and append it to the target file which you can get from the file name. for file in *.fastq.gz; do IFS='_' read -a array <<< $file; name="${array[0]}_${array[3]}.fastq.gz"; cat $file >> $name; done

a = 0 for i in * do for j in * do if [ $(echo $j |cut -d _ -f 1) = $(echo $i |cut -d _ -f 1) ] then cat $i $j > $a a = a + 1 fi done done

This might work for you (GNU parallel): parallel --dry-run -N4 --plus cat {1} {4} \> {1%_.}_R1.{1+..} ::: R[12]* This will print out the intended cat commands, check the results and if OK remove the --dry-run option.

Related

Bash: Two for loops at once?

find files that have number in file name greater than

appending files into multiple files depending on the appended files line count in unix

extracting specified line numbers from file using shell script

How to split a file and keep the first line in each of the pieces?

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Concatenate files together that have similar names - string

You can read each file and append it to the target file which you can get from the file name. for file in *.fastq.gz; do IFS='_' read -a array <<< $file; name="${array[0]}_${array[3]}.fastq.gz"; cat $file >> $name; done

a = 0 for i in * do for j in * do if [ $(echo $j |cut -d _ -f 1) = $(echo $i |cut -d _ -f 1) ] then cat $i $j > $a a = a + 1 fi done done

This might work for you (GNU parallel): parallel --dry-run -N4 --plus cat {1} {4} \> {1%_.*}_R1.{1+..} ::: *R[12]* This will print out the intended cat commands, check the results and if OK remove the --dry-run option.

Related

Bash: Two for loops at once?

find files that have number in file name greater than

appending files into multiple files depending on the appended files line count in unix

extracting specified line numbers from file using shell script

How to split a file and keep the first line in each of the pieces?

Categories

Resources

This might work for you (GNU parallel): parallel --dry-run -N4 --plus cat {1} {4} \> {1%_.}_R1.{1+..} ::: R[12]* This will print out the intended cat commands, check the results and if OK remove the --dry-run option.