How to append contents of multiple files into one file - linux

I want to copy the contents of five files to one file as is. I tried doing it using cp for each file. But that overwrites the contents copied from the previous file. I also tried
paste -d "\n" 1.txt 0.txt
and it did not work.
I want my script to add the newline at the end of each text file.
eg. Files 1.txt, 2.txt, 3.txt. Put contents of 1,2,3 in 0.txt
How do I do it ?

You need the cat (short for concatenate) command, with shell redirection (>) into your output file
cat 1.txt 2.txt 3.txt > 0.txt

Another option, for those of you who still stumble upon this post like I did, is to use find -exec:
find . -type f -name '*.txt' -exec cat {} + >> output.file
In my case, I needed a more robust option that would look through multiple subdirectories so I chose to use find. Breaking it down:
find .
Look within the current working directory.
-type f
Only interested in files, not directories, etc.
-name '*.txt'
Whittle down the result set by name
-exec cat {} +
Execute the cat command for each result. "+" means only 1 instance of cat is spawned (thx #gniourf_gniourf)
>> output.file
As explained in other answers, append the cat-ed contents to the end of an output file.

if you have a certain output type then do something like this
cat /path/to/files/*.txt >> finalout.txt

If all your files are named similarly you could simply do:
cat *.log >> output.log

If all your files are in single directory you can simply do
cat * > 0.txt
Files 1.txt,2.txt, .. will go into 0.txt

for i in {1..3}; do cat "$i.txt" >> 0.txt; done
I found this page because I needed to join 952 files together into one. I found this to work much better if you have many files. This will do a loop for however many numbers you need and cat each one using >> to append onto the end of 0.txt.
Edit:
as brought up in the comments:
cat {1..3}.txt >> 0.txt
or
cat {0..3}.txt >> all.txt

Another option is sed:
sed r 1.txt 2.txt 3.txt > merge.txt
Or...
sed h 1.txt 2.txt 3.txt > merge.txt
Or...
sed -n p 1.txt 2.txt 3.txt > merge.txt # -n is mandatory here
Or without redirection ...
sed wmerge.txt 1.txt 2.txt 3.txt
Note that last line write also merge.txt (not wmerge.txt!). You can use w"merge.txt" to avoid confusion with the file name, and -n for silent output.
Of course, you can also shorten the file list with wildcards. For instance, in case of numbered files as in the above examples, you can specify the range with braces in this way:
sed -n w"merge.txt" {1..3}.txt

if your files contain headers and you want remove them in the output file, you can use:
for f in `ls *.txt`; do sed '2,$!d' $f >> 0.out; done

All of the (text-) files into one
find . | xargs cat > outfile
xargs makes the output-lines of find . the arguments of cat.
find has many options, like -name '*.txt' or -type.
you should check them out if you want to use it in your pipeline

If the original file contains non-printable characters, they will be lost when using the cat command. Using 'cat -v', the non-printables will be converted to visible character strings, but the output file would still not contain the actual non-printables characters in the original file. With a small number of files, an alternative might be to open the first file in an editor (e.g. vim) that handles non-printing characters. Then maneuver to the bottom of the file and enter ":r second_file_name". That will pull in the second file, including non-printing characters. The same could be done for additional files. When all files have been read in, enter ":w". The end result is that the first file will now contain what it did originally, plus the content of the files that were read in.

Send multi file to a file(textall.txt):
cat *.txt > textall.txt

If you want to append contents of 3 files into one file, then the following command will be a good choice:
cat file1 file2 file3 | tee -a file4 > /dev/null
It will combine the contents of all files into file4, throwing console output to /dev/null.

Related

How to rename fasta header based on filename in multiple files?

I have a directory with multiple fasta file named as followed:
BC-1_bin_1_genes.faa
BC-1_bin_2_genes.faa
BC-1_bin_3_genes.faa
BC-1_bin_4_genes.faa
etc. (about 200 individual files)
The fasta header look like this:
>BC-1_k127_3926653_6 # 4457 # 5341 # -1 # ID=2_6;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.697
I now want to add the filename to the header since I want to annotate the sequences for each file.I tried the following:
for file in *.faa;
do
sed -i "s/>.*/${file%%.*}/" "$file" ;
done
It worked partially but it removed the ">" from the header which is essential for the fasta file. I tried to modify the "${file%%.*}" part to keep the carrot but it always called me out on bad substitutions.
I also tried this:
awk '/>/{sub(">","&"FILENAME"_");sub(/\.faa/,x)}1' *.faa
This worked in theory but only printed everything on my terminal rather than changing it in the respective files.
Could someone assist with this?
It's not clear whether you want to replace the earlier header, or add to it. Both scenarios are easy to do. Don't replace text you don't want to replace.
for file in ./*.faa;
do
sed -i "s/^>.*/>${file%%.*}/" "$file"
done
will replace the header, but include a leading > in the replacement, effectively preserving it; and
for file in ./*.faa;
do
sed -i "s/^>.*/&${file%%.*}/" "$file"
done
will append the file name at the end of the header (& in the replacement string evaluates to the string we are replacing, again effectively preserving it).
For another variation, try
for file in *.faa;
do
sed -i "/^>/s/\$/ ${file%%.*}/" "$file"
done
which says on lines which match the regex ^>, replace the empty string at the end of the line $ with the file name.
Of course, your Awk script could easily be fixed, too. Standard Awk does not have an option to parallel the -i "in-place" option of sed, but you can easily use a temporary file:
for file in ./*.faa;
do
awk '/>/{ $0 = $0 " " FILENAME);sub(/\.faa/,"")}1' "$file" >"$file.tmp" &&
mv "$file.tmp" "$file"
done
GNU Awk also has an -i inplace extension which you could simply add to the options of your existing script if you have GNU Awk.
Since FASTA files typically contain multiple headers, adding to the header rather than replacing all headers in a file with the same string seems more useful, so I changed your Awk script to do that instead.
For what it's worth, the name of the character ^ is caret (carrot is 🥕). The character > is called greater than or right angle bracket, or right broket or sometimes just wedge.
You just need to detect the pattern to replace and use regex to implement it:
fasta_helper.sh
location=$1
for file in $location/*.faa
do
full_filename=${file##*/}
filename="${full_filename%.*}"
#scape special chars
filename=$(echo $filename | sed 's_/_\\/_g')
echo "adding file name: $filename to: $full_filename"
sed -i -E "s/^[^#]+/>$filename /" $location/$full_filename
done
usage:
Just pass the folder with fasta files:
bash fasta_helper.sh /foo/bar
test:
lectures
Regex: matching up to the first occurrence of a character
Extract filename and extension in Bash
https://unix.stackexchange.com/questions/78625/using-sed-to-find-and-replace-complex-string-preferrably-with-regex
Locating your files
Suggesting to first identify your files with find command or ls command.
find . -type f -name "*.faa" -printf "%f\n"
A find command to print only file with filenames extension .faa. Including sub directories to current directory.
ls -1 "*.faa"
An ls command to print files and directories with extension .faa. In current directory.
Processing your files
Once you have the correct files list, iterate over the list and apply sed command.
for fileName in $(find . -type f -name "*.faa" -printf "%f\n"); do
stripedFileName=${fileName/.*/} # strip extension .faa
sed -i "1s|\$| $stripedFileName|" "fileName" # append value of stripedFileName at end of line 1
done

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

Automate and looping through batch script

I'm new to batch. I want iterate through a list and use the output content to replace a string in another file.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
do
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
done
When I run the code I get the error
sed: 1: "s/dataFile/"$torepla ...": bad flag in substitute command: '$'
Example of somefile with which has list of files paths
foo/name/xxx/2020-01-01.txt
foo/name/xxx/2020-01-02.txt
foo/name/xxx/2020-01-03.txt
However, my desired output is to use the list of file paths in somefile directory to replace a string in another file2 content. Something like this:
This is the directory of locations where data from /Team/foo/name/xxx/2020-01-01.txt ............
I'm not sure if I understand your desired outcome, but hopefully this will help you to figure out your problem:
You have three files in a directory:
TEAM/foo/name/xxx/2020-01-02.txt
TEAM/foo/name/xxx/2020-01-03.txt
TEAM/foo/name/xxx/2020-01-01.txt
And you have another file called to_be_changed.txt which contains the text This is the directory of locations where data from TO_BE_REPLACED ............ and you want to grab the filenames of your three files and insert them into your to_be_changed.txt file, you can do it with:
while read file
do
filename="$file"
sed "s/TO_BE_REPLACED/${filename##*/}/g" to_be_changed.txt >> changed.txt
done < <(find ./TEAM/ -name "*.txt")
And you will then have made a file called changed.txt which contains:
This is the directory of locations where data from 2020-01-02.txt ............
This is the directory of locations where data from 2020-01-03.txt ............
This is the directory of locations where data from 2020-01-01.txt ............
Is this what you're trying to achieve? If you need further clarification I'm happy to edit this answer to provide more details/explanation.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
No. No, no, nono.
ls -l somefile is only going to show somefile unless it's a directory.
(Don't name a directory "somefile".)
If you mean somefile.txt, please clarify in your post.
grep .txt is going to look through the lines presented for the three characters txt preceded by any character (the dot is a regex wildcard). Since you asked for a long listing of somefile it shouldn't find any, so nothing should be passed along.
awk 'print $4}' is a typo which won't compile. awk will crash.
Keep it simple. What I suspect you meant was
for file in *.txt
Then in
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
it's unlear what you expect $file to be - awk's $4 from an ls -l seems unlikely.
Assuming it's the filenames from the for above, then try
sed "s,dataFile,/Team/$file," file2 > /tmp/test.txt
Does that help? Correct me as needed. Sorry if I seem harsh.
Welcome to SO. ;)

Matching file content with other filenames to extract and merge contents

I have two directories.
In directory_1, I have many .txt files
Content of these files (for example file1.txt) are a list of characters
file1.txt
--
rer_098
dfrkk9
In directory_2, I have many files, two of them are ‘rer_098’ and ‘dfrkk9’.
Content of these files are as follows:
rer_098
--
>123_nbd
sasert
>456_nbd
ffjko
dfrkk9
--
>789_nbd
figyi
>012_nbd
jjjygk
Now in a separate output directory (directory_3), for this above example, I want output files like:
file1.txt
--
>123_nbd
sasert
>456_nbd
ffjko
>789_nbd
figyi
>012_nbd
jjjygk
and so on for file2.txt
Thanks!
This might work for you (GNU parallel):
parallel 'cat {} | parallel -I## cat dir_2/## > dir_3/{/}' ::: dir_1/*.txt
Use two invocations of parallel, the first traverses dir_1 and pipes its output in a second parallel. This cats the input files and outputs the result dir_3 keeping the original name from the first parallel invocation.
N.B. The use of the -I option to rename the parameter delimiters from the default {} to ##.
Pretty easy to do with just shell. Something like
for fullname in directory_1/*.txt; do
file=$(basename "$fullname")
while read -r line; do
cat "directory_2/$line"
done <"$fullname" >"directory_3/$file"
done
for file in directory_1/*.txt; do
awk 'NR==FNR{ARGV[ARGC++]="directory_2/"$0; next} 1' "$file" > "directory_3/${file##%/}"
done

Paste files from list of paths into single output file

I have a file containing a list of filenames and their paths, as in the example below:
$ cat ./filelist.txt
/trunk/data/9.20.txt
/trunk/data/9.30.txt
/trunk/data/50.3.txt
/trunk/data/55.100.txt
...
All of these files, named as X.Y.txt, contain a list of double values. For example:
$ cat ./9.20.txt
1.23
1.0e-6
...
I'm trying to paste all of these X.Y.txt files into a single file, but I'm not sure about how to do it. Here's what I've been able to do so far:
cat ./filelist.txt | xargs paste output.txt >> output.txt
Any ideas on how to do it properly?
You could simply cat-append each file into your output file, as in:
$ cat <list_of_paths> | xargs -I {} cat {} >> output.txt
In the above command, each line from your input file will be taken by xargs, and will be used to replace {}, so that each actual command being run is:
$ cat <X.Y.txt> >> output.txt
If all you're looking to do is to read each line from filelist.txt and append the contents of the file that the line refers to to a single output file, use this:
while read -r file; do
[[ -f "$file" ]] && cat "$file"
done < "filelist.txt" > "output.txt"
Edit: If you know your input file to only contain lines that are file paths (and optionally empty lines) - and no comments, etc. - #Rubens' xargs-based solution is the simplest.
The advantage of the while loop is that you can pre-process each line from the input file, as demonstrated by the -f test above, which ensures that the input line refers to an existing file.
More complex but without argument length limit
Well, the limit here is the available computer memory.
The file buffer.txt must not exist already.
touch buffer.txt
cat filelist.txt | xargs -iXX bash -c 'paste buffer.txt XX > output.txt; mv output.txt buffer.txt';
mv buffer.txt output.txt
What this does, by line:
Create a buffer.txt file which must be initially empty. (paste does not seem to like non-existent files. There does not seem to be a way to make it treat such files as empty.)
Run paste buffer.txt XX > output.txt; mv output.txt buffer.txt. XX is replaced by each file in the filelist.txt file. You can't just do paste buffer.txt XX > buffer.txt because buffer.txt will be truncated before paste processes it. Hence the mv rigmarole.
Move buffer.txt to output.txt so that you get your output with the file name you wanted. Also makes it safe to rerun the whole process.
The previous version forced xargs to issue exactly one paste per file you want to paste but for even better performance, you can do this:
touch buffer.txt;
cat filelist.txt | xargs bash -c 'paste buffer.txt "$#" > output.txt; mv output.txt buffer.txt' FILLER;
mv buffer.txt output.txt
Note the presence of "$#" in the command that bash executes. So paste gets the list of arguments from the list of arguments given to bash. The FILLER parameter passed to bash is to give it a value for $0. If it were not there, then the first file that xargs gives to bash would be used for $0 and thus paste would skip some files.
This way, xargs can pass hundreds of parameters to paste with each invocation and thus reduce dramatically the number of times paste is invoked.
Simpler but limited way
This method suffer from limitations on the number of arguments that a shell can pass to a command it executes. However, in many cases it is good enough. I can't count the number of times when I was performing spur-of-the-moment operations where using xargs would have been superfluous. (As part of a long term solution, that's another matter.)
The simpler way is:
paste `cat filelist.txt` > output.txt
It seems you were thinking that xargs would execute paste output.txt >> output.txt multiple times but that's not how it works. The redirection applies to the entire cat ./filelist.txt | xargs paste output.txt (as you initially had it). If you want to have redirection apply to the individual commands launched by xargs you have it launch a shell, like I do above.
#!/usr/bin/env bash
set -x
while read -r
do
echo "${REPLY}" >> output.txt
done < filelist.txt
OR, to get the files directly:-
#!/usr/bin/env bash
set -x
find *.txt -type f | while read $files
do
echo "${files}" >> output.txt
done
A simple while loop should do the trick:
while read line; do
cat ${line} >> output.txt
done < filelist.txt

Resources