Matching file content with other filenames to extract and merge contents - linux

I have two directories.
In directory_1, I have many .txt files
Content of these files (for example file1.txt) are a list of characters
file1.txt
--
rer_098
dfrkk9
In directory_2, I have many files, two of them are ‘rer_098’ and ‘dfrkk9’.
Content of these files are as follows:
rer_098
--
>123_nbd
sasert
>456_nbd
ffjko
dfrkk9
--
>789_nbd
figyi
>012_nbd
jjjygk
Now in a separate output directory (directory_3), for this above example, I want output files like:
file1.txt
--
>123_nbd
sasert
>456_nbd
ffjko
>789_nbd
figyi
>012_nbd
jjjygk
and so on for file2.txt
Thanks!

This might work for you (GNU parallel):
parallel 'cat {} | parallel -I## cat dir_2/## > dir_3/{/}' ::: dir_1/*.txt
Use two invocations of parallel, the first traverses dir_1 and pipes its output in a second parallel. This cats the input files and outputs the result dir_3 keeping the original name from the first parallel invocation.
N.B. The use of the -I option to rename the parameter delimiters from the default {} to ##.

Pretty easy to do with just shell. Something like
for fullname in directory_1/*.txt; do
file=$(basename "$fullname")
while read -r line; do
cat "directory_2/$line"
done <"$fullname" >"directory_3/$file"
done

for file in directory_1/*.txt; do
awk 'NR==FNR{ARGV[ARGC++]="directory_2/"$0; next} 1' "$file" > "directory_3/${file##%/}"
done

Related

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

Get last line from grep search on multiple files, and write them in a output file

I have multiple files located in multiple directories. From them I search a keyword 'ENERGY' by grep. In each file I get multiple match cases. I want to take the last line from each file and save the results in the output.txt file. I wrote the following code:
labl=SubDir
ENERGY=`grep 'ENERGY' MyDir*${labl}*/*.txt`
cat > output.txt << EOF
${ENERGY}
EOF
This code saves all match cases from each file. But as mentioned, I need the last match case from each file. For that I modified the grep command as:
ENERGY=`grep 'ENERGY' MyDir*${labl}*/*.txt|taile -l`
Unfortunately this doesn't do the job either. Instead, it saves all the match cases from the last file only.
How to solve it?
Please don't run multiple processes/pipes to achieve this.
gawk '/ENERGY/{last=$0} ENDFILE{if(last!="") print last; last=""}' MyDir*"$labl"*/*.txt
/ENERGY/{last=$0}: On lines which match the regex ENERGY, set variable last to the contents of the entire line $0
ENDFILE{...} Run this {action} at the end of every input file supplied by the glob.
if(last!="") print last: print last if it's not null
last="": reset this variable to null, avoiding duplication
MyDir*"${labl}"*/*.txt: Quoted variable in glob will match directory names that include spaces
Use a for loop:
for f in MyDir*"$lab1"*/*.txt; do
grep ENERGY "$f" | tail -1 >> output.txt
done
Yet one but probably not last possible approach is to use parallel like this. Probably you can achieve the same with xargs, but I personally prefer parallel as simpler and giving the possibility to scale your process.
ls -1 file* | parallel -j1 "grep ENERGY {} | tail -n 1" > output.txt

Copy a txt file twice to a different file using bash

I am trying to cat a file.txt and loop it twice through the whole content and copy it to a new file file_new.txt. The bash command I am using is as follows:
for i in {1..3}; do cat file.txt > file_new.txt; done
The above command is just giving me the same file contents as file.txt. Hence file_new.txt is also of the same size (1 GB).
Basically, if file.txt is a 1GB file, then I want file_new.txt to be a 2GB file, double the contents of file.txt. Please, can someone help here? Thank you.
Simply apply the redirection to the for loop as a whole:
for i in {1..3}; do cat file.txt; done > file_new.txt
The advantage of this over using >> (aside from not having to open and close the file multiple times) is that you needn't ensure that a preexisting output file is truncated first.
Note that the generalization of this approach is to use a group command ({ ...; ...; }) to apply redirections to multiple commands; e.g.:
$ { echo hi; echo there; } > out.txt; cat out.txt
hi
there
Given that whole files are being output, the cost of invoking cat for each repetition will probably not matter that much, but here's a robust way to invoke cat only once:[1]
# Create an array of repetitions of filename 'file' as needed.
files=(); for ((i=0; i<3; ++i)); do files[i]='file'; done
# Pass all repetitions *at once* as arguments to `cat`.
cat "${files[#]}" > file_new.txt
[1] Note that, hypothetically, you could run into your platform's command-line length limit, as reported by getconf ARG_MAX - given that on Linux that limit is 2,097,152 bytes (2MB) that's not likely, though.
You could use the append operator, >>, instead of >. Then adjust your loop count as needed to get the output size desired.
You should adjust your code so it is as follows:
for i in {1..3}; do cat file.txt >> file_new.txt; done
The >> operator appends data to a file rather than writing over it (>)
if file.txt is a 1GB file,
cat file.txt > file_new.txt
cat file.txt >> file_new.txt
The > operator will create file_new.txt(1GB),
The >> operator will append file_new.txt(2GB).
for i in {1..3}; do cat file.txt >> file_new.txt; done
This command will make file_new.txt(3GB),because for i in {1..3} will run three times.
As others have mentioned, you can use >> to append. But, you could also just invoke cat once and have it read the file 3 times. For instance:
n=3; cat $( yes file.txt | sed ${n}q ) > file_new.txt
Note that this solution exhibits a common anti-pattern and fails to properly quote the arguments, which will cause issues if the filename contains whitespace. See mklement's solution for a more robust solution.

Creating a file by merging two files

I would like to merge two files and create a new file using Linux command.
I have the two files named as a1b.txt and a1c.txt
Content of a1b.txt
Hi,Hi,Hi
How,are,you
Content of a1c.txt
Hadoop|are|world
Data|Big|God
And I need a new file called merged.txt with the below content(expected output)
Hi,Hi,Hi
How,are,you
Hadoop|are|world
Data|Big|God
To achieve that in terminal I am running the below command,but it gives me output like below
Hi,Hi,Hi
How,are,youHadoop|are|world
Data|Big|God
cat /home/cloudera/inputfiles/a1* > merged.txt
Could somebody help on getting the expected ouput
Probably your files do not have newline characters. Here is how to put the newline character to them.
$ sed -i -e '$a\' /home/cloudera/inputfiles/a1*
$ cat /home/cloudera/inputfiles/a1* > merged.txt
If you are allowed to be destructive (not have to keep the original two files unmodified) then:
robert#debian:/tmp$ cat fileB.txt >> fileA.txt
robert#debian:/tmp$ cat fileA.txt
this is file A
This is file B.

How to append contents of multiple files into one file

I want to copy the contents of five files to one file as is. I tried doing it using cp for each file. But that overwrites the contents copied from the previous file. I also tried
paste -d "\n" 1.txt 0.txt
and it did not work.
I want my script to add the newline at the end of each text file.
eg. Files 1.txt, 2.txt, 3.txt. Put contents of 1,2,3 in 0.txt
How do I do it ?
You need the cat (short for concatenate) command, with shell redirection (>) into your output file
cat 1.txt 2.txt 3.txt > 0.txt
Another option, for those of you who still stumble upon this post like I did, is to use find -exec:
find . -type f -name '*.txt' -exec cat {} + >> output.file
In my case, I needed a more robust option that would look through multiple subdirectories so I chose to use find. Breaking it down:
find .
Look within the current working directory.
-type f
Only interested in files, not directories, etc.
-name '*.txt'
Whittle down the result set by name
-exec cat {} +
Execute the cat command for each result. "+" means only 1 instance of cat is spawned (thx #gniourf_gniourf)
>> output.file
As explained in other answers, append the cat-ed contents to the end of an output file.
if you have a certain output type then do something like this
cat /path/to/files/*.txt >> finalout.txt
If all your files are named similarly you could simply do:
cat *.log >> output.log
If all your files are in single directory you can simply do
cat * > 0.txt
Files 1.txt,2.txt, .. will go into 0.txt
for i in {1..3}; do cat "$i.txt" >> 0.txt; done
I found this page because I needed to join 952 files together into one. I found this to work much better if you have many files. This will do a loop for however many numbers you need and cat each one using >> to append onto the end of 0.txt.
Edit:
as brought up in the comments:
cat {1..3}.txt >> 0.txt
or
cat {0..3}.txt >> all.txt
Another option is sed:
sed r 1.txt 2.txt 3.txt > merge.txt
Or...
sed h 1.txt 2.txt 3.txt > merge.txt
Or...
sed -n p 1.txt 2.txt 3.txt > merge.txt # -n is mandatory here
Or without redirection ...
sed wmerge.txt 1.txt 2.txt 3.txt
Note that last line write also merge.txt (not wmerge.txt!). You can use w"merge.txt" to avoid confusion with the file name, and -n for silent output.
Of course, you can also shorten the file list with wildcards. For instance, in case of numbered files as in the above examples, you can specify the range with braces in this way:
sed -n w"merge.txt" {1..3}.txt
if your files contain headers and you want remove them in the output file, you can use:
for f in `ls *.txt`; do sed '2,$!d' $f >> 0.out; done
All of the (text-) files into one
find . | xargs cat > outfile
xargs makes the output-lines of find . the arguments of cat.
find has many options, like -name '*.txt' or -type.
you should check them out if you want to use it in your pipeline
If the original file contains non-printable characters, they will be lost when using the cat command. Using 'cat -v', the non-printables will be converted to visible character strings, but the output file would still not contain the actual non-printables characters in the original file. With a small number of files, an alternative might be to open the first file in an editor (e.g. vim) that handles non-printing characters. Then maneuver to the bottom of the file and enter ":r second_file_name". That will pull in the second file, including non-printing characters. The same could be done for additional files. When all files have been read in, enter ":w". The end result is that the first file will now contain what it did originally, plus the content of the files that were read in.
Send multi file to a file(textall.txt):
cat *.txt > textall.txt
If you want to append contents of 3 files into one file, then the following command will be a good choice:
cat file1 file2 file3 | tee -a file4 > /dev/null
It will combine the contents of all files into file4, throwing console output to /dev/null.

Resources