Cat several thousand files - linux

I have several(60,000) files in a folder that need to be combined into 3 separate files.
How would I cat this so that I could have each file containing the contents of ~20,000 of these files?
I know it would be like a loop:
for i in {1..20000}
do
cat file-$i > new_file_part_1
done

Doing:
cat file-$i > new_file_part_1
Will truncate new_file_part_1 every time the loop iterates. You want to append to the file:
cat file-$i >> new_file_part_1

The other answers close and open the file on every iteration. I would prefer
for i in {1..20000}
do
cat file-$i
done > new_file_part_1
so the output of all cat runs are piped into one file opend once for all.

Assuming it doesn't matter which input file goes to which output file:
for i in {1..60000}
do
cat file$i >> out$(($i % 3))
done
This script uses the modulo operator % to divide the input into 3 bins; it will generate 3 output files:
out0 contains file3, file6, file9, ...
out1 contains file1, file4, file7, ...
out2 contains file2, file5, file8, ...

#!/bin/bash
cat file-{1..20000} > new_file_part_1
This launches cat only once and opens and closes the output file only once. No loop required, since cat can accept all 20000 arguments.
An astute observer noted that on some systems, the 20000 arguments may exceed the system's ARG_MAX limit. In such a case, xargs can be used, with the penalty that cat will be launched more than once (but still significantly fewer than 20000 times).
echo file-{1..20000} | xargs cat > new_file_part_1
This works because, in Bash, echo is a shell built-in and as such is not subject to ARG_MAX.

Related

How to introduce an input in a C program through shell script?

When I execute the program in console I just do this:
./c1 2500
textfile.txt
and it just print a integer. The thing here is that I want to introduce 1000 textfiles as input so I made this script:
c=1
while [ $c -le 1000 ]
do
./c1 2500 >> sal.txt
$c.txt
(( c++ ))
done
The trouble here is that the script is not putting the output in the file text because is not iterating as it should, I think the problem is when the name of the filetext is introduced as $c.txt, how can i solve this?
Thanks for reading
$c.txt is not a command and the bash interpreter can't understand what that means
if you want to create a file, use touch [file]
or you want to copy a existing file to the destination, use cp [src_file] [dst_file]
so the code may like this:
./c1 2500 > $c.txt
or you may want to append the result to a file:
./c1 2500 > $c.txt
cat $c.txt >> sal.txt
ps:
> and >> both of these operators represent output redirection
> writes the output to the file
>> appends the output to the file
cat concatenates files and print on the standard output

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

Matching file content with other filenames to extract and merge contents

I have two directories.
In directory_1, I have many .txt files
Content of these files (for example file1.txt) are a list of characters
file1.txt
--
rer_098
dfrkk9
In directory_2, I have many files, two of them are ‘rer_098’ and ‘dfrkk9’.
Content of these files are as follows:
rer_098
--
>123_nbd
sasert
>456_nbd
ffjko
dfrkk9
--
>789_nbd
figyi
>012_nbd
jjjygk
Now in a separate output directory (directory_3), for this above example, I want output files like:
file1.txt
--
>123_nbd
sasert
>456_nbd
ffjko
>789_nbd
figyi
>012_nbd
jjjygk
and so on for file2.txt
Thanks!
This might work for you (GNU parallel):
parallel 'cat {} | parallel -I## cat dir_2/## > dir_3/{/}' ::: dir_1/*.txt
Use two invocations of parallel, the first traverses dir_1 and pipes its output in a second parallel. This cats the input files and outputs the result dir_3 keeping the original name from the first parallel invocation.
N.B. The use of the -I option to rename the parameter delimiters from the default {} to ##.
Pretty easy to do with just shell. Something like
for fullname in directory_1/*.txt; do
file=$(basename "$fullname")
while read -r line; do
cat "directory_2/$line"
done <"$fullname" >"directory_3/$file"
done
for file in directory_1/*.txt; do
awk 'NR==FNR{ARGV[ARGC++]="directory_2/"$0; next} 1' "$file" > "directory_3/${file##%/}"
done

Copy a txt file twice to a different file using bash

I am trying to cat a file.txt and loop it twice through the whole content and copy it to a new file file_new.txt. The bash command I am using is as follows:
for i in {1..3}; do cat file.txt > file_new.txt; done
The above command is just giving me the same file contents as file.txt. Hence file_new.txt is also of the same size (1 GB).
Basically, if file.txt is a 1GB file, then I want file_new.txt to be a 2GB file, double the contents of file.txt. Please, can someone help here? Thank you.
Simply apply the redirection to the for loop as a whole:
for i in {1..3}; do cat file.txt; done > file_new.txt
The advantage of this over using >> (aside from not having to open and close the file multiple times) is that you needn't ensure that a preexisting output file is truncated first.
Note that the generalization of this approach is to use a group command ({ ...; ...; }) to apply redirections to multiple commands; e.g.:
$ { echo hi; echo there; } > out.txt; cat out.txt
hi
there
Given that whole files are being output, the cost of invoking cat for each repetition will probably not matter that much, but here's a robust way to invoke cat only once:[1]
# Create an array of repetitions of filename 'file' as needed.
files=(); for ((i=0; i<3; ++i)); do files[i]='file'; done
# Pass all repetitions *at once* as arguments to `cat`.
cat "${files[#]}" > file_new.txt
[1] Note that, hypothetically, you could run into your platform's command-line length limit, as reported by getconf ARG_MAX - given that on Linux that limit is 2,097,152 bytes (2MB) that's not likely, though.
You could use the append operator, >>, instead of >. Then adjust your loop count as needed to get the output size desired.
You should adjust your code so it is as follows:
for i in {1..3}; do cat file.txt >> file_new.txt; done
The >> operator appends data to a file rather than writing over it (>)
if file.txt is a 1GB file,
cat file.txt > file_new.txt
cat file.txt >> file_new.txt
The > operator will create file_new.txt(1GB),
The >> operator will append file_new.txt(2GB).
for i in {1..3}; do cat file.txt >> file_new.txt; done
This command will make file_new.txt(3GB),because for i in {1..3} will run three times.
As others have mentioned, you can use >> to append. But, you could also just invoke cat once and have it read the file 3 times. For instance:
n=3; cat $( yes file.txt | sed ${n}q ) > file_new.txt
Note that this solution exhibits a common anti-pattern and fails to properly quote the arguments, which will cause issues if the filename contains whitespace. See mklement's solution for a more robust solution.

Problem with Bash output redirection [duplicate]

This question already has answers here:
Why doesnt "tail" work to truncate log files?
(6 answers)
Closed 1 year ago.
I was trying to remove all the lines of a file except the last line but the following command did not work, although file.txt is not empty.
$cat file.txt |tail -1 > file.txt
$cat file.txt
Why is it so?
Redirecting from a file through a pipeline back to the same file is unsafe; if file.txt is overwritten by the shell when setting up the last stage of the pipeline before tail starts reading off the first stage, you end up with empty output.
Do the following instead:
tail -1 file.txt >file.txt.new && mv file.txt.new file.txt
...well, actually, don't do that in production code; particularly if you're in a security-sensitive environment and running as root, the following is more appropriate:
tempfile="$(mktemp file.txt.XXXXXX)"
chown --reference=file.txt -- "$tempfile"
chmod --reference=file.txt -- "$tempfile"
tail -1 file.txt >"$tempfile" && mv -- "$tempfile" file.txt
Another approach (avoiding temporary files, unless <<< implicitly creates them on your platform) is the following:
lastline="$(tail -1 file.txt)"; cat >file.txt <<<"$lastline"
(The above implementation is bash-specific, but works in cases where echo does not -- such as when the last line contains "--version", for instance).
Finally, one can use sponge from moreutils:
tail -1 file.txt | sponge file.txt
You can use sed to delete all lines but the last from a file:
sed -i '$!d' file
-i tells sed to replace the file in place; otherwise, the result would write to STDOUT.
$ is the address that matches the last line of the file.
d is the delete command. In this case, it is negated by !, so all lines not matching the address will be deleted.
Before 'cat' gets executed, Bash has already opened 'file.txt' for writing, clearing out its contents.
In general, don't write to files you're reading from in the same statement. This can be worked around by writing to a different file, as above:$cat file.txt | tail -1 >anotherfile.txt
$mv anotherfile.txt file.txtor by using a utility like sponge from moreutils:$cat file.txt | tail -1 | sponge file.txt
This works because sponge waits until its input stream has ended before opening its output file.
When you submit your command string to bash, it does the following:
Creates an I/O pipe.
Starts "/usr/bin/tail -1", reading from the pipe, and writing to file.txt.
Starts "/usr/bin/cat file.txt", writing to the pipe.
By the time 'cat' starts reading, 'file.txt' has already been truncated by 'tail'.
That's all part of the design of Unix and the shell environment, and goes back all the way to the original Bourne shell. 'Tis a feature, not a bug.
tmp=$(tail -1 file.txt); echo $tmp > file.txt;
This works nicely in a Linux shell:
replace_with_filter() {
local filename="$1"; shift
local dd_output byte_count filter_status dd_status
dd_output=$("$#" <"$filename" | dd conv=notrunc of="$filename" 2>&1; echo "${PIPESTATUS[#]}")
{ read; read; read -r byte_count _; read filter_status dd_status; } <<<"$dd_output"
(( filter_status > 0 )) && return "$filter_status"
(( dd_status > 0 )) && return "$dd_status"
dd bs=1 seek="$byte_count" if=/dev/null of="$filename"
}
replace_with_filter file.txt tail -1
dd's "notrunc" option is used to write the filtered contents back, in place, while dd is needed again (with a byte count) to actually truncate the file. If the new file size is greater or equal to the old file size, the second dd invocation is not necessary.
The advantages of this over a file copy method are: 1) no additional disk space necessary, 2) faster performance on large files, and 3) pure shell (other than dd).
As Lewis Baumstark says, it doesn't like it that you're writing to the same filename.
This is because the shell opens up "file.txt" and truncates it to do the redirection before "cat file.txt" is run. So, you have to
tail -1 file.txt > file2.txt; mv file2.txt file.txt
echo "$(tail -1 file.txt)" > file.txt
Just for this case it's possible to use cat < file.txt | (rm file.txt; tail -1 > file.txt)
That will open "file.txt" just before connection "cat" with subshell in "(...)". "rm file.txt" will remove reference from disk before subshell will open it for write for "tail", but contents will be still available through opened descriptor which is passed to "cat" until it will close stdin. So you'd better be sure that this command will finish or contents of "file.txt" will be lost
It seems to not like the fact you're writing it back to the same filename. If you do the following it works:
$cat file.txt | tail -1 > anotherfile.txt
tail -1 > file.txt will overwrite your file, causing cat to read an empty file because the re-write will happen before any of the commands in your pipeline are executed.

Resources