Merge multiple files preserving the original sequence in unix - linux

I have multiple (more than 100) text files in the directory such as
files_1_100.txt
files_101_200.txt
The contents of the file are name of some variables like files_1_100.txt contains some variables names between 1 to 100
"var.2"
"var.5"
"var.15"
Similarly files_201_300.txt contains some variables between 101 to 200
"var.203"
"var.227"
"var.285"
and files_1001_1100.txt as
"var.1010"
"var.1006"
"var.1025"
I can merge them using the command
cat files_*00.txt > ../all_files.txt
However, the contents of files does not follow that in the parent files. For example all_files.txt shows
"var.1010"
"var.1006"
"var.1025"
"var.1"
"var.5"
"var.15"
"var.203"
"var.227"
"var.285"
So, how can I ensure that contents of files_1_100.txt comes first, followed by files_201_300.txt and then files_1001_1100.txt such that the contents of the all_files.txt is
"var.1"
"var.5"
"var.15"
"var.203"
"var.227"
"var.285"
"var.1010"
"var.1006"
"var.1025"

Let me try it out, but I think that this will work:
ls file*.txt | sort -n -t _ -k2 -k3 | xargs cat
The idea is to take your list of files and sort them and then pass them to the cat command.
The sort uses several options:
-n - use a numeric sort rather than alphabetic
-t _ - divide the input (the filename) into fields using the underscore character
-k2 -k3 - sort first by the 2nd field and then by the 3rd field (the 2 numbers)
You have said that your files are named file_1_100.txt, file_101_201.txt, etc. If that means (as it seems to indicate) that the first numeric "chunk" is always unique then you can leave off the -k3 flag. That flag is needed only if you will end up, for instance, with file_100_2.txt and file_100_10.txt where you have to look at the 2nd numeric "chunk" to determine the preferred order.
Depending on the number of files you are working with you may find that specifying the glob (file*.txt) may overwhelm the shell and cause errors about the line being too long. If that's the case you could do it like this:
ls | grep '^file.*\.txt$' | sort -n -t _ -k2 -k3 | xargs cat

You can use printf sort and pipe that to xargs cat:
printf "%s\0" f*txt | sort -z -t_ -nk2 | xargs -0 cat > ../all_files.txt
Note that whole pipeline is working on NULL terminated filenames thus making sure this command even works foe filenames with space/newlines etc.

If your filenames are free from any special characters or white-spaces, then other answers should be easy solutions.
Else, try this rename based approach:
$ ls files_*.txt
files_101_200.txt files_1_100.txt
$ rename 's/files_([0-9]*)_([0-9]*)/files_000$1_000$2/;s/files_0*([0-9]{3})_0*([0-9]{3})/files_$1_$2/' files_*.txt
$ ls files_*.txt
files_100_100.txt files_101_200.txt
$ cat files_*.txt > outputfile.txt
$ rename 's/files_0*([0-9]*)_0*([0-9]*)/files_$1_$2/' files_*.txt

The default sorting behavior of cat file_* is alphabetical, rather than numeric.
List them in numerical order and then cat each one, appending the output to some file.
ls -1| sort -n |xargs -i cat {} >> file.out

You could try using a for-loop and adding the files one by one (the -v sorts the files correctly when the numbers are not zero-padded)
for i in $(ls -v files_*.txt)
do
cat $i >> ../all_files.txt
done
or more convenient in a single line:
for i in $(ls -v files_*.txt) ; do cat $i >> ../all_files.txt ; done

You could also do this with Awk by splitting and sorting ARGV:
awk 'BEGIN {
for(i=1; i<=ARGC-1; i++) {
if(i > 1) {
j=i-1
split(ARGV[i], curr, "_")
split(ARGV[j], last, "_")
if (curr[2] < last[2]) {
tmp=ARGV[i]
ARGV[i]=ARGV[j]
ARGV[j]=tmp
}
}
}
}1' files_*00.txt

Related

Bash , listing files according to the file names with numbers and strings

I am trying to sort the files based on their files names
first sort based on number, then sort based on the letters it has.
the numbers are the age, so the adult will be the last.
if the same numbers are used, sort based on rep1,rep2, and rep 3
the original files are ls like this
abc.0.5
abc.1.2
abc.15.3
abc.2.3
abc.35.5dpp
abc.35.5dpp.rep2
abc.7.3
abc.adult
abc.adult.rep2
abc.adult.rep3
the result should be
abc.0.5
abc.1.2
abc.2.3
abc.7.3
abc.15.3
abc.35.5dpp
abc.35.5dpp.rep2
abc.adult
abc.adult.rep2
abc.adult.rep3
I have tried
ls -v a*| sed 's/\.adult\./.99./' | sort -V | sed 's/\.99\./.adult./'
But it won't order correctly if there is dpp or rep after the numbers and adult.
If you want to force adult to be last, just move it to the end by doing two separate batches, one of filenames that include that string, and another batch of filenames that don't. That is:
shopt -s extglob nullglob
print_lines() { (( $# )) && printf '%s\n' "$#"; }
{ print_lines abc!(*adult*) | sort -V; print_lines abc*adult* | sort -V; }
I think you want:
ls -v a* | sed 's/\.adult\./.99./' | sort -t. -k2,2n -k3,3n | sed 's/\.99\./.adult./'
This one do the job
ls -l abc* | awk '{print $9}'| sed 's/\.adult/.99/' | sort -V | sed 's/\.99/.adult/'

concatenate ordered files using cat on Linux

I have files from 1 to n, which look like the following:
sim.o500.1
sim.o500.2
.
.
.
sim.o500.n
Each file contains only one line. Now I want to concatenate them in the order from 1 to n.
I tried cat sim.o500.* > out.dat. Sadly this does not work if e.g. n is larger than 9, because this concatenates then sim.o500.1 followed by sim.o500.10and not sim.o500.1 followed by sim.o500.2.
How can I loop through the file names using the numeric order?
Since * expands in a non-numeric-sorted way, you'd better create the sequence yourself with seq: this way, 10 will coome after 9, etc.
for id in $(seq $n)
do
cat sim.o500.$id >> out.dat
done
Note I use seq so that you are able to use a variable to indicate the length of the sequence. If this value happens to be fixed and known beforehand, you can directly use range expansion writing the n value like: for id in {1..23}.
echo {1..12}
would print
1 2 3 4 5 6 7 8 9 10 11 12
You can use this Bash's range expansion feature to your advantage.
cat sim.o500.{1..20}
would expand to the filenames in numerically sorted order and it is terse (lesser keystrokes).
One caveat is that you may hit the "too many arguments" error if the file count exceeds the limit.
Try
ls sim.o500.* | sort -t "." -n -k 3,3 | xargs cat > out.dat
Explanation:
ls
ls sim.o500.* will produce a list of file names, matching pattern sim.o500.*, and pass it over the pipe to sort
sort
sort -t "." -n -k 3,3 will grab all those file names and sort them as numbers (-n) in descending order using 3rd column (-k 3,3) and pass it over the pipe to xargs.
-t "." tells sort to use . as a separator, instead of spacing characters, as it does by default. So, having sim.o500.5 as an example, the first column would be sim, the second column: o500 and the third column: 5.
xargs
xargs cat > out.dat will start cat and append all lines, received via pipe from sort as command arguments. It does something like:
execute("cat > out.dat sim.o500.1 sim.o500.2 sim.o500.3 ... sim.o500.n")
> out.dat
for i in `ls -v sim*`
do
echo $i
cat $i >> out.dat
done
If you have all the files in a folder try to use this command:
cat $(ls -- sim* | sort) >> out.dat

How to match strings in file names and rename file according to string?

I have many files with matching strings in file names.
foostring.bar
barstring.bar
fuustring.bar
aha_foostring.abc
meh_barstring.abc
lol_fuustring.abc
...
I need to find the bar and abc files with matching strings, and rename the *.bar-files basename to the look like the *.abc-files. In other words, add a string prefix.
The result I'm looking for should look like this:
aha_foostring.bar
meh_barstring.bar
lol_fuustring.bar
aha_foostring.abc
meh_barstring.abc
lol_fuustring.abc
...
Clarification Edit: The strings in the *.abc-files are always situated after the last underscore _ and before the dot . The string only contains letters and numbers. The prefix can contain any number of characters, and any type of character, including _ and . This means I also need to take the below example into consideration.
dindongstring.bar
w_h.a.t_e_v_e.r_dingdongstring.abc
I've been experimenting with find, prefix and basename, but I need help and advice here.
Thanks
I would go with something like this:
(I am sure there are more elegant ways to do it (awk/sed))
#!/bin/bash
for filename in *.abc
do
prefix=${filename%_*}
searchstring=${filename%.abc}
searchstring=${searchstring#*_}
if [[ -f "$searchstring.bar" ]]
then
mv "${searchstring}.bar" "${prefix}_${searchstring}.bar"
fi
done
# show the result
ls -al
Apologies for adding this in your answer but since I've deleted my answer and you answer is closest to what OP needs. (I dont mind... I care about solutions =)
EDIT: Probably this is what OP wants:
for f in *.abc; do
prefix=${f%_*}
bar=${f%.abc}
bar="${bar##*_}.bar"
[[ -f "$bar" ]] && mv "$bar" "${prefix}_${bar}"
done
I suggest to try the following "magick":
$ join -j 2 <(ls -1 . | sed -n '/\.bar/s/^\(.*\)\(\.[^.]\+\)$/\1\2\t\1/p' | sort -k2) <(ls -1 . | sed -n '/\.abc/s/^\(.\+_\)\?\([a-zA-Z0-9]\+\)\(\.[^.]\+\)$/\1\2\3\t\2\t\1/p' | sort -k2) | awk '{print $2 " " $4}' | while read FILE PREFIX; do echo mv -v "$FILE" "$PREFIX$FILE"; done
mv -v barstring.bar meh_barstring.bar
mv -v dingdongstring.bar w_h.a.t_e_v_e.r_dingdongstring.bar
mv -v foostring.bar aha_foostring.bar
mv -v fuustring.bar lol_fuustring.bar
If it will show expected commands then remove echo before mv and run again to do the changes.
Note also that there I use ls -1 . command to show files of the current directory, probably you'll need to change directory or run command in directory with files.
Little explanation:
The idea behind that code is to create pairs of filename-common part for .bar and .abc files:
$ ls -1 . | sed -n '/\.bar/s/^\(.*\)\(\.[^.]\+\)$/\1\2\t\1/p' | sort -k2
barstring.bar barstring
dingdongstring.bar dingdongstring
foostring.bar foostring
fuustring.bar fuustring
$ ls -1 . | sed -n '/\.abc/s/^\(.\+_\)\?\([a-zA-Z0-9]\+\)\(\.[^.]\+\)$/\1\2\3\t\2\t\1/p' | sort -k2
meh_barstring.abc barstring meh_
w_h.a.t_e_v_e.r_dingdongstring.abc dingdongstring w_h.a.t_e_v_e.r_
aha_foostring.abc foostring aha_
lol_fuustring.abc fuustring lol_
As you can see there the 2nd field is common part. After that we join these lists together by common part and leave only .abc filename and prefix:
$ join -j 2 <(ls -1 . | sed -n '/\.bar/s/^\(.*\)\(\.[^.]\+\)$/\1\2\t\1/p' | sort -k2) <(ls -1 . | sed -n '/\.abc/s/^\(.\+_\)\?\([a-zA-Z0-9]\+\)\(\.[^.]\+\)$/\1\2\3\t\2\t\1/p' | sort -k2) | awk '{print $2 " " $4}'
barstring.bar meh_
dingdongstring.bar w_h.a.t_e_v_e.r_
foostring.bar aha_
fuustring.bar lol_
And final step is to rename files by adding appropriate prefix to them.

How to find file in each directory with the highest number as filename?

I have a file structure that looks like this
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
and I would like to find the file path to the .bin file in each directory which have the highest number as filename. So the output I am looking for would be
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
The highest number a file can have is 9.
Question
How do I do that in BASH?
I have come as far as find .|grep bin|sort
Globs are guaranteed to be expanded in lexical order.
for dir in ./*/
do
files=($dir/*) # create an array
echo "${files[#]: -1}" # access its last member
done
What about using awk? You can get the FIRST occurrence really simply:
[ghoti#pc ~]$ cat data1
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$ awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' data1
./501.res/1.bin
./503.res/1.bin
./504.res/1.bin
[ghoti#pc ~]$
To get the last occurrence you could pipe through a couple of sorts:
[ghoti#pc ~]$ sort -r data1 | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$
Given that you're using "find" and "grep", you could probably do this:
find . -name \*.bin -type f -print | sort -r | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
How does this work?
The find command has many useful options, including the ability to select your files by glob, select the type of file, etc. Its output you already know, and that becomes the input to sort -r.
First, we sort our input data in reverse (sort -r). This insures that within any directory, the highest numbered file will show up first. That result gets fed into awk. FS is the field separator, which makes $2 into things like "/501", "/502", etc. Awk scripts have sections in the form of condition {action} which get evaluated for each line of input. If a condition is missing, the action runs on every line. If "1" is the condition and there is no action, it prints the line. So this script is broken out as follows:
a[$2] {next} - If the array a with the subscript $2 (i.e. "/501") exists, just jump to the next line. Otherwise...
{a[$2]=1} - set the array a subscript $2 to 1, so that in future the first condition will evaluate as true, then...
1 - print the line.
The output of this awk script will be the data you want, but in reverse order. The final sort puts things back in the order you'd expect.
Now ... that's a lot of pipes, and sort can be a bit resource hungry when you ask it to deal with millions of lines of input at the same time. This solution will be perfectly sufficient for small numbers of files, but if you're dealing with large quantities of input, let us know, and I can come up with an all-in-one awk solution (that will take longer than 60 seconds to write).
UPDATE
Per Dennis' sage advice, the awk script I included above could be improved by changing it from
BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1
to
BEGIN{FS="."} $2 in a {next} {a[$2]} 1
While this is functionally identical, the advantage is that you simply define array members rather than assigning values to them, which may save memory or cpu depending on your implementation of awk. At any rate, it's cleaner.
Tested:
find . -type d -name '*.res' | while read dir; do
find "$dir" -maxdepth 1 | sort -n | tail -n 1
done
I came up with someting like this:
for dir in $(find . -mindepth 1 -type d | sort); do
file=$(ls "$dir" | sort | tail -n 1);
[ -n "$file" ] && (echo "$dir/$file");
done
Maybe it can be simpler
If invoking a shell from within find is an option try this
find * -type d -exec sh -c "echo -n './'; ls -1 {}/*.bin | sort -n -r | head -n 1" \;
And here is one liner
find . -mindepth 1 -type d | sort | sed -e "s/.*/ls & | sort | tail -n 1 | xargs -I{} echo &\/{}/" | bash

How to split a file and keep the first line in each of the pieces?

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.
This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.
Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines
Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.
A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt
I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total

Resources