concatenate ordered files using cat on Linux

concatenate ordered files using cat on Linux - linux

I have files from 1 to n, which look like the following:
sim.o500.1
sim.o500.2
.
.
.
sim.o500.n
Each file contains only one line. Now I want to concatenate them in the order from 1 to n.
I tried cat sim.o500.* > out.dat. Sadly this does not work if e.g. n is larger than 9, because this concatenates then sim.o500.1 followed by sim.o500.10and not sim.o500.1 followed by sim.o500.2.
How can I loop through the file names using the numeric order?

Since * expands in a non-numeric-sorted way, you'd better create the sequence yourself with seq: this way, 10 will coome after 9, etc.
for id in $(seq $n)
do
cat sim.o500.$id >> out.dat
done
Note I use seq so that you are able to use a variable to indicate the length of the sequence. If this value happens to be fixed and known beforehand, you can directly use range expansion writing the n value like: for id in {1..23}.

echo {1..12}
would print
1 2 3 4 5 6 7 8 9 10 11 12
You can use this Bash's range expansion feature to your advantage.
cat sim.o500.{1..20}
would expand to the filenames in numerically sorted order and it is terse (lesser keystrokes).
One caveat is that you may hit the "too many arguments" error if the file count exceeds the limit.

Try
ls sim.o500.* | sort -t "." -n -k 3,3 | xargs cat > out.dat
Explanation:
ls
ls sim.o500.* will produce a list of file names, matching pattern sim.o500.*, and pass it over the pipe to sort
sort
sort -t "." -n -k 3,3 will grab all those file names and sort them as numbers (-n) in descending order using 3rd column (-k 3,3) and pass it over the pipe to xargs.
-t "." tells sort to use . as a separator, instead of spacing characters, as it does by default. So, having sim.o500.5 as an example, the first column would be sim, the second column: o500 and the third column: 5.
xargs
xargs cat > out.dat will start cat and append all lines, received via pipe from sort as command arguments. It does something like:
execute("cat > out.dat sim.o500.1 sim.o500.2 sim.o500.3 ... sim.o500.n")

> out.dat
for i in `ls -v sim*`
do
echo $i
cat $i >> out.dat
done

If you have all the files in a folder try to use this command:
cat $(ls -- sim* | sort) >> out.dat

Related

how to get last line first 2 character of a file in linux

I have files containing below format, generated by another system
12;453453;TBS;OPPS;
12;453454;TGS;OPPS;
12;453455;TGS;OPPS;
12;453456;TGS;OPPS;
20;787899;THS;CLST;
33;786789;
i have to check the last line contains 33 , then have to continue to copy the file/files to other location. else discard the file.
currently I am doing as below
tail -1 abc.txt >> c.txt
awk '{print substr($0,0,2)}' c.txt
then if the o/p is saved to another variable and copying.
Can anyone suggest any other simple way.
Thank you!
R/

Imagine you have the following input file:
$ cat file
a
b
c
d
e
agc
Then you can run the following commands (grep, awk, sed, cut) to get the first 2 char of last line:
AWK
$ awk 'END{print substr($0,0,2)}' file
ag
SED
$ sed -n '$s/^\(..\).*/\1/p' file
ag
GREP
$ tail -1 file | grep -oE '^..'
ag
CUT
$ tail -1 file | cut -c '1-2'
ag
BASH SUBSTRING
line=$(tail -1 file); echo ${line:0:2}
All of those commands do what you are looking for, the awk command will just do the operation on the last line of the file so you do not need tail anymore, the said command will extract the last line of the file and store it in its pattern buffer, then replace everything that is not the first 2 chars by nothing and then print the pattern buffer (the 2 fist char of the last line), another solution is just to tail the last line of the file and to extract the first 2 chars using grep, by piping those 2 commands you can also do it in one step without using intermediate variables, files.
Now if you want to put everything in one script this become:
$ more file check_2chars.sh
::::::::::::::
file
::::::::::::::
a
b
c
d
e
33abc
::::::::::::::
check_2chars.sh
::::::::::::::
#!/bin/bash
s1=$(tail -1 file | cut -c 1-2) #you can use other commands from this post
s2=33
if [ "$s1" == "$s2" ]
then
echo "match" #implement the copy/discard logic
fi
Execution:
$ ./check_2chars.sh
match
I will let you implement the copy/discard logic
PROOF:

Given the task of either copying or deleting files based on their contents, shell variables aren't necessary.
Using the sed File name command and xargs the whole task can be done in just one line:
find | xargs -l sed -n '${/^33/!F}' | xargs -r rm ; cp * dest/dir/
Or preferably, with GNU sed:
sed -sn '${/^33/!F}' * | xargs -r rm ; cp * dest/dir/
Or if all the filenames contain no whitespace:
rm -r $(sed -sn '${/^33/!F}' *) ; cp * dest/dir/
That assumes all the files in the current directory are to be tested.
sed looks at the last line ($) of every file, and runs what's in the curly braces.
If any of those last lines line do not begin with 33 (/^33/!), sed outputs just those unwanted file names (F).
Supposing the unwanted files are named foo and baz -- those are piped to xargs which runs rm foo baz.
At this point the only files left should be copied over to dest/dir/: cp * dest/dir/.
It's efficient, cp and rm need only be run once.
If a shell variable must be used, here are two more methods:
Using tail and bash, store first two chars of the last line to $n:
n="$(tail -1 abc.txt)" n="${n:0:2}"
Here's a more portable POSIX shell version:
n="$(tail -1 abc.txt)" n="${n%${n#??}}"

You may explicitly test with sed for the last line ($) starting with 33 (/^33.*/):
echo " 12;453453;TBS;OPPS;
12;453454;TGS;OPPS;
12;453455;TGS;OPPS;
12;453456;TGS;OPPS;
20;787899;THS;CLST;
33;786789;" | sed -n "$ {/^33.*/p}"
33;786789;
If you store the result in a variable, you may test it for being empty or not:
lastline33=$(echo " 12;453453;TBS;OPPS;
12;453454;TGS;OPPS;
12;453455;TGS;OPPS;
12;453456;TGS;OPPS;
20;787899;THS;CLST;
33;786789;" | sed -n "$ {/^33.*/p}")
echo $(test -n "$lastline33" && echo not null || echo null)
not null
Probably you like the regular expression to contain the semicolon, because else it would match 330, 331, ...339, 33401345 and so on, but maybe that can be excluded from the context - to me it seems a good idea:
lastline33=$(sed -n "$ {/^33;.*/p}" abc.txt)

Sort and rename files using a 9 digit sequence number

I want to rename multiple jpg files in a directory so they have 9 digit sequence number. I also want the files to be sorted by date from oldest to newest. I came up with this:
ls -tr | nl -v 100000000 | while read n f; do mv "$f" "$n.jpg"; done
this renames the files as I want them but the sequence numbers do not follow the date. I have also tried doing
ls -tr | cat -n .....
but that does not allow me to sepecify the starting sequence number.
Any suggestions what's wrong with my syntax?
Any other ways of achieving my goal?
Thanks

If any of your filename contains a whitespace, you can use the following:
i=100000000
find -type f -printf '%T# %p\0' | \
sort -zk1nr | \
sed -z 's/^[^ ]* //' | \
xargs -0 -I % echo % | \
while read f; do
mv "$f" "$(printf "%09d" $i).jpg"
let i++
done
Note that this doesn't use ls for parsing, but uses the null byte as field separator in the different commands, respectively set as \0, -z, -0.
The find command prints the file time together with the name.
Then the file are sorted and sed removes the timestamp. xargs is giving the filenames to the mv command through read.

DIR="/tmp/images"
FILELIST=$(ls -tr ${DIR})
n=1
for file in ${FILELIST}; do
printf -v digit "%09d" $n
mv "$DIR/${file}" "$DIR/${digit}.jpg"
n=$[n + 1]
done
Something like this? Then you can use n to sepecify the starting sequence number. However, if you have spaces in your file names this would not work.

If using external tool is acceptable, you can use rnm:
rnm -ns '/i/.jpg' -si 100000000 -s/mt *.jpg
-ns: Name string (new name).
/i/: Index (A name string rule).
-si: Option that sets starting index.
-s/mt: Sort according to modification time.
If you want an arbitrary increment value:
rnm -ns '/i/.jpg' -si 100000000 -inc 45 -s/mt *.jpg
-inc: Specify an increment value.

Merge multiple files preserving the original sequence in unix

I have multiple (more than 100) text files in the directory such as
files_1_100.txt
files_101_200.txt
The contents of the file are name of some variables like files_1_100.txt contains some variables names between 1 to 100
"var.2"
"var.5"
"var.15"
Similarly files_201_300.txt contains some variables between 101 to 200
"var.203"
"var.227"
"var.285"
and files_1001_1100.txt as
"var.1010"
"var.1006"
"var.1025"
I can merge them using the command
cat files_*00.txt > ../all_files.txt
However, the contents of files does not follow that in the parent files. For example all_files.txt shows
"var.1010"
"var.1006"
"var.1025"
"var.1"
"var.5"
"var.15"
"var.203"
"var.227"
"var.285"
So, how can I ensure that contents of files_1_100.txt comes first, followed by files_201_300.txt and then files_1001_1100.txt such that the contents of the all_files.txt is
"var.1"
"var.5"
"var.15"
"var.203"
"var.227"
"var.285"
"var.1010"
"var.1006"
"var.1025"

Let me try it out, but I think that this will work:
ls file*.txt | sort -n -t _ -k2 -k3 | xargs cat
The idea is to take your list of files and sort them and then pass them to the cat command.
The sort uses several options:
-n - use a numeric sort rather than alphabetic
-t _ - divide the input (the filename) into fields using the underscore character
-k2 -k3 - sort first by the 2nd field and then by the 3rd field (the 2 numbers)
You have said that your files are named file_1_100.txt, file_101_201.txt, etc. If that means (as it seems to indicate) that the first numeric "chunk" is always unique then you can leave off the -k3 flag. That flag is needed only if you will end up, for instance, with file_100_2.txt and file_100_10.txt where you have to look at the 2nd numeric "chunk" to determine the preferred order.
Depending on the number of files you are working with you may find that specifying the glob (file*.txt) may overwhelm the shell and cause errors about the line being too long. If that's the case you could do it like this:
ls | grep '^file.*\.txt$' | sort -n -t _ -k2 -k3 | xargs cat

You can use printf sort and pipe that to xargs cat:
printf "%s\0" f*txt | sort -z -t_ -nk2 | xargs -0 cat > ../all_files.txt
Note that whole pipeline is working on NULL terminated filenames thus making sure this command even works foe filenames with space/newlines etc.

If your filenames are free from any special characters or white-spaces, then other answers should be easy solutions.
Else, try this rename based approach:
$ ls files_*.txt
files_101_200.txt files_1_100.txt
$ rename 's/files_([0-9]*)_([0-9]*)/files_000$1_000$2/;s/files_0*([0-9]{3})_0*([0-9]{3})/files_$1_$2/' files_*.txt
$ ls files_*.txt
files_100_100.txt files_101_200.txt
$ cat files_*.txt > outputfile.txt
$ rename 's/files_0*([0-9]*)_0*([0-9]*)/files_$1_$2/' files_*.txt

The default sorting behavior of cat file_* is alphabetical, rather than numeric.
List them in numerical order and then cat each one, appending the output to some file.
ls -1| sort -n |xargs -i cat {} >> file.out

You could try using a for-loop and adding the files one by one (the -v sorts the files correctly when the numbers are not zero-padded)
for i in $(ls -v files_*.txt)
do
cat $i >> ../all_files.txt
done
or more convenient in a single line:
for i in $(ls -v files_*.txt) ; do cat $i >> ../all_files.txt ; done

You could also do this with Awk by splitting and sorting ARGV:
awk 'BEGIN {
for(i=1; i<=ARGC-1; i++) {
if(i > 1) {
j=i-1
split(ARGV[i], curr, "_")
split(ARGV[j], last, "_")
if (curr[2] < last[2]) {
tmp=ARGV[i]
ARGV[i]=ARGV[j]
ARGV[j]=tmp
}
}
}
}1' files_*00.txt

Regarding grep command

I have two queries
I do a grep and get the line number of the input file. i want to retrieve a set of lines before and after the line number from the inputfile and redirect to a /tmp/testout file. how can i do it.
I have a line numbers 10000,20000. I want to retrieve the lines between 10000 and 20000 of the input file and redirect to a /tmp/testout file. how can i doi it

for grep -C is the straight forward option
for the 2nd question try this!
sed -n "100000,20000p" bar.txt > foo.txt

You want to look into the -A -B and -C options of grep. See man grep for more information
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing -- between contiguous groups of
matches.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
-C NUM, --context=NUM
Print NUM lines of output context. Places a line containing --
between contiguous groups of matches.
For redirecting the output, do the following: grep "your pattern" yourinputfile > /tmp/testout

See head and/or tail.
For example:
head -n 20000 <input> | tail -n 10000 > /tmp/testout
whereas the argument of tail is (20000 - 10000).

If you're using GNU grep, you can supply -B and -A to get lines before and after the match with grep.
E.g.
grep -B 5 -A 10 SearchString File
will give print each line matching SearchString from File plus 5 lines before and 10 lines after the matching line.
For the other part of your question, you can use head/tail or sed. Please see other answers for details.

For part 2, awk will allow you to print a range of lines thus:
awk 'NR==10000,NR==20000{print}{}' inputfile.txt >/tmp/testout
This basically gives a range based on the record number NR.
For part 1, context from grep can be obtained using the --after-context=X and --before-context=X switches. If you're running a grep that doesn't allow that, you can dummy up an awk script based on the part 2 answer above.

to see the before and after: (3 lines before and 3 lines after)
grep -C3 foo bar.txt
the second question:
head -20000 bar.txt | tail -10000 > foo.txt

you can do these with just awk, eg display 2 lines before and after "6", and display range from linenumber 4 to 8
$ cat file
1
2
3
4
5
6
7
8
9
10
$ awk 'c--&&c>=0{print "2 numbers below 6: "$0};/6/{c=2;for(i=d;i>d-2;i--)print "2 numbers above 6: "a[i];delete a}{a[++d]=$0} NR>3&&NR<9{print "With range: ->"$0}' file
With range: ->4
With range: ->5
2 numbers above 6: 5
2 numbers above 6: 4
With range: ->6
2 numbers below 6: 7
With range: ->7
2 numbers below 6: 8
With range: ->8

If your grep doesn't have -A, -B and -C, then this sed command may work for you:
sed -n '1bb;:a;/PATTERN/{h;n;p;H;g;bb};N;//p;:b;99,$D;ba' inputfile > outputfile
where PATTERN is the regular expression your looking for and 99 is one greater than the number of context lines you want (equivalent to -C 98).
It works by keeping a window of lines in memory and when the regex matches, the captured lines are output.
If your sed doesn't like semicolons and prefers -e, this version may work for you:
sed -n -e '1bb' -e ':a' -e '/PATTERN/{h' -e 'n' -e 'p' -e 'H' -e 'g' -e 'bb}' -e 'N' -e '//p' -e ':b' -e '99,$D' -e 'ba' inputfile > outputfile
For your line range output, this will work and will finish a little more quickly if there are a large number of lines after the end of the range:
sed -n '100000,20000p;q' inputfile > outputfile
or
sed -n -e '100000,20000p' -e 'q' inputfile > outputfile

How to split a file and keep the first line in each of the pieces?

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?

This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines

Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.

A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt

I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

concatenate ordered files using cat on Linux - linux

> out.dat for i in `ls -v sim*` do echo $i cat $i >> out.dat done

If you have all the files in a folder try to use this command: cat $(ls -- sim* | sort) >> out.dat

Related

how to get last line first 2 character of a file in linux

Sort and rename files using a 9 digit sequence number

Merge multiple files preserving the original sequence in unix

Regarding grep command

How to split a file and keep the first line in each of the pieces?

Categories

Resources