I use the command wc -l count number of lines in my text files (also i want to sort everything through a pipe), like this:
wc -l $directory-path/*.txt | sort -rn
The output includes "total" line, which is the sum of lines of all files:
10 total
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
Is there any way to suppress this summary line? Or even better, to change the way the summary line is worded? For example, instead of "10", the word "lines" and instead of "total" the word "file".
Yet a sed solution!
1. short and quick
As total are comming on last line, $d is the sed command for deleting last line.
wc -l $directory-path/*.txt | sed '$d'
2. with header line addition:
wc -l $directory-path/*.txt | sed '$d;1ilines total'
Unfortunely, there is no alignment.
3. With alignment: formatting left column at 11 char width.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
/^ *[0-9]\+ total$/d;
1i\ lines filename'
Will do the job
lines file
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
4. But if really your wc version could put total on 1st line:
This one is for fun, because I don't belive there is a wc version that put total on 1st line, but...
This version drop total line everywhere and add header line at top of output.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
1{
/^ *[0-9]\+ total$/ba;
bb;
:a;
s/^.*$/ lines file/
};
bc;
:b;
1i\ lines file' -e '
:c;
/^ *[0-9]\+ total$/d
'
This is more complicated because we won't drop 1st line, even if it's total line.
This is actually fairly tricky.
I'm basing this on the GNU coreutils version of the wc command. Note that the total line is normally printed last, not first (see my comment on the question).
wc -l prints one line for each input file, consisting of the number of lines in the file followed by the name of the file. (The file name is omitted if there are no file name arguments; in that case it counts lines in stdin.)
If and only if there's more than one file name argument, it prints a final line containing the total number of lines and the word total. The documentation indicates no way to inhibit that summary line.
Other than the fact that it's preceded by other output, that line is indistinguishable from output for a file whose name happens to be total.
So to reliably filter out the total line, you'd have to read all the output of wc -l, and remove the final line only if the total length of the output is greater than 1. (Even that can fail if you have files with newlines in their names, but you can probably ignore that possibility.)
A more reliable method is to invoke wc -l on each file individually, avoiding the total line:
for file in $directory-path/*.txt ; do wc -l "$file" ; done
And if you want to sort the output (something you mentioned in a comment but not in your question):
for file in $directory-path/*.txt ; do wc -l "$file" ; done | sort -rn
If you happen to know that there are no files named total, a quick-and-dirty method is:
wc -l $directory-path/*.txt | grep -v ' total$'
If you want to run wc -l on all the files and then filter out the total line, here's a bash script that should do the job. Adjust the *.txt as needed.
#!/bin/bash
wc -l *.txt > .wc.out
lines=$(wc -l < .wc.out)
if [[ lines -eq 1 ]] ; then
cat .wc.out
else
(( lines-- ))
head -n $lines .wc.out
fi
rm .wc.out
Another option is this Perl one-liner:
wc -l *.txt | perl -e '#lines = <>; pop #lines if scalar #lines > 1; print #lines'
#lines = <> slurps all the input into an array of strings. pop #lines discards the last line if there are more than one, i.e., if the last line is the total line.
The program wc, always displays the total when they are two or more than two files ( fragment of wc.c):
if (argc > 2)
report ("total", total_ccount, total_wcount, total_lcount);
return 0;
also the easiest is to use wc with only one file and find present - one after the other - the file to wc:
find $dir -name '*.txt' -exec wc -l {} \;
Or as specified by liborm.
dir="."
find $dir -name '*.txt' -exec wc -l {} \; | sort -rn | sed 's/\.txt$//'
This is a job tailor-made for head:
wc -l | head --lines=-1
This way, you can still run in one process.
Can you use another wc ?
The POSIX wc(man -s1p wc) shows
If more than one input file operand is specified, an additional line shall be written, of the same format as the other lines, except that the word total (in the POSIX locale) shall be written instead of a pathname and the total of each column shall be written as appropriate. Such an additional line, if any, is written at the end of the output.
You said the Total line was the first line, the manual states its the last and other wc's don't show it at all. Removing the first or last line is dangerous, so I would grep -v the line with the total (in the POSIX locale...), or just grep the slash that's part of all other lines:
wc -l $directory-path/*.txt | grep "/"
Not the most optimized way since you can use combinations of cat, echo, coreutils, awk, sed, tac, etc., but this will get you want you want:
wc -l ./*.txt | awk 'BEGIN{print "Line\tFile"}1' | sed '$d'
wc -l ./*.txt will extract the line count. awk 'BEGIN{print "Line\tFile"}1' will add the header titles. The 1 corresponds to the first line of the stdin. sed '$d' will print all lines except the last one.
Example Result
Line File
6 ./test1.txt
1 ./test2.txt
You can solve it (and many other problems that appear to need a for loop) quite succinctly using GNU Parallel like this:
parallel wc -l ::: tmp/*txt
Sample Output
3 tmp/lines.txt
5 tmp/unfiltered.txt
42 tmp/file.txt
6 tmp/used.txt
The simplicity of using just grep -c
I rarely use wc -l in my scripts because of these issues. I use grep -c instead. Though it is not as efficient as wc -l, we don't need to worry about other issues like the summary line, white space, or forking extra processes.
For example:
/var/log# grep -c '^' *
alternatives.log:0
alternatives.log.1:3
apache2:0
apport.log:160
apport.log.1:196
apt:0
auth.log:8741
auth.log.1:21534
boot.log:94
btmp:0
btmp.1:0
<snip>
Very straight forward for a single file:
line_count=$(grep -c '^' my_file.txt)
Performance comparison: grep -c vs wc -l
/tmp# ls -l *txt
-rw-r--r-- 1 root root 721009809 Dec 29 22:09 x.txt
-rw-r----- 1 root root 809338646 Dec 29 22:10 xyz.txt
/tmp# time grep -c '^' *txt
x.txt:7558434
xyz.txt:8484396
real 0m12.742s
user 0m1.960s
sys 0m3.480s
/tmp/# time wc -l *txt
7558434 x.txt
8484396 xyz.txt
16042830 total
real 0m9.790s
user 0m0.776s
sys 0m2.576s
Similar to Mark Setchell's answer you can also use xargs with an explicit separator:
ls | xargs -I% wc -l %
Then xargs explicitly doesn't send all the inputs to wc, but one operand line at a time.
Shortest answer:
ls | xargs -l wc
What about using sed with the pattern removal option as below which would only remove the total line if it is present (but also any files with total in them).
wc -l $directory-path/*.txt | sort -rn | sed '/total/d'
Related
Specify a command / set of commands that displays the number of lines of code in the .c and .h files in the current directory, displaying each file in alphabetical order followed by: and the number of lines in the files, and finally the total of the lines of code. .
An example that might be displayed would be:
main.c: 202
util.c: 124
util.h: 43
TOTAL: 369
After many attempts, my final result of a one-liner command was :
wc -l *.c *.h | awk '{print $2 ": " $1}' | sed "$ s/total/TOTAL/g"
The problem is that i don't know how to sort them alphabetically without moving the TOTAL aswell (considering we don't know how many files are in that folder). I'm not sure if the command above is that efficient so if you have a better one you can include more variations of it.
perhaps easier would be to sort the input arguments to wc -- perhaps something like this:
$ find . -maxdepth 1 '(' -name '*.py' -o -name '*.md' ')' | sort | xargs -d'\n' wc -l | awk '{print $2": "$1}' | sed 's/^total:/TOTAL:/'
./__main__.py: 70
./README.md: 96
./SCREENS.md: 76
./setup.py: 2
./t.py: 10
TOTAL: 254
note that I use xargs -d'\n' to avoid filenames with spaces in them (if I were targetting GNU+ I would drop the . from find . and perhaps use -print0 | sort -z | xargs -0 instead)
You could save input into a variable, extract the last line, sort everything except the last line and then output the last line:
printf '%s\n' 'main.c: 202' 'util.c: 124' 'util.h: 43' 'TOTAL: 369' | {
v=$(cat);
total=$(tail -n1 <<<"$v");
head -n-1 <<<"$v" | sort -r;
printf "%s\n" "$total";
}
here's a meme-ier answer which uses python3.8+ (hey awk is turing complete why not use python?)
python3.8 -c 'import sys;s=0;[print(x+":",-s+(s:=s+len(list(open(x)))))for x in sorted(sys.argv[1:])];print("TOTAL:",s)' *.py *.cfg
__main__.py: 70
setup.cfg: 55
setup.py: 2
t.py: 10
TOTAL: 137
expanding it out, it abuses a few things:
len(list(open(x))) - open(x) returns a file, list exhausts the iterator (by lines) and then the length of that is the number of lines
-s+(s:=s+...) -- this is an assignment expression which has the side-effect of accumulating s but having the expression be the difference (...)
sorted(sys.argv) satisfies the sorting part
this is my script
SourceFile='/root/Document/Source/'
FND=$(find $SourceFile. -regextype posix-regex -iregex "^.*/ABCDEF_555_[0-9]{5}\.txt$")
echo $FND
#*I've tried using "awk" but haven't gotten perfect results*
File Name:
ABCDEF_555_12345.txt
ABCDEF_555_54321.txt
ABCDEF_555_11223.txt
BEFORE
File Content from ABCDEF_555_12345.txt:
no|name|address|pos_code
1|rick|ABC|12342
2|rock|ABC|12342
3|Robert|DEF|54321
File Content from ABCDEF_555_54321.txt:
no|id|name|city
1|0101|RIZKI|JKT
2|0102|LALA|SMG
3|0302|ROY|YGY
i want to append a column that shows the file name in every row starting from the 2nd, and append a column with name_file to the first and i want to change the contents of the original files.
AFTER
file: ABCDEF_555_12345.txt
no|name|address|pos_code|name_file
1|rick|ABC|12342|ABCDEF_555_12345.txt
2|rock|ABC|12342|ABCDEF_555_12345.txt
3|Robert|DEF|54321|ABCDEF_555_12345.txt
file: ABCDEF_555_54321.txt
no|id|name|city|name_file
1|0101|RIZKI|JKT|ABCDEF_555_54321.txt
2|0102|LALA|SMG|ABCDEF_555_54321.txt
3|0302|ROY|YGY|ABCDEF_555_54321.txt
please give me light to find a solution :))
Thanks :))
The best solution is to use awk.
If it's the first line (NR == 1), print the line and append |name_file.
For all other lines print the line and append the filename using the FILENAME variable:
awk 'NR == 1 {print $0 "|name_file"; next;}{print $0 "|" FILENAME;}' foo.txt
You can either use it with multiple files:
find . -iname "*.txt" -print0 | xargs -0 awk '
NR == 1 {print $0 "|name_file"; next;}
FRN == 1 {next;} # Skip header of next files
{print $0 "|" FILENAME;}'
My first solution used to use the paste command.
Paste allows you to concatenate files horizontally (compared to cat which concatenates vertically).
To achieve the following with paste, do:
first concatenate the first line of your file (head -n1 foo.txt) with the column header (echo "name_file"). The command paste accept the -d flag to define the separator between columns.
second, extract all lines except the first (tail -n+2 foo.txt) and concatenate them with as many foo.txt required (use a for loop, computing the number of lines to fill.
The solution looks like this:
paste -d'|' <(head -n1 foo.txt) <(echo "name_file")
paste -d'|' <(tail -n+2 foo.txt) <(for i in $(seq $(tail -n+2 foo.txt | wc -l)); do echo "foo.txt"; done)
no|name|address|pos_code|name_file
1|rick|ABC|12342|foo.txt
2|rock|ABC|12342|foo.txt
3|Robert|DEF|54321|foo.txt
However, the awk solution must be prefered because it is clearer (only one call, less process substitutions and co.), and faster.
$ wc -l foo.txt
100004 foo.txt
$ time ./awk.sh >/dev/null
./awk.sh > /dev/null 0,03s user 0,01s system 98% cpu 0,041 total
$ time ./paste.sh >/dev/null
./paste.sh > /dev/null 0,38s user 0,33s system 154% cpu 0,459 total
Using find and GNU awk:
My find implementation doesn't have regextype posix-regex and I used posix-extended instead, but since you got the correct results it should be fine.
srcdir='/root/Document/Source/'
find "$srcdir" -regextype posix-regex -iregex ".*/ABCDEF_555_[0-9]{5}\.txt$"\
-exec awk -i inplace -v fname="{}" '
BEGIN{ OFS=FS="|"; sub(/.*\//, "", fname) } # set field separators / extract filename
{ $(NF+1)=NR==1 ? "name_file" : fname; print } # add header field / filename, print line
' {} \;
The pathname found by find is passed to awk in variable fname. In the BEGIN block the filename is extracted from the path.
The files are modified "inplace", make sure you make a backup of your files before running this.
consider todays date as 24/02/14
I have set of files in mention directory "/apps_kplus/KplusOpenReport/crs_scripts/rconvysya" and file names as
INTTRADIVBMM20142402
INTTRADIVBFX20142402
INTTRADIVBFI20142402
INTTRADIVBDE20142402
INTPOSIVBIR20142402
INTPOSIVBIR20142302
INTTRADIVBDE20142302
INTTRADIVBFI20142302
INTTRADIVBFX20142302
INTTRADIVBMM20142302
I wanted to get count of file with current date. for that i am using below query
#! /bin/bash
tm=$(date +%y%d%m)
x=$(ls /apps_kplus/KplusOpenReport/crs_scripts/rconvysya/INT*$tm.txt 2>/dev/null | wc -l)
if [ $x -eq 5 ]
then
exit 4
else
exit 3
fi
But not getting desired output.
what is wrong.
You're searching for files with a .txt extension, and the files you list have none. If you remove the .txt from the pathname, it works.
With Awk, anything is possible!
tm=$(date +%Y%d%m) # Capital Y is for the four digit year!
ls -a | awk -v tm=$tm '$0 ~ tm' | wc -l
This is the Great and Awful Awk Command. (Awful as in both full of awe and as in Starbucks Coffee).
The -v parameter allows me to set Awk variables. The $0 represents the entire line which is the file name from the ls command. The $0 ~ tm means I'm looking for files that contain the time string you specified.
I could do this too:
ls -a | awk "/$tm\$/"
Which lets the shell interpolate $tm into the Awk program. This is looking only for files that end in your $tm stamp. The /.../ by itself means matching the entire line. It's an even shorter Awk shortcut than I had before. Plus, it makes sure you're only matching the timestamp on the end of the file.
You can try the following awk:
awk -v d="$date" '{split(d,ary,/\//);if($0~"20"ary[3]ary[1]ary[2]) cnt++}END{print cnt}' file
where $date is your shell variable containing the date you wish to search for.
$ cat file
INTTRADIVBMM20142402
INTTRADIVBFX20142402
INTTRADIVBFI20142402
INTTRADIVBDE20142402
INTPOSIVBIR20142402
INTPOSIVBIR20142302
INTTRADIVBDE20142302
INTTRADIVBFI20142302
INTTRADIVBFX20142302
INTTRADIVBMM20142302
$ awk -v d="$date" '{split(d,ary,/\//);if($0~"20"ary[3]ary[1]ary[2]) cnt++}END{print cnt}' file
5
ls /apps_kplus/KplusOpenReport/crs_scripts/rconvysya | grep "`date +%Y%m%d`"| wc -l
This worked for me and it seems like the simplest solution, I hope it fits you needs
To get exit status 4 if there are 0k files present use code bellow:
0=`ls /apps_kplus/KplusOpenReport/crs_scripts/rconvysya | grep "`date +%Y%m%d`"| xargs -n 1 stat -c%s | grep "^0$" | wc -l`
if [ "$0" -eq "0" ]
then
exit 4
else
exit 3
fi
I have a file structure that looks like this
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
and I would like to find the file path to the .bin file in each directory which have the highest number as filename. So the output I am looking for would be
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
The highest number a file can have is 9.
Question
How do I do that in BASH?
I have come as far as find .|grep bin|sort
Globs are guaranteed to be expanded in lexical order.
for dir in ./*/
do
files=($dir/*) # create an array
echo "${files[#]: -1}" # access its last member
done
What about using awk? You can get the FIRST occurrence really simply:
[ghoti#pc ~]$ cat data1
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$ awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' data1
./501.res/1.bin
./503.res/1.bin
./504.res/1.bin
[ghoti#pc ~]$
To get the last occurrence you could pipe through a couple of sorts:
[ghoti#pc ~]$ sort -r data1 | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$
Given that you're using "find" and "grep", you could probably do this:
find . -name \*.bin -type f -print | sort -r | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
How does this work?
The find command has many useful options, including the ability to select your files by glob, select the type of file, etc. Its output you already know, and that becomes the input to sort -r.
First, we sort our input data in reverse (sort -r). This insures that within any directory, the highest numbered file will show up first. That result gets fed into awk. FS is the field separator, which makes $2 into things like "/501", "/502", etc. Awk scripts have sections in the form of condition {action} which get evaluated for each line of input. If a condition is missing, the action runs on every line. If "1" is the condition and there is no action, it prints the line. So this script is broken out as follows:
a[$2] {next} - If the array a with the subscript $2 (i.e. "/501") exists, just jump to the next line. Otherwise...
{a[$2]=1} - set the array a subscript $2 to 1, so that in future the first condition will evaluate as true, then...
1 - print the line.
The output of this awk script will be the data you want, but in reverse order. The final sort puts things back in the order you'd expect.
Now ... that's a lot of pipes, and sort can be a bit resource hungry when you ask it to deal with millions of lines of input at the same time. This solution will be perfectly sufficient for small numbers of files, but if you're dealing with large quantities of input, let us know, and I can come up with an all-in-one awk solution (that will take longer than 60 seconds to write).
UPDATE
Per Dennis' sage advice, the awk script I included above could be improved by changing it from
BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1
to
BEGIN{FS="."} $2 in a {next} {a[$2]} 1
While this is functionally identical, the advantage is that you simply define array members rather than assigning values to them, which may save memory or cpu depending on your implementation of awk. At any rate, it's cleaner.
Tested:
find . -type d -name '*.res' | while read dir; do
find "$dir" -maxdepth 1 | sort -n | tail -n 1
done
I came up with someting like this:
for dir in $(find . -mindepth 1 -type d | sort); do
file=$(ls "$dir" | sort | tail -n 1);
[ -n "$file" ] && (echo "$dir/$file");
done
Maybe it can be simpler
If invoking a shell from within find is an option try this
find * -type d -exec sh -c "echo -n './'; ls -1 {}/*.bin | sort -n -r | head -n 1" \;
And here is one liner
find . -mindepth 1 -type d | sort | sed -e "s/.*/ls & | sort | tail -n 1 | xargs -I{} echo &\/{}/" | bash
Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.
This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.
Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines
Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.
A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt
I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total