Boosting the grep search using GNU parallel

Boosting the grep search using GNU parallel - multithreading

I am using the following grep script to output all the unmatched patterns:
grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt
patterns file contains the following 12-characters long substrings (some instances are shown below):
6b6c665d4f44
8b715a5d5f5f
26364d605243
717c8a919aa2
large_strings file contains extremely long strings of around 20-100 million characters longs (a small piece of the string is shown below):
121b1f212222212123242223252b36434f5655545351504f4e4e5056616d777d80817d7c7b7a7a7b7c7d7f8997a0a2a2a3a5a5a6a6a6a6a6a7a7babbbcbebebdbcbcbdbdbdbdbcbcbcbcc2c2c2c2c2c2c2c2c4c4c4c3c3c3c2c2c3c3c3c3c3c3c3c3c2c2c1c0bfbfbebdbebebebfbfc0c0c0bfbfbfbebebdbdbdbcbbbbbababbbbbcbdbdbdbebebfbfbfbebdbcbbbbbbbbbcbcbcbcbcbcbcbcbcb8b8b8b7b7b6b6b6b8b8b9babbbbbcbcbbbabab9b9bababbbcbcbcbbbbbababab9b8b7b6b6b6b6b7b7b7b7b7b7b7b7b7b7b6b6b5b5b6b6b7b7b7b7b8b8b9b9b9b9b9b8b7b7b6b5b5b5b5b5b4b4b3b3b3b6b5b4b4b5b7b8babdbebfc1c1c0bfbec1c2c2c2c2c1c0bfbfbebebebebfc0c1c0c0c0bfbfbebebebebebebebebebebebebebdbcbbbbbab9babbbbbcbcbdbdbdbcbcbbbbbbbbbbbabab9b7b6b5b4b4b4b4b3b1aeaca9a7a6a9a9a9aaabacaeafafafafafafafafafb1b2b2b2b2b1b0afacaaa8a7a5a19d9995939191929292919292939291908f8e8e8d8c8b8a8a8a8a878787868482807f7d7c7975716d6b6967676665646261615f5f5e5d5b5a595957575554525
How can we speed up the above script (gnu parallel, xargs, fgrep, etc.)? I tried using --pipepart and --block but it doesn't allow you to pipe two grep commands.
Btw these are all hexadecimal strings and patterns.
The working code below is a little faster than the traditional grep:
rg -oFf patterns.txt large_strings.txt | rg -vFf - patterns.txt > unmatched_patterns.txt
grep took an hour to finish the process of pattern matching while it took ripgrep around 45 mins.

If you do not need to use grep try:
build_k_mers() {
k="$1"
slot="$2"
perl -ne 'for $n (0..(length $_)-'"$k"') {
$prefix = substr($_,$n,2);
$fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'";
$fh = $fh{$prefix};
print $fh substr($_,$n,'"$k"'),"\n"
}'
}
export -f build_k_mers
rm -rf tmp
mkdir tmp
export LC_ALL=C
# search strings must be sorted for comm
parsort patterns.txt | awk '{print >>"tmp/patterns."substr($1,1,2)}' &
# make shorter lines: Insert \n(last 12 char before \n) for every 32k
# This makes it easier for --pipepart to find a newline
# It will not change the kmers generated
perl -pe 's/(.{32000})(.{12})/$1$2\n$2/g' large_strings.txt > large_lines.txt
# Build 12-mers
parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}'
# -j10 and 20s may be adjusted depending on hardware
parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'`
wait
parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??
I have tested this on patterns.txt: 9GBytes/725937231 lines, large_strings.txt: 19GBytes/184 lines and on my 64-core machine it completes in 3 hours.

Related

Need to reduce the execution time

We are trying to execute below script for finding out the occurrence of a particular word in a log file
Need suggestions to optimize the script.
Test.log size - Approx to 500 to 600 MB
$wc -l Test.log
16609852 Test.log
po_numbers - 11 to 12k po's to search
$more po_numbers
xxx1335
AB1085
SSS6205
UY3347
OP9111
....and so on
Current Execution Time - 2.45 hrs
while IFS= read -r po
do
check=$(grep -c "PO_NUMBER=$po" Test.log)
echo $po "-->" $check >>list3
if [ "$check" = "0" ]
then
echo $po >>po_to_server
#else break
fi
done < po_numbers

You are reading your big file too many times when you execute
grep -c "PO_NUMBER=$po" Test.log
You can try to split your big file into smaller ones or write your patterns to a file and make grep use it
echo -e "PO_NUMBER=$po\n" >> patterns.txt
then
grep -f patterns.txt Test.log

$ grep -Fwf <(sed 's/.*/PO_NUMBER=&/' po_numbers) Test.log
create the lookup file from po_numbers (process substitution) check for literal word matches from the log file. This assumes the searched PO_NUMBER=xxx is a separate word, if not remove -w, also assumes there is no regex but just literal matches, if not remove -F, however both will slow down searches.

Using Grep :
sed -e 's|^|PO_NUMBER=|' po_numbers | grep -o -F -f - Test.log | sed -e 's|^PO_NUMBER=||' | sort | uniq -c > list3
grep -o -F -f po_numbers list3 | grep -v -o -F -f - po_numbers > po_to_server
Using awk :
This awk program might work faster
awk '(NR==FNR){ po[$0]=0; next }
{ for(key in po) {
str=$0
po[key]+=gsub("PO_NUMBER="key,"",str)
}
}
END {
for(key in po) {
if (po[key]==0) {print key >> "po_to_server" }
else {print key"-->"po[key] >> "list3" }
}
}' po_numbers Test.log
This does the following :
The first line loads the po keys from the file po_numbers
The second awk parser, will pars the file for occurences of PO_NUMBER=key per line. (gsub is a function which performs a substitutation and returns the substitution count)
In the end we print out the requested output to the requested files.
The assumption here is that is might be possible that multiple patterns could occure multiple times on a single line of Test.log
Comment: the original order of po_numbers will not be satisfied.

"finding out the occurrence"
Not sure if you mean to count the number of occurrences for each searched word or to output the lines in the log that contain at least one of the searched words. This is how you could solve it in the latter case:
(cat po_numbers; echo GO; cat Test.log) | \
perl -nle'$r?/$r/&&print:/GO/?($r=qr/#{[join"|",#s]}/):push#s,$_'

Suppressing summary information in `wc -l` output

I use the command wc -l count number of lines in my text files (also i want to sort everything through a pipe), like this:
wc -l $directory-path/*.txt | sort -rn
The output includes "total" line, which is the sum of lines of all files:
10 total
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
Is there any way to suppress this summary line? Or even better, to change the way the summary line is worded? For example, instead of "10", the word "lines" and instead of "total" the word "file".

Yet a sed solution!
1. short and quick
As total are comming on last line, $d is the sed command for deleting last line.
wc -l $directory-path/*.txt | sed '$d'
2. with header line addition:
wc -l $directory-path/*.txt | sed '$d;1ilines total'
Unfortunely, there is no alignment.
3. With alignment: formatting left column at 11 char width.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
/^ *[0-9]\+ total$/d;
1i\ lines filename'
Will do the job
lines file
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
4. But if really your wc version could put total on 1st line:
This one is for fun, because I don't belive there is a wc version that put total on 1st line, but...
This version drop total line everywhere and add header line at top of output.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
1{
/^ *[0-9]\+ total$/ba;
bb;
:a;
s/^.*$/ lines file/
};
bc;
:b;
1i\ lines file' -e '
:c;
/^ *[0-9]\+ total$/d
'
This is more complicated because we won't drop 1st line, even if it's total line.

This is actually fairly tricky.
I'm basing this on the GNU coreutils version of the wc command. Note that the total line is normally printed last, not first (see my comment on the question).
wc -l prints one line for each input file, consisting of the number of lines in the file followed by the name of the file. (The file name is omitted if there are no file name arguments; in that case it counts lines in stdin.)
If and only if there's more than one file name argument, it prints a final line containing the total number of lines and the word total. The documentation indicates no way to inhibit that summary line.
Other than the fact that it's preceded by other output, that line is indistinguishable from output for a file whose name happens to be total.
So to reliably filter out the total line, you'd have to read all the output of wc -l, and remove the final line only if the total length of the output is greater than 1. (Even that can fail if you have files with newlines in their names, but you can probably ignore that possibility.)
A more reliable method is to invoke wc -l on each file individually, avoiding the total line:
for file in $directory-path/*.txt ; do wc -l "$file" ; done
And if you want to sort the output (something you mentioned in a comment but not in your question):
for file in $directory-path/*.txt ; do wc -l "$file" ; done | sort -rn
If you happen to know that there are no files named total, a quick-and-dirty method is:
wc -l $directory-path/*.txt | grep -v ' total$'
If you want to run wc -l on all the files and then filter out the total line, here's a bash script that should do the job. Adjust the *.txt as needed.
#!/bin/bash
wc -l *.txt > .wc.out
lines=$(wc -l < .wc.out)
if [[ lines -eq 1 ]] ; then
cat .wc.out
else
(( lines-- ))
head -n $lines .wc.out
fi
rm .wc.out
Another option is this Perl one-liner:
wc -l *.txt | perl -e '#lines = <>; pop #lines if scalar #lines > 1; print #lines'
#lines = <> slurps all the input into an array of strings. pop #lines discards the last line if there are more than one, i.e., if the last line is the total line.

The program wc, always displays the total when they are two or more than two files ( fragment of wc.c):
if (argc > 2)
report ("total", total_ccount, total_wcount, total_lcount);
return 0;
also the easiest is to use wc with only one file and find present - one after the other - the file to wc:
find $dir -name '*.txt' -exec wc -l {} \;
Or as specified by liborm.
dir="."
find $dir -name '*.txt' -exec wc -l {} \; | sort -rn | sed 's/\.txt$//'

This is a job tailor-made for head:
wc -l | head --lines=-1
This way, you can still run in one process.

Can you use another wc ?
The POSIX wc(man -s1p wc) shows
If more than one input file operand is specified, an additional line shall be written, of the same format as the other lines, except that the word total (in the POSIX locale) shall be written instead of a pathname and the total of each column shall be written as appropriate. Such an additional line, if any, is written at the end of the output.
You said the Total line was the first line, the manual states its the last and other wc's don't show it at all. Removing the first or last line is dangerous, so I would grep -v the line with the total (in the POSIX locale...), or just grep the slash that's part of all other lines:
wc -l $directory-path/*.txt | grep "/"

Not the most optimized way since you can use combinations of cat, echo, coreutils, awk, sed, tac, etc., but this will get you want you want:
wc -l ./*.txt | awk 'BEGIN{print "Line\tFile"}1' | sed '$d'
wc -l ./*.txt will extract the line count. awk 'BEGIN{print "Line\tFile"}1' will add the header titles. The 1 corresponds to the first line of the stdin. sed '$d' will print all lines except the last one.
Example Result
Line File
6 ./test1.txt
1 ./test2.txt

You can solve it (and many other problems that appear to need a for loop) quite succinctly using GNU Parallel like this:
parallel wc -l ::: tmp/*txt
Sample Output
3 tmp/lines.txt
5 tmp/unfiltered.txt
42 tmp/file.txt
6 tmp/used.txt

The simplicity of using just grep -c
I rarely use wc -l in my scripts because of these issues. I use grep -c instead. Though it is not as efficient as wc -l, we don't need to worry about other issues like the summary line, white space, or forking extra processes.
For example:
/var/log# grep -c '^' *
alternatives.log:0
alternatives.log.1:3
apache2:0
apport.log:160
apport.log.1:196
apt:0
auth.log:8741
auth.log.1:21534
boot.log:94
btmp:0
btmp.1:0
<snip>
Very straight forward for a single file:
line_count=$(grep -c '^' my_file.txt)
Performance comparison: grep -c vs wc -l
/tmp# ls -l *txt
-rw-r--r-- 1 root root 721009809 Dec 29 22:09 x.txt
-rw-r----- 1 root root 809338646 Dec 29 22:10 xyz.txt
/tmp# time grep -c '^' *txt
x.txt:7558434
xyz.txt:8484396
real 0m12.742s
user 0m1.960s
sys 0m3.480s
/tmp/# time wc -l *txt
7558434 x.txt
8484396 xyz.txt
16042830 total
real 0m9.790s
user 0m0.776s
sys 0m2.576s

Similar to Mark Setchell's answer you can also use xargs with an explicit separator:
ls | xargs -I% wc -l %
Then xargs explicitly doesn't send all the inputs to wc, but one operand line at a time.

Shortest answer:
ls | xargs -l wc

What about using sed with the pattern removal option as below which would only remove the total line if it is present (but also any files with total in them).
wc -l $directory-path/*.txt | sort -rn | sed '/total/d'

Find a maximum and minimum in subdirectory (less than 10 seconds with a shell script)

I have several files of this type:
Sensor Location Temp Threshold
------ -------- ---- ---------
#1 PROCESSOR_ZONE 23C/73F 62C/143F
#2 CPU#1 30C/86F 73C/163F
#3 I/O_ZONE 32C/89F 68C/154F
#4 CPU#2 22C/71F 73C/163F
#5 POWER_SUPPLY_BAY 17C/62F 55C/131F
There is approximately 124630 in several sub-directories
I try to determine the maximal and minimal temperature of PROCESSOR_ZONE
Here is my script at the moment:
#!/bin/bash
max_value=0
min_value=50
find $1 -name hp-temps.txt -exec grep "PROCESSOR_ZONE" {} + | sed -e 's/\ \+/,/g' | cut -d, -f3 | cut -dC -f1 | while read current_value ;
do
echo $current_value;
done
output after my script:
30
28
26
23
...
My script is not finished and it sets 10 minutes to show all the temperatures.
I think that to arrive there, I have to put the result of my command in a file, sort out it and get back the first line which is the max and the last one the min. But I do not know how to make it.

Instead of this bit:
... | while read current_value ;
do
echo $current_value;
done
just direct the output after cut to a file:
... > temperatures.txt
If you need them sorted, sort them first:
... | sort -n > temperatures.txt
Then the first line of the file will be the minimum temperature returned, and the last line will be the maximum temperature.
Performance suggestion:
This find command runs a new grep process on each file. If there are hundreds of thousands of these files in your directory, it will run grep hundreds of thousands of times. You can speed it up by telling find to run a grep command once for each batch of a few thousand files:
find $1 -name hp-temps.txt -print | xargs grep -h "PROCESSOR_ZONE" | sed ...
The find command prints the filenames out on standard output; the xargs command reads these in and runs grep on a batch of files at once. The -h option to grep means "don't include the filename in the output".
Running it this way should speed up your search considerably if there are thousands of files to be processed.

If your script is slow, you might want to analyse first which command is slow. Using find on for example Windows/Cygwin with a lot of files is going to be slow.
Perl is a perfect match for your problem:
find $1 -name hp-temps.txt -exec perl -ne '/PROCESSOR_ZONE\s+(\d+)C/ and print "$1\n"' {} +
In this way you do a (Perl) regular expression match on a lot of files at the same time. The brackets match the temperature digits (\d+) and $1 references that. The and makes sure print only is executed if the match succeeds.
You could even consider using opendir and readdir to recursively decend into directories in Perl to get rid of the find, but it is not going to be faster.
To get the minimum and maximum values:
find $1 -name hp-temps.txt -exec perl -ne 'if (/PROCESSOR_ZONE\s+(\d+)C/){ $min=$1 if $1<$min or $min == undef; $max=$1 if $1>$max }; sub END { print "$min - $max\n" }' {} +
With 100k+ lines of output on your terminal this should save quite a bit of time.

#!/bin/bash
max_value=0
min_value=50
find $1 -name file.txt -exec grep "PROCESSOR_ZONE" {} + | sed -e 's/\ \+/,/g' | cut -d, -f3 | cut -dC -f1 |
{
while read current_value ; do
#For maximum
if [[ $current_value -gt $max_value ]]; then
max_value=$current_value
fi
#For minimum
if [[ $current_value -lt $min_value ]]; then
min_value=$current_value
echo "new min $min_value"
fi
done
echo "NEW MAX : $max_value °C"
echo "NEW MIN : $min_value °C"
}

Finding the longest word in a text file

I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememe and if I want to find the length I can use echo {#a} its word I would echo ${a}. But I want to apply it on this below
for i in `cat so.txt` do
Where so.txt contains words, I hope it makes sense.

bash one liner.
sed 's/ /\n/g' YOUR_FILENAME | sort | uniq | awk '{print length, $0}' | sort -nr | head -n 1
read file and split the words (via sed)
remove duplicates (via sort | uniq)
prefix each word with it's length (awk)
sort the list by the word length
print the single word with greatest length.
yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.

Normally, you'd want to use a while read loop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.
#!/bin/bash
longest=0
for word in $(<so.txt)
do
len=${#word}
if (( len > longest ))
then
longest=$len
longword=$word
fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"

Another solution:
for item in $(cat "$infile"); do
length[${#item}]=$item # use word length as index
done
maxword=${length[#]: -1} # select last array element
printf "longest word '%s', length %d" ${maxword} ${#maxword}

longest=""
for word in $(cat so.txt); do
if [ ${#word} -gt ${#longest} ]; then
longest=$word
fi
done
echo $longest

awk script:
#!/usr/bin/awk -f
# Initialize two variables
BEGIN {
maxlength=0;
maxword=0
}
# Loop through each word on the line
{
for(i=1;i<=NF;i++)
# Assign the maxlength variable if length of word found is greater. Also, assign
# the word to maxword variable.
if (length($i)>maxlength)
{
maxlength=length($i);
maxword=$i;
}
}
# Print out the maxword and the maxlength
END {
print maxword,maxlength;
}
Textfile:
[jaypal:~/Temp] cat textfile
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language
consisting of a set of actions to be taken against textual data (either in files or data streams)
for the purpose of producing formatted reports.
The language used by awk extensively uses the string datatype,
associative arrays (that is, arrays indexed by key strings), and regular expressions.
Test:
[jaypal:~/Temp] ./script.awk textfile
data_extraction 15

Relatively speedy bash function using no external utils:
# Usage: longcount < textfile
longcount ()
{
declare -a c;
while read x; do
c[${#x}]="$x";
done;
echo ${#c[#]} "${c[${#c[#]}]}"
}
Example:
longcount < /usr/share/dict/words
Output:
23 electroencephalograph's
'Modified POSIX shell version of jimis' xargs-based
answer; still very slow, takes two or three minutes:
tr "'" '_' < /usr/share/dict/words |
xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} "$1"' |
sort -n | tail | tr '_' "'"
Note the leading and trailing tr bit to get around GNU xargs
difficulty with single quotes.

for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1

Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:
$ cat /usr/share/dict/words | \
xargs -n1 -I '{}' -d '\n' sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize
You can easily parallelize, e.g. to 4 CPUs by providing -P4 to xargs.
EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -d argument.
EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0 option to xargs. I also added -P4 to compute on 4 cores:
cat /usr/share/dict/words | tr '\n' '\0' | \
xargs -0 -I {} -n1 -P4 sh -c 'echo ${#1} "$1"' wordcount {} | \
sort -n | tail

How to split a file and keep the first line in each of the pieces?

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?

This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines

Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.

A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt

I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Boosting the grep search using GNU parallel - multithreading

Related

Need to reduce the execution time

Suppressing summary information in `wc -l` output

Find a maximum and minimum in subdirectory (less than 10 seconds with a shell script)

Finding the longest word in a text file

How to split a file and keep the first line in each of the pieces?

Categories

Resources