split a large file in multiple files but without case statement - linux

I'm new in bash scripting and I'm trying to make a script which split a large file in multiple files.
I succeeded with case statement but how can I make it without case statement? For example If I have a file with 30 millions of lines (some database file).
Thank you in advance!
echo File which one you want to split
read pathOfFile
echo
countLines=`wc -l < $pathOfFile`
echo The file has $countLines lines
echo
echo In how many files do you want to split?
echo -e "a = 2 files\nb = 3 files\nc = 4 files\nd = 5 files\ne = 10 files\nf = 25 files"
read numberOfFiles
echo
echo The files name with should start:
read nameForFiles
echo
#Split the file
case $numberOfFiles in
a) split -l $(($countLines / 2)) $pathOfFile $nameForFiles;;
b) split -l $(($countLines / 3)) $pathOfFile $nameForFiles;;
c) split -l $(($countLines / 4)) $pathOfFile $nameForFiles;;
d) split -l $(($countLines / 5)) $pathOfFile $nameForFiles;;
e) split -l $(($countLines / 10)) $pathOfFile $nameForFiles;;
f) split -l $(($countLines / 25)) $pathOfFile $nameForFiles;;
*) echo Invalid choice.
esac

You can just use an array to store values then convert your character to an integer to use as an index:
# ...
z=('2' '3' '4' '5' '10' '25')
x=$(( $(printf '%d' "'$numberOfFiles") -97 ))
if [[ $x -lt "${#z[#]}" ]] && [[ $x -ge '0' ]] ; then
split -l $(($countLines / ${z[x]})) $pathOfFile $nameForFiles
else
echo "Invalid choice"
fi
As you can see just convert character to ascii then minus 97 will ensure index lines up within the range of array z.

I found another way to resolve this one, check below:
echo File which one you want to split
read pathOfFile
echo
countLines=`wc -l < $pathOfFile`
echo The file has $countLines lines
echo
echo In how many files do you want to split?
read numberOfFiles
echo
echo The files name with should start:
read nameForFiles
echo
#Split the file
if [[ -n ${numberOfFiles//[0-9]/} ]];
then
echo You type something else than a number. - Bye
exit 1
else
split -l $(($countLines / $numberOfFiles)) -a 3 -d $pathOfFiles $nameForFiles
fi

Related

Bash script: max,min,sum - many sources as parameter

Is it possible to write a script that reads the file containing numbers (one per line) and writes their maximum, minimum and sum. If the file is empty, it will print an appropriate message. The name of the file is to be given as the parameter of the script. I mange to create below script, but there are 2 errors:
./4.3: line 20: syntax error near unexpected token `done'
./4.3: line 20: `done echo "Max: $max" '
Is it possible to add multiple files as parameter?
lines=`cat "$1" | wc -l`
if [ $lines -eq 0 ];
then echo "File $1 is empty!"
exit fi min=`cat "$1" | head -n 1`
max=$min sum=0
while [ $lines -gt 0 ];
do num=`cat "$1" |
tail -n $lines`
if [ $num -gt $max ];
then max=$num
elif [ $num -lt $min ];
then min=$num fiS
sum=$[ $sum + $num] lines=$[ $lines - 1 ]
done echo "Max: $max"
echo "Min: number $min"
echo "Sum: $sum"
Pretty compelling use of GNU datamash here:
read sum min max < <( datamash sum 1 min 1 max 1 < "$1" )
[[ -z $sum ]] && echo "file is empty"
echo "sum=$sum; min=$min; max=$max"
Or, sort and awk:
sort -n "$1" | awk '
NR == 1 { min = $1 }
{ sum += $1 }
END {
if (NR == 0) {
print "file is empty"
} else {
print "min=" min
print "max=" $1
print "sum=" sum
}
}
'
Here's how I'd fix your original attempt, preserving as much of the intent as possible:
#!/usr/bin/env bash
lines=$(wc -l "$1")
if [ "$lines" -eq 0 ]; then
echo "File $1 is empty!"
exit
fi
min=$(head -n 1 "$1")
max=$min
sum=0
while [ "$lines" -gt 0 ]; do
num=$(tail -n "$lines" "$1")
if [ "$num" -gt "$max" ]; then
max=$num
elif [ "$num" -lt "$min" ]; then
min=$num
fi
sum=$(( sum + num ))
lines=$(( lines - 1 ))
done
echo "Max: $max"
echo "Min: number $min"
echo "Sum: $sum"
The dealbreakers were missing linebreaks (can't use exit fi on a single line without ;); other changes are good practice (quoting expansions, useless use of cat), but wouldn't have prevented your script from working; and others are cosmetic (indentation, no backticks).
The overall approach is a massive antipattern, though: you read the whole file for each line being processed.
Here's how I would do it instead:
#!/usr/bin/env bash
for fname in "$#"; do
[[ -s $fname ]] || { echo "file $fname is empty" >&2; continue; }
IFS= read -r min < "$fname"
max=$min
sum=0
while IFS= read -r num; do
(( sum += num ))
(( max = num > max ? num : max ))
(( min = num < min ? num : min ))
done < "$fname"
printf '%s\n' "$fname:" " min: $min" " max: $max" " sum: $sum"
done
This uses the proper way to loop over an input file and utilizes the ternary operator in the arithmetic context.
The outermost for loop loops over all arguments.
You can do the whole thing in one while loop inside a shell script. Here's the bash version:
s=0
while read x; do
if [ ! $mi ]; then
mi=$x
elif [ $mi -gt $x ]; then
mi=$x
fi
if [ ! $ma ]; then
ma=$x
elif [ $ma -lt $x ]; then
ma=$x
fi
s=$((s+x))
done
if [ ! $ma ]; then
echo "File is empty."
else
echo "s=$s, mi=$mi, ma=$ma"
fi
Save that script into a file, and then you can use pipes to send as many input files into it as you wish, like so (assuming the script is called "mysum"):
cat file1 file2 file3 | mysum
or for a single file
mysum < file1
(Make sure, the script is executable and on the $PATH, otherwise use "./mysum" for the script in the current directory or indeed "bash mysum" if it isn't executable.)
The script assumes that the numbers are one per line and that there's nothing else on the line. It gives a message if the input is empty.
How does it work? The "read x" will take input from stdin line-by-line. If the file is empty, the while loop will never be run, and thus variables mi and ma won't be set. So we use this at the end to trigger the appropriate message. Otherwise the loop checks first if the mi and ma variables exist. If they don't, they are initialised with the first x. Otherwise it is checked if the next x requires updating the mi and ma found thus far.
Note that this trick ensures that you can feed-in any sequence of numbers. Otherwise you have to initialise mi with something that's definitely too large and ma with something that's definitely too small - which works until you encounter a strange number list.
Note further, that this works for integers only. If you need to work with floats, then you need to use some other tool than the shell, e.g. awk.
Just for fun, here's the awk version, a one-liner, use as-is or in a script, and it will work with floats, too:
cat file1 file2 file3 | awk 'BEGIN{s=0}; {s+=$1; if(length(mi)==0)mi=$1; if(length(ma)==0)ma=$1; if(mi>$1)mi=$1; if(ma<$1)ma=$1} END{print s, mi, ma}'
or for one file:
awk 'BEGIN{s=0}; {s+=$1; if(length(mi)==0)mi=$1; if(length(ma)==0)ma=$1; if(mi>$1)mi=$1; if(ma<$1)ma=$1} END{print s, mi, ma}' < file1
Downside: if doesn't give a decent error message for an empty file.
a script that reads the file containing numbers (one per line) and writes their maximum, minimum and sum
Bash solution using sort:
<file sort -n | {
read -r sum
echo "Min is $sum"
while read -r num; do
sum=$((sum+num));
done
echo "Max is $num"
echo "Sum is $sum"
}
Let's speed up by using some smart parsing using tee, tr and calculating with bc and if we don't mind using stderr for output. But we could do a little fifo and synchronize tee output. Anyway:
{
<file sort -n |
tee >(echo "Min is $(head -n1)" >&2) >(echo "Max is $(tail -n1)" >&2) |
tr '\n' '+';
echo 0;
} | bc | sed 's/^/Sum is /'
And there is always datamash. The following willl output 3 numbers, being sum, min and max:
<file datamash sum 1 min 1 max 1
You can try with a shell loop and dc
while [ $# -gt 0 ] ; do
dc -f - -e '
['"$1"' is empty]sa
[la p q ]sZ
z 0 =Z
# if file is empty
dd sb sc
# populate max and min with the first value
[d sb]sY
[d lb <Y ]sM
# if max keep it
[d sc]sX
[d lc >X ]sN
# if min keep it
[lM x lN x ld + sd z 0 <B]sB
lB x
# on each line look for max, min and keep the sum
[max for '"$1"' = ] n lb p
[min for '"$1"' = ] n lc p
[sum for '"$1"' = ] n ld p
# print summary at end of each file
' <"$1"
shift
done

Add a special value to a variable

I want to add a special value to the value of a variable. this is my script:
x 55;
y 106;
now I want to change the value of x from 55 to 60.
Generally, how can we apply a math expression on the values of variables in a script?
Others might come up with something simpler (ex: sed, awk, ...), but this quick and dirty script works. It assumes your input file is exactly like you posted:
this is my script.
x 55;
y 106;
And the code:
#!/bin/bash
#
if [ $# -ne 1 ]
then
echo "ERROR: usage $0 <file>"
exit 1
else
inputfile=$1
if [ ! -f $inputfile ]
then
echo "ERROR: could not find $inputfile"
exit 1
fi
fi
tempfile="/tmp/tempfile.$$"
>$tempfile
while read line
do
firstelement=$(echo $line | awk '{print $1}')
if [ "$firstelement" == 'x' ]
then
secondelement=$(echo $line | awk '{print $2}' | cut -d';' -f1)
(( secondelement = secondelement + 5 ))
echo "$firstelement $secondelement;" >>$tempfile
else
echo "$line" >>$tempfile
fi
done <$inputfile
mv $tempfile $inputfile
So it reads the input file line per line. If the line starts with variable x, it takes the number that follows, does +5 to it and outputs it to a temp file. If the line does not start with x, it outputs the line, unchanged, to the temp file. Lastly the temp file overwrite the input file.
Copy this code in a file, make it executable and run it with the input file as an argument.

bash returning erroneous results after about 36 million lines when iterating through a pair of files - is this a memory error?

I have written a simple script in bash to iterate through a pair of text files to make sure they are properly formatted.
The required format is as follows:
Each file contains millions of ‘records’.
Each record takes up two lines in each file – a header line and a sequence line.
Each header lines consists of a “>” symbol, followd by a sample name (alphanumeric string), followed by a period, followed by a unique record identifier number (an integer), followed by a suffix of either ‘/1’ or ‘/2’.
Each sequence line contains a string of 30-100 A,C,G and T characters (the four DNA nucleotides, if anyone is wondering).
The files are paired, in that the first record in one file corresponds to the first record in the second file, and so forth. The header lines in the two files should be identical, except that in one files they will all have a ‘/1’ siffix and in the other file they will all have a ‘/2’ suffix. The sequence lines can be very different between the two files.
The code I developed is designed to check that (a) the hearder lines in each record follow the correct format, (2) the header lines in the corresponding records in the two files match (i.e. are identical except for the /1 and /2 suffixes) and (c) the sequence lines contain only A,C,G and T characters.
Example of properly formatted records:
> cat -n file1 | head -4
1 >SRR573705.1/1
2 ATAATCATTTGCCTCTTAAGTGGGGGCTGGTATGAATGGCAAGACGGGAATCTAGCTGTCTCTCCCTTATATCTTGAAGTTAATATTTCTGTGAAGAAGC
3 >SRR573705.2/1
4 CCACTTGTCCCAGTCTGTGCTGCCTGTACAATGGATTAGCTGAGGAAAACTGGCATCCCATGGCCTCAAACAGACGCAGCAAGTCCATGAAGCCATAATT
> cat –n file2 | head -4
1 >SRR573705.1/2
2 TTTCTAACAATTGAATTAGCAACACAAACACTATTGACAAAGCTATATCTTATTTCTACTAAAGCTCGATAGGGTCTTCTCGTCCTGCGATCCCATTCCT
3 >SRR573705.2/2
4 GTATGATGGGTGTGTCAAGGAGCTCAACCATCGTGATAGGCTACCTCATGCATCGAGACAAGATCACATTTAATGAGGCATTTGACATGGTCAGGAAGCA
My code is below. It works perfectly well for small test files containing only a couple of hundred records. When reading a real data file with millions or records, however, it returns non-sensical errors, for example:
Inaccurate header line in read 18214236 of file2
Line 36428471: TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The error above is simply wrong. Line 36,428,471 of file2 is ‘>SRR573705.19887618/2’
The string reported in the error is not even present in file 2. It does, however, appear multiple times in file1, i.e.:
cat -n /file1 | grep 'TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC'
4632838 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
24639990 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
143478526 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The data in the two files seems to match perfectly in the region where the error was returned:
cat -n file1 | head -36428474 | tail
36428465 >SRR573705.19887614/1
36428466 CACCCCAGCATGTTGACCACCCATGCCATTATTTCATGGTATTTTCTTACATTTTGTATATAACAGATGCATTACGTATTATAGCATTGCTTTTCGTAAA
36428467 >SRR573705.19887616/1
36428468 AGATCCTCCTCCTCATCGGTCAGTCGCCAATCCAACAACTCAACCTTCTTCTTCAAGTCACTCAGCCGTCGGCCCGGGACTGCCGTTTCATGATGCCTAT
36428469 >SRR573705.19887617/1
36428470 CAATAGCGTATATTAAAATTGCTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGCGCAGGCTGGCGGTCCGCCGCAAGGCGCGCCACTGCCAGCCTGGCC
36428471 >SRR573705.19887618/1
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428473 >SRR573705.19887619/1
36428474 CCAGCCTGCGCCCAAGATCCAACTACGAGCTTTTTAACTGCAGCAATTTTAATATACGCTATTGGAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTG
>cat -n file2 | head -36428474 | tail
36428465 >SRR573705.19887614/2
36428466 GTAATTTACAGGAATTGTTTACATTCTGAGCAAATAAAACAAATAATTTTAATACACAAACTTGTTGAAAGTTAATTAGGTTTTACGAAAA
36428467 >SRR573705.19887616/2
36428468 GCCGTCGCAGCAACATTTGAGATATCCCGTAAGACGTCTTGAACGGCTGGCTCTGTCTGCTCTCGGAGAACCTGCCGGCTGAACCGGACAGCGCAGACG
36428469 >SRR573705.19887617/2
36428470 CTCGAGTTCCGAAAACCAACGCAATAGAACCGAGGTCCTATTCCATTATTCCATGCTCTGCTGTCCAGGCGGTCGGCCTG
36428471 >SRR573705.19887618/2
36428472 GGACATGGAAACAGAAAATAATGAAAAGACCAAAGAAGATGCACTTGAGGTTGATAAGCCTAAAGG
36428473 >SRR573705.19887619/2
36428474 CCCGACACGGGGAGGTAGTGACGAAAAATAGCAATACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTACACTTTAAATCCTTTAACGAGGATCTAT
Is there some sort of memory limit in bash that could cause such an error? I have run various versions of this code over multiple files and consistently get this problem after 36,000,000 lines.
My code:
set -u
function fastaConsistencyChecker {
F_READS=$1
R_READS=$2
echo -e $F_READS
echo -e $R_READS
if [[ ! -s $F_READS ]]; then echo -e "File $F_READS could not be found."; exit 0; fi
if [[ ! -s $R_READS ]]; then echo -e "File $R_READS could not be found."; exit 0; fi
exec 3<$F_READS
exec 4<$R_READS
line_iterator=1
read_iterator=1
while read FORWARD_LINE <&3 && read REVERSE_LINE <&4; do
if [[ $(( $line_iterator % 2 )) == 1 ]]; then
## This is a header line ##
if [[ ! ( $FORWARD_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/1$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${F_READS}"
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
exit 0
fi
if [[ ! ( $REVERSE_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/2$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${R_READS}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
F_Name=${FORWARD_LINE:1:${#FORWARD_LINE}-3}
R_Name=${REVERSE_LINE:1:${#REVERSE_LINE}-3}
if [[ $F_Name != $R_Name ]]; then
echo -e "Record names do not match. "
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
line_iterator=$(( $line_iterator + 1 ))
else
if [[ ! ( $FORWARD_LINE =~ ^[ATCGNatcgn]+$ ) ]]; then
echo -e "Ambigous sequence detected for read ${read_iterator} at line ${line_iterator} in file ${F_READS}"
exit 0
fi
read_iterator=$(( $read_iterator + 1 ))
line_iterator=$(( $line_iterator + 1 ))
fi
unset FORWARD_LINE
unset REVERSE_LINE
done
echo -e "$line_iterator lines and $read_iterator reads"
echo -e "No errors detected."
echo -e ""
}
export -f fastaConsistencyChecker
FILE3="filepath/file1"
FILE4="filepath/file2"
fastaConsistencyChecker $FILE3 $FILE4
I think you've proven there's an issue related to memory usage with bash. I think you can accomplish your format verification without running afoul of the memory issue by using text processing tools from bash.
#!/bin/bash
if ! [[ $1 && $2 && -s $1 && -s $2 ]]; then
echo "usage: $0 <file1> <file2>"
exit 1
fi
set -e
dir=`mktemp -d`
clean () { rm -fr $dir; }
trap clean EXIT
pairs () { sed 'N;s/\n/\t/' "$#"; }
pairs $1 > $dir/$1
pairs $2 > $dir/$2
paste $dir/$1 $dir/$2 | grep -vP '^>(\w+\.\d+)/1\t[ACGT]+\t>\1/2\t[ACGT]+$' && exit 1
exit 0
The sed script takes a line and concatenates it with the next, separated by a tab. This:
>SRR573705.1/1
ATAATCATTTGCCTCTT...
becomes this:
>SRR573705.1/1 ATAATCATTTGCCTCTT...
The paste takes the first line of file 1 and the first line of file 2 and outputs them as one line separated by a tab. It does the same for the second line, and so forth. grep see input like this:
>SRR573705.1/1. ATAATCATTTGCCTCT.... >SRR573705.1/2. TTTCTAACAATTGAAT...
The regular expression captures the first identifier and matches the same identifier later in the line with the backreference \1.
The script outputs any lines failing to match the regex due to the -v switch to grep. If lines are output, the script exits with status 1.

Bash expr error

I have a few text files with numbers (structure as below). I'd like to sum up every line form one file with ever line form other files (line1 from file1 + line1 from file2 etc.). I have written the bash script as following but this gives me the expr error.
function countHourly () {
for i in {1..24}
do
for file in $PLACE/*.dailycount.txt
do
SECBUFF=`head -n $i $file`
VAL=`expr $VAL + $SECBUFF` ## <-- this cause expr error
done
echo line $i form all files counts: $VAL
done
}
file structure *.dailycount.txt:
1
0
14
56
45
0
3
45
23
23
9 (every number in new line).
Assuming your files each contain exactly 24 lines, you could solve this problem with a simple one-liner:
counthourly() {
paste -d+ $PLACE/*.dailycount.txt | bc
}
The head -n NUMBER FILE command outputs the first NUMBER lines. This means that SECBUFF ends up being 1 0 on the second run of the loop, and something like expr 1 + 2 3 is not a valid expression so you get an error from expr.
You can use sed to pick only the nth line from a file, but I wonder if you shouldn't restructure the program somehow.
SECBUFF=`sed -ne ${i}p $file`
This could help. With that variation you could check very input so that only numbers would be added for the sum, even if there are lines that invalid.
function countHourly {
local NUMBERS TOTAL=0 I
readarray -t NUMBERS < <(cat "$PLACE"/*.dailycount.txt)
for I in "${NUMBERS[#]}"; do
[[ $I =~ ^[[:digit:]]+$ ]] && (( TOTAL += I ))
done
echo "Total: $TOTAL"
}
Or
function countHourly {
local NUMBERS TOTAL=0 I
while read I; do
[[ $I =~ ^[[:digit:]]+$ ]] && (( TOTAL += I ))
done < <(cat "$PLACE"/*.dailycount.txt)
echo "Total: $TOTAL"
}

Awk: loop & save different lines to different files?

I'm looping over a series of large files with a shell script:
i=0
while read line
do
# get first char of line
first=`echo "$line" | head -c 1`
# make output filename
name="$first"
if [ "$first" = "," ]; then
name='comma'
fi
if [ "$first" = "." ]; then
name='period'
fi
# save line to new file
echo "$line" >> "$2/$name.txt"
# show live counter and inc
echo -en "\rLines:\t$i"
((i++))
done <$file
The first character in each line will either be alphanumeric, or one of the above defined characters (which is why I'm renaming them for use in the output file name).
It's way too slow.
5,000 lines takes 128seconds.
At this rate I've got a solid month of processing.
Will awk be faster here?
If so, how do I fit the logic into awk?
This can certainly be done more efficiently in bash.
To give you an example: echo foo | head does a fork() call, creates a subshell, sets up a pipeline, starts the external head program... and there's no reason for it at all.
If you want the first character of a line, without any inefficient mucking with subprocesses, it's as simple as this:
c=${line:0:1}
I would also seriously consider sorting your input, so you can only re-open the output file when a new first character is seen, rather than every time through the loop.
That is -- preprocess with sort (as by replacing <$file with < <(sort "$file")) and do the following each time through the loop, reopening the output file only conditionally:
if [[ $name != "$current_name" ]] ; then
current_name="$name"
exec 4>>"$2/$name" # open the output file on FD 4
fi
...and then append to the open file descriptor:
printf '%s\n' "$line" >&4
(not using echo because it can behave undesirably if your line is, say, -e or -n).
Alternately, if the number of possible output files is small, you can just open them all on different FDs up-front (substituting other, higher numbers where I chose 4), and conditionally output to one of those pre-opened files. Opening and closing files is expensive -- each close() forces a flush to disk -- so this should be a substantial help.
A few things to speed it up:
Don't use echo/head to get the first character. You're
spawning at least two additional processes per line. Instead,
use bash's parameter expansion facilities to get the first character.
Use if-elif to avoid checking $first against all the
possibilities
each time. Even better, if you are using bash 4.0 or later, use an associative array
to store the output file names, rather than checking against
$first in a big if-statement for each line.
If you don't have a version of bash that supports associative
arrays, replace your if statements with the following.
if [[ "$first" = "," ]]; then
name='comma'
elif [[ "$first" = "." ]]; then
name='period'
else
name="$first"
fi
But the following is suggested. Note the use of $REPLY as the default variable used by read if no name is given (just FYI).
declare -A OUTPUT_FNAMES
output[","]=comma
output["."]=period
output["?"]=question_mark
output["!"]=exclamation_mark
output["-"]=hyphen
output["'"]=apostrophe
i=0
while read
do
# get first char of line
first=${REPLY:0:1}
# make output filename
name=${output[$first]:-$first}
# save line to new file
echo $REPLY >> "$name.txt"
# show live counter and inc
echo -en "\r$i"
((i++))
done <$file
#!/usr/bin/awk -f
BEGIN {
punctlist = ", . ? ! - '"
pnamelist = "comma period question_mark exclamation_mark hyphen apostrophe"
pcount = split(punctlist, puncts)
ncount = split(pnamelist, pnames)
if (pcount != ncount) {print "error: counts don't match, pcount:", pcount, "ncount:", ncount; exit}
for (i = 1; i <= pcount; i++) {
punct_lookup[puncts[i]] = pnames[i]
}
}
{
print > punct_lookup[substr($0, 1, 1)] ".txt"
printf "\r%6d", i++
}
END {
printf "\n"
}
The BEGIN block builds an associative array so you can do punct_lookup[","] and get "comma".
The main block simply does the lookups for the filenames and outputs the line to the file. In AWK, > truncates the file the first time and appends subsequently. If you have existing files that you don't want truncated, then change it to >> (but don't use >> otherwise).
Yet another take:
declare -i i=0
declare -A names
while read line; do
first=${line:0:1}
if [[ -z ${names[$first]} ]]; then
case $first in
,) names[$first]="$2/comma.txt" ;;
.) names[$first]="$2/period.txt" ;;
*) names[$first]="$2/$first.txt" ;;
esac
fi
printf "%s\n" "$line" >> "${names[$first]}"
printf "\rLine $((++i))"
done < "$file"
and
awk -v dir="$2" '
{
first = substr($0,1,1)
if (! (first in names)) {
if (first == ",") names[first] = dir "/comma.txt"
else if (first == ".") names[first] = dir "/period.txt"
else names[first] = dir "/" first ".txt"
}
print > names[first]
printf("\rLine %d", NR)
}
'

Resources