how do I split a file into chunks by regexp line separators? - linux

I would like to have a Linux oneliner that runs a "split" command in a slightly different way -- rather than by splitting the file into smaller files by using a constant number of lines or bytes, it will split the file according to a regexp separator that identifies where to insert the breaks and start a new file.
The problem is that most pipe commands work on one stream and can't split a stream into multiple files, unless there is some command that does that.
The closest I got to was:
cat myfile |
perl -pi -e 's/theseparatingregexp/SPLITHERE/g' |
split -l 1 -t SPLITHERE - myfileprefix
but it appears that split command cannot take multi-character delimeters.

Related

Split large gz files while preserving rows

I have a larger .gz file (2.1G) that I am trying to load into R, but it is large enough that I have to split it into pieces and load each individually before recombining them. However, I am having difficulty in splitting the file in a way that preserves the structure of the data. The file itself, with the exception of the first two rows, is a 56318 x 9592 matrix with non-homogenous entries.
I'm using Ubuntu 16.04. First, I tried using the split command from terminal as suggested by this link (https://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts?rq=1)
$ split --lines=10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
Doing this, though, creates far more files than I would have expected (since my matrix has 57000 rows, I was hoping to output 6 files, each 10000 rows in size). When reading one of these into R and investigating the dimensions, I see that each is a matrix of 62x9592, indicating that the columns have all been preserved, but I'm getting significantly less rows than I would have hoped. Further, when reading it in, I get an error specifying an unexpected end of file. My thought is that it's not reading in how I want it to.
I found a two possible alternatives here - https://superuser.com/questions/381394/unix-split-a-huge-gz-file-by-line
In particular, I've tried piping different arguments using gunzip, and then passing the output through to split (with the assumption that perhaps the file being compressed is what led to inconsistent end lines). I tried
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
but, doing this, I ended up with the exact same splits that I had previously. I have the same problem replacing "zcat" with "gunzip -c", which should have sent the uncompressed output to the split command.
Another answer on that link suggested piping to head or tail with something like zcat, for example
$ zcat originalFile.gct.gz | head -n 10000 >> "originalFile.gct.gz.1"
With zcat, this works perfectly, and it's exactly what I want. The dimension for this ends up being 10000x9592, so this is the ideal solution. One thing that I'll note is that this output is an ASCII text file rather than a compressed file, and I'm perfectly OK with that.
However, I want to be able to do this until end up file, making an additional output file for each 10000 rows. For this particular case, it's not a huge deal to just make the six, but I have tens of files like this, some of which are >10gb. My question, then, is how can I use split command that will take the first 10000 lines of the unzipped file and then output them, automatically updating the suffix with each new file? Basically, I want the output that I got from using "head", but with "split" so that I can do it over the entire file.
Here is the solution that ended up working for me
$ zcat originalFile.gct.gz | split -l 10000 - "originalFile.gtc.gz-"
As Guido mentioned in the comment, my original command
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
was discarding the output of zcat, and split was once again reading from the compressed data. By including the "-" after the split argument, I was able to pass the standard output from zcat into split, and now the piping works as I was expecting it to.
When you want to control your splitting better, you can use awk.
You mentioned that the first two rows were special.
Try something like
zcat originalFile.gct.gz |
awk 'BEGIN {j=1} NR<3 {next} {i++} i%5==0 {j++} {print > "originalFile.gct.part"j }'
When you want your outfiles compressed, modify the awk command: Let is print the completed files and use xargs to gzip them.
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/name_"$1".gct.gz";}'
and example line of my file was: 2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/file_"$1".gct.gz";}'
and example line of my file was:
2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the variable $1
Getting and ouput of:
/tmp/file_2014.gct.gz
/tmp/file_2015.gct.gz
/tmp/file_2016.gct.gz
/tmp/file_2017.gct.gz
/tmp/file_2018.gct.gz

How do I pipe to Linux split command?

I'm a bit useless at Linux CLI, and I am trying to run the following commands to randomly sort, then split a file with output file prefixes 'out' (one output file will have 50 lines, the other the rest):
sort -R somefile | split -l 50 out
I get the error
split: cannot open ‘out’ for reading: No such file or directory
this is presumably because the third parameter of split should be its input file. How do I pass the result of the sort to split? TIA!!
Use - for stdin:
sort -R somefile | split -l 50 - out
From man split:
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no
INPUT, or when INPUT is -, read standard input.
Allowing - to specify input is stdin is a convention many UNIX utilities follow.
out is interpreted as input file. You can should a single dash to indicate reading from STDIN:
sort -R somefile | split - -l 50 out
For POSIX systems like mac os the - parameter is not accepted and you need to completely omit the filename, and let it generate it's own names.
sort -R somefile | split -l 50

Improve performance of Bash loop that removes windows line endings

Editor's note: This question was always about loop performance, but the original title led some answerers - and voters - to believe it was about how to remove Windows line endings.
The below bash loop below just remove the windows line endings and converts them to unix and appears to be running, but it is slow. The input files are small (4 files ranging from 167 bytes - 1 kb), and are all the same structure (list of names) and the only thing that varies is the length (ie. some files are 10 names others are 50). Is it supposed to take over 15 minutes to complete this task using a xeon processor? Thank you :)
for f in /home/cmccabe/Desktop/files/*.txt ; do
bname=`basename $f`
pref=${bname%%.txt}
sed 's/\r//' $f - $f > /home/cmccabe/Desktop/files/${pref}_unix.txt
done
Input .txt files
AP3B1
BRCA2
BRIP1
CBL
CTC1
EDIT
This is not a duplicate as I was more asking for why my bash loop that uses sed to remove windows line endings was running so slow. I did not mean to imply how to remove them, was asking for ideas that might speed up the loop and I got many. Thank you :). I hope this helps.
Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.
Your 'sed' command looks wrong. I believe the trailing $f - $f should simply be $f. Running your script as written hangs for a very long time on my system, but making this change causes it to complete almost instantly.
Of course, the best answer is to use dos2unix, which was designed to handle this exact thing:
cd /home/cmccabe/Desktop/files
for f in *.txt ; do
pref=$(basename -s '.txt' "$f")
dos2unix -q -n "$f" "${pref}_unix.txt"
done
This always works for me:
perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt
you can use dos2unix as stated before or use this small sed:
sed 's/\r//' file
The key to performance in Bash is to avoid loops in general, and in particular those that call one or more external utilities in each iteration.
Here is a solution that uses a single GNU awk command:
awk -v RS='\r\n' '
BEGINFILE { outFile=gensub("\\.txt$", "_unix&", 1, FILENAME) }
{ print > outFile }
' /home/cmccabe/Desktop/files/*.txt
-v RS='\r\n' sets CRLF as the input record separator, and by virtue of leaving ORS, the output record separator at its default, \n, simply printing each input line will terminate it with \n.
the BEGINFILE block is executed every time processing of a new input file starts; in it, gensub() is used to insert _unix before the .txt suffix of the input file at hand to form the output filename.
{print > outFile} simply prints the \n-terminated lines to the output file at hand.
Note that use of a multi-char. RS value, the BEGINFILE block, and the gensub() function are GNU extensions to the POSIX standard.
Switching from the OP's sed solution to a GNU awk-based one was necessary in order to provide a single-command solution that is both simpler and faster.
Alternatively, here's a solution that relies on dos2unix for conversion of Window line-endings (for instance, you can install dos2unix with sudo apt-get install dos2unix on Debian-based systems); except for requiring dos2unix, it should work on most platforms (no GNU utilities required):
It uses a loop only to construct the array of filename arguments to pass to dos2unix - this should be fast, given that no call to basename is involved; Bash-native parameter expansion is used instead.
then uses a single invocation of dos2unix to process all files.
# cd to the target folder, so that the operations below do not need to handle
# path components.
cd '/home/cmccabe/Desktop/files'
# Collect all *.txt filenames in an array.
inFiles=( *.txt )
# Derive output filenames from it, using Bash parameter expansion:
# '%.txt' matches '.txt' at the end of each array element, and replaces it
# with '_unix.txt', effectively inserting '_unix' before the suffix.
outFiles=( "${inFiles[#]/%.txt/_unix.txt}" )
# Create an interleaved array of *input-output filename pairs* to be passed
# to dos2unix later.
# To inspect the resulting array, run `printf '%s\n' "${fileArgs[#]}"`
# You'll see pairs like these:
# file1.txt
# file1_unix.txt
# ...
fileArgs=(); i=0
for inFile in "${inFiles[#]}"; do
fileArgs+=( "$inFile" "${outFiles[i++]}" )
done
# Now, use a *single* invocation of dos2unix, passing all input-output
# filename pairs at once.
dos2unix -q -n "${fileArgs[#]}"

Use Linux with line breaks

I am using this:
perl -pi -w -e 's/SEARCH_FOR/REPLACE_WITH/g;' *.txt
and the SEARCH_FOR input has line breaks. For example:
SEARCH_FOR:
I want to get
rid of this text
in multiple files
REPLACE_WITH:
[Nothingness / 0 bytes]
-0 command line parameter changes input record separator $/, and -0777 sets it to undef which effectively puts readline() into slurping whole file at once, so you can successfully apply multi line substitution regex.
Since perl -p reads, processes and prints one line at a time, the multi-line search-for pattern is never going to match a single line of input. Therefore, you're going to have to find a way to make Perl read multiple lines at a time.

How do I insert the results of several commands on a file as part of my sed stream?

I use DJing software on linux (xwax) which uses a 'scanning' script (visible here) that compiles all the music files available to the software and outputs a string which contains a path to the filename and then the title of the mp3. For example, if it scans path-to-mp3/Artist - Test.mp3, it will spit out a string like so:
path-to-mp3/Artist - Test.mp3[tab]Artist - Test
I have tagged all my mp3s with BPM information via the id3v2 tool and have a commandline method for extracting that information as follows:
id3v2 -l name-of-mp3.mp3 | grep TBPM | cut -D: -f2
That spits out JUST the numerical BPM to me. What I'd like to do is prepend the BPM number from the above command as part of the xwax scanning script, but I'm not sure how to insert that command in the midst of the script. What I'd want it to generate is:
path-to-mp3/Artist - Test.mp3[tab][bpm]Artist - Test
Any ideas?
It's not clear to me where in that script you want to insert the BPM number, but the idea is this:
To embed the output of one command into the arguments of another, you can use the "command substitution" notation `...` or $(...). For example, this:
rm $(echo abcd)
runs the command echo abcd and substitutes its output (abcd) into the overall command; so that's equivalent to just rm abcd. It will remove the file named abcd.
The above doesn't work inside single-quotes. If you want, you can just put it outside quotes, as I did in the above example; but it's generally safer to put it inside double-quotes (so as to prevent some unwanted postprocessing). Either of these:
rm "$(echo abcd)"
rm "a$(echo bc)d"
will remove the file named abcd.
In your case, you need to embed the command substitution into the middle of an argument that's mostly single-quoted. You can do that by simply putting the single-quoted strings and double-quoted strings right next to each other with no space in between, so that Bash will combine them into a single argument. (This also works with unquoted strings.) For example, either of these:
rm a"$(echo bc)"d
rm 'a'"$(echo bc)"'d'
will remove the file named abcd.
Edited to add: O.K., I think I understand what you're trying to do. You have a command that either (1) outputs out all the files in a specified directory (and any subdirectories and so on), one per line, or (2) outputs the contents of a file, where the contents of that file is a list of files, one per line. So in either case, it's outputting a list of files, one per line. And you're piping that list into this command:
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\t\2:pi
}
'
which runs a sed script over that list. What you want is for all of the replacement-strings to change from \0\t... to \0\tBPM\t..., where BPM is the BPM number computed from your command. Right? And you need to compute that BPM number separately for each file, so instead of relying on seds implicit line-by-line looping, you need to handle the looping yourself, and process one line at a time. Right?
So, you should change the above command to this:
while read -r LINE ; do # loop over the lines, saving each one as "$LINE"
BPM=$(id3v2 -l "$LINE" | grep TBPM | cut -D: -f2) # save BPM as "$BPM"
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\t\2:pi
}
' <<<"$LINE" # take $LINE as input, rather than reading more lines
done
(where the only change to the sed script itself was to insert '"$BPM"'\t in a few places to switch from single-quoting to double-quoting, then insert the BPM, then switch back to single-quoting and add a tab).

Resources