How do I pipe to Linux split command? - linux

I'm a bit useless at Linux CLI, and I am trying to run the following commands to randomly sort, then split a file with output file prefixes 'out' (one output file will have 50 lines, the other the rest):
sort -R somefile | split -l 50 out
I get the error
split: cannot open ‘out’ for reading: No such file or directory
this is presumably because the third parameter of split should be its input file. How do I pass the result of the sort to split? TIA!!

Use - for stdin:
sort -R somefile | split -l 50 - out
From man split:
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no
INPUT, or when INPUT is -, read standard input.
Allowing - to specify input is stdin is a convention many UNIX utilities follow.

out is interpreted as input file. You can should a single dash to indicate reading from STDIN:
sort -R somefile | split - -l 50 out

For POSIX systems like mac os the - parameter is not accepted and you need to completely omit the filename, and let it generate it's own names.
sort -R somefile | split -l 50

Related

how do I split a file into chunks by regexp line separators?

I would like to have a Linux oneliner that runs a "split" command in a slightly different way -- rather than by splitting the file into smaller files by using a constant number of lines or bytes, it will split the file according to a regexp separator that identifies where to insert the breaks and start a new file.
The problem is that most pipe commands work on one stream and can't split a stream into multiple files, unless there is some command that does that.
The closest I got to was:
cat myfile |
perl -pi -e 's/theseparatingregexp/SPLITHERE/g' |
split -l 1 -t SPLITHERE - myfileprefix
but it appears that split command cannot take multi-character delimeters.

Split a text file using gsplit on a delimiter on OSX Mojave [duplicate]

This question already has answers here:
Split one file into multiple files based on delimiter
(12 answers)
Closed 2 years ago.
Have searched many answers for hours but none have helped me use gsplit with a delimiter. Frustrating that there is no well explained answer to this in 2020. So far i've tried:
first i install coreutils:
brew install coreutils
then i run this command which works at splitting by 5000 lines.. However i need it to split by a delimiter, not 5000 lines.
gsplit -l 5000 -d --additional-suffix=.txt $FileName file
I can't find anything in the help file about how to split by a delimiter, any delimiter like 'abc' for example. And there are so many answers on here that simply dont explain how to get some other utility they use to work (awk or gawk??) with no explanation of how to install it or what operating system they use etc..
My file (myfile.txt) that i want to split with the 'abc' delimeter looks like this:
myfile.txt:
randomHTML
randomHTML
randomHTML
randomHTML
abc
randomHTML
abc
randomHTML
randomJS
randomHTML
randomHTML
abc
randomHTML
randomJS
abc
There's no mention of a delimiter in the gsplit help
gsplit --help
Usage: gsplit [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of records per output file
-d use numeric suffixes starting at 0, not alphabetic
--numeric-suffixes[=FROM] same as -d, but allow setting the start value
-x use hex suffixes starting at 0, not alphabetic
--hex-suffixes[=FROM] same as -x, but allow setting the start value
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines/records per output file
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
-t, --separator=SEP use SEP instead of newline as the record separator;
'\0' (zero) specifies the NUL character
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines/records
l/K/N output Kth of N to stdout without splitting lines/records
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'
How about:
awk -F\(abc\) 'RS="^$" { for (i=1;i<NF;i++) { system("echo \""$i"\" > "i"-abc.txt") } }' abc.txt
We remove the record separator so we can process the whole file as one record. Then we set "abc" as the delimiter and then we look through each record and use the system command to echo out record to a file names abc prefixed with the number of the record.
abc.txt holds the original data

why can't pass file path argument to shell command 'more' in pipeline mode?

I have a text file a.txt
hello world
I use following commands:
cmd1:
$ more a.txt
output:
hello world
cmd2:
$ echo 'a.txt'|more
output:
a.txt
I thought cmd2 should equal to echo 'a.txt'|xargs -i more {},but it's not.
I want to know why cmd2 worked like that and how to write code which work differently in pipeline mode.
Redirection with | or < controls what the stdin stream contains; it has no impact on a program's command line argument list.
Thus, more <a.txt (efficiently) or cat a.txt | more (inefficiently) both attach a file handle from which one can read the contents of a.txt to the stdin file handle of a new process before replacing that process with more. Similarly, echo a.txt | more makes a.txt itself the literal text that more reads from its stdin stream, which is the default place it's documented to get the input to display from, if not given any more specific filename(s) on its command line.
Generally, if you have a list of filenames and want to convert them to command-line arguments, this is what xargs is for (though using it without a great deal of care can introduce bugs, potentially-security-impacting ones).
Consider the following, which (using NUL rather than newline delimiters to separate filenames) is a safe use of xargs to take a list of filenames being piped into it, and transform that into an argument list to cat, used to concatenate all those files together and generate a single stream of input to more:
printf '%s\0' a.txt b.txt |
xargs -0 cat -- |
more

How to pipe multiple binary files to an application which reads from stdin

For a single file,
$ my_app < file01.binary
For multiple files,
$ cat file*.binary | my_app
Each binary file is of size 500MB and the total size of all file*.binary is around 8GB. Based on my understanding, cat will first concatenate all files then redirect the single big file to my_app.
Is there a better way to send multiple binary files to my_app without first concatenating them?
No. cat will just read lines/blocks from the input files in a loop and print them to the pipe. No worries.
The "concatenate" in cat means that it concatenates its input to its output. It does not imply that it concatenates its input(s) in memory first.
ls file*.binary | xargs cat | xargs my_app
xargs is a command to build and execute commands from standard input. It converts input from standard input into arguments to a command.

How to take advantage of filters

I've read here that
To make a pipe, put a vertical bar (|) on the command line between two commands.
then
When a program takes its input from another program, performs some operation on that input, and writes the result to the standard output, it is referred to as a filter.
So I've first tried the ls command whose output is:
Desktop HelloWord.java Templates glassfish-4.0
Documents Music Videos hs_err_pid26742.log
Downloads NetBeansProjects apache-tomcat-8.0.3 mozilla.pdf
HelloWord Pictures examples.desktop netbeans-8.0
Then ls | echo which outputs absolutely nothing.
I'm looking for a way to take advantages of pipelines and filters in my bash script. Please help.
echo doesn't read from standard input. It only writes its command-line arguments to standard output. The cat command is what you want, which takes what it reads from standard input to standard output.
ls | cat
(Note that the pipeline above is a little pointless, but does demonstrate the idea of a pipe. The command on the right-hand side must read from standard input.)
Don't confuse command-line arguments with standard input.
echo doesn't read standard input. To try something more useful, try
ls | sort -r
to get the output sorted in reverse,
or
ls | grep '[0-9]'
to only keep the lines containing digits.
In addition to what others have said - if your command (echo in this example) does not read from standard input you can use xargs to "feed" this command from standard input, so
ls | echo
doesn't work, but
ls | xargs echo
works fine.

Resources