Split a text file using gsplit on a delimiter on OSX Mojave [duplicate] - linux

This question already has answers here:
Split one file into multiple files based on delimiter
(12 answers)
Closed 2 years ago.
Have searched many answers for hours but none have helped me use gsplit with a delimiter. Frustrating that there is no well explained answer to this in 2020. So far i've tried:
first i install coreutils:
brew install coreutils
then i run this command which works at splitting by 5000 lines.. However i need it to split by a delimiter, not 5000 lines.
gsplit -l 5000 -d --additional-suffix=.txt $FileName file
I can't find anything in the help file about how to split by a delimiter, any delimiter like 'abc' for example. And there are so many answers on here that simply dont explain how to get some other utility they use to work (awk or gawk??) with no explanation of how to install it or what operating system they use etc..
My file (myfile.txt) that i want to split with the 'abc' delimeter looks like this:
myfile.txt:
randomHTML
randomHTML
randomHTML
randomHTML
abc
randomHTML
abc
randomHTML
randomJS
randomHTML
randomHTML
abc
randomHTML
randomJS
abc
There's no mention of a delimiter in the gsplit help
gsplit --help
Usage: gsplit [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of records per output file
-d use numeric suffixes starting at 0, not alphabetic
--numeric-suffixes[=FROM] same as -d, but allow setting the start value
-x use hex suffixes starting at 0, not alphabetic
--hex-suffixes[=FROM] same as -x, but allow setting the start value
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines/records per output file
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
-t, --separator=SEP use SEP instead of newline as the record separator;
'\0' (zero) specifies the NUL character
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines/records
l/K/N output Kth of N to stdout without splitting lines/records
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'

How about:
awk -F\(abc\) 'RS="^$" { for (i=1;i<NF;i++) { system("echo \""$i"\" > "i"-abc.txt") } }' abc.txt
We remove the record separator so we can process the whole file as one record. Then we set "abc" as the delimiter and then we look through each record and use the system command to echo out record to a file names abc prefixed with the number of the record.
abc.txt holds the original data

Related

how do I split a file into chunks by regexp line separators?

I would like to have a Linux oneliner that runs a "split" command in a slightly different way -- rather than by splitting the file into smaller files by using a constant number of lines or bytes, it will split the file according to a regexp separator that identifies where to insert the breaks and start a new file.
The problem is that most pipe commands work on one stream and can't split a stream into multiple files, unless there is some command that does that.
The closest I got to was:
cat myfile |
perl -pi -e 's/theseparatingregexp/SPLITHERE/g' |
split -l 1 -t SPLITHERE - myfileprefix
but it appears that split command cannot take multi-character delimeters.

Grep for specific numbers within a text file and output per number text file

I have a text file chunk_names.txt that looks like this:
chr1_12334_64321
chr1_134435_77474
chr10_463252_74754
chr10_54265_423435
chr13_5464565_547644567
This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*.
I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt:
In Linux I have tried :
for i in {1..22}
do
grep chr$i chunk_names.txt > chr$i.txt
done
However, the chr1.txt output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).
How would I modify this script to separate out the chromosomes?
I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately
Things I have tried :
grep -o gives me just "chr$i" as an output
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem
Many thanks for your time.
Your 'for' loop will mean parsing your file N times (where N is the number of chromosomes/contigs in your list). Here's an agnostic approach using awk that will parse the file just once:
awk -F '_' '{ print > $1 ".txt" }' chunk_names.txt
If you include the _ following the number you can distinguish between chr1_ and e.g. chr10_. To include X and Y, simply include these in the loop
for i in {1..22} X Y
do
grep "chr${i}_" chunk_names.txt > chr$i.txt
done
To search at the beginning of the line only you can add a leading ^ to the pattern
grep "^chr${i}_" chunk_names.txt > chr$i.txt
Explanation about your attempts:
grep chr$i searches for the pattern anywhere in the line. The shell replaces $i with the value of the variable i, so you get chr1, chr2 etc.
If you enclose the pattern in double quotes as grep "chr$i" the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.
If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i (instead of chr1 etc.) which does not occur in your file.
Explanation about quotes:
The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.
Example:
If your file would contain a chr1* instead of the chr1_, the pattern chr${i}* would be replaced by the list of matching files.
When you already created your output files chr1.txt etc., try these commands
$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*
In the first case, the grepcommand
grep chr${i}* chunk_names.txt
would be expanded as
grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt
which would search for the pattern chr10.txt in files chr11.txt ... chr1.txt and chunk_names.txt.

Linux tail command includes more lines than intended

so I want to get a little into Linux scripting and started by a simple example in a book. In this book, the author wants me to grab the five lines before "Step #6: Configure output plugins" from snort.conf.
Analogous to the author I determined where the line is that I want, which returns 445 for me. If I then use tail the result returns more text than I expect and the searched line that should be in line 5 is at line 88. I fail to understand how I use the tail command and start at the specific line but then more text is included.
To search for the line I used
nl /etc/snort/snort.conf | grep output.
To get the 5 lines before including the searched line:
tail -n+440 /etc/snort/snort.conf | head -n+6
where as the tail statement seems to be the problem. Any help is appreciated on why my answer is not working!
Your tail command is correct in principle.
The problem lies in the way in which you acquire the line number using nl. The nl command does not count empty lines by default, while the tail command does. You should specify in your nl command that you want to count the empty lines as well, which you can do using the -b, (body-numbering) option and specify a as your style. This would look as follows:
nl -ba /etc/snort/snort.conf | grep output.
From nl --help:
Usage: nl [OPTION]... [FILE]...
Write each FILE to standard output, with line numbers added.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-b, --body-numbering=STYLE use STYLE for numbering body lines
[...]
By default, selects -v1 -i1 -l1 -sTAB -w6 -nrn -hn -bt -fn. CC are
two delimiter characters for separating logical pages, a missing
second character implies :. Type \\ for \. STYLE is one of:
a number all lines
t number only nonempty lines
Number all lines and use that line number in tail.
Hello in trying the same with same book that you are using but I didn’t find any great solution with tail or nl but i come up with simple grep switch -B and -A before and after switches for grep.
I achieved this issue by typing
grep -B 5 “Step #6: Configure output plugins “ /etc/snort/snort.conf
After that you will gonna get 5 lines before that line same for After -A for after lines.
Hope this will help someone staysafe happy learning 🙂

Removing content existing in another file in bash [duplicate]

This question already has answers here:
Print lines in one file matching patterns in another file
(5 answers)
Closed 4 years ago.
I am attempting to clean one file1.txt that contains always the same lines using file2.txt that contains a list of IP addresses I want to remove.
The working script I have written I believe can be enhanced somehow to be faster in execution.
My script:
#!/bin/bash
IFS=$'\n'
for i in $(cat file1.txt); do
for j in $(cat file2); do
echo ${i} | grep -v ${j}
done
done
I have tested the script with the following data set:
Amount of lines in file1.txt = 10,000
Amount of lines in file2.txt = 3
Scrit execution time:
real 0m31.236s
user 0m0.820s
sys 0m6.816s
file1.txt content:
I3fSgGYBCBKtvxTb9EMz,1.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.2.2.3,45,This IP belongs to office space,1539760502,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.3.2.3,45,This IP belongs to office space,1539760503,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.4.2.3,45,This IP belongs to office space,1539760504,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.5.2.3,45,This IP belongs to office space,1539760505,https://myoffice.com
... lots of other lines in the same format
I3fSgGYBCBKtvxTb9EMz,4.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
file2.txt content:
1.1.2.3
1.2.2.3
... lots of other IPs here
1.2.3.9
How can I improve those timings?
I am confident that the files will grow over time. In my case I will run the script every hour from cron, therefore I would like to improve here.
You want to get rid of all lines in file1.txt that contains substrings which match file2.txt. grep to the rescue
grep -vFwf file2.txt file1.txt
The -w is need to avoid that 11.11.11.11 matches 111.11.11.111
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
source: man grep
On a further note, here are a couple of pointers for your script:
Don't use for loops to read files (http://mywiki.wooledge.org/DontReadLinesWithFor).
Don't use cat (See How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?)
Use quotes! (See Bash and Quotes)
This allows us to rewrite it as:
#!/bin/bash
while IFS=$'\n' read -r i; do
while IFS=$'\n' read -r j; do
echo "$i" | grep -v "$j"
done < file2
done < file1
Now the problem is that you read file2 N times. Where N is the number of lines of file1. This is not really efficient. And luckily grep has the solution for us (see top).

How do I pipe to Linux split command?

I'm a bit useless at Linux CLI, and I am trying to run the following commands to randomly sort, then split a file with output file prefixes 'out' (one output file will have 50 lines, the other the rest):
sort -R somefile | split -l 50 out
I get the error
split: cannot open ‘out’ for reading: No such file or directory
this is presumably because the third parameter of split should be its input file. How do I pass the result of the sort to split? TIA!!
Use - for stdin:
sort -R somefile | split -l 50 - out
From man split:
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no
INPUT, or when INPUT is -, read standard input.
Allowing - to specify input is stdin is a convention many UNIX utilities follow.
out is interpreted as input file. You can should a single dash to indicate reading from STDIN:
sort -R somefile | split - -l 50 out
For POSIX systems like mac os the - parameter is not accepted and you need to completely omit the filename, and let it generate it's own names.
sort -R somefile | split -l 50

Resources