Use Linux with line breaks - linux

I am using this:
perl -pi -w -e 's/SEARCH_FOR/REPLACE_WITH/g;' *.txt
and the SEARCH_FOR input has line breaks. For example:
SEARCH_FOR:
I want to get
rid of this text
in multiple files
REPLACE_WITH:
[Nothingness / 0 bytes]

-0 command line parameter changes input record separator $/, and -0777 sets it to undef which effectively puts readline() into slurping whole file at once, so you can successfully apply multi line substitution regex.

Since perl -p reads, processes and prints one line at a time, the multi-line search-for pattern is never going to match a single line of input. Therefore, you're going to have to find a way to make Perl read multiple lines at a time.

Related

how do I split a file into chunks by regexp line separators?

I would like to have a Linux oneliner that runs a "split" command in a slightly different way -- rather than by splitting the file into smaller files by using a constant number of lines or bytes, it will split the file according to a regexp separator that identifies where to insert the breaks and start a new file.
The problem is that most pipe commands work on one stream and can't split a stream into multiple files, unless there is some command that does that.
The closest I got to was:
cat myfile |
perl -pi -e 's/theseparatingregexp/SPLITHERE/g' |
split -l 1 -t SPLITHERE - myfileprefix
but it appears that split command cannot take multi-character delimeters.

How to truncate rest of the text in a file after finding a specific text pattern, in unix?

I have a HTML PAGE which I have extracted in unix using wget command, in that after the word "Check list" I need to remove all of the text and with the remaining I am trying to grep some data. I am unable to think on a way which can be helpful for removing the text after a keyword. if I do
s/Check list.*//g
It just removes the line , I want everything below that to be gone. How do I perform this?
The other solutions you have so far require non-POSIX-mandatory tools (GNU sed, GNU awk, or perl) so YMMV with their availability and will read the whole file into memory at once.
These will work in any awk in any shell on every Unix box and only read 1 line at a time into memory:
awk -F 'Check list' '{print $1} NF>1{exit}' file
or:
awk 'sub(/Check list.*/,""){f=1} {print} f{exit}' file
With GNU awk for multi-char RS you could do:
awk -v RS='Check list' '{print; exit}' file
but that would still read all of the text before Check list into memory at once.
Depending on which sed version you have, maybe
sed -z 's/Check list.*//'
The /g flag is useless as you only want to replace everything once.
If your sed does not have the -z option (which says to use the ASCII null character as line terminator instead of newline; this hinges on your file not containing any actual nulls, but that should trivially be true for any text file), try Perl:
perl -0777 -pe 's/Check list.*//s'
Unlike sed -z, this explicitly says to slurp the entire file into memory (the argument to -0 is the octal character code of a terminator character, but 777 is not a valid terminator character at all, so it always reads the entire file as a single "line") so this works even if there are spurious nulls in your file. The final s flag says to include newline in what . matches (otherwise s/.*// would still only substitute on the matching physical line).
I assume you are aware that removing everything will violate the integrity of the HTML file; it needs there to be a closing tag for every start tag near the beginning of the document (so if it starts with <html><body> you should keep </body></html> just before the end of the file, for example).
With awk you could make use of RS variable and then set field separator to regex with word boundaries and then print the very first field as per need.
awk -v RS="^$" -v FS='\\<check_list\\>' '{print $1}' Input_file
You might use q to instruct GNU sed to quit, thus ending processing, consider following simple example, let file.txt content be
123
456
789
and say you want to jettison everything beyond 5, then you could do
sed '/5/{s/5.*//;q}' file.txt
which gives output
123
4
Explanation: for line having 5, substitute 5 and everything beyond it with empty string (i.e. delete it), then q. Observe that lowercase q is used to provide printing of altered line before quiting.
(tested in GNU sed 4.7)

How to add character at the end of specific line in UNIX/LINUX?

Here is my input file. I want to add a character ":" into the end of lines that have ">" at the beginning of the line. I tried seq -i 's|$|:|' input.txt but ":" was added to all the ending of each line. It is also hard to call out specific line numbers because, in each of my input files, the line contains">" present in different line numbers. I want to run a loop for multiple files so it is useless.
>Pas_pyrG_2
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Here is experted output file:
>Pas_pyrG_2:
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4:
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2:
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Do seq have more option to modify or the other commands can solve this problem?
sed -i '/^>/ s/$/:/' input.txt
Search the lines of input for lines that match ^> (regex for "starts with the > character). Those that do substitute : for end-of-line (you got this part right).
/ slashes are the standard separator character in sed. If you wish to use different characters, be sure to pass -e or s|$|:| probably won't work. Since / characters, unlike | characters, are not meaningful character within the shell, it's best to use them unless the pattern also contains slashes, in which case things get unwieldy.
Be careful with sed -i. Make a backup - make sure you know what's changing by using diff to compare the files.
On OSX -i requires an argument.
Using ed to edit the file:
printf "%s\n" 'g/^>/s/$/:/' w | ed -s input.txt
For every line starting with >, add a colon to the end, and then write the changed file back to disk.

linux command for finding a substring and moving it to the end of line

I need to read a file line by line in Linux, find a substring in each line, remove it and place it at the end of that line.
Example:
Line in the original file:
a,b,c,substring,d,e,f
Line in the output file:
a,b,c,d,e,f,substring
How do I do it with the Linux command? Thanks!
sed '/substring/{ s///; s/$/substring/;} '
will handle a fixed substring. Note that if substring begins with a ,, this handles your example case well. If the substring is not fixed but may be a general regular expression:
sed 's/\(substring\)\(.*\)/\2\1'
If you are looking for general csv parsing, you should rephrase the question. (It will be difficult to apply this solution to find a fixed string at the start of a line if you are thinking of the input as comma separated fields.)
I always prefer to use perl's command line to do such regex tasks - perl is powerful enough to cover awk and sed in most of my usages, and both available in windows and linux, it is just easy and handy to me, so the solution in perl would be like:
perl -ne "s/^(.*?)(?:(?<comma>,)(?<substr>substring)|(?<substr>substring)(?<comma>,))(?<right>.*)$/$1$+{right}$+{comma}$+{substr}/; print" input.txt > output.txt
or a simpler one:
perl -lpe "if(s/(,substring|substring,)//){ s/$/,substring/ }" input.txt > output.txt
input.txt
substring,a,b,c,d,e,f
a,b,c,substring,d,e,f
a,b,c,d,e,f,substring
substring,a
a,substring
substring
a
output.txt
a,b,c,d,e,f,substring
a,b,c,d,e,f,substring
a,b,c,d,e,f,substring
a,substring
a,substring
substring
a
You can edit based on your actual input:
If there are any space between words and commas
If you are using tab as separator
Some explanation of the command line:
use perl's -n -e options: -n means process the input line by line in a loop; -e means one line program in the command line
use perl's -l -p options: -l means process multilines; -p means always print
The one line program is just a regex replacement and a print
(?:pattern) means group but don't capture the match
(?<comma>) is a named group, you then need to use $+{comma} hash to access it

grep giving error

I am trying to extract no.s from a file, so I created a script, but grep is giving error:grep: line too long. Can anyone tell me where am I wrong. command is:
echo $(cat filename|grep '\<[0-9]*\>')
Thanks in advance
grep is line-oriented; it will print matching lines to output. Probably you have a huge line in your file, and the resulting line cannot be converted into a string value by shell, as $(...) requires.
First of all, try just cat filename | grep '\<[0-9]*\>' > results and see what is in the results file. Maybe it's enough.
But if you have multiple numbers in a line and you want to extract them all, use -o: grep -o '\<[0-9]*\>'. This will print only matching parts, every match on a new line, even if original matches are on the same line. If you need line numbers, too, add -n.

Resources