does sed treat newline and regular character identically? - search

Given a data file tmp_file,
5 0 0 0
0 5 0 0
0 0 6 0
0 0 0 6
the following two commands render different results, why is that?
sed 's/\n/ /g' tmp_file
5 0 0 0
0 5 0 0
0 0 6 0
0 0 0 6
sed 's/0/ /g' tmp_file
5
5
6
6
EDIT:
a previous post, How can I replace a newline (\n) using sed?, was suggested to have resolved the issue. Surely the solution is the same, but the question looks seemingly different in the eye of a newbie. Also, one should not expect anybody is able to find the right answer among millions of posts even with hours of research online as done in the current post. I would rather withdraw the post if a negative mark is given due to being similar to previous posts.

Let us try to understand how sed reads the Input_file, so sed reads the lines by separator which is new line it self. Which means if lines are separated with new line a single line CANNOT have new line in it, until/unless we do let sed know to read whole Input_file in a loop and then replace the new lines(kind of keeping values in hold space which is out of box functionality provided by sed to keep values save in memory for operational purposes). So that is why your first command is NOT having any affect since sed not found any line which is having new line in it.
Here is very nice thread(How can I replace a newline (\n) using sed? ) I found over SO you could go through its solution and its explanation(taken from that thread) as follows.
sed ':a;N;$!ba;s/\n/ /g' Input_file
Explanation of above code:
1-Create a label via :a.
2- Append the current and next line to the pattern space via N.
3- If we are before the last line, branch to the created label $!ba ($! means not to do it on the last line as there should be one final newline).
4- Finally the substitution replaces every newline with a space on the pattern space (which is the whole file).
I wanted to add my explanation above and then add that thread's details.

Related

Replace same occurrence of word with different words using sed

I want to replace 2 same words in file(app.properties) with 2 different words using sed command.
Example:
mysql.host=<<CHANGE_ME>>
mysql.username=testuser
mysql.port=3306
mysql.db.password=<<CHANGE_ME>>
required output will be
mysql.host=localhost
mysql.username=testuser
mysql.port=3306
mysql.db.password=password123
I tried below command:
sed -e "s/<<CHANGE_ME>>/localhost/1" -e "s/<<CHANGE_ME>>/password123/2" app.properties > /home/centos/SCRIPT/io.properties_new
However I am getting localhost at both the places.
I'm sure it's not impossible, but also that you will not be able to figure out how it works once you find an answer. A better solution is to switch to a language which is more human-readable, so you can understand what it does.
awk 'BEGIN { split("localhost:password123", items, ":") }
/<<CHANGE_ME>>/ { sub(/<<CHANGE_ME>>/, items[++i]) } 1' input_file >output_file
The BEGIN block creates an array items of replacements. The main script then increments i every time we perform a replacement, indexing further into items for the replacement string.
This may be possible but I don't know if this is really readable for everyone.
Something like this might suite you :
sed -e '0,/<<CHANGE_ME>>/{s/<<CHANGE_ME>>/localhost/}' -e '1,/<<CHANGE_ME>>/{s/<<CHANGE_ME>>/password123/}' app.properties > /home/centos/SCRIPT/io.properties_new
If you have any idea to improve this, don't hesitate. I would really like to learn the best way to do this too :D
With sed, using a wonderfully confusing if (first time) do x, else do y logic:
sed '/<CHANGE_ME>/{bb;:a;s/<CHANGE_ME>/password123/;:b;x;s/E//;x;ta;s/<CHANGE_ME>/localhost/;x;s/^/E/;x}' input_file
Writing each command of the sed script on its own line makes it more understandable, or at least easier for me to expain it:
/<CHANGE_ME>/{
bb
:a
s/<CHANGE_ME>/password123/
:b
x
s/E//
x
ta
s/<CHANGE_ME>/localhost/
x
s/^/E/
x
}
Here's the explanation:
/<CHANGE_ME>/{…} means that the stuff in {…} is only applied to lines matching <CHANGE_ME>;
bb: "branch to (go to) :b", in this case used to skip the first substitution command;
:a: a target for another branch or test-and-branch command;
s/…/…/: you know what it does, but we skip this the first time the script is run;
b: branches to the end of the script, skipping everything (because we are giving no argument to b);
:b: the target of the command bb at 1.;
x: swap patter space (the line you're dealing with at the moment), with the hold space (a kind of variable that you can put stuff into via x, h, and H commands);
s/E//: tries to match and delete a E (just because that's the initial of my name), which fails the first time we run this, because the hold space that we've swapped earlier with the patter space was empty;
x: undos what the previous x did, so we're back on working with the line matching <CHANGE_ME>;
ta: tests if last peformed s/…/…/ command succeeded and, if so, it goes to :a, otherwise it's a no-op; the first time we run the script this is a no-op, because step 6 failed;
s/…/…/: you know what it does;
x: see above
s/^/E/: inserts the E at the beginning of the line, so that next time we run the script substitution of step 7 succeedes, step 9 successfully branches to :a, step 3 is peformed for the first time, and step 4 exits the script for ever;
x: see above
Perhaps this might help:
sed -e '1s/<<CHANGE_ME>>/localhost/' \
-e '4s/<<CHANGE_ME>>/password123/' \
app.properties > /home/centos/SCRIPT/io.properties_new

Log Splitting based on Recurring Pattern and Number of Lines

I have a system that is generating very large text logs (in excess of 1GB each). The utility into which I am feeding them requires that each file be less than 500MB. I cannot simply use the split command because this runs the risk of splitting a log entry in half, which would cause errors in the utility to which they are being fed.
I have done some research into split, csplit, and awk. So far I have had the most luck with the following:
awk '/REG_EX/{if(NR%X >= (X-Y) || NR%2000 <= Y)x="split"++i;}{print > x;}' logFile.txt
In the above example, X represents the number of lines I want each split file to contain. In practice, this ends up being about 10 million. Y represents a "plus or minus." So if I want "10 million plus or minus 50", Y allows for that.
The actual regular expression I use is not important, because that part works. The goal is that the file be split every X lines, but only if it is an occurrence of REG_EX. This is where the if() clause comes in. I attempted to have some "wiggle room" of plus or minus Y lines, because there is no guarantee that REG_EX will exist at exactly NR%X. My problem is that if I set Y too small, then I end up with files with two or three times the number of lines I am aiming for. If I set Y too large, then I end up with some files containing anywhere between 1 and X lines(it is possible for REG_EX to occurr several times in immediate succession).
Short of writing my own program that traverses the file line by line with a line counter, how can I go about elegantly solving this problem? I have a script that a co-worker created, but it takes easily over an hour to complete. My awk command completes in less than 60 seconds on a 1.5GB file with a X value of 10 million, but is not a 100% solution.
== EDIT ==
Solution found. Thank you to everyone who took the time to read my question, understand it, and provide a suggested solution. Most of them were very helpful, but the one I marked as the solution provided the greatest assistance. My problem was with my modular math being the cutoff point. I needed a way to keep track of lines and reset the counter each time I split a file. Being new to awk, I wasn't sure how to utilize the BEGIN{ ... } feature. Allow me to summarize the problem set and then list the command that solved the problem.
PROBLEM:
-- System produces text logs > 1.5GB
-- System into which logs are fed requires logs <= 500MB.
-- Every log entry begins with a standardized line
-- using the split command risks a new file beginning WITHOUT the standard line
REQUIREMENTS:
-- split files at Xth line, BUT
-- IFF Xth line is in the standard log entry format
NOTE:
-- log entries vary in length, with some being entirely empty
SOLUTION:
awk 'BEGIN {min_line=10000000; curr_line=1; new_file="split1"; suff=1;} \
/REG_EX/ \
{if(curr_line >= min_line){new_file="split"++suff; curr_line=1;}} \
{++curr_line; print > new_file;}' logFile.txt
The command can be typed on one line; I broke it up here for readability. Ten million lines works out to between 450MB and 500MB. I realized that given that how frequently the standard log entry line occurrs, I didn't need to set an upper line limit so long as I picked a lower limit with room to spare. Each time the REG_EX is matched, it checks to see if the current number of lines is greater than my limit, and if it is, starts a new file and resets my counter.
Thanks again to everyone. I hope that anyone else who runs into this or a similar problem finds this useful.
If you want to create split files based on exact n-count of pattern occurrences, you could do this:
awk '/^MYREGEX/ {++i; if(i%3==1){++j}} {print > "splitfilename"j}' logfile.log
Where:
^MYREGEX is your desired pattern.
3 is the count of pattern
occurrences you want in each file.
splitfilename is the prefix of
the filenames to be created.
logfile.log is your input log file.
i is a counter which is incremented for each occurrence of the pattern.
j is a counter which is incremented for each n-th occurrence of the pattern.
Example:
$ cat test.log
MY
123
ksdjfkdjk
MY
234
23
MY
345
MY
MY
456
MY
MY
xyz
xyz
MY
something
$ awk '/^MY/ {++i; if(i%3==1){++j}} {print > "file"j}' test.log
$ ls
file1 file2 file3 test.log
$ head file*
==> file1 <==
MY
123
ksdjfkdjk
MY
234
23
MY
345
==> file2 <==
MY
MY
456
MY
==> file3 <==
MY
xyz
xyz
MY
something
If splitting based on the regex is not important, one option would be to create new files line-by-line keeping track of the number of characters you are adding to an output file. If the number of characters are greater than a certain threshold, you can start outputting to the next file. An example command-line script is:
cat logfile.txt | awk 'BEGIN{sum=0; suff=1; new_file="tmp1"} {len=length($0); if ((sum + len) > 500000000) { ++suff; new_file = "tmp"suff; sum = 0} sum += len; print $0 > new_file}'
In this script, sum keeps track of the number of characters we have parsed from the given log file. If sum is within 500 MB, it keeps outputting to tmp1. Once sum is about to exceed that limit, it will start outputting to tmp2, and so on.
This script will not create files that are greater than the size limit. It will also not break a log entry.
Please note that this script doesn't make use of any pattern matching that you used in your script.
You could potentially split the log file by 10 million lines.
Then if the 2nd split file does not start with desired line, go find the last desired line in 1st split file, delete that line and subsequent lines from there, then prepend those lines to 2nd file.
Repeat for each subsequent split file.
This would produce files with very similar count of your regex matches.
In order to improve performance and not have to actually write out intermediary split files and edit them, you could use a tool such as pt-fifo-split for "virtually" splitting your original log file.
Replace fout and slimit values to your needs
#!/bin/bash
# big log filename
f="test.txt"
fout="$(mktemp -p . f_XXXXX)"
fsize=0
slimit=2500
while read line; do
if [ "$fsize" -le "$slimit" ]; then
# append to log file and get line size at the same time ;-)
lsize=$(echo "$line" | tee -a $fout | wc -c)
# add to file size
fsize=$(( $fsize + $lsize ))
else
echo "size of last file $fout: $fsize"
# create a new log file
fout="$(mktemp -p . f_XXXXX)"
# reset size counter
fsize=0
fi
done < <(grep 'YOUR_REGEXP' "$f")
size of last file ./f_GrWgD: 2537
size of last file ./f_E0n7E: 2547
size of last file ./f_do2AM: 2586
size of last file ./f_lwwhI: 2548
size of last file ./f_4D09V: 2575
size of last file ./f_ZuNBE: 2546

'N' and 'D' not working as expected with sed

sed 'N; D' testfile
testfile contains:
this is the first line
this is the second line
this is the third line
this is the fourth line
I am using RHEL 6 and the output comes as:
this is the fourth line
As per my understanding, N just pulls in the next line into the pattern space and D deletes just the first line of the pattern space. Therefore, the output should have been:
this is the second line
this is the fourth line
Can someone please explain why the output is coming as mentioned above?
According to the documentation:
D
If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
(Emphasis mine.)
It sounds like this would restart your sed program from the beginning, reading and deleting lines until it runs out of input, at which point only the last line is left in the buffer.
As already shown using D will move to the beginning of program. You can however use the following to print even lines:
sed -n 'n;p'
and to print odds:
sed 'n;d'
In GNU sed you can also use:
sed '0~2!d' # Odd
sed '1~2!d' # Even
An alternative can be something like:
N;s/^[^\n]*\n//
which will read the next line into the pattern space and then substitute the first away.
One might ask why this is the behavior. One reason is to make things like this possible, working with multiply lines in the pattern space:
$!N;/\npattern$/d;P;D
The above will delete lines matching pattern as well as the line before.

delete strings from file bash

I have a file with tons of call logs and I am trying to clean it up using bash. I figured out how to search for a string and delete the entire line it is on but that isn't what I want to accomplish.
I want to search for a string as an example:
There are tons of MAC address in the file and I want to remove them all MAC:00-0A-DD-84-01-33
There is also a call ID at the beginning of each line that looks like: 354469805 or 354469894 and I want to remove all of those as well.
I'm just starting in bash so please excuse my ignorance. I am entering 2 lines of the call log below for clarification. I want to delete the 3544 number, the MAC address, and the word Telepacific.
354469725 06/24/2013 09:34 00:03:26 Chante Squires 105 TelePacific MAC:00-0A-DD-84-01-1D TelePacific 17025290701 1
354469732 06/24/2013 09:59 00:01:16 Chante Squires 105 TelePacific MAC:00-0A-DD-84-01-1D TelePacific 12132238375 1
You could use sed:
sed -i 's/^[0-9]\{9\}\|MAC:[0-9A-Fa-f]\{2\}\([-\:][0-9A-Fa-f]\{2\}\)\{5\}//g' input.log
Between the 's/ and //g' is a regular expression that matches the removal criteria in your question. The s flag in front means "search and replace" the regular expression. The // means replace the regular expression with nothing. The g flag at the end means "replace all matches" if they occur more than once in a line. Finally, the -i switch means "edit the files in-place".
This solution assumes that your call IDs are all 9 digits and that the MAC address has six groups of two hexadecimal digits separated by dashes or colons.
One way with awk (you will loose extra tabs space, every field will be separated by single space):
awk '{for(i=2;i<NF;i++) if(8>i || i>10) printf "%s ", $i; print $NF}' log

sed regex not being greedy?

In bash I have a string variable tempvar, which is created thus:
tempvar=`grep -n 'Mesh Tally' ${meshtalfile}`
meshtalfile is a (large) input file which contains some header lines and a number of blocks of data lines, each marked by a beginning line which is searched for in the grep above.
In the case at hand, the variable tempvar contains the following string:
5: Mesh Tally Number 4 977236: Mesh Tally Number 14 1954467: Mesh Tally Number 24 4354479: Mesh Tally Number 34
I now wish to extract the line number relating to a particularly mesh tally number - so I define a variable meshnum1 as equal to 24, and run the following sed command:
echo ${tempvar} | sed -r "s/^.*([0-9][0-9]*):\sMesh\sTally\sNumber\s${meshnum1}.*$/\1/"
This is where things go wrong. I expect the output 1954467, but instead I get 7. Trying with number 34 instead returns 9 instead of 4354479. It seems that sed is returning only the last digit of the number - which surely violates the principle of greedy matching? And oddly, when I move the open parenthesis ( left a couple of characters to include .*, it returns the whole line up to and including the single character it was previously returning. Surely it cannot be greedy in one situation and antigreedy in another? Hopefully I have just done something stupid with the syntax...
The problem is that the .* is being greedy too, which means that it will get all numbers too. Since you force it to get at least one digit in the [0-9][0-9]* part, the .* before it will be greedy enough to leave only one digit for the expression after it.
A solution could be:
echo ${tempvar} | sed -r "s/^.*\s([0-9][0-9]*):\sMesh\sTally\sNumber\s${meshnum1}.*$/\1/"
Where now the \s between the .* and the [0-9][0-9]* explictly forces there to be a space before the digits you want to match.
Hope this helps =)
Are the values in $tempvar supposed to be multiple or a single line? Because if it is a single line, ".*$" should match to the end of line, meaning all the other values too, right?
There's no need for sed, here's one way using GNU grep:
echo "$tempvar" | grep -oP "[0-9]+(?=:\sMesh\sTally\sNumber\s${meshnum1}\b)"

Resources