remove a line with special character with pattern - linux

I'm trying to remove a line with special characters which is not prefixed with \.
Below are the special characters:
^$%.*+?!(){}[]|\
I need to check all the above special characters which is not prefixed with \ in 2nd column.
I'm trying with awk to complete this, but no luck. I want the output as below.
input.txt
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
5,sm\(oke
6,ra\in
7,pla\\y
8,wor\+k
output.txt
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
6,ra\in

I believe you are simply looking for:
awk '$2 !~ /\\[][|\\{}()!?+*.%$^]/' FS=,
This gives the desired output on the given input file, but does not at all match the description given in the question.
EDIT
Given the discussion in the comment section, it appears that the desired solution should output all lines that contain a special character, unless that character is preceded by a backslash. Given that description, we must remove backslash from the list of special characters. A (non-working, given for the purpose of description) solution is:
awk '$2 ~ /[^\\][][|{}()!?+*.%$^]/' FS=,
This simply matches any two character string in which the first is not a backslash and the 2nd is one of the characters ][|{}()!?+*.%$^. This fails because it does not catch the case in which a special character occurs as the first element of the string. For that, we extend the regex so that the first character can be either the beginning of the string or anything that is not a backslash.
awk '$2 ~ /(^|[^\\])[][|{}()!?+*.%$^]/' FS=,
The reason we need to re-order the special characters is that ] has a special meaning inside brackets (namely, it closed the brackets!) and it must be list first to avoid that meaning. Similarly, ^ must not be first because it has a special meaning when it is the first member of a character class (it negates the class). (The other characters don't matter; they just got reordered as a typographical accident.)

One part of the trick is to put the special characters into a character class safely, remembering that ], ^ and - (not present in your list) have special rules associated with them in character classes. Specifically, the ^ as first character negates the character class (so place it somewhere other than first), and the ] character terminates the character class unless it is either first or second after a ^.
Hence, you want:
awk '/\\[]^$%.*+?!(){}[\\|]/ { next } { print }' input.txt
The complex (ghastly) regex matches a backslash followed by one of the special characters; the action is next to skip that line. The { print } (which could also be written 1 or any other true value) prints those lines which are not eliminated by the regex.
Example output
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
6,ra\in
You can refine the processing to ignore the first field and so on as in William Pursell's answer, which does the reordering of the characters in the list substantially the same way I did, but without explaining why.
awk -F, '$2 !~ /\\[]^$%.*+?!(){}[\\|]/ { print }' input.txt

Related

Replace line in text containing special characters (mathematical equation) linux text

I want to replace a line, that represents a part of mathematical equation:
f(x,z,time,temp)=-(2.0)/(exp(128*((x-2.5*time)*(x-2.5*time)+(z-0.2)*(z-0.2))))+(
with a new one similar to the above. Both new and old lines are saved in bash variables.
Main problem is that mathematical equation is full with special characters that do not allow proper search and replace in bash mode, even when I used as delimiter special character that is not used in equation.
I used
sed -n "s|$OLD|$NEW|g" restart.k
and
sed -i "s|$OLD|$NEW|g" restart.k
but all times I get wrong results.
Any idea to solve this?
There is only * in your pattern here that is special for sed, so escape it and do replacement as usual:
sed "s:$(sed 's:[*]:\\&:g' <<<"$old"):$new:" infile
if there are more special characters in your real sample, then you will need to add them inside bracket []; there are some exceptions like:
if ^ character: it can be place anywhere in [] but not first character, because ^ character at first negates the characters within its bracket expression.
if ] character: it should be the first character, because this character is also used to end the bracket expression.
if - character: it should be the first or last character, because this character is also can be used for defining the range of characters too.

What does this sed command line do?

I see this lines in my study.
$temp = 'echo $line | sed s/[a-z AZ 0-9 _]//g'
IF($temp != '')
echo "Line contains illegal characters"
I don't understand. Isn't sed is like substituting function? In the code, [a-z AZ 0-9 _] should be replace with ''. I don't understand how this determines if $line has illegal characters.
sed is a stream editor tool that applies regular expressions to transform the input. The command
sed s/regex/replace/g
reads from stdin and every time it finds something matching regex, it replaces it with the contents of replace. In your case, the command
sed s/[a-z A-Z 0-9 _]//g
has [a-z A-Z 0-9] as its regular expression and the empty string as its replacement. (Did you forget a dash between the A and the Z?) This means that anything matching the indicated regular expression gets deleted. This regular expression means "any character that's either between a and z, between A and Z, between 0 and 9, a space, or an underscore," so this command essentially deletes any alphanumeric characters, whitespaces, or underscores from the input and dumps what's left to stdout. Testing whether the output is empty then asks whether there were any characters in there that weren't alphanumeric, spaces, or numbers, which is how the code works.
I'd recommend adding sed to the list of tools you should get a basic familiarity with, since it's a fairly common one to see on the command-line.

Write a command that lists the unique values (remember, name/value pairs) that are actually set (i.e. ignore commented lines)

The /etc/ssl/openssl.cnf file contains a number of name, value pairs of the form “name = value”. It also contains a number of other things including headers, in which the line starts with a [, and comments, which have a # as the first character. Write a command that lists the unique values (remember, name/value pairs) that are actually set (i.e. ignore commented lines).
Yes this is a homework question but I do not know what I am doing and was hoping someone could help or give me an example of something that could work so I can have a better understanding?
Check out awk for questions that require you to go line by line, parsing the line, and making decisions on what get's outputted. For example:
awk -F"[= \t]+" '$0 ~ /^[^#\r\n\t \[]/ {print $1, $2}' /etc/ssl/openssl.cnf
awk moves through files line by line. We first specify a delimiter to split each line into tokens using the -F flag. Here we split lines into multiple tokens using =, , or tab \t occurring one or many times in a row (-F"[= \t]+").
Then in the awk script itself we have blocks where we take action on each line encountered. Blocks occur between squiggly brackets { this is a block }. We can refer to each split off token using the dollar sign and a number starting at 1. $1 means the first token encountered. Here our block that is executed on each line is {print $1, $2} which says "Print out the first and second token encountered for this line", which is the stuff before the equal sign and the stuff directly after.
Furthermore, a block can be made conditional by putting a conditional statement right before it that returns true or false. Here we use a regex condition on the special token $0 which means the entire line's content: $0 ~ /^[^#\r\n\t \[]/ That says: "If the line doesn't start with a #, carriage return, line feed, tab, space, or [ character, then execute the block for this line.
You can specify the character that you would like to seperate your key/value pairs using awks built in OFS variable in a special BEGIN{} block, which is executed once before the file processing begins:
awk -F"[= \t]+" 'BEGIN{OFS="|"} $0 ~ /^[^#\r\n\t \[]/ {print $1, $2}' /etc/ssl/openssl.cnf
With that you'll get a pipe delimiter between your key/value pair from the config file.
One caveat here, values of the key/value pair that have delimiters in them will be truncated, so a for loop in the awk script may be needed to iterate through your tokens starting at $2 until you hit a token that matches that /^[^#\r\n\t \[]/ may be needed to collect all the bits of the value.

extract first instance per line (maybe grep?)

I want to extract the first instance of a string per line in linux. I am currently trying grep but it yields all the instances per line. Below I want the strings (numbers and letters) after "tn="...but only the first set per line. The actual characters could be any combination of numbers or letters. And there is a space after them. There is also a space before the tn=
Given the following file:
hello my name is dog tn=12g3 fun 23k3 hello tn=1d3i9 cheese 234kd dks2 tn=6k4k ksk
1263 chairs are good tn=k38493kd cars run vroom it95958 tn=k22djd fair gold tn=293838 tounge
Desired output:
12g3
k38493
Here's one way you can do it if you have GNU grep, which (mostly) supports Perl Compatible Regular Expressions with -P. Also, the non-standard switch -o is used to only print the part matching the pattern, rather than the whole line:
grep -Po '^.*?tn=\K\S+' file
The pattern matches the start of the line ^, followed by any characters .*?, where the ? makes the match non-greedy. After the first match of tn=, \K "kills" the previous part so you're only left with the bit you're interested in: one or more non-space characters \S+.
As in Ed's answer, you may wish to add a space before tn to avoid accidentally matching something like footn=.... You might also prefer to use something like \w to match "word" characters (equivalent to [[:alnum:]_]).
Just split the input in tn=-separators and pick the second one. Then, split again to get everything up to the first space:
$ awk -F"tn=" '{split($2,a, " "); print a[1]}' file
12g3
k38493kd
$ awk 'match($0,/ tn=[[:alnum:]]+/) {print substr($0,RSTART+4,RLENGTH-4)}' file
12g3
k38493kd

A Linux Shell Script Problem

I have a string separated by dot in Linux Shell,
$example=This.is.My.String
I want to
1.Add some string before the last dot, for example, I want to add "Good.Long" before the last dot, so I get:
This.is.My.Goood.Long.String
2.Get the part after the last dot, so I will get
String
3.Turn the dot into underscore except the last dot, so I will get
This_is_My.String
If you have time, please explain a little bit, I am still learning Regular Expression.
Thanks a lot!
I don't know what you mean by 'Linux Shell' so I will assume bash. This solution will also work in zsh, etcetera:
example=This.is.My.String
before_last_dot=${example%.*}
after_last_dot=${example##*.}
echo ${before_last_dot}.Goood.Long.${after_last_dot}
This.is.My.Goood.Long.String
echo ${before_last_dot//./_}.${after_last_dot}
This_is_My.String
The interim variables before_last_dot and after_last_dot should explain my usage of the % and ## operators. The //, I also think is self-explanatory but I'd be happy to clarify if you have any questions.
This doesn't use sed (or even regular expressions), but bash's inbuilt parameter substitution. I prefer to stick to just one language per script, with as few forks as possible :-)
Other users have given good answers for #1 and #2. There are some disadvantages to some of the answers for #3. In one case, you have to run the substitution twice. In another, if your string has other underscores they might get clobbered. This command works in one go and only affects dots:
sed 's/\(.*\)\./\1\n./;h;s/[^\n]*\n//;x;s/\n.*//;s/\./_/g;G;s/\n//'
It splits the line before the last dot by inserting a newline and copies the result into hold space:
s/\(.*\)\./\1\n./;h
removes everything up to and including the newline from the copy in pattern space and swaps hold space and pattern space:
s/[^\n]*\n//;x
removes everything after and including the newline from the copy that's now in pattern space
s/\n.*//
changes all dots into underscores in the copy in pattern space and appends hold space onto the end of pattern space
s/\./_/g;G
removes the newline that the append operation adds
s/\n//
Then the sed script is finished and the pattern space is output.
At the end of each numbered step (some consist of two actual steps):
Step Pattern Space Hold Space
This.is.My\n.String This.is.My\n.String
This.is.My\n.String .String
This.is.My .String
This_is_My\n.String .String
This_is_My.String .String
Solution
Two versions of this, too:
Complex: sed 's/\(.*\)\([.][^.]*$\)/\1.Goood.Long\2/'
Simple: sed 's/.*\./&Goood.Long./' - thanks Dennis Williamson
What do you want?
Complex: sed 's/.*[.]\([^.]*\)$/\1/'
Simpler: sed 's/.*\.//' - thanks, glenn jackman.
sed 's/\([^.]*\)[.]\([^.]*[.]\)/\1_\2/g'
With 3, you probably need to run the substitute (in its entirety) at least twice, in general.
Explanation
Remember, in sed, the notation \(...\) is a 'capture' that can be referenced as '\1' or similar in the replacement text.
Capture everything up to a string starting with a dot followed by a sequence of non-dots (which you also capture); replace by what came before the last dot, the new material, and the last dot and what came after it.
Ignore everything up to the last dot followed by a capture of a sequence of non-dots; replace with the capture only.
Find and capture a sequence of non-dots, a dot (not captured), followed by a sequence of non-dots and a dot; replace the first dot with an underscore. This is done globally, but the second and subsequent matches won't touch anything already matched. Therefore, I think you need ceil(log2N) passes, where N is the number of dots to be replaced. One pass deals with 1 dot to replace; two passes deals with 2 or 3; three passes deals with 4-7, and so on.
Here's a version that uses Bash's regex matching (Bash 3.2 or greater).
[[ $example =~ ^(.*)\.(.*)$ ]]
echo ${BASH_REMATCH[1]//./_}.${BASH_REMATCH[2]}
Here's a Bash version that uses IFS (Internal Field Separator).
saveIFS=$IFS
IFS=.
array=($e) # * split the string at each dot
lastword=${array[#]: -1}
unset "array[${#array}-1]" # *
IFS=_
echo "${array[*]}.$lastword" # The asterisk as a subscript when inside quotes causes IFS (an underscore in this case) to be inserted between each element of the array
IFS=$saveIFS
* use declare -p array after these steps to see what the array looks like.
1.
$ echo 'This.is.my.string' | sed 's}[^\.][^\.]*$}Good Long.&}'
This.is.my.Good Long.string
before: a dot, then no dot until the end. after: obvious, & is what matched the first part
2.
$ echo 'This.is.my.string' | sed 's}.*\.}}'
string
sed greedy matches, so it will extend the first closure (.*) as far as possible i.e. to the last dot.
3.
$ echo 'This.is.my.string' | tr . _ | sed 's/_\([^_]*\)$/\.\1/'
This_is_my.string
convert all dots to _, then turn the last _ to a dot.
(caveat: this will turn 'This.is.my.string_foo' to 'This_is_my_string.foo', not 'This_is_my.string_foo')
You don't need regular expressions at all (those complex things hurt my eyes!) if you use Awk and are a little creative.
1. echo $example| awk -v ins="Good.long" -F . '{OFS="."; $NF = ins"."$NF;print}'
What this does:
-v ins="Good.long" tells awk to create a variable called 'ins' with "Good.long" as content,
-F . tells awk to use the dot as a separator for your fields for input,
-OFS tells awk to use the dot as a separator for your fields as output,
NF is the number of fields, so $NF represents the last field,
the $NF=... part replaces the last field, it appends the current last string to what you want to insert (the variable called "ins" declared earlier).
2. echo $example| awk -F . '{print $NF}'
$NF is the last field, so that's all!
3. echo $example| awk -F . '{OFS="_"; $(NF-1) = $(NF-1)"."$NF; NF=NF-1; print}'
Here we have to be creative, as Awk AFAIK doesn't allow deleting fields. Of course, we set the output field separateor to underscore.
$(NF-1) = $(NF-1)"."$NF: First, we replace the second last field with the last glued to the second last, with a dot between.
Then, we fool awk to make it think the Number of fields is equal to the number of fields minus one, hence deleting the last field!
Note you can't say $NF="", because then it would display two underscores.

Resources