Removing special characters in bash

Removing special characters in bash - linux

I have a variable "text" which contains the value "IN▒ENJERING"
Hex-values : 49 4E B4 45 4E 4A 45 52 49 4E 47
I want to remove the special character B4.
Now if I remove a regular character (e.g. "I") using the command
text=$(printf "$text" | sed "s/\x49/ /g")
the command works fine
Result : text=N▒ENJER NG
If I want to remove the special character, it seems not to work
text=$(printf "$text" | sed "s/\xB4/ /g")
Result : IN▒ENJERING
Any idea what is wrong ?

This might work for you (GNU sed):
sed 's/\o342\o226\o222/ /g' file
To find the octal representation use:
<<<"IN▒ENJERING" sed -n l
IN\342\226\222ENJERING$
Then to replace each octal character prepend \o and use as a pattern match.

Related

Why grep search '0,^M$' return empty lines?

$ cat -e test.csv | grep 150463452112
65,150463452112,609848340831,2.87,126,138757585104,0,0,^M$
65,150463452112,609848340832,3.37,126,138757585105,1,0,^M$
$ grep 150463452112 test.csv | grep '0,^M$'
$
I enter the '^M' with Ctrl+V Ctrl+M and need to match the line with the ending of `0,^M$'. However, the grep returns empty lines.
Question> What is the correct syntax to search the ending?
Thank you
,0,0, seen in hexdump is as follows:
2c 30 2c 30 2c 0d 0a
|,0,0,..|

The underlying problem is that your file doesn't actually contain any two-character ^M sequence (and even if it did, ^ is special to regex and doesn't match itself). Rather, it contains a carriage return before its final linefeed (being a DOS-style rather than UNIX-style text file). What you want to match is not a ^M sequence but a literal carriage return.
One way to do this is to pass grep a shell literal using bash and ksh $'' C-style string literal syntax:
grep $'0,\r$'
...which you can test as follows:
## test function: generate two lines with CRLFs, one "hello world", the other "foo,0,"
$ generate_sample_data() { printf '%s\r\n' 'hello world' 'foo,0,'; }
## demonstrate that we have 0d 0a line endings on the output from this function
$ generate_sample_data | hexdump -C
00000000 68 65 6c 6c 6f 20 77 6f 72 6c 64 0d 0a 66 6f 6f |hello world..foo|
00000010 2c 30 2c 0d 0a |,0,..|
00000015
## demonstrate that the given grep matches only the line ending in "0," before the CRLF
$ generate_sample_data | egrep $'0,\r$'
foo,0,

As pointed here, escape characters for color highlighting might be interfering with the ^M character.
You probably have grep aliased to grep --color=auto or something similar. Use \grep or grep --color=never.
$ grep 150463452112 test.csv | \grep '0,^M$'
65,150463452112,609848340831,2.87,126,138757585104,0,0,
65,150463452112,609848340832,3.37,126,138757585105,1,0,
With ^M entered with Ctrl+V Ctrl+M.

How to delimit file with "\t\n" on a Mac

I have a document whose lines are separated by "\t\n". Records are separated either by "\t", OR by "\n".
Normally, this should be a straigtforward awk query:
BEGIN {
RS='\t\n';
}
{
print;
print "Next entry:";
}
However, on a Mac, regular expressions do not seem to be supported (maybe I'm not doing something right?) So I tried, RS="\t\n"; however, this is interpreted as RS='\t | \n'. Similar problems running awk from the command line:
awk 1 RS='\t\n' ORS='abc' input > output
replaces the \t's, but leaves the \n's be.
Next try: using tr. This obviously fails for sequence of more than one character-- since \t and \n are both used individually in the rows.
Next:
sed -e '/\t\n/s//NextEntry:/g' input > output
However, doesn't work. Entering any ASCII character sequence instead of \t\n works.
Read the manual. It says that \t is not supported in sed strings. Fair enough
sed -e '/\x9\xa/s//abc/' input > output
Still doesn't work. Idea: use tr to replace \t and \n by characters unused in the input file, use sed to change them to what I want, and then tr to change the remaining characters back to what they should be.
tr: Illegal byte sequence
Turns out, that f6 character makes tr just totally fail.
Went through the suggestions in Sed not recognizing \t instead it is treating it as 't' why? . That might work for replacing output strings (except the "Pasting tab into command prompt via CTRL+V" suggestion-- the shell just rejected that paste.), but did not seem to help in my case.
Maybe it's because it's a Mac? Maybe it's because that's the text I'm looking for, not replacing with? Maybe it's the combination with \n?
Any other suggestions?
UPDATE:
I found thread How can I replace a newline (\n) using sed? . Apparently, I am unable even to replace a \n by the string "abc" using the suggestions in that thread.
EDIT: Hex head of source file:
5a 20 4e 4f 09 0a 41 53 20 4f 46 20 30 31 2d 30
34 2d 30 35 20 45 4d 50 4c 4f 59 45 45 0a 47 52
4f 55 50 09 48 49 52 45 20 44 41 54 45 09 53 41
4c 41 52 59 09 4a 4f 42 20 54 49 54 4c 45 09 0a
4a 4f 42 20 4c 45 56 45 4c 0a 53 45 52 49 45 53
09 41 50 50 54 20 54 59 50 45 09 0a 50 41 59 20
53 54 41 54 55 53 0a f6

Unfortunately, BSD awk, as also used on macOS, doesn't support multi-character record separators (RS) altogether (in line with POSIX) - only a single, literal character is supported.
BSD sed, as also used on macOS, supports only \n in regexes - any other escapes, including hex ones (e.g., \x09) are not supported.
See this answer of mine for a comprehensive comparison of GNU and BSD sed.
Assuming that your sed command works in principle, you can use an ANSI C-quoted string
($'\t') to splice a literal tab char. into your sed script (assumes bash (the macOS default shell), ksh, or zsh),:
sed -e ':a' -e '$!{N;ba' -e '}' -e '/'$'\t''\n/s//NextEntry:/g'
Note that, in order to replace newlines, you must instruct sed to read the entire file into memory first, which is what -e ':a' -e '$!{N;ba' -e '}' does (the BSD Sed-compatible form of the common GNU sed idiom :a;$!{N;ba}).

Confusion about the text file encoding and how to transform between different encoding method?

I had a text file f.txt which encoded as UTF-8 as shown below:
chengs-MBP:test cheng$ cat f.txt
Wіnd
like
chengs-MBP:test cheng$ FILE -I f.txt
f.txt: text/plain; charset=utf-8
However, this two words in this file Wind and like are diiferent, as the like can be find by grep command while Wind cannot, it confused me:
chengs-MBP:test cheng$ cat f.txt | grep like
like
chengs-MBP:test cheng$ cat f.txt | grep Wind
chengs-MBP:test cheng$
And I want to transform this file to us-ascii by iconv command, but I failed:
chengs-MBP:test cheng$ iconv -f UTF-8 -t US-ASCII f.txt > new.txt
conv: f.txt:1:0: cannot convert
My goal is to transform this file to a format which all of the words inside this file could be find by grep or sed... that's all.

UTF-8 is an encoding for the Unicode character set. Some Unicode characters are in look-alike subsets, sometimes collectively called "confusables". So,
f.txt | grep "Wind"
will look for LATIN SMALL LETTER I, while
f.txt | grep "Wіnd"
will look for CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I. You could write it as
f.txt | grep "W\xD1\x96nd"
Since і is not a member of the ASCII character set, it can't be encoded in ASCII.
If you want to take this further, I suggest that you don't abandon UTF-8 as your text file encoding but you might want to transliterate confusable letters to Basic Latin or use a search library that deals with transliteration as a feature. grep just gives you exactly what you ask for.

In my examples I will use 'i' in word 'Wind' is CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I in hex d196.
For finding out symbol hex representation you can use xxd or hexdump:
$ xxd -g 1 f.txt
00000000: 57 d1 96 6e 64 0a 6c 69 6b 65 0a W..nd.like.
$ hexdump -C f.txt
00000000 57 d1 96 6e 64 0a 6c 69 6b 65 0a |W..nd.like.|
As you can see, on the right side, in ASCII section UTF symbols replaced with dots.
Also you can use Unicode Utitilies.
$ uniname f.txt
character byte UTF-32 encoded as glyph name
0 0 000057 57 W LATIN CAPITAL LETTER W
1 1 000456 D1 96 і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
2 3 00006E 6E n LATIN SMALL LETTER N
3 4 000064 64 d LATIN SMALL LETTER D
4 5 00000A 0A LINE FEED (LF)
5 6 00006C 6C l LATIN SMALL LETTER L
6 7 000069 69 i LATIN SMALL LETTER I
7 8 00006B 6B k LATIN SMALL LETTER K
8 9 000065 65 e LATIN SMALL LETTER E
9 10 00000A 0A LINE FEED (LF)
After you find out what symbols in your file are not ASCII, you can replace them in file to ASCII equivalent.
$ sed -i 's/\xd1\x96/i/g' f.txt

How can I replace byte sequences in my data using Sed?

I have this rule in my Makefile, to replace ||| (three pipe characters; hex 7c 7c 7c) with CRLFNUL (carriage return + line feed + null; hex 0d 0a 00):
rom.hex: rom.txt
hexdump -C rom.txt | cut -c10-60 > rom.hex
sed -i -e 's/ / /g' rom.hex
sed -i -e 's/7c 7c 7c/0d 0a 00/g' rom.hex
This works some of the time - but, if the output of hexdump splits a 7c 7c 7c sequence across two lines it isn't matched by sed.
The replacement has to be the same length as the match, so as not to shift the subsequent bytes.

You could make the replacement first, before transforming into hex:
rom.hex: rom.txt
sed -e 's/|||/\r\n\x00/g' $< | hexdump -v | cut -c'10-60' >$#
Note that the backslash escapes are a GNU sed extension, so this is not a completely portable solution. If you need a portable sed command, you'll need to put it in a separate file, because you can't include a NUL in a command-line argument. The literal newline must be quoted, too:
s/|||/^M\
^#/g
For clarity, the control characters above are
73 2f 7c 7c 7c 2f 0d 5c 0a 00 2f 67 |s/|||/.\../g|
Then the rule would be
rom.hex: rom.txt
sed -f "transform.sed" $< | hexdump -v | cut -c'10-60' >$#

- Toby Speight's helpful answer elegantly bypasses the OP's problem by using GNU sed to replace data at the source, without needing to operate on a hex. representation (his portable alternative doesn't work with BSD sed, but that's only because of the NUL character in the replacement string).
- The value of this answer is in solving the OP's problem exactly as stated, notably using tr -s '\n' ' ', and in providing a relatively simple portable solution at the bottom - it is of interest from a byte-represenation / text processing perspective.
- See my other answer for a simpler solution that uses hexdump's formatting options to produce the desired output format directly.
Note:
The solutions below transform the byte-value representation of the input into a single line, so as to enable robust use of sed to replace values.
If you do want the fixed-width multi-line output that hexdump produces by default, pipe the output to ... | fmt -w48
The following command normalizes all whitespace in the output from hexdump -C:
hexdump -vC rom.txt | cut -c10-60 | tr -s '\n' ' ' > rom.hex
Note the addition of -v, which prevents loss of information.
Without -v, duplicates in adjacent repeating lines would be represented as *.
The result is:
a single line bookended by a leading and trailing space,
If you want to strip these, see the portable solution at the bottom.
with byte values all separated by a single space each; e.g.:
23 21 2f 62 69 6e 2f 62 61 73 68 0a 0a 23 20 23 20 76 3d 24 5f 0a 23 20 23 20 65 63 68 6f 20 22 ....
Note that tr's -s ("squeeze") option, after having performed the translation (\n to in this case, i.e.), folds runs of multiple occurrences of the target character ( (space) in this case) into single-character runs.
Thus:
The intermediate sed command (sed -i -e 's/ /...) to normalize the line-internal spaces is no longer needed.
The final sed command (sed -i -e 's/7c 7c 7c/ ...) can safely use space-separated values as the search string, without worrying about where the line breaks happened to be in hexdump -C's output.
There is room for simplification:
A single pipeline can be used - no need to write to the file in an intermediate form and update it in place later.
As a side effect, because -i is no longer needed, the sed command becomes portable (POSIX-compliant); while this form will work on both Linux and BSD/OSX platforms, it is still not strictly POSIX-compliant as a whole, because hexdump is a nonstandard utility; see the bottom for a strictly POSIX-compliant solution.
Special make variables $<, the (first) prerequisite (rom.hex), and $#, the target (rom.txt) can be used.
There is no need for the -C option of hexdump, if only the byte values are needed; this allows simplification of the cut command, which, incidentally, strips the leading space from the output (and also makes tr's -s option unnecessary):
rom.hex: rom.txt
hexdump -v $< | cut -sd' ' -f2- | tr '\n' ' ' | sed 's/7c 7c 7c/0d 0a 00/g' > $#
cut -sd' ' -f2-:
-s means that lines not containing the delimiter (separator) specified with -d are skipped, which skips a trailing empty line (empty except for the byte-offset column) that hexdump may output.
-d' ' splits the input into fields using a single space as the delimiter.
-f2- outputs the 2nd field through the end of the line (-), effectively stripping the 1st field (the input-address offset column in hexdump's output).
To make the command fully portable, POSIX utility od can be used in lieu of the nonstandard hexdump utility.
Furthermore, an extra sed command is used to strip the leading and trailing space from the output.
rom.hex: rom.txt
od -t x1 -A n -v $< | tr -s '\n' ' ' | sed 's/^ //; s/ $//' | sed 's/7c 7c 7c/0d 0a 00/g' > $#
od -t x1 -A n -v outputs hex. (x) bytes (1) across multiple lines of fixed width, similar to hexdump, except that -A n blanks out the input-address offset column; -v ensures that all bytes are represented; without it, adjacent duplicate lines would be represented as *.
tr -s '\n' ' ', as above, normalizes the whitespace to produce a single, long line with byte values separated by a single space, bookended by a single leading and trailing space.
sed 's/^ //; s/ $//' removes the leading and trailing space.
The rest of the command is as before.

- See my other answer for how to solve the problem as stated or if you need a POSIX-compliant solution.
- This answer is of interest from a byte-representation formatting perspective.
Note:
The solutions below transform the byte-value representation of the input into a single line, so as to enable robust use of sed to replace values.
If you do want the fixed-width multi-line output that hexdump produces by default, pipe the output to ... | fmt -w48
The problem can be bypassed by passing formatting options to hexdump:
hexdump -ve '1/1 "%02x "'
produces the desired output format as a single line directly (there will be a single trailing space).
-v prevents abbreviation of repeating bytes as *
-e '1/1 "%02x "':
1/1 specifies that the following format string be applied to 1 unit of byte size 1, i.e., each byte.
"%02x " is the format string to apply to each byte: a 2-digit hex number followed by a space.
To put it all together, using special make variables $<, the (first) prerequisite (rom.hex), and $#, the target (rom.txt):
rom.hex: rom.txt
hexdump -ve '1/1 "%02x "' $< | sed 's/7c 7c 7c/0d 0a 00/g' > $#
Alternative solution, using the (also nonstandard) xxd utility; like hexdump, however, it is available on both Linux and BSD/OSX:
rom.hex: rom.txt
xxd -p $< | tr -d '\n' | sed 's/../& /g; s/ $//' | sed 's/7c 7c 7c/0d 0a 00/g' > $#
xxd -p prints a stream of byte values without separators, broken into lines of fixed length.
tr -d '\n' removes the newlines from the output.
sed 's/../& /g; s/ $//' inserts a space after every 2 characters, then deletes the trailing space at the end of the line.
Finally, as Toby Speight points out in a [since cleaned-up] comment, you can use the GNU version of od with the nonstandard -w option:
rom.hex: rom.txt
od -t x1 -A n -w1 -v $< | tr -d '\n' | sed 's/7c 7c 7c/0d 0a 00/g' > $#
od -t x1 -A n -w1 -v outputs hex. (x) bytes (1) 1 byte at a time (-w1); -A n omits the input-address offset column; -v ensures that all bytes are represented; without it, adjacent duplicate lines would be represented as *.
tr -d '\n' simply removes all newlines, and since each line starts with a space, the result is a single long line with a leading space.

Remove lines from file using Hex locations

I've a big file from which i want to remove some content, the file is binary, and i don't have line numbers, but hex address, so how can i remove the region between:
0x13e70a00 and 0x1eaec03ff
With sed (both inclusive)
Will something like this, work?
sed -n 's/\x13e70a00/,s/\x1eaec03ff/ p' orig-data-file > new-file

from what you wrote it looks like you are trying to delete all the bytes between the two hex patterns. for that you will need
this deletes all the bytes between the patterns inclusive of the patterns.
sed 's/\x13\xe7\x0a\x00.*\x1e\xae\xc0\x3f//g' in >out
This deletes all bytes between patterns leaving the patterns intact. (there is a way to this with numbered parts of regexes but this is a bit clearer to beging with)
sed 's/\x13\xe7\x0a\x00.*\x1e\xae\xc0\x3f/\x13\xe7\x0a\x00\x1e\xae\xc0\x3f/g' in >out
They search s/ for a <pattern1> followed by any text .* followed by <pattern2> and replace it with either nothing //g or just the two edges /<pattern1><pattern2>/g throughout the file /g
If you want to delete (or replace) from byte 300 to byte 310:
sed 's/\(.\{300\}\).\{10\}/\1rep-str/' in>out
this matches the first 300 characters (.\{300\} )and remembers them (the \(\) ). It matches the next 10 characters too. It replaces this whole combined match with the first 300 characters (\1) followed by your replacement string rep-str this replacement string can be empty to just delete the text between bytes 300 and 310.
However, this is quite brittle if there are any newline characters. if you can live without replacement:
dd if=file bs=1 skip=310|dd of=file bs=1 seek=300 conv=notrunc
this does an in place replacement by copying from the 310th byte onwards till into the file starting from 300 position thus deleting 10 bytes
an even more general alternative is
dd if=in bs=1 count=300>out
printf "replacement text">>out
dd if=in bs=1 skip=310>>out
though the simplest thing to do will be to use a hex editor like Bless

You should be able to use a clever combination of converting bash numbers from hex to decimal, bash math to add 1 to the decimal offsets, and cut --complement -b to remove the correct segment from the file.
EDIT: Like this:
$ snip_out 0x0f 0x10 <<< "0123456789abcdeffedcba9876543210" | od -t x1
0000000 30 31 32 33 34 35 36 37 38 39 61 62 63 64 65 65
0000020 64 63 62 61 39 38 37 36 35 34 33 32 31 30
0000036
Where snip_out is a two-parameter shell-script that operates on stdin and stdout:
#!/bin/bash
START_RANGE_DEC=$(printf "%d" $1)
END_RANGE_DEC=$(printf "%d" $2)
# Most hex ranges begin with 0; cut begins with 1.
CUT_START_DEC=$(( $START_RANGE_DEC + 1 ))
CUT_END_DEC=$(( $END_RANGE_DEC + 1 ))
# cut likes to append a newline after output. Use head to remove it.
exec cut --complement -b $CUT_START_DEC-$CUT_END_DEC | head -c -1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string