How to search, replace specific hex code in automated way - linux

I have a 100M row file that has some encoding problems -- was "originally" EBCDIC, saved as US-ASCII, now UTF-8. I don't know much more about its heritage, sorry -- I've just been asked to analyze the content.
The "cents" character from EBCDIC is "hidden" in this file in random places, causing all sorts of errors. Here is more on this bugger: cents character in hex
Converting this file using iconv -f foo -t UTF-8 -c is not working -- the cents character prevails.
When I use hex editor, I can find the appearance of 0xC2 0xA2 (c2a2). But in a BIG file, this isn't ideal. Sed doesn't work at hex level, so... Not sure about tr -- I only really use it for carriage return / new line.
What linux utility / command can I use to find and delete this character reasonably quickly on very big files?
2 parts:
1 -- utility / command to find / count the number of these occurrences (octal \242)
2 -- command to replace (this works tr '\242' ' ' < source > output )
How the text appears on my ubuntu terminal:
1019EQ?IT DEPT GENERATED
With xxd, how it looks at hex level (ascii to the side looks the same as above):
0000000: 3130 3139 4551 a249 5420 4445 5054 2047 454e 4552 4154 4544 0d0a
With xxd, how it looks with "show ebcdic" -- here, just showing the ebcdic from side:
......s.....&....+........
So hex "a2" is the culprit. I'm now trying xxd -E foo | grep a2 to count the instances up.
Adding output from od -ctxl, rather than xxd, for those interested:
0000000 1 0 1 9 E Q 242 I T D E P T G
31 30 31 39 45 51 a2 49 54 20 44 45 50 54 20 47
0000020 E N E R A T E D \r \n
45 4e 45 52 41 54 45 44 0d 0a

When you say the file was converted what do you mean? Do you mean the binary file was simply dumped from an IBM 360 to another ASCII based computer, or was the file itself converted over to ASCII when it was transferred over?
The question is whether the file is actually in a well encoded state or not. The other question is how do you want the file encoded?
On my Mac (which uses UTF-8 by default, just like Linux systems), I have no problem using sed to get rid of the ¢ character:
Here's my file:
$ cat test.txt
This is a test --¢-- TEST TEST
$ od -ctx1 test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - ¢ ** - - T E S T T E S T \n
2d c2 a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000040
You can see that cat has no problems printing out that ¢ character. And, you can see in the od dump the c2a2 encoding of the ¢ character.
$ sed 's/¢/$/g' test.txt > new_test.txt
$ cat new_test.txt
This is a test --$-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - $ - - T E S T T E S T \n
2d 24 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's my sed has no problems changing that ¢ into a $ sign. The dump now shows that this test file is equivalent to a strictly ASCII encoded file. That two hexadecimal digit encoded ¢ is now a nice clean single hexadecimal digit encoded $.
It looks like sed can handle your issue.
If you want to use this file on a Windows system, you can convert the file to the standard Windows Code Page 1252:
$ iconv -f utf8 -t cp1252 test.txt > new_test.txt
$ cat new_test.txt
This is a test --?-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - 242 - - T E S T T E S T \n
2d a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's the file now in Codepage 1252 just like the way Windows likes it! Note that the ¢ is now a nice hex 242 character.
So, what is exactly the issue? Do you need to file in pure ASCII defined 127 characters? Do you need the file encoded, so Windows machines can work on it? Are you having problems entering the ¢ character?
Let me know. I'm not from the government, and yet I'm here to help you.

Related

How to delimit file with "\t\n" on a Mac

I have a document whose lines are separated by "\t\n". Records are separated either by "\t", OR by "\n".
Normally, this should be a straigtforward awk query:
BEGIN {
RS='\t\n';
}
{
print;
print "Next entry:";
}
However, on a Mac, regular expressions do not seem to be supported (maybe I'm not doing something right?) So I tried, RS="\t\n"; however, this is interpreted as RS='\t | \n'. Similar problems running awk from the command line:
awk 1 RS='\t\n' ORS='abc' input > output
replaces the \t's, but leaves the \n's be.
Next try: using tr. This obviously fails for sequence of more than one character-- since \t and \n are both used individually in the rows.
Next:
sed -e '/\t\n/s//NextEntry:/g' input > output
However, doesn't work. Entering any ASCII character sequence instead of \t\n works.
Read the manual. It says that \t is not supported in sed strings. Fair enough
sed -e '/\x9\xa/s//abc/' input > output
Still doesn't work. Idea: use tr to replace \t and \n by characters unused in the input file, use sed to change them to what I want, and then tr to change the remaining characters back to what they should be.
tr: Illegal byte sequence
Turns out, that f6 character makes tr just totally fail.
Went through the suggestions in Sed not recognizing \t instead it is treating it as 't' why? . That might work for replacing output strings (except the "Pasting tab into command prompt via CTRL+V" suggestion-- the shell just rejected that paste.), but did not seem to help in my case.
Maybe it's because it's a Mac? Maybe it's because that's the text I'm looking for, not replacing with? Maybe it's the combination with \n?
Any other suggestions?
UPDATE:
I found thread How can I replace a newline (\n) using sed? . Apparently, I am unable even to replace a \n by the string "abc" using the suggestions in that thread.
EDIT: Hex head of source file:
5a 20 4e 4f 09 0a 41 53 20 4f 46 20 30 31 2d 30
34 2d 30 35 20 45 4d 50 4c 4f 59 45 45 0a 47 52
4f 55 50 09 48 49 52 45 20 44 41 54 45 09 53 41
4c 41 52 59 09 4a 4f 42 20 54 49 54 4c 45 09 0a
4a 4f 42 20 4c 45 56 45 4c 0a 53 45 52 49 45 53
09 41 50 50 54 20 54 59 50 45 09 0a 50 41 59 20
53 54 41 54 55 53 0a f6
Unfortunately, BSD awk, as also used on macOS, doesn't support multi-character record separators (RS) altogether (in line with POSIX) - only a single, literal character is supported.
BSD sed, as also used on macOS, supports only \n in regexes - any other escapes, including hex ones (e.g., \x09) are not supported.
See this answer of mine for a comprehensive comparison of GNU and BSD sed.
Assuming that your sed command works in principle, you can use an ANSI C-quoted string
($'\t') to splice a literal tab char. into your sed script (assumes bash (the macOS default shell), ksh, or zsh),:
sed -e ':a' -e '$!{N;ba' -e '}' -e '/'$'\t''\n/s//NextEntry:/g'
Note that, in order to replace newlines, you must instruct sed to read the entire file into memory first, which is what -e ':a' -e '$!{N;ba' -e '}' does (the BSD Sed-compatible form of the common GNU sed idiom :a;$!{N;ba}).

Confusion about the text file encoding and how to transform between different encoding method?

I had a text file f.txt which encoded as UTF-8 as shown below:
chengs-MBP:test cheng$ cat f.txt
Wіnd
like
chengs-MBP:test cheng$ FILE -I f.txt
f.txt: text/plain; charset=utf-8
However, this two words in this file Wind and like are diiferent, as the like can be find by grep command while Wind cannot, it confused me:
chengs-MBP:test cheng$ cat f.txt | grep like
like
chengs-MBP:test cheng$ cat f.txt | grep Wind
chengs-MBP:test cheng$
And I want to transform this file to us-ascii by iconv command, but I failed:
chengs-MBP:test cheng$ iconv -f UTF-8 -t US-ASCII f.txt > new.txt
conv: f.txt:1:0: cannot convert
My goal is to transform this file to a format which all of the words inside this file could be find by grep or sed... that's all.
UTF-8 is an encoding for the Unicode character set. Some Unicode characters are in look-alike subsets, sometimes collectively called "confusables". So,
f.txt | grep "Wind"
will look for LATIN SMALL LETTER I, while
f.txt | grep "Wіnd"
will look for CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I. You could write it as
f.txt | grep "W\xD1\x96nd"
Since і is not a member of the ASCII character set, it can't be encoded in ASCII.
If you want to take this further, I suggest that you don't abandon UTF-8 as your text file encoding but you might want to transliterate confusable letters to Basic Latin or use a search library that deals with transliteration as a feature. grep just gives you exactly what you ask for.
In my examples I will use 'i' in word 'Wind' is CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I in hex d196.
For finding out symbol hex representation you can use xxd or hexdump:
$ xxd -g 1 f.txt
00000000: 57 d1 96 6e 64 0a 6c 69 6b 65 0a W..nd.like.
$ hexdump -C f.txt
00000000 57 d1 96 6e 64 0a 6c 69 6b 65 0a |W..nd.like.|
As you can see, on the right side, in ASCII section UTF symbols replaced with dots.
Also you can use Unicode Utitilies.
$ uniname f.txt
character byte UTF-32 encoded as glyph name
0 0 000057 57 W LATIN CAPITAL LETTER W
1 1 000456 D1 96 і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
2 3 00006E 6E n LATIN SMALL LETTER N
3 4 000064 64 d LATIN SMALL LETTER D
4 5 00000A 0A LINE FEED (LF)
5 6 00006C 6C l LATIN SMALL LETTER L
6 7 000069 69 i LATIN SMALL LETTER I
7 8 00006B 6B k LATIN SMALL LETTER K
8 9 000065 65 e LATIN SMALL LETTER E
9 10 00000A 0A LINE FEED (LF)
After you find out what symbols in your file are not ASCII, you can replace them in file to ASCII equivalent.
$ sed -i 's/\xd1\x96/i/g' f.txt

Read 3 output lines into 3 variables in bash

I have a command that generates 3 lines of output such as
$ ./mycommand
1
asdf
qwer zxcv
I'd like to assign those 3 lines to 3 different variables ($a, $b, $c) such that
$ echo $a
1
$ echo $b
asdf
$ echo $c
qwer zxcv
I'm familiar with the while loop method that would normally be used for reading 3 lines at a time from output that contains sets of 3 lines. But that seems less than elegant considering my command will only ever output 3 lines.
I tried playing around with various combinations of values for IFS= and options for read -r a b c, sending the command output as stdin, but I could only ever get it to set the first line to the first variable. Some examples:
IFS= read -r a b c < <(./mycommand)
IFS=$'\n' read -r a b c < <(./mycommand)
IFS= read -r -d $'\n' < <(./mycommand)
If I modify my command so that the 3 lines are separated by spaces instead of newlines, I can successfully just use this variation as long as each former line is properly quoted:
read -r a b c < <(./mycommand)
And while that is working for the purposes of my current project, it's still bugging me that I couldn't get it to work the other way. So I'm wondering if anyone can see and explain what I was missing in my original attempt with the 3 lines of output.
If you want to read data from three lines, use three reads:
{ read -r a; read -r b; read -r c; } < <(./mycommand)
read reads a chunk of data and then splits it up. You couldn't get it to work because your chunks were always single lines.
Newer BASH versions support mapfile command. Using that you can read all the lines into an array:
mapfile -t ary < <(./command)
Check the array content:
declare -p ary
declare -a ary='([0]="1" [1]="asdf" [2]="qwer zxcv")'
Perhaps this explanation will be useful to you.
... it's still bugging me that I couldn't get it to work the other way. So I'm wondering if anyone can see and explain what I was missing in my original attempt with the 3 lines of output.
Simple: read works only with one line (by default). This:
#!/bin/bash
mycommand(){ echo -e "1\nasdf\nqwer zxcv"; }
read a b c < <(mycommand)
printf 'first : %s\nsecond : %s\nthird : %s\n' "$a" "$b" "$c"
Will print:
first : 1
second :
third :
However, using a null character will capture the whole string in (replace this line above):
read -d '' a b c < <(mycommand)
Will print:
first : 1
second : asdf
third : qwer zxcv
The read command absorbed the whole output of the command and was broken into parts with the default value of IFS: SpaceTabEnter.
In this specific example, that worked correctly because the last value is the one with more than one "part".
But this kind of processing is very brittle. For example: this other possible output of the command, the assignment to variables will break:
mycommand(){ echo -e "1 and 2\nasdf and dfgh\nqwer zxcv"; }
Will output (incorrectly):
first : 1
second : and
third : 2
asdf and dfgh
qwer zxcv
The processing is brittle. To make it robust we need to use a loop. But you say that that is something you already know:
#!/bin/bash
mycommand(){ echo -e "1 and 2\nasdf and dfgh\nqwer zxcv"; }
i=0; while read arr[i]; do ((i++)); done < <(mycommand)
printf 'first : %s\nsecond : %s\nthird : %s\n' "${arr[0]}" "${arr[1]}" "${arr[2]}"
Which will (correctly) print:
first : 1 and 2
second : asdf and dfgh
third : qwer zxcv
However, the loop could be made simpler using bash command readarray:
#!/bin/bash
mycommand(){ echo -e "1 and 2\nasdf and dfgh\nqwer zxcv"; }
readarray -t arr < <(mycommand)
printf 'first : %s\nsecond : %s\nthird : %s\n' "${arr[0]}" "${arr[1]}" "${arr[2]}"
And using a printf "loop" will make the structure work for any count of input lines:
#!/bin/bash
mycommand(){ echo -e "1 and 2\nasdf and dfgh\n*\nqwer zxcv"; }
readarray -t arr < <(mycommand)
printf 'value : %s\n' "${arr[#]}"
Hope that this helped.
EDIT
About nulls (in simple read):
In bash, the use of nulls is almost never practical. In specific, nulls are erased silently in most condidions. This solution does suffer of that limitation.
Including a null in the input:
mycommand(){ echo -e "1 and 2\nasdf and dfgh\n\000\n*\nqwer zxcv"; }
will make a simple read -r -d '' get the input up to the first null (understanding such null as the character with octal 000).
echo "test one:"; echo
echo "input"; echo
mycommand | od -tcx1
echo "output"; echo
read -r -d '' arr < <(mycommand)
echo "$arr" | od -tcx1
Gives this as output:
test one:
input
0000000 1 a n d 2 \n a s d f a n d
31 20 61 6e 64 20 32 0a 61 73 64 66 20 61 6e 64
0000020 d f g h \n \0 \n * \n q w e r z
20 64 66 67 68 0a 00 0a 2a 0a 71 77 65 72 20 7a
0000040 x c v \n
78 63 76 0a
0000044
output
0000000 1 a n d 2 \n a s d f a n d
31 20 61 6e 64 20 32 0a 61 73 64 66 20 61 6e 64
0000020 d f g h \n
20 64 66 67 68 0a
0000026
It is clear that the value captured by read stops at the first octal 000.
Which, frankly, is to be expected.
About nulls (in readarray):
I have to report, however, that readarray does not stop at the octal 000 but just silently removes it (an usual shell trait).
Running this code:
#!/bin/bash
mycommand(){ echo -e "1 and 2\nasdf and dfgh\n\000\n*\nqwer zxcv"; }
echo "test two:"; echo
echo "input"; echo
mycommand | od -tcx1
echo "output"; echo
readarray -t arr < <(mycommand)
printf 'value : %s\n' "${arr[#]}"
echo
printf 'value : %s\n' "${arr[#]}"|od -tcx1
Renders this output:
test two:
input
0000000 1 a n d 2 \n a s d f a n d
31 20 61 6e 64 20 32 0a 61 73 64 66 20 61 6e 64
0000020 d f g h \n \0 \n * \n q w e r z
20 64 66 67 68 0a 00 0a 2a 0a 71 77 65 72 20 7a
0000040 x c v \n
78 63 76 0a
0000044
output
value : 1 and 2
value : asdf and dfgh
value :
value : *
value : qwer zxcv
0000000 v a l u e : 1 a n d 2 \n
76 61 6c 75 65 20 3a 20 31 20 61 6e 64 20 32 0a
0000020 v a l u e : a s d f a n d
76 61 6c 75 65 20 3a 20 61 73 64 66 20 61 6e 64
0000040 d f g h \n v a l u e : \n v
20 64 66 67 68 0a 76 61 6c 75 65 20 3a 20 0a 76
0000060 a l u e : * \n v a l u e :
61 6c 75 65 20 3a 20 2a 0a 76 61 6c 75 65 20 3a
0000100 q w e r z x c v \n
20 71 77 65 72 20 7a 78 63 76 0a
0000113
That is, the null 000 or just \0 gets silently removed.

How to convert a text file containing hexadecimal to binary file using linux commands?

I have hex code of a binary in text (string) format. How do I convert it to a binary file using linux commands like cat and echo ?
I know command following command with create a binary test.bin. But what if this hexcode is in another .txt file ? How do I "cat" the content of text file to "echo" and generate a binary file ?
# echo -e "\x00\x001" > test.bin
use xxd -r. it reverts a hexdump to its binary representation.
source and source
Edit: The -p parameter is also very useful. It accepts "plain" hexadecimal values, but ignores whitespace and line changes.
So, if you have a plain text dump like this:
echo "0000 4865 6c6c 6f20 776f 726c 6421 0000" > text_dump
You can convert it to binary with:
xxd -r -p text_dump > binary_dump
And then get useful output with something like:
xxd binary_dump
If you have long text or text in file you can also use the binmake tool that allows you to describe in text format some binary data and generate a binary file (or output to stdout). It allows to change the endianess and number formats and accepts comments.
Its default format is hexadecimal but not limited to this.
First get and compile binmake:
$ git clone https://github.com/dadadel/binmake
$ cd binmake
$ make
You can pipe it using stdin and stdout:
$ echo '32 decimal 32 61 %x20 %x61' | ./binmake | hexdump -C
00000000 32 20 3d 20 61 |2 = a|
00000005
Or use files. So create your text file file.txt:
# an exemple of file description of binary data to generate
# set endianess to big-endian
big-endian
# default number is hexadecimal
00112233
# man can explicit a number type: %b means binary number
%b0100110111100000
# change endianess to little-endian
little-endian
# if no explicit, use default
44556677
# bytes are not concerned by endianess
88 99 aa bb
# change default to decimal
decimal
# following number is now decimal
0123
# strings are delimited by " or '
"this is some raw string"
# explicit hexa number starts with %x
%xff
Generate your binary file file.bin:
$ ./binmake file.txt file.bin
$ hexdump file.bin -C
00000000 00 11 22 33 4d e0 77 66 55 44 88 99 aa bb 7b 74 |.."3M.wfUD....{t|
00000010 68 69 73 20 69 73 20 73 6f 6d 65 20 72 61 77 20 |his is some raw |
00000020 73 74 72 69 6e 67 ff |string.|
00000027
In addition of xxd, you should also look at the packages/commands od and hexdump. All are similar, however each provide slightly different options that will allow you to tailor the output to your desired needs. For example hexdump -C is the traditional hexdump with associated ASCII translation along side.

Shell script printing contents of variable containing output of a command removes newline characters [duplicate]

This question already has answers here:
Capturing multiple line output into a Bash variable
(7 answers)
Closed 7 years ago.
I'm writing a shell script which will store the output of a command in a variable, process the output, and later echo the results. Here's what I've got:
stuff=$(diff -u pens tape)
# process the output
echo $stuff
The problem is, the output I get from running the script is this:
--- pens 2009-09-27 10:29:06.000000000 -0400 +++ tape 2009-09-18 16:45:08.000000000 -0400 ## -1,4 +1,2 ## -highlighter -marker -pencil -POSIX +masking +duct
Whereas I was expecting this:
--- pens 2009-09-27 10:29:06.000000000 -0400
+++ tape 2009-09-18 16:45:08.000000000 -0400
## -1,4 +1,2 ##
-highlighter
-marker
-pencil
-POSIX
+masking
+duct
It looks like the newline characters are being removed somehow. How do I get them to say in?
If you want to preserve the newlines, enclose the variable in double quotes:
echo "$stuff"
When you write it without the double quotes, the shell expands $stuff into a space-separated list of words (where 'words' are sequences of non-space characters, and the space characters are blanks and tabs and newlines; upon experimentation, it seems that form feeds, carriage returns and back-spaces are not counted as space).
Demonstrating interpretation of control characters as white space. ASCII 8 is backspace, 9 is tab, 10 is new line (LF), 11 is vertical tab, 12 is form feed, 13 is carriage return. The first command generates a sequence of characters separated by the various control characters. The second command echoes with the result with the original characters preserved - see the hex dump. The third command echoes the result with the shell splitting the words; you can see that the tab and newline were replaced by blank (0x20).
$ x=$(./ascii 64 65 8 66 67 9 68 69 10 70 71 11 72 73 12 74 75 13 76 77)
$ echo "$x" | odx
0x0000: 40 41 08 42 43 09 44 45 0A 46 47 0B 48 49 0C 4A #A.BC.DE.FG.HI.J
0x0010: 4B 0D 4C 4D 0A K.LM.
0x0015:
$ echo $x | odx
0x0000: 40 41 08 42 43 20 44 45 20 46 47 0B 48 49 0C 4A #A.BC DE FG.HI.J
0x0010: 4B 0D 4C 4D 0A K.LM.
0x0015:
$

Resources