thegladiator:~/cp$ cat new.txt
Hello World This is a Trest Progyy
thegladiator:~/cp$ hexdump new.txt
0000000 6548 6c6c 206f 6f57 6c72 2064 6854 7369
0000010 6920 2073 2061 7254 7365 2074 7250 676f
0000020 7979 000a
0000023
How is that text data represented in hex like that? What is the meaning of this?
it's just what it says, a dump of the data in hexidecimal format:
H 48
e 65
l 6c
l 6c
o 6f
It is odd though that all of the bytes are swapped (65 48 : e H)
If you're on a *nix system, you can use 'od -x', or 'man od' will tell you all the ways to get data from od :)
The text in the file new.txt is stored using ASCII encoding. Each letter is represented by a number, decimal: 32-127 hexidecimal: 20-7F. So the first three letters (H,e,l), are represented by the decimal numbers: 72,101,108 and the hexidecimal numbers: 48,65,6C
Hexdump by default takes each 16 bit word of the input file new.txt and outputs this word as a Hexidecimal number. Because it is operating on 16 bits, not 8 bits, you see the output in an unexpected order.
If you instead use xxd new.txt, you will see the output in the expected order.
Related
I had a text file f.txt which encoded as UTF-8 as shown below:
chengs-MBP:test cheng$ cat f.txt
Wіnd
like
chengs-MBP:test cheng$ FILE -I f.txt
f.txt: text/plain; charset=utf-8
However, this two words in this file Wind and like are diiferent, as the like can be find by grep command while Wind cannot, it confused me:
chengs-MBP:test cheng$ cat f.txt | grep like
like
chengs-MBP:test cheng$ cat f.txt | grep Wind
chengs-MBP:test cheng$
And I want to transform this file to us-ascii by iconv command, but I failed:
chengs-MBP:test cheng$ iconv -f UTF-8 -t US-ASCII f.txt > new.txt
conv: f.txt:1:0: cannot convert
My goal is to transform this file to a format which all of the words inside this file could be find by grep or sed... that's all.
UTF-8 is an encoding for the Unicode character set. Some Unicode characters are in look-alike subsets, sometimes collectively called "confusables". So,
f.txt | grep "Wind"
will look for LATIN SMALL LETTER I, while
f.txt | grep "Wіnd"
will look for CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I. You could write it as
f.txt | grep "W\xD1\x96nd"
Since і is not a member of the ASCII character set, it can't be encoded in ASCII.
If you want to take this further, I suggest that you don't abandon UTF-8 as your text file encoding but you might want to transliterate confusable letters to Basic Latin or use a search library that deals with transliteration as a feature. grep just gives you exactly what you ask for.
In my examples I will use 'i' in word 'Wind' is CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I in hex d196.
For finding out symbol hex representation you can use xxd or hexdump:
$ xxd -g 1 f.txt
00000000: 57 d1 96 6e 64 0a 6c 69 6b 65 0a W..nd.like.
$ hexdump -C f.txt
00000000 57 d1 96 6e 64 0a 6c 69 6b 65 0a |W..nd.like.|
As you can see, on the right side, in ASCII section UTF symbols replaced with dots.
Also you can use Unicode Utitilies.
$ uniname f.txt
character byte UTF-32 encoded as glyph name
0 0 000057 57 W LATIN CAPITAL LETTER W
1 1 000456 D1 96 і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
2 3 00006E 6E n LATIN SMALL LETTER N
3 4 000064 64 d LATIN SMALL LETTER D
4 5 00000A 0A LINE FEED (LF)
5 6 00006C 6C l LATIN SMALL LETTER L
6 7 000069 69 i LATIN SMALL LETTER I
7 8 00006B 6B k LATIN SMALL LETTER K
8 9 000065 65 e LATIN SMALL LETTER E
9 10 00000A 0A LINE FEED (LF)
After you find out what symbols in your file are not ASCII, you can replace them in file to ASCII equivalent.
$ sed -i 's/\xd1\x96/i/g' f.txt
I have hex code of a binary in text (string) format. How do I convert it to a binary file using linux commands like cat and echo ?
I know command following command with create a binary test.bin. But what if this hexcode is in another .txt file ? How do I "cat" the content of text file to "echo" and generate a binary file ?
# echo -e "\x00\x001" > test.bin
use xxd -r. it reverts a hexdump to its binary representation.
source and source
Edit: The -p parameter is also very useful. It accepts "plain" hexadecimal values, but ignores whitespace and line changes.
So, if you have a plain text dump like this:
echo "0000 4865 6c6c 6f20 776f 726c 6421 0000" > text_dump
You can convert it to binary with:
xxd -r -p text_dump > binary_dump
And then get useful output with something like:
xxd binary_dump
If you have long text or text in file you can also use the binmake tool that allows you to describe in text format some binary data and generate a binary file (or output to stdout). It allows to change the endianess and number formats and accepts comments.
Its default format is hexadecimal but not limited to this.
First get and compile binmake:
$ git clone https://github.com/dadadel/binmake
$ cd binmake
$ make
You can pipe it using stdin and stdout:
$ echo '32 decimal 32 61 %x20 %x61' | ./binmake | hexdump -C
00000000 32 20 3d 20 61 |2 = a|
00000005
Or use files. So create your text file file.txt:
# an exemple of file description of binary data to generate
# set endianess to big-endian
big-endian
# default number is hexadecimal
00112233
# man can explicit a number type: %b means binary number
%b0100110111100000
# change endianess to little-endian
little-endian
# if no explicit, use default
44556677
# bytes are not concerned by endianess
88 99 aa bb
# change default to decimal
decimal
# following number is now decimal
0123
# strings are delimited by " or '
"this is some raw string"
# explicit hexa number starts with %x
%xff
Generate your binary file file.bin:
$ ./binmake file.txt file.bin
$ hexdump file.bin -C
00000000 00 11 22 33 4d e0 77 66 55 44 88 99 aa bb 7b 74 |.."3M.wfUD....{t|
00000010 68 69 73 20 69 73 20 73 6f 6d 65 20 72 61 77 20 |his is some raw |
00000020 73 74 72 69 6e 67 ff |string.|
00000027
In addition of xxd, you should also look at the packages/commands od and hexdump. All are similar, however each provide slightly different options that will allow you to tailor the output to your desired needs. For example hexdump -C is the traditional hexdump with associated ASCII translation along side.
input
pattern01 pattern11
pattern02 NonNumeric pattern12
ouput
pattern01 pattern11
pattern02 pattern12
pattern0x has got a length from 7 to 15 characters which can be numerical or '.'
patthern1x has got length between 2 and 4 numerical characters
NonNumric is a unique strickly non numerical character
ex:
input
12.1 58
135454& 548
124.485* 5587
12.58.336./ 54
output
12.1 58
135454 548
124.485 5587
12.58.336. 54
thank you very much!
Try sed using below:
sed -r 's/[^0-9. ]//g' test.txt
I've a big file from which i want to remove some content, the file is binary, and i don't have line numbers, but hex address, so how can i remove the region between:
0x13e70a00 and 0x1eaec03ff
With sed (both inclusive)
Will something like this, work?
sed -n 's/\x13e70a00/,s/\x1eaec03ff/ p' orig-data-file > new-file
from what you wrote it looks like you are trying to delete all the bytes between the two hex patterns. for that you will need
this deletes all the bytes between the patterns inclusive of the patterns.
sed 's/\x13\xe7\x0a\x00.*\x1e\xae\xc0\x3f//g' in >out
This deletes all bytes between patterns leaving the patterns intact. (there is a way to this with numbered parts of regexes but this is a bit clearer to beging with)
sed 's/\x13\xe7\x0a\x00.*\x1e\xae\xc0\x3f/\x13\xe7\x0a\x00\x1e\xae\xc0\x3f/g' in >out
They search s/ for a <pattern1> followed by any text .* followed by <pattern2> and replace it with either nothing //g or just the two edges /<pattern1><pattern2>/g throughout the file /g
If you want to delete (or replace) from byte 300 to byte 310:
sed 's/\(.\{300\}\).\{10\}/\1rep-str/' in>out
this matches the first 300 characters (.\{300\} )and remembers them (the \(\) ). It matches the next 10 characters too. It replaces this whole combined match with the first 300 characters (\1) followed by your replacement string rep-str this replacement string can be empty to just delete the text between bytes 300 and 310.
However, this is quite brittle if there are any newline characters. if you can live without replacement:
dd if=file bs=1 skip=310|dd of=file bs=1 seek=300 conv=notrunc
this does an in place replacement by copying from the 310th byte onwards till into the file starting from 300 position thus deleting 10 bytes
an even more general alternative is
dd if=in bs=1 count=300>out
printf "replacement text">>out
dd if=in bs=1 skip=310>>out
though the simplest thing to do will be to use a hex editor like Bless
You should be able to use a clever combination of converting bash numbers from hex to decimal, bash math to add 1 to the decimal offsets, and cut --complement -b to remove the correct segment from the file.
EDIT: Like this:
$ snip_out 0x0f 0x10 <<< "0123456789abcdeffedcba9876543210" | od -t x1
0000000 30 31 32 33 34 35 36 37 38 39 61 62 63 64 65 65
0000020 64 63 62 61 39 38 37 36 35 34 33 32 31 30
0000036
Where snip_out is a two-parameter shell-script that operates on stdin and stdout:
#!/bin/bash
START_RANGE_DEC=$(printf "%d" $1)
END_RANGE_DEC=$(printf "%d" $2)
# Most hex ranges begin with 0; cut begins with 1.
CUT_START_DEC=$(( $START_RANGE_DEC + 1 ))
CUT_END_DEC=$(( $END_RANGE_DEC + 1 ))
# cut likes to append a newline after output. Use head to remove it.
exec cut --complement -b $CUT_START_DEC-$CUT_END_DEC | head -c -1
I have a 100M row file that has some encoding problems -- was "originally" EBCDIC, saved as US-ASCII, now UTF-8. I don't know much more about its heritage, sorry -- I've just been asked to analyze the content.
The "cents" character from EBCDIC is "hidden" in this file in random places, causing all sorts of errors. Here is more on this bugger: cents character in hex
Converting this file using iconv -f foo -t UTF-8 -c is not working -- the cents character prevails.
When I use hex editor, I can find the appearance of 0xC2 0xA2 (c2a2). But in a BIG file, this isn't ideal. Sed doesn't work at hex level, so... Not sure about tr -- I only really use it for carriage return / new line.
What linux utility / command can I use to find and delete this character reasonably quickly on very big files?
2 parts:
1 -- utility / command to find / count the number of these occurrences (octal \242)
2 -- command to replace (this works tr '\242' ' ' < source > output )
How the text appears on my ubuntu terminal:
1019EQ?IT DEPT GENERATED
With xxd, how it looks at hex level (ascii to the side looks the same as above):
0000000: 3130 3139 4551 a249 5420 4445 5054 2047 454e 4552 4154 4544 0d0a
With xxd, how it looks with "show ebcdic" -- here, just showing the ebcdic from side:
......s.....&....+........
So hex "a2" is the culprit. I'm now trying xxd -E foo | grep a2 to count the instances up.
Adding output from od -ctxl, rather than xxd, for those interested:
0000000 1 0 1 9 E Q 242 I T D E P T G
31 30 31 39 45 51 a2 49 54 20 44 45 50 54 20 47
0000020 E N E R A T E D \r \n
45 4e 45 52 41 54 45 44 0d 0a
When you say the file was converted what do you mean? Do you mean the binary file was simply dumped from an IBM 360 to another ASCII based computer, or was the file itself converted over to ASCII when it was transferred over?
The question is whether the file is actually in a well encoded state or not. The other question is how do you want the file encoded?
On my Mac (which uses UTF-8 by default, just like Linux systems), I have no problem using sed to get rid of the ¢ character:
Here's my file:
$ cat test.txt
This is a test --¢-- TEST TEST
$ od -ctx1 test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - ¢ ** - - T E S T T E S T \n
2d c2 a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000040
You can see that cat has no problems printing out that ¢ character. And, you can see in the od dump the c2a2 encoding of the ¢ character.
$ sed 's/¢/$/g' test.txt > new_test.txt
$ cat new_test.txt
This is a test --$-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - $ - - T E S T T E S T \n
2d 24 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's my sed has no problems changing that ¢ into a $ sign. The dump now shows that this test file is equivalent to a strictly ASCII encoded file. That two hexadecimal digit encoded ¢ is now a nice clean single hexadecimal digit encoded $.
It looks like sed can handle your issue.
If you want to use this file on a Windows system, you can convert the file to the standard Windows Code Page 1252:
$ iconv -f utf8 -t cp1252 test.txt > new_test.txt
$ cat new_test.txt
This is a test --?-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - 242 - - T E S T T E S T \n
2d a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's the file now in Codepage 1252 just like the way Windows likes it! Note that the ¢ is now a nice hex 242 character.
So, what is exactly the issue? Do you need to file in pure ASCII defined 127 characters? Do you need the file encoded, so Windows machines can work on it? Are you having problems entering the ¢ character?
Let me know. I'm not from the government, and yet I'm here to help you.