Text to hexadecimals using hexdump - æøå characters - linux

Using a shell/bash script, I need to convert some text to hexadecimals so I pipe the source text into hexdump, so far so good. The problem is æøå characters. They show up fine in the console (UTF-8), but the hexadecimal values hexdump provides isn't correct. All other standard latin letters. echo -en "Some text containing æøåÆØÅ"|hexdump -v -e '"xx" 1/1 "%02X", then I use sed to replace the xx with %. Well, all letters, punctations, new line, etc is, just not non-standard-latin letters.
So, how do I go about solving this? Is it the input codepage that is the problem, or is there some limitations wiht hexdump? Thanks!
EDIT: By codepage, I mean character encoding. Not 100% sure it is the same thing. Bear with me please! :)

Related

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

using tr to strip characters but keep line breaks

I am trying to format some text that was converted from UTF-16 to ASCII, the output looks like this:
C^#H^#M^#M^#2^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
T^#h^#e^#m^#e^# ^#M^#a^#n^#a^#g^#e^#r^# ^#f^#o^#r^# ^#3^#D^#S^#^#^#^#^#^#^#^#^#^#^#^#^#^#
The only text I want out of that is:
CHMM2
Theme Manager for 3DS
So there is a line break "\n" at the end of each line and when I use
tr -cs 'a-zA-Z0-9' 'newtext' infile.txt > outfile.txt
It is stripping the new line as well so all the text ends up in one big string on one line.
Can anyone assist with figuring out how to strip out only the ^#'s and keeping spaces and new lines?
The ^#s are most certainly null characters, \0s, so:
tr -d '\0'
Will get rid of them.
But this is not really the correct solution. You should simply use theiconv command to convert from UTF-16 to UTF-8 (see its man page for more information). That is, of course, what you're really trying to accomplish here, and this will be the correct way to do it.
This is an XY problem. Your problem is not deleting the null characters. Your real problem is how to convert from UTF-16 to either UTF-8, or maybe US-ASCII (and I chose UTF-8, as the conservative answer).

node JS console log ascii symbol

HiI am using node JS for my app, and I want to print ascii symbols in terminal.Here is a table for ascii symbols. Please check Extended ASCII Codes field. I want to print square or circle, for example 178 or 219.
Can anyone say me, how can do it?Thank you
Like several other languages, Javascript suffers from The UTF‐16
Curse. Except that Javascript has an even worse form of it, The UCS‐2
Curse. Things like charCodeAt and fromCharCode only ever deal with
16‐bit quantities, not with real, 21‐bit Unicode code points.
Therefore, if you want to print out something like 𝒜, U+1D49C,
MATHEMATICAL SCRIPT CAPITAL A, you have to specify not one character
but two “char units”: "\uD835\uDC9C".
Please refer to this link: https://dheeb.files.wordpress.com/2011/07/gbu.pdf
Your desired character is not a printable ASCII character. On linux you can print all the printable ascii characters by running this command:
for((i=32;i<=127;i++)) do printf \\$(printf '%03o\t' "$i"); done;printf "\n"
or
man ascii
So what you can do is to print unicode characters. Here is a list of all the available unicode characters, and you can select one which is looking almost identical with your desired character.
http://unicode-table.com/en/#2764
I've tested on a windows terminal but it is still not showing the desired character, but it's working on linux. If it's still not working you had to make sure to set LANGUAGE="en_US.UTF-8" in /etc/rc.conf and LANG="en_US.UTF-8" in /etc/locale.conf.
So printing out something like this on node console:
console.log('\u2592 start typing...');
will output this result:
▒ start typing...
Actually, if you only care about ASCII that should not be a real problem at all. You only have to properly escape them. A good reference for this is https://mathiasbynens.be/notes/javascript-escapes
console.log('\xB2 \xDB')
Works for me with recentish node under Windows (cmd shell) and mac OS. For ASCII characters you can just convert them to hex and prepend them with \x in your strings. Give it a try with node -e "console.log('\xB2')"
And when you try this answer, and it works, you might want to try:
node -e "console.log('\x07')"

Print non-ascii/unicode characters in shell

I am using following command to search and print non-ascii characters:
grep --color -R -C 2 -P -n "[\x80-\xFF]" .
The output that I get, prints the line which has non-ascii characters in it.
However it does not print the actual unicode character.
Is there a way to print the unicode character?
output
./test.yml-35-
./test.yml-36-- name: Flush Handlers
./test.yml:37:  meta: flush_handlers
./test.yml-38-
--
This was answered in Searching for non-ascii characters. The real issue as shown in Filtering invalid utf8 is that the regular expression you are using is for single bytes, while UTF-8 is a multibyte encoding (and the pattern must therefore cover multiple bytes).
The extensive answer by #Peter O in the latter Q/A appears to be the best one, using Perl. grep is the wrong tool.

How to tell sed "do not remove some characters"?

I have a text file containing Arabic characters and some other characters (punctuation marks, numbers, English characters, ... ).
How can I tell sed to remove all the characters in the file, except Arabic ones? In short I can say that we typically tell sed to remove/replace some specific characters and print others, but now I am looking for a way to tell sed just print my desired characters, and remove all other characters.
With GNU sed, you should be able to specify characters by their hex code. You can use those in a a character class:
sed 's/[\x00-\x7F]//g' # hex notation
sed 's/[\o000-\o177]//g' # octal notation
You should also be able to achieve the same effect with the tr command:
tr -d '[\000-\177]'
Both methods assume UTF8 encoding of your input file. Multi-byte characters have their highest bit set, so you can simply strip everything that's a standard ASCII (7 bits) character.
To keep everything except some well defined characters, use a negative character classe:
sed 's/[^characters you want to keep]//g'
Using a pattern alike to [^…]\+ might improve performance of the regex.

Resources