I've got a list of songs that look like this.
GIFT〜白〜(冬恋/君の歌をうたう)【完全生産限定盤】
The latin letters GIFT here looks odd and I can't figure out how to make it read like a normal text. For example if you copy this word, it doesn't have a space in-between the letters or anything but seems to be in a different text format.
Can someone help me how I can convert this into normal text?
These are Unicode characters.
For example, the 'G' is
Unicode Character 'FULLWIDTH LATIN CAPITAL LETTER G' (U+FF27)
UTF-8 (hex) 0xEF 0xBC 0xA7 (efbca7)
see here
You can copy the string to Notepad++ and then convert it into Hex code (Extensions/Converter/ASCII->HEX)
and get EFBCA7EFBCA9EFBCA6EFBCB4 for the word 'GIFT'
Then serach for "Unicode EFBCA7" to find the above information.
This can be converted into normal latin characters. For example in .Net there is the Normalize function:
using System;
using System.Text;
public class Program
{
public static void Main()
{
Console.WriteLine("Unicode:");
String text = "GIFT";
Console.WriteLine(text);
byte[] bytes = Encoding.UTF8.GetBytes(text);
foreach(var b in bytes)
Console.Write("{0:X} ", b);
Console.WriteLine("\nASCII:");
String text2 = text.Normalize(NormalizationForm.FormKC);
Console.WriteLine(text2);
bytes = Encoding.UTF8.GetBytes(text2);
foreach(var b in bytes)
Console.Write("{0:X} ", b);
}
}
try it on .Net Fiddle
This will print out:
Unicode:
GIFT
EF BC A7 EF BC A9 EF BC A6 EF BC B4
ASCII:
GIFT
47 49 46 54
Probably there are similar functions in other languages too. Now you know what you're looking for.
The search term "convert unicode FULLWIDTH LATIN" will help you.
See also here
When you can't find a function for that, you can also just do your own conversion, after all the character code is just an offset to the normal ASCII/UTF-8 latin character set. See examples here.
Related
I am learning go and am unsure why this piece of code prints nothing
package main
import (
"strings"
)
func main(){
var sb strings.Builder
sb.WriteByte(byte(127))
println(sb.String())
}
I would expect it to print 127
You are appending a byte to the string's buffer, not the characters "127".
Since Go strings are UTF-8, any number <=127 will be the same character as that number in ASCII. As you can see in this ASCII chart, 127 will get you the "delete" character. Since "delete" is a non-printable character, println doesn't output anything.
Here's an example of doing the same thing from your question, but using a printable character. 90 for "Z". You can see that it does print out Z.
If you want to append the characters "127" you can use sb.WriteString("127") or sb.Write([]byte("127")). If you want to append the string representation of a byte, you might want to look at using fmt.Sprintf.
Note: I'm not an expert on character encoding so apologies if the terminology in this answer is incorrect.
On this RFC: https://www.rfc-editor.org/rfc/rfc7616#page-19 at page 19, there's this example of a text encoded in UTF-8:
J U+00E4 s U+00F8 n D o e
4A C3A4 73 C3B8 6E 20 44 6F 65
How do I represent it in a Rust String?
I tried https://mothereff.in/utf-8 and doing J\00E4s\00F8nDoe but it didn't work.
"Jäsøn Doe" should work fine. Rust source files are always UTF-8 encoded and a string literal may contain any Unicode scalar value (that is, any code point except surrogates, which must not be encoded in UTF-8).
If your editor does not support UTF-8 encoding, but supports ASCII, you can use Unicode code point escapes, which are documented in the Rust reference:
A 24-bit code point escape starts with U+0075 (u) and is followed by up to six hex digits surrounded by braces U+007B ({) and U+007D (}). It denotes the Unicode code point equal to the provided hex value.
suggesting the correct syntax should be "J\u{E4}s\u{F8}n Doe".
You can refer to Rust By Example as everything is not covered in rust eBook
(https://doc.rust-lang.org/stable/rust-by-example/std/str.html#literals-and-escapes)
You can use the syntax \u{your_unicode}
let unicode_str = String::from("J\u{00E4}s\u{00F8}nDoe");
println!("{}", unicode_str);
I've been trying different ways to read this hex string, and I cannot figure out how. Each method only converts part of it. The online converters don't do it, and this is the method I tried:
function string.fromhex(str)
return (str:gsub('..', function (cc)
return string.char(tonumber(cc, 16))
end))
end
packedStr = "1b4c756151000104040408001900000040746d702f66756e632e416767616d656e696f6e2e6c756100160000002b0000000000000203000000240000001e0000011e008000000000000100000000000000170000002900000000000008410000000500000006404000068040001a400000160000801e0080000a00800041c0000022408000450001004640c1005a40000016800c80458001008500000086404001868040018600420149808083458001008580020086c04201c5000000c640c001c680c001c600c301014103009c80800149808084450001004980c38245c003004600c4005c408000458002004640c40085800400860040018640400186c0440186004201c14003005c00810116000280450105004641c50280010000c00100025c8180015a01000016400080420180005e010001614000001600fd7f4580050081c005005c400001430080004700060043008000478001004300800047c003001e008000190000000405000000676d63700004050000004368617200040700000053746174757300040b000000416767616d656e696f6e000404000000746d7000040f000000636865636b65645f7374617475730004030000007477000409000000636861724e616d6500040900000066756c6c6e616d6500040a0000006775696c644e616d65000407000000737472696e670004060000006d617463680004060000006775696c6400040400000025772b0001010403000000756900040d00000064726177456c656d656e7473000407000000676d617463680004030000005f470004050000004e616d650004060000007461626c6500040900000069734d656d6265720004060000006572726f7200044e0000005468697320786d6c206973206e6f742070726f7065726c7920636f6e6669677572656420666f722074686973206368617261637465722e20506c6561736520636f6e74616374204b616575732100040900000071756575654163740000000000410000001800000018000000180000001800000018000000180000001900000019000000190000001a0000001a0000001a0000001a0000001b0000001b0000001b0000001b0000001b0000001b0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001d0000001d0000001e0000001e0000001e0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000002000000020000000200000002000000020000000200000002000000021000000210000001f000000220000002400000024000000240000002500000025000000260000002600000027000000270000002900000005000000020000006e0009000000400000001000000028666f722067656e657261746f7229002b000000370000000c00000028666f7220737461746529002b000000370000000e00000028666f7220636f6e74726f6c29002b000000370000000200000074002c000000350000000000000003000000290000002a0000002b000000010000000800000072657466756e6300010000000200000000000000"
local f = assert(io.open("unsquished.lua", "w+"));
f:write(packedStr:fromhex());
f:close()
This simply gives me a bunch of gibberish surrounded by a few readable strings.
Could someone please tell me how to convert the entirety of this string into readable format? Thank you!
Break your packedStr in parts of 2
1b = 27
4c = 76
75 = 117
61 = 97
and so forth. When you use string.char() with the resulting decimal output, it converts them to equivalent ASCII values. Of the total possible 256 ASCII values in extended ASCII table, only 95 are printable characters.
Thus, you'll always receive the gibberish text. Here's what you'd receive when trying to print each of the character separately: http://codepad.org/orM7pmAb and that is the only possible "readable" output.
I'm attempting to convert a string of characters from ASCII to EBCDIC using an IBM codepage. The conversion is correct except for lower case 'a' which is converted to an unprintable character.
Here is a piece of a groovy script running in Windows 7 that illustrates the problem.
groovy:000> letters='abcdABCD'
===> abcdABCD
groovy:000> String.format("%04x", new BigInteger(1, letters.getBytes())
===> 6162636441424344
groovy:000> lettersx=new String(letters.getBytes('IBM500'))
===> ?éâä┴┬├─
groovy:000> String.format("%04x", new BigInteger(1, lettersx.getBytes()))
===> 3f828384c1c2c3c4
After converting to EBCDIC all the characters in the string are valid except the first one, a lower case 'a'. Try as I might I can't find any information on this problem. I've tried a number of IBM code pages with the same results (IBM01140, IBM1047 etc.)
The problem is in this expression:
new String(letters.getBytes('IBM500'))
letters.getBytes creates a byte-array containing (in hexadecimal):
81 82 83 84 C1 C2 C3 C4
but then you're immediately converting that back to a Unicode String using your platform default encoding:
new String( <byte-array> );
If you want the ordinal values of the the characters in your String to be equal to the byte value, you must specify an encoding that does that, for example ISO-8859-1:
new String(letters.getBytes('IBM500'), "ISO-8859-1")
The encoding you're using does not define a character encoding for byte 81 so it is replacing it with ? (3f). You're most likely using Windows-1252.
Strings contain characters, not bytes. Java will always apply an encoding conversion when going from one to the other.
EDIT: responding to #mister270's comment:
Here's a program in Java to demonstrate:
public class Ebcdic
{
public static void main(String[] args) throws Exception
{
String letters = "abcdABCD";
byte[] ebcdic = letters.getBytes("IBM500");
System.out.print("Ebcdic bytes:");
for (byte b: ebcdic)
{
System.out.format(" %02X", b & 0xFF);
}
System.out.println();
String lettersEbcdic = new String(ebcdic, "ISO-8859-1");
System.out.print("Ebcdic bytes stored in chars:");
for (char c: lettersEbcdic.toCharArray())
{
System.out.format(" %04X", (int) c);
}
System.out.println();
System.out.println("Ebcdic bytes in chars printed in using my default platform encoding: " + lettersEbcdic);
}
}
Output is:
Ebcdic bytes: 81 82 83 84 C1 C2 C3 C4
Ebcdic bytes stored in chars: 0081 0082 0083 0084 00C1 00C2 00C3 00C4
Ebcdic bytes in chars printed in using my default platform encoding: ????��ǎ
What this shows is that
the Ebcdic conversion into the byte-array is occurring correctly using "IBM500"
the "identity" conversion of bytes to chars using "ISO-8859-1" is occurring correctly
My system doesn't have a mapping to convert Unicode character U+0081 etc to my default platform character encoding so it displays it as ?
Java (so Groovy too) stores characters internally as Unicode. UTF16, to be precise. If you want to encode them as Ebcdic, then they stop being characters and should no longer be held in the Strings. Ebcdic is an 8-bit encoding so each character can be stored in a byte. If you need to interface with a system that expects a particular encoding (in your case, Ebcdic), then that system really should accept bytes, not Strings, otherwise you end up with just these sorts of confusion.
If you must use Strings to hold Ebcdic bytes, then you must use the ISO-8859-1 encoding whenever you use an InputStream or OutputStream (including System.out) to ensure that your ebcdic codes are not "translated" from bytes to characters
I started coding with groovy today and I notices that if I take the following code:
int aaa = "6"
log.info(aaa)
The output I get is:
54 <-- (ASCII Code for '6')
If I assign aaa with any number which is beyond the range of 0..9 I get a class cast exception.
Looks like if the string is actually a single character - groovy converts its ASCII code/hashCode.
I tried this code:
int aaa = "A"
log.info(aaa)
And the output I got was:
65 <-- (ASCII code for 'A')
What is the official reason for this?
Is it because groovy automatically changes "A" into 'A'?
As Jochen says here in the JIRA; Strings of length 1 are converted to chars if needed (and by putting it into an int variable, it is assuming that that is what you want to do)
If you want to accept bigger numbers, you can do:
int a = '12345' as int
And that will convert the whole number to an int.