is DCPU-16 assembler 'dat' with a string supposed to generate a byte or word per character? - dcpu-16

It's not clear to me whether
dat "Hello"
is supposed to generate 5 words or 3 (with one byte of padding)

according to this pic it is one word per 8 bit character:
so
:data dat 0x170, "Hello ", 0x2e1 ...
will generate
0x0170 0x0048 0x0065 0x006c 0x006c 0x006f 0x0020 0x02e1
etc.
he tests the difference between normal chars and the special chars with
ifg a, 0xff
this would conclude that all ascii char will have its own word

Related

Confusion regarding UTF8 substring length

Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?
Using Python 3.5, I opened the XHTML file as UTF8 text:
inputTopicFile = open(inputFileName, "rt", encoding="utf8")
As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF:
I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2] in the string. So I tried this:
firstLine = firstLine[3:]
Didn't work -- the characters <? were no longer present at the start of the resulting line.
So I did this experiment:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, firstLine[charPos]))
Which printed:
charPos 0 ==
charPos 1 == <
charPos 2 == ?
I then added .encode to that loop as follows:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))
Which gave me:
charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'
Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?
At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]. But it seems wrong to do it that way(?)
So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?
EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig" when I opened the source XHTML file:
inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")
That strips out the BOM. Voila!
As you mentioned in your edit, you can open the file with the utf8-sig encoding, but to answer your question of why it was behaving this way:
Python 3 distinguishes between byte strings (the ones with the b prefix) and character strings (without the b prefix), and prefers to use character strings whenever possible. A byte string works with the actual bytes; a character string works with Unicode codepoints. The BOM is a single codepoint, U+FEFF, so in a regular string Python 3 will treat it as a single character (because it is a single character). When you call encode, you turn the character string into a byte string.
Thus the results you were seeing are exactly what you should have: Python 3 does know what counts as a single character, which is all it sees until you call encode.

VBA Byte Array to String

Apologies if this question has been previously answered, I was unable to find an explanation. I've created a script in VBScript to encrypt an user input and match to an already encrypted password. I ran into some issues along the way and managed to deduce to the following.
I have a byte array (1 to 2) as values (1, 16). I am then defining a string with the value of the array as per below:
Dim bytArr(1 To 2) As Byte
Dim output As String
bytArr(1) = 16
bytArr(2) = 1
output = bytArr
Debug.Print output
The output I get is Ð (Eth) ASCII Value 208. Could someone please explain how the byte array is converted to this character?
In VBA, Byte Arrays are special because, unlike arrays of other datatypes, a string can be directly assigned to a byte array. In VBA, Strings are UNICODE strings, so when one assigns a string to a byte array then it stores two digits for each character;
although the glyphs seem to be the same, see charmap:
Ð is Unicode Character 'LATIN CAPITAL LETTER ETH' (U+00D0) shown in charmap DOS Western (Central) Europe character set (0xD1, i.e. decimal 209);
Đ is Unicode Character 'LATIN CAPITAL LETTER D WITH STROKE' (U+0110) shown in charmap Windows Western (Central Europe) character set (0xD0, i.e. decimal 208).
Get above statements together keeping in mind endianness (byte order) of the computer architecture: Intel x86 processors use little-endian, so byte array (0x10, 0x01) is the same as unicode string U+0110.
Charaters are amalgamated via flagrant mojibake case. For proof, please use Asc and AscW Functions as follows: Debug.Print output, Asc(output), AscW(output) with different console code pages, e.g. under chcp 852 and chcp 1250.

what is the syntax to define a string constant in assembly?

I am learning assembly I see two examples of defining a string:
msg db 'Hello, world!',0xa
what does the 0xa mean here?
message DB 'I am loving it!', 0
why we have a 0 here?
is it a trailing null character?
why we have 0xa the above example but 0 here? (doesn't seem they are relating to string length)
If the above examples are two ways of defining an assembly string, how could the program differentiate them?
Thanks ahead for any help :)
The different assemblers have different syntax, but in the case of db directive they are pretty consistent.
db is an assembly directive, that defines bytes with the given value in the place where the directive is located in the source. Optionally, some label can be assigned to the directive.
The common syntax is:
[label] db n1, n2, n3, ..., nk
where n1..nk are some byte sized numbers (from 0..0xff) or some string constant.
As long as the ASCII string consists of bytes, the directive simply places these bytes in the memory, exactly as the other numbers in the directive.
Example:
db 1, 2, 3, 4
will allocate 4 bytes and will fill them with the numbers 1, 2, 3 and 4
string db 'Assembly', 0, 1, 2, 3
will be compiled to:
string: 41h, 73h, 73h, 65h, 6Dh, 62h, 6Ch, 79h, 00h, 01h, 02h, 03h
The character with ASCII code 0Ah (0xa) is the character LF (line feed) that is used in Linux as a new line command for the console.
The character with ASCII code 00h (0) is the NULL character that is used as a end-of-string mark in the C-like languages. (and probably in the OS API calls, because most OSes are written in C)
Appendix 1: There are several other assembly directives similar to DB in that they define some data in the memory, but with other size. Most common are DW (define word), DD (define double word) and DQ (define quadruple word) for 16, 32 and 64 bit data. However, their syntax accepts only numbers, not strings.
0 is a trailing null, yes. 0xa is a newline. They don’t define the same string, so that’s how you would differentiate them.
0xa stands for the hexadecimal value "A" which is 10 in decimal. The Linefeed control character has ASCII code 10 (Return has D hexadecimal or 13 decimal).
Strings are commonly terminated by a nul character to indicate their end.

convert two chars at a time from a string to hex

I have the following piece of code which converts 1 char to a hex at a time. I want to convert two chars to a hex. ie 99ab should be treated as '99', 'ab' to be converted to its equivalent hex.
Current implementation is as follows
$final =~ s/(.)/sprintf("0x%X ",ord($1))/eg;
chop($final);
TIA
Your question doesn't make much sense. Hex is a string representation of a number. You can't convert a string to hex.
You can convert individual characters of a string to hex since characters are merely numbers, but that's clearly not what you want. (That's what your code does.)
I think you are trying to convert from from hex to chars.
6 chars "6a6b0a" ⇒ 3 chars "\x6a\x6b\x0a"
If so, you can use your choice of
$final =~ s/(..)/ chr(hex($1)) /seg;
or
$final = pack 'H*', $final;
The other possibility I can think of is that you want to unpack 16-bit integers.
6 chars "6a6b" ⇒ 13 chars "0x6136 0x6236" (LE byte order)
-or-
6 chars "6a6b" ⇒ 13 chars "0x3661 0x3662" (BE byte order)
If so, you can use
my #nums = unpack 'S<*', $packed; # For 16-bit ints, LE byte order
-or-
my #nums = unpack 'S>*', $packed; # For 16-bit ints, BE byte order
my $final = join ' ', map sprintf('0x%04X', $_), #nums;

What is the size of a character packed integer in comparison with its original size?

Suppose that I is the size of a T integer.
What is the maximum size of a string S that contains the digits of T arranged into characters?
For example:
T = 12345
S = '12345'
log10(T) + 1 will give you the size (in characters) of the string S
Actually, the basic equation only works for ASCII characters in ASCII or UTF-8 encoding; one byte per character. For UTF-16, these same characters would be encoded as 2 bytes each, and in UTF-32, 4 bytes each. This matters, depending on the programming language and runtime; .NET strings are stored and encoded in UTF-16.
So, it's actually (log(N) + 1)*sizeof(char)
The answer is the log (base10) of I + 1
Thus an int I of 1000 - log I would give you 3 + 1 = 4

Resources