How to construct a String instance from a sequence of integers? - string

I would like to create a test string from Unicode code points
Something like this
65 asCharacter asString,
66 asCharacter asString,
67 asCharacter asString,
65 asCharacter asString,
769 asCharacter asString
Or
String with: 65 asCharacter
with: 66 asCharacter
with: 67 asCharacter
with: 65 asCharacter
with: 769 asCharacter
This works but
I am looking for a way to convert an array of integer values to an instance of class String.
#(65 66 67 65 769)
Is there a built in method for this?
I am looking for an answer like this What is the correct way to test Unicode support in a Smalltalk implementation? one, but for Strings.

Many ways
1. #streamContents:
Use stream if you are doing larger string concatenation/building as it is faster. If just concatenating couple of strings use whatever is more readable.
String streamContents: [ :aStream |
#(65 66 67 65 769) do: [ :each |
aStream nextPut: each asCharacter
]
]
or
String streamContents: [ :aStream |
aStream nextPutAll: (#(65 66 67 65 769) collect: #asCharacter)
]
2. #withAll:
String withAll: (#(65 66 67 65 769) collect: #asCharacter)
3. #collect:as: String
#(65 66 67 65 769) collect: #asCharacter as: String
4. #joinUsing: the characters
(#(65 66 67 65 769) collect: #asCharacter) joinUsing: ''
Note:
At least in Pharo you can use either [ :each | each selector ], or just simply #selector. I find the latter more readable for simple things, but that may be personal preference.

Construct the String instance with #withAll:
String withAll:
(#(65 66 67 65 769) collect: [:codepoint | codepoint asCharacter])

Here is a "low level" variant:
codepoints := #(65 66 67 65 769).
string := WideString new: codepoints size.
codepoints withIndexDo: [:cp :i | string wordAt: i put: cp].
^string

Please, consider the following as awfully hackish, undocumented, unsupported and thus absolutely wrong way to do it!
You would think that you cannot mix characters and integers that easily, err you can:
'' asWideString copyReplaceFrom: 1 to: 0 with: (#(65 66 67 65 769) as: WordArray).
Indeed, this goes thru a primitive that doesn't really check for the class, but just for the fact that both receiver and argument are VariableWord classes...
For the very same reason (depending on WriteStream implementation - let's say fragile) this can work:
^'' asWideString writeStream
nextPutAll: (#(65 66 67 65 769) as: WordArray);
contents
The same apply to ByteString and ByteArray.
And of course, in the same vein, let's not forget the most convoluted way to do it, BitBlt:
^((BitBlt toForm: (Form new hackBits: (WideString new: 5)))
sourceForm: (Form new hackBits: (#(65 66 67 65 769) as: WordArray));
combinationRule: Form over;
copyBits;
destForm) bits
We again exploit the WordArray nature of WideString to serve as the container for the bits of a Form (a bitmap).
Hopefully this answer won't get too many votes, it doesn't deserve it!

Related

How to delimit file with "\t\n" on a Mac

I have a document whose lines are separated by "\t\n". Records are separated either by "\t", OR by "\n".
Normally, this should be a straigtforward awk query:
BEGIN {
RS='\t\n';
}
{
print;
print "Next entry:";
}
However, on a Mac, regular expressions do not seem to be supported (maybe I'm not doing something right?) So I tried, RS="\t\n"; however, this is interpreted as RS='\t | \n'. Similar problems running awk from the command line:
awk 1 RS='\t\n' ORS='abc' input > output
replaces the \t's, but leaves the \n's be.
Next try: using tr. This obviously fails for sequence of more than one character-- since \t and \n are both used individually in the rows.
Next:
sed -e '/\t\n/s//NextEntry:/g' input > output
However, doesn't work. Entering any ASCII character sequence instead of \t\n works.
Read the manual. It says that \t is not supported in sed strings. Fair enough
sed -e '/\x9\xa/s//abc/' input > output
Still doesn't work. Idea: use tr to replace \t and \n by characters unused in the input file, use sed to change them to what I want, and then tr to change the remaining characters back to what they should be.
tr: Illegal byte sequence
Turns out, that f6 character makes tr just totally fail.
Went through the suggestions in Sed not recognizing \t instead it is treating it as 't' why? . That might work for replacing output strings (except the "Pasting tab into command prompt via CTRL+V" suggestion-- the shell just rejected that paste.), but did not seem to help in my case.
Maybe it's because it's a Mac? Maybe it's because that's the text I'm looking for, not replacing with? Maybe it's the combination with \n?
Any other suggestions?
UPDATE:
I found thread How can I replace a newline (\n) using sed? . Apparently, I am unable even to replace a \n by the string "abc" using the suggestions in that thread.
EDIT: Hex head of source file:
5a 20 4e 4f 09 0a 41 53 20 4f 46 20 30 31 2d 30
34 2d 30 35 20 45 4d 50 4c 4f 59 45 45 0a 47 52
4f 55 50 09 48 49 52 45 20 44 41 54 45 09 53 41
4c 41 52 59 09 4a 4f 42 20 54 49 54 4c 45 09 0a
4a 4f 42 20 4c 45 56 45 4c 0a 53 45 52 49 45 53
09 41 50 50 54 20 54 59 50 45 09 0a 50 41 59 20
53 54 41 54 55 53 0a f6
Unfortunately, BSD awk, as also used on macOS, doesn't support multi-character record separators (RS) altogether (in line with POSIX) - only a single, literal character is supported.
BSD sed, as also used on macOS, supports only \n in regexes - any other escapes, including hex ones (e.g., \x09) are not supported.
See this answer of mine for a comprehensive comparison of GNU and BSD sed.
Assuming that your sed command works in principle, you can use an ANSI C-quoted string
($'\t') to splice a literal tab char. into your sed script (assumes bash (the macOS default shell), ksh, or zsh),:
sed -e ':a' -e '$!{N;ba' -e '}' -e '/'$'\t''\n/s//NextEntry:/g'
Note that, in order to replace newlines, you must instruct sed to read the entire file into memory first, which is what -e ':a' -e '$!{N;ba' -e '}' does (the BSD Sed-compatible form of the common GNU sed idiom :a;$!{N;ba}).

Variable Delimited Text in Excel

I have a string of text that I need delimited:
New Utilizers 75 28 9 66 66 79 74 69 29 21 84 75 675 20,511 45,925
Ordinarily I would just use a space delimiter and I'd be all set, but this splits the "New Utilizers" string into two columns instead of one. Is there a way to start delimiting after a certain point, in this case start after new utilizers
Can you choise another delimiter? say $ or ;
if $ for example
New Utilizers$75$28$9$66$66$79$74$69$29$21$84$75$675$20,511$45,925
then split by $

Python3 ValueError: too many values to unpack with for loop

I've seen this same question asked around, but it's always with something like:
val1, val2 = input("Enter 2 numbers")
My problem is different.
I have two strings, str1 and str2. I want to compare them byte-by-byte such that the output would look something like this:
str1 str2
0A 0A
20 20
41 41
42 42
43 43
31 31
32 32
33 33
2E 21
So, I've tried various syntaxes to compare them, but it always ends in the same error. Here's one of my latest attempts:
#!/usr/bin/python3
for c1, c2 in (tuple("\n ABC123."), tuple("\n ABC123!")):
print("%02X %02X" % (ord(c1), ord(c2)))
And the error:
$ python3 test.py
Traceback (most recent call last):
File "test.py", line 1, in <module>
ValueError: too many values to unpack (expected 2)
Of course, this line:
for c1, c2 in (tuple("\n ABC123."), tuple("\n ABC123!")):
has gone through many different iterations:
for c1, c2 in "asdf", "asdf"
for c1, c2 in list("asdf"), list("asdf")
for c1, c2 in tuple("asdf"), tuple("asdf")
for c1, c2 in (tuple("asdf"), tuple("asdf"))
for (c1, c2) in (tuple("asdf"), tuple("asdf"))
All of which threw the same error.
I don't think I quite understand python's zipping/unzipping syntax, and I'm just about ready to resort hacking a low-level solution together.
Any ideas?
Okay, so I ended up doing this:
for char in zip(s1,s2):
print("%02X %02X" % ( ord(char[0]), ord(char[1]) ))
However, I notice that if I happen to have two lists of differing lengths, the longer list seems to get truncated at the end. For example:
s1 = "\n ABC123."
s2 = "\n ABC123!."
0A 0A
20 20
41 41
42 42
43 43
31 31
32 32
33 33
2E 21
# !! <-- There is no "2E"
So I guess I could work around that by printing the len() for each string, and then padding the shorter one to meet the longer one.

How to search, replace specific hex code in automated way

I have a 100M row file that has some encoding problems -- was "originally" EBCDIC, saved as US-ASCII, now UTF-8. I don't know much more about its heritage, sorry -- I've just been asked to analyze the content.
The "cents" character from EBCDIC is "hidden" in this file in random places, causing all sorts of errors. Here is more on this bugger: cents character in hex
Converting this file using iconv -f foo -t UTF-8 -c is not working -- the cents character prevails.
When I use hex editor, I can find the appearance of 0xC2 0xA2 (c2a2). But in a BIG file, this isn't ideal. Sed doesn't work at hex level, so... Not sure about tr -- I only really use it for carriage return / new line.
What linux utility / command can I use to find and delete this character reasonably quickly on very big files?
2 parts:
1 -- utility / command to find / count the number of these occurrences (octal \242)
2 -- command to replace (this works tr '\242' ' ' < source > output )
How the text appears on my ubuntu terminal:
1019EQ?IT DEPT GENERATED
With xxd, how it looks at hex level (ascii to the side looks the same as above):
0000000: 3130 3139 4551 a249 5420 4445 5054 2047 454e 4552 4154 4544 0d0a
With xxd, how it looks with "show ebcdic" -- here, just showing the ebcdic from side:
......s.....&....+........
So hex "a2" is the culprit. I'm now trying xxd -E foo | grep a2 to count the instances up.
Adding output from od -ctxl, rather than xxd, for those interested:
0000000 1 0 1 9 E Q 242 I T D E P T G
31 30 31 39 45 51 a2 49 54 20 44 45 50 54 20 47
0000020 E N E R A T E D \r \n
45 4e 45 52 41 54 45 44 0d 0a
When you say the file was converted what do you mean? Do you mean the binary file was simply dumped from an IBM 360 to another ASCII based computer, or was the file itself converted over to ASCII when it was transferred over?
The question is whether the file is actually in a well encoded state or not. The other question is how do you want the file encoded?
On my Mac (which uses UTF-8 by default, just like Linux systems), I have no problem using sed to get rid of the ¢ character:
Here's my file:
$ cat test.txt
This is a test --¢-- TEST TEST
$ od -ctx1 test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - ¢ ** - - T E S T T E S T \n
2d c2 a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000040
You can see that cat has no problems printing out that ¢ character. And, you can see in the od dump the c2a2 encoding of the ¢ character.
$ sed 's/¢/$/g' test.txt > new_test.txt
$ cat new_test.txt
This is a test --$-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - $ - - T E S T T E S T \n
2d 24 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's my sed has no problems changing that ¢ into a $ sign. The dump now shows that this test file is equivalent to a strictly ASCII encoded file. That two hexadecimal digit encoded ¢ is now a nice clean single hexadecimal digit encoded $.
It looks like sed can handle your issue.
If you want to use this file on a Windows system, you can convert the file to the standard Windows Code Page 1252:
$ iconv -f utf8 -t cp1252 test.txt > new_test.txt
$ cat new_test.txt
This is a test --?-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - 242 - - T E S T T E S T \n
2d a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's the file now in Codepage 1252 just like the way Windows likes it! Note that the ¢ is now a nice hex 242 character.
So, what is exactly the issue? Do you need to file in pure ASCII defined 127 characters? Do you need the file encoded, so Windows machines can work on it? Are you having problems entering the ¢ character?
Let me know. I'm not from the government, and yet I'm here to help you.

Haskell doubt: how to transform a Matrix represented as: [String] to a Matrix Represented as [[Int]]?

Im trying to solve Problem 11 of Project Euler in haskell. I almost did it, but right now im
stuck, i want to transform a Matrix represented as [String] to a Matrix represented as [[Int]].
I "drawed" the matrices:
What i want:
"08 02 22 97 38 15 00 40 [ ["08","02","22","97","38","15","00","40"], [[08,02,22,97,38,15,00,40]
49 49 99 40 17 81 18 57 map words lines ["49","49","99","40","17","81","18","57"], ??a [49,49,99,40,17,81,18,57]
81 49 31 73 55 79 14 29 ----------> ["81","49","31","73","55","79","14","29"], ---------> [81,49,31,73,55,79,14,29]
52 70 95 23 04 60 11 42 ["52","70","95","23","04","60","11","42"], [52,70,95,23,04,60,11,42]
22 31 16 71 51 67 63 89 ["22","31","16","71","51","67","63","89"], [22,31,16,71,51,67,63,89]
24 47 32 60 99 03 45 02" ["24","47","32","60","99","03","45","02"] ] [24,47,32,60,99,03,45,02]]
Im stuck in doing the last transformation (??a)
for curiosity(and learning) i also want to know how to do a matrix of digits:
Input:
"123456789 [ "123456789" [ [1,2,3,4,5,6,7,8,9]
124834924 lines "124834924" ??b [1,2,4,8,3,4,9,2,4]
328423423 ---------> "328423423" ---------> [3,2,8,4,2,3,4,2,3]
334243423 "334243423" [3,3,4,2,4,3,4,2,3]
932402343" "932402343" ] [9,3,2,4,0,2,3,4,3] ]
What is the best way to make (??a) and (??b) ?
What you want is the read function:
read :: (Read a) => String -> a
This thoughtfully parses a string into whatever you're expecting (as long as it's an instance of the class Read, but fortunately Int is such).
So just map that over the words, like so:
parseMatrix :: (Read a) => String -> [[a]]
parseMatrix s = map (map read . words) $ lines s
Just use that in a context that expects [[Int]] and Haskell's type inference will take it from there.
To get the digits, just remember that String is actually just [Char]. Instead of using words, map a function that turns each Char into a single-element list; everything else is the same.

Resources