Converting Text to HTML In D - string

I'm trying to figure the best way of encoding text (either 8-bit ubyte[] or string) to its HTML counterpart.
My proposal so far is to use a lookup-table to map the 8-bit characters
string[256] lutLatin1ToHTML;
lutLatin1ToXML[0x22] = "&quot";
lutLatin1ToXML[0x26] = "&amp";
...
in HTML that have special meaning using the function
pure string toHTML(in string src,
ref in string[256] lut) {
return src.map!(a => (lut[a] ? lut[a] : new string(a))).reduce!((a, b) => a ~ b) ;
}
I almost work except for the fact that I don't know how to create a string from a `ubyte? (the no-translation case).
I tried
writeln(new string('a'));
but it prints garbage and I don't know why.
For more details on HTML encoding see https://en.wikipedia.org/wiki/Character_entity_reference

You can make a string from a ubyte most easily by doing "" ~ b, for example:
ubyte b = 65;
string a = "" ~ b;
writeln(a); // prints A
BTW, if you want to do a lot of html stuff, my dom.d and characterencodings.d might be useful:
https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff
It has a html parser, dom manipulation functions similar to javascript (e.g. ele.querySelector(), getElementById, ele.innerHTML, ele.innerText, etc.), conversion from a few different character encodings, including latin1, and outputs ascii safe html with all special and unicode characters properly encoded.
assert(htmlEntitiesEncode("foo < bar") == "foo < bar";
stuff like that.

In this case Adam's solution works just fine, of course. (It takes advantage of the fact that ubyte is implicitly convertible to char, which is then appended to the immutable(char)[] array for which string is an alias.)
In general the safe way of converting types is to use std.conv.
import std.stdio, std.conv;
void main() {
// utf-8
char cc = 'a';
string s1 = text(cc);
string s2 = to!string(cc);
writefln("%c %s %s", cc, s1, s2);
// utf-16
wchar wc = 'a';
wstring s3 = wtext(wc);
wstring s4 = to!wstring(wc);
writefln("%c %s %s", wc, s3, s4);
// utf-32
dchar dc = 'a';
dstring s5 = dtext(dc);
dstring s6 = to!dstring(dc);
writefln("%c %s %s", dc, s5, s6);
ubyte b = 65;
string a = to!string(b);
}
NB. text() is actually intended for processing multiple arguments, but is conveniently short.

Related

How to convert saved text file encoding to UTF8?

recently i saved a text file on my computer but when i open it again i saw some strings like:
"˜ÌÇí ÍÑÝã ÚÌíÈå¿"
now i want to know is it possible to reconvert it to the original text (UTF8)?
i try this codes but it doesn't works
string tempStr="˜ÌÇí ÍÑÝã ÚÌíÈå¿";
Encoding ANSI = Encoding.GetEncoding(1256);
byte[] ansiBytes = ANSI.GetBytes(tempStr);
byte[] utf8Bytes = Encoding.Convert(ANSI, Encoding.UTF8, ansiBytes);
String utf8String = Encoding.UTF8.GetString(utf8Bytes);
You can use something like:
string str = Encoding.GetEncoding(1256).GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(tempStr))
The string wasn't really decoded... Its bytes where simply "enlarged" to char, with something like:
byte[] bytes = ...
char[] chars = new char[bytes.Length];
for (int i = 0; i < bytes.Length; i++)
{
chars[i] = bytes[i];
}
string str = new string(chars);
Now... This transformation is the same that is done by the codepage ISO-8859-1. So I could simply have done the reverse, or I could have used that codepage to do it for me, I selected the second one.
Encoding.GetEncoding("iso-8859-1").GetBytes(tempStr)
this gave me the original byte[]
Then I've done some tests and it seems that the text in the beginning wasn't UTF8, it was in codepage 1256, that is an arabic codepage. So I
string str = Encoding.GetEncoding(1256).GetString(...);
The only thing, the ˜ doesn't seem to be part of the original string.
There is another possibility:
string str = Encoding.GetEncoding(1256).GetString(Encoding.GetEncoding(1252).GetBytes(tempStr));
The codepage 1252 is the codepage used in the USA and in a big part of Europe. If you have a Windows configured to English, there is a good chance it uses the 1252 as the default codepage. The result is slightly different than using the iso-8859-1

D: how to remove last char in string?

I need to remove last char in string in my case it's comma (","):
foreach(line; fcontent.splitLines)
{
string row = line.split.map!(a=>format("'%s', ", a)).join;
writeln(row.chop.chop);
}
I have found only one way - to call chop two times. First remove \r\n and second remove last char.
Is there any better ways?
import std.array;
if (!row.empty)
row.popBack();
As it usually happens with string processing, it depends on how much Unicode do you care about.
If you only work with ASCII it is very simple:
import std.encoding;
// no "nice" ASCII literals, D really encourages Unicode
auto str1 = cast(AsciiString) "abcde";
str1 = str1[0 .. $-1]; // get slice of everything but last byte
auto str2 = cast(AsciiString) "abcde\n\r";
str2 = str2[0 .. $-3]; // same principle
In "last char" actually means unicode code point (http://unicode.org/glossary/#code_point) it gets a bit more complicated. Easy way is to just rely on D automatic decoding and algorithms:
import std.range, std.stdio;
auto range = "кириллица".retro.drop(1).retro();
writeln(range);
Here retro (http://dlang.org/phobos/std_range.html#.retro) is a lazy reverse iteration function. It takes any range (unicode string is a valid range) and returns wrapper that is capable of iterating it backwards.
drop (http://dlang.org/phobos/std_range.html#.drop) simply pops a single range element and ignores it. Calling retro again will reverse the iteration order back to normal, but now with the last element dropped.
Reason why it is different from ASCII version is because of nature of Unicode (specifically UTF-8 which D defaults to) - it does not allow random access to any code point. You actually need to decode them all one by one to get to any desired index. Fortunately, D takes care of all decoding for you hiding it behind convenient range interface.
For those who want even more Unicode correctness, it should be possible to operate on graphemes (http://unicode.org/glossary/#grapheme):
import std.range, std.uni, std.stdio;
auto range = "abcde".byGrapheme.retro.drop(1).retro();
writeln(range);
Sadly, looks like this specific pattern is not curently supported because of bug in Phobos. I have created an issue about it : https://issues.dlang.org/show_bug.cgi?id=14394
NOTE: Updated my answer to be a bit cleaner and removed the lambda function in 'map!' as it was a little ugly.
import std.algorithm, std.stdio;
import std.string;
void main(){
string fcontent = "I am a test\nFile\nwith some,\nCommas here and\nthere,\n";
auto data = fcontent
.splitLines
.map!(a => a.replaceLast(","))
.join("\n");
writefln("%s", data);
}
auto replaceLast(string line, string toReplace){
auto o = line.lastIndexOf(toReplace);
return o >= 0 ? line[0..o] : line;
}
module main;
import std.stdio : writeln;
import std.string : lineSplitter, join;
import std.algorithm : map, splitter, each;
enum fcontent = "some text\r\nnext line\r\n";
void main()
{
fcontent.lineSplitter.map!(a=>a.splitter(' ')
.map!(b=>"'" ~ b ~ "'")
.join(", "))
.each!writeln;
}
Take a look, I use this extension method to replace any last character or sub-string, for example:
string testStr = "Happy holiday!";<br>
Console.Write(testStr.ReplaceVeryLast("holiday!", "Easter!"));
public static class StringExtensions
{
public static string ReplaceVeryLast(this string sStr, string sSearch, string sReplace = "")
{
int pos = 0;
sStr = sStr.Trim();
do
{
pos = sStr.LastIndexOf(sSearch, StringComparison.CurrentCultureIgnoreCase);
if (pos >= 0 && pos + sSearch.Length == sStr.Length)
sStr = sStr.Substring(0, pos) + sReplace;
} while (pos == (sStr.Length - sSearch.Length + 1));
return sStr;
}
}

Defining a custom PURE Swift Character Set

So, using Foundation you can use NSCharacterSet to define character sets and test character membership in Strings. I would like to do so without Cocoa classes, but in a purely Swift manner.
Ideally, code could be used like so:
struct ReservedCharacters: CharacterSet {
characters "!", "#", "$", "&", ... etc.
func isMember(character: Character) -> Bool
func encodeCharacter(parameters) { accepts a closure }
func decodeCharacter(parameters) { accepts a closure }
}
This is probably a very loaded question. But I'd like to see what you Swifters think.
You can already test for membership in a character set by initializing a String and using the contains global function:
let vowels = "aeiou"
let isVowel = contains(vowels, "i") // isVowel == true
As far as your encode and decode functions go, are you just trying to get the 8-bit or 16-bit encodings for the Character? If that is the case then just convert them to a String and access there utf8 or utf16 properties:
let char = Character("c")
let a = Array(String(char).utf8)
println() // This prints [99]
Decode would take a little more work, but I know there's a function for it...
Edit: This will replace a character from a characterSet with '%' followed by the character's hex value:
let encode: String -> String = { s in
reduce(String(s).unicodeScalars, "") { x, y in
switch contains(charSet, Character(y)) {
case true:
return x + "%" + String(y.value, radix: 16)
default:
return x + String(y)
}
}
}
let badURL = "http://why won't this work.com"
let encoded = encode(badURL)
println(encoded) // prints "http://why%20won%27t%20this%20work.com"
Decoding, again, is a bit more challenging, but I'm sure it can be done...

How to check whether a string is Base64 encoded or not

I want to decode a Base64 encoded string, then store it in my database. If the input is not Base64 encoded, I need to throw an error.
How can I check if a string is Base64 encoded?
You can use the following regular expression to check if a string constitutes a valid base64 encoding:
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
In base64 encoding, the character set is [A-Z, a-z, 0-9, and + /]. If the rest length is less than 4, the string is padded with '=' characters.
^([A-Za-z0-9+/]{4})* means the string starts with 0 or more base64 groups.
([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$ means the string ends in one of three forms: [A-Za-z0-9+/]{4}, [A-Za-z0-9+/]{3}= or [A-Za-z0-9+/]{2}==.
If you are using Java, you can actually use commons-codec library
import org.apache.commons.codec.binary.Base64;
String stringToBeChecked = "...";
boolean isBase64 = Base64.isArrayByteBase64(stringToBeChecked.getBytes());
[UPDATE 1] Deprecation Notice
Use instead
Base64.isBase64(value);
/**
* Tests a given byte array to see if it contains only valid characters within the Base64 alphabet. Currently the
* method treats whitespace as valid.
*
* #param arrayOctet
* byte array to test
* #return {#code true} if all bytes are valid characters in the Base64 alphabet or if the byte array is empty;
* {#code false}, otherwise
* #deprecated 1.5 Use {#link #isBase64(byte[])}, will be removed in 2.0.
*/
#Deprecated
public static boolean isArrayByteBase64(final byte[] arrayOctet) {
return isBase64(arrayOctet);
}
Well you can:
Check that the length is a multiple of 4 characters
Check that every character is in the set A-Z, a-z, 0-9, +, / except for padding at the end which is 0, 1 or 2 '=' characters
If you're expecting that it will be base64, then you can probably just use whatever library is available on your platform to try to decode it to a byte array, throwing an exception if it's not valid base 64. That depends on your platform, of course.
As of Java 8, you can simply use java.util.Base64 to try and decode the string:
String someString = "...";
Base64.Decoder decoder = Base64.getDecoder();
try {
decoder.decode(someString);
} catch(IllegalArgumentException iae) {
// That string wasn't valid.
}
Try like this for PHP5
//where $json is some data that can be base64 encoded
$json=some_data;
//this will check whether data is base64 encoded or not
if (base64_decode($json, true) == true)
{
echo "base64 encoded";
}
else
{
echo "not base64 encoded";
}
Use this for PHP7
//$string parameter can be base64 encoded or not
function is_base64_encoded($string){
//this will check if $string is base64 encoded and return true, if it is.
if (base64_decode($string, true) !== false){
return true;
}else{
return false;
}
}
var base64Rejex = /^(?:[A-Z0-9+\/]{4})*(?:[A-Z0-9+\/]{2}==|[A-Z0-9+\/]{3}=|[A-Z0-9+\/]{4})$/i;
var isBase64Valid = base64Rejex.test(base64Data); // base64Data is the base64 string
if (isBase64Valid) {
// true if base64 formate
console.log('It is base64');
} else {
// false if not in base64 formate
console.log('it is not in base64');
}
Try this:
public void checkForEncode(String string) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(string);
if (m.find()) {
System.out.println("true");
} else {
System.out.println("false");
}
}
It is impossible to check if a string is base64 encoded or not. It is only possible to validate if that string is of a base64 encoded string format, which would mean that it could be a string produced by base64 encoding (to check that, string could be validated against a regexp or a library could be used, many other answers to this question provide good ways to check this, so I won't go into details).
For example, string flow is a valid base64 encoded string. But it is impossible to know if it is just a simple string, an English word flow, or is it base 64 encoded string ~Z0
There are many variants of Base64, so consider just determining if your string resembles the varient you expect to handle. As such, you may need to adjust the regex below with respect to the index and padding characters (i.e. +, /, =).
class String
def resembles_base64?
self.length % 4 == 0 && self =~ /^[A-Za-z0-9+\/=]+\Z/
end
end
Usage:
raise 'the string does not resemble Base64' unless my_string.resembles_base64?
Check to see IF the string's length is a multiple of 4. Aftwerwards use this regex to make sure all characters in the string are base64 characters.
\A[a-zA-Z\d\/+]+={,2}\z
If the library you use adds a newline as a way of observing the 76 max chars per line rule, replace them with empty strings.
/^([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$/
this regular expression helped me identify the base64 in my application in rails, I only had one problem, it is that it recognizes the string "errorDescripcion", I generate an error, to solve it just validate the length of a string.
For Flutter, I tested couple of the above comments and translated that into dart function as follows
static bool isBase64(dynamic value) {
if (value.runtimeType == String){
final RegExp rx = RegExp(r'^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$',
multiLine: true,
unicode: true,
);
final bool isBase64Valid = rx.hasMatch(value);
if (isBase64Valid == true) {return true;}
else {return false;}
}
else {return false;}
}
In Java below code worked for me:
public static boolean isBase64Encoded(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}
This works in Python:
import base64
def IsBase64(str):
try:
base64.b64decode(str)
return True
except Exception as e:
return False
if IsBase64("ABC"):
print("ABC is Base64-encoded and its result after decoding is: " + str(base64.b64decode("ABC")).replace("b'", "").replace("'", ""))
else:
print("ABC is NOT Base64-encoded.")
if IsBase64("QUJD"):
print("QUJD is Base64-encoded and its result after decoding is: " + str(base64.b64decode("QUJD")).replace("b'", "").replace("'", ""))
else:
print("QUJD is NOT Base64-encoded.")
Summary: IsBase64("string here") returns true if string here is Base64-encoded, and it returns false if string here was NOT Base64-encoded.
C#
This is performing great:
static readonly Regex _base64RegexPattern = new Regex(BASE64_REGEX_STRING, RegexOptions.Compiled);
private const String BASE64_REGEX_STRING = #"^[a-zA-Z0-9\+/]*={0,3}$";
private static bool IsBase64(this String base64String)
{
var rs = (!string.IsNullOrEmpty(base64String) && !string.IsNullOrWhiteSpace(base64String) && base64String.Length != 0 && base64String.Length % 4 == 0 && !base64String.Contains(" ") && !base64String.Contains("\t") && !base64String.Contains("\r") && !base64String.Contains("\n")) && (base64String.Length % 4 == 0 && _base64RegexPattern.Match(base64String, 0).Success);
return rs;
}
There is no way to distinct string and base64 encoded, except the string in your system has some specific limitation or identification.
This snippet may be useful when you know the length of the original content (e.g. a checksum). It checks that encoded form has the correct length.
public static boolean isValidBase64( final int initialLength, final String string ) {
final int padding ;
final String regexEnd ;
switch( ( initialLength ) % 3 ) {
case 1 :
padding = 2 ;
regexEnd = "==" ;
break ;
case 2 :
padding = 1 ;
regexEnd = "=" ;
break ;
default :
padding = 0 ;
regexEnd = "" ;
}
final int encodedLength = ( ( ( initialLength / 3 ) + ( padding > 0 ? 1 : 0 ) ) * 4 ) ;
final String regex = "[a-zA-Z0-9/\\+]{" + ( encodedLength - padding ) + "}" + regexEnd ;
return Pattern.compile( regex ).matcher( string ).matches() ;
}
If the RegEx does not work and you know the format style of the original string, you can reverse the logic, by regexing for this format.
For example I work with base64 encoded xml files and just check if the file contains valid xml markup. If it does not I can assume, that it's base64 decoded. This is not very dynamic but works fine for my small application.
This works in Python:
def is_base64(string):
if len(string) % 4 == 0 and re.test('^[A-Za-z0-9+\/=]+\Z', string):
return(True)
else:
return(False)
Try this using a previously mentioned regex:
String regex = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
if("TXkgdGVzdCBzdHJpbmc/".matches(regex)){
System.out.println("it's a Base64");
}
...We can also make a simple validation like, if it has spaces it cannot be Base64:
String myString = "Hello World";
if(myString.contains(" ")){
System.out.println("Not B64");
}else{
System.out.println("Could be B64 encoded, since it has no spaces");
}
if when decoding we get a string with ASCII characters, then the string was
not encoded
(RoR) ruby solution:
def encoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count.zero?
end
def decoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count > 0
end
Function Check_If_Base64(ByVal msgFile As String) As Boolean
Dim I As Long
Dim Buffer As String
Dim Car As String
Check_If_Base64 = True
Buffer = Leggi_File(msgFile)
Buffer = Replace(Buffer, vbCrLf, "")
For I = 1 To Len(Buffer)
Car = Mid(Buffer, I, 1)
If (Car < "A" Or Car > "Z") _
And (Car < "a" Or Car > "z") _
And (Car < "0" Or Car > "9") _
And (Car <> "+" And Car <> "/" And Car <> "=") Then
Check_If_Base64 = False
Exit For
End If
Next I
End Function
Function Leggi_File(PathAndFileName As String) As String
Dim FF As Integer
FF = FreeFile()
Open PathAndFileName For Binary As #FF
Leggi_File = Input(LOF(FF), #FF)
Close #FF
End Function
import java.util.Base64;
public static String encodeBase64(String s) {
return Base64.getEncoder().encodeToString(s.getBytes());
}
public static String decodeBase64(String s) {
try {
if (isBase64(s)) {
return new String(Base64.getDecoder().decode(s));
} else {
return s;
}
} catch (Exception e) {
return s;
}
}
public static boolean isBase64(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}
For Java flavour I actually use the following regex:
"([A-Za-z0-9+]{4})*([A-Za-z0-9+]{3}=|[A-Za-z0-9+]{2}(==){0,2})?"
This also have the == as optional in some cases.
Best!
I try to use this, yes this one it's working
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
but I added on the condition to check at least the end of the character is =
string.lastIndexOf("=") >= 0

how to convert hexadecimal from string in c#

i have a string which contains hexadecimal values i want to know how to convert that string to hexadecimal using c#
There's several ways of doing this depending on how efficient you need it to be.
Convert.ToInt32(value, fromBase) // ie Convert.ToInt32("FF", 16) == 255
That is the easy way to convert to an Int32. You can use Byte, Int16, Int64, etc. If you need to convert to an array of bytes you can chew through the string 2 characters at a time parsing them into bytes.
If you need to do this in a fast loop or with large byte arrays, I think this class is probably the fastest way to do it in purely managed code. I'm always open to suggestions for how to improve it though.
Given the following formats
10A
0x10A
0X10A
Perform the following.
public static int ParseHexadecimalInteger(string v)
{
var r = 0;
if (!string.IsNullOrEmpty(v))
{
var s = v.ToLower().Replace("0x", "");
var c = CultureInfo.CurrentCulture;
int.TryParse(s, NumberStyles.HexNumber, c, out r);
}
return r;
}

Resources