Last character of a multibyte string

Last character of a multibyte string - multibyte

One of the things I often need to do when handling a multibyte string is deleting its last character. How do I locate this last character so I can chop it off using normal byte operations, preferably with as few reads as possible?
Note that this question is intended to work for most, if not all, multibyte encodings. The answer for self-synchonizing encodings like UTF-8 is trivial, as you can just go right-to-left in the bytestring for a start marker.

The answer will be written in C, with POSIX multibyte functions. The said functions are also found on Windows. Assume that the bytestring ends at len and is well-formed up to the point; assume appropriate setlocale calls. Porting to mbrlen is left as an exercise for the reader.
The naive solution
The obviously correct solution involves parsing the encoding "as intended", going from left-to-right.
ssize_t index_of_last_char_left(const char *c, size_t len) {
size_t pos = 0;
size_t next = 1;
mblen(NULL, 0);
while (pos < len - 1) {
next = mblen(c + pos, len - pos);
if (next == -1) // Invalid input
return pos;
pos += next;
}
return pos - next;
}
Deleting multiple characters like this will cause an "accidentally quadratic" situation; memorizing of intermediate positions will help, but additional management is required.
The right-to-left solution
As I mentioned in the question, for self-synchonizing encodings the only thing to do is to look for a start marker. But what breaks with the ones that don't self-synchonize?
The one-or-two-byte EUC encodings have both bytes of the two-byte sequence higher than 0x7f, and there's almost no differentiating between start and continuation bytes. For that we can check for mblen(pos) == bytes_left since we know the string is well-formed.
The Big5, GBK, and GB10830 encodings also allow a continuation byte in the ASCII range, so a lookbehind is mandatory.
With that cleared out (and assuming the bytestring up to len is well-formed), we can have:
// As much as CJK encodings do. I don't have time to see if it works for UTF-1.
#define MAX_MB_LEN 4
ssize_t index_of_last_char_right(const char *c, size_t len) {
ssize_t pos = len - 1;
bool last = true;
bool last_is_okay = false;
assert(!mblen(NULL, 0)); // No, we really cannot handle shift states.
for (; pos >= 0 && pos >= len - 2 - MAX_MB_LEN; pos--) {
int next = mblen(c + pos, len - pos);
bool okay = (next > 0) && (next == len - pos - 1);
if (last) {
last_is_okay = okay;
last = false;
} else if (okay)
return pos;
}
return last_is_okay ? len - 1 : -1;
}
(You should be able to find the last good char of a malformed string by (next > 0) && (next <= len - pos - 1). But don't return that when the last byte is okay!)
What's the point of this?
The code sample above is for the idealist who does not want to write just a "UTF-8 support" but a "locale support" based on the C library. There might not have a point for this at all in 2021 :)

Related

find the number of ways you can form a string on size N, given an unlimited number of 0s and 1s

The below question was asked in the atlassian company online test ,I don't have test cases , this is the below question I took from this link
find the number of ways you can form a string on size N, given an unlimited number of 0s and 1s. But
you cannot have D number of consecutive 0s and T number of consecutive 1s. N, D, T were given as inputs,
Please help me on this problem,any approach how to proceed with it
My approach for the above question is simply I applied recursion and tried for all possiblity and then I memoized it using hash map
But it seems to me there must be some combinatoric approach that can do this question in less time and space? for debugging purposes I am also printing the strings generated during recursion, if there is flaw in my approach please do tell me
#include <bits/stdc++.h>
using namespace std;
unordered_map<string,int>dp;
int recurse(int d,int t,int n,int oldd,int oldt,string s)
{
if(d<=0)
return 0;
if(t<=0)
return 0;
cout<<s<<"\n";
if(n==0&&d>0&&t>0)
return 1;
string h=to_string(d)+" "+to_string(t)+" "+to_string(n);
if(dp.find(h)!=dp.end())
return dp[h];
int ans=0;
ans+=recurse(d-1,oldt,n-1,oldd,oldt,s+'0')+recurse(oldd,t-1,n-1,oldd,oldt,s+'1');
return dp[h]=ans;
}
int main()
{
int n,d,t;
cin>>n>>d>>t;
dp.clear();
cout<<recurse(d,t,n,d,t,"")<<"\n";
return 0;
}

You are right, instead of generating strings, it is worth to consider combinatoric approach using dynamic programming (a kind of).
"Good" sequence of length K might end with 1..D-1 zeros or 1..T-1 of ones.
To make a good sequence of length K+1, you can add zero to all sequences except for D-1, and get 2..D-1 zeros for the first kind of precursors and 1 zero for the second kind
Similarly you can add one to all sequences of the first kind, and to all sequences of the second kind except for T-1, and get 1 one for the first kind of precursors and 2..T-1 ones for the second kind
Make two tables
Zeros[N][D] and Ones[N][T]
Fill the first row with zero counts, except for Zeros[1][1] = 1, Ones[1][1] = 1
Fill row by row using the rules above.
Zeros[K][1] = Sum(Ones[K-1][C=1..T-1])
for C in 2..D-1:
Zeros[K][C] = Zeros[K-1][C-1]
Ones[K][1] = Sum(Zeros[K-1][C=1..T-1])
for C in 2..T-1:
Ones[K][C] = Ones[K-1][C-1]
Result is sum of the last row in both tables.
Also note that you really need only two active rows of the table, so you can optimize size to Zeros[2][D] after debugging.

This can be solved using dynamic programming. I'll give a recursive solution to the same. It'll be similar to generating a binary string.
States will be:
i: The ith character that we need to insert to the string.
cnt: The number of consecutive characters before i
bit: The character which was repeated cnt times before i. Value of bit will be either 0 or 1.
Base case will: Return 1, when we reach n since we are starting from 0 and ending at n-1.
Define the size of dp array accordingly. The time complexity will be 2 x N x max(D,T)
#include<bits/stdc++.h>
using namespace std;
int dp[1000][1000][2];
int n, d, t;
int count(int i, int cnt, int bit) {
if (i == n) {
return 1;
}
int &ans = dp[i][cnt][bit];
if (ans != -1) return ans;
ans = 0;
if (bit == 0) {
ans += count(i+1, 1, 1);
if (cnt != d - 1) {
ans += count(i+1, cnt + 1, 0);
}
} else {
// bit == 1
ans += count(i+1, 1, 0);
if (cnt != t-1) {
ans += count(i+1, cnt + 1, 1);
}
}
return ans;
}
signed main() {
ios_base::sync_with_stdio(false), cin.tie(nullptr);
cin >> n >> d >> t;
memset(dp, -1, sizeof dp);
cout << count(0, 0, 0);
return 0;
}

Append Char To StringBuilder C++/CLI

I am trying to use StringBuilder to create the output that is being sent over the serial port for a log file. The output is stored in a byte array, and I am recursing through it.
ref class UART_G {
public:
static array<System::Byte>^ message = nullptr;
static uint8_t message_length = 0;
};
static void logSend ()
{
StringBuilder^ outputsb = gcnew StringBuilder();
outputsb->Append("Sent ");
for (uint8_t i = 0; i < UART_G::message_length; i ++)
{
unsigned char mychar = UART_G::message[i];
if (
(mychar >= ' ' && mychar <= 'Z') || //Includes 0-9, A-Z.
(mychar >= '^' && mychar <= '~') || //Includes a-z.
(mychar >= 128 && mychar <= 254)) //I think these are okay.
{
outputsb->Append(L""+mychar);
}
else
{
outputsb->Append("[");
outputsb->Append(mychar);
outputsb->Append("]");
}
}
log_line(outputsb->ToString());
}
I want all plain text characters (eg A, :) to be sent as text, while functional characters (eg BEL, NEWLINE) will be sent like [7][13].
What is happening is that the StringBuilder, in all cases, is outputting the character as a number. For example, A is being sent out as 65.
For example, if I have the string 'APPLE' and a newline in my byte array, I want to see:
Sent APPLE[13]
Instead, I see:
Sent 6580807669[13]
I have tried every way imaginable to get it to display the character properly, including type-casting, concatenating it to a string, changing the variable type, etc... I would really appreciate if anyone knows how to do this. My log files are largely unreadable without this function.

You're getting the ASCII values because the compiler is choosing one of the Append overloads that takes an integer of some sort. To fix this, you could do a explicit cast to System::Char, to force the correct overload.
However, that won't necessarily give the proper results for 128-255. You could cast a value in that range from Byte to Char, and it'll give something, but not necessarily what you expect. First off, 0x80 through 0x9F are control characters, and whereever you're getting the bytes from might not intend the same representation for 0xA0 through 0xFF as Unicode has.
In my opinion, the best solution would be to use the "[value]" syntax that you're using for the other control characters for 0x80 through 0xFF as well. However, if you do want to convert those to characters, I'd use Encoding::Default, not Encoding::ASCII. ASCII only defines 0x00 through 0x7F, 0x80 and higher will come out as "?". Encoding::Default is whatever code page is defined for the language you have selected in Windows.
Combine all that, and here's what you'd end up with:
for (uint8_t i = 0; i < UART_G::message_length; i ++)
{
unsigned char mychar = UART_G::message[i];
if (mychar >= ' ' && mychar <= '~' && mychar != '[' && mychar != ']')
{
// Use the character directly for all ASCII printable characters,
// except '[' and ']', because those have a special meaning, below.
outputsb->Append((System::Char)(mychar));
}
else if (mychar >= 128)
{
// Non-ASCII characters, use the default encoding to convert to Unicode.
outputsb->Append(Encoding::Default->GetChars(UART_G::message, i, 1));
}
else
{
// Unprintable characters, use the byte value in brackets.
// Also do this for bracket characters, so there's no ambiguity
// what a bracket means in the logs.
outputsb->Append("[");
outputsb->Append((unsigned int)mychar);
outputsb->Append("]");
}
}

You are recieveing ascii value of the string .
See the Ascii chart
65 = A
80 = P
80 = P
76 = L
69 = E
Just write a function that converts the ascii value to string

Here is the code I came up with which resolved the issue:
static void logSend ()
{
StringBuilder^ outputsb = gcnew StringBuilder();
ASCIIEncoding^ ascii = gcnew ASCIIEncoding;
outputsb->Append("Sent ");
for (uint8_t i = 0; i < UART_G::message_length; i ++)
{
unsigned char mychar = UART_G::message[i];
if (
(mychar >= ' ' && mychar <= 'Z') || //Includes 0-9, A-Z.
(mychar >= '^' && mychar <= '~') || //Includes a-z.
(mychar >= 128 && mychar <= 254)) //I think these are okay.
{
outputsb->Append(ascii->GetString(UART_G::message, i, 1));
}
else
{
outputsb->Append("[");
outputsb->Append(mychar);
outputsb->Append("]");
}
}
log_line(outputsb->ToString());
}
I still appreciate any alternatives which are more efficient or simpler to read.

Counter for two binary strings C++

I am trying to count two binary numbers from string. The maximum number of counting digits have to be 253. Short numbers works, but when I add there some longer numbers, the output is wrong. The example of bad result is "10100101010000111111" with "000011010110000101100010010011101010001101011100000000111000000000001000100101101111101000111001000101011010010111000110".
#include <iostream>
#include <stdlib.h>
using namespace std;
bool isBinary(string b1,string b2);
int main()
{
string b1,b2;
long binary1,binary2;
int i = 0, remainder = 0, sum[254];
cout<<"Get two binary numbers:"<<endl;
cin>>b1>>b2;
binary1=atol(b1.c_str());
binary2=atol(b2.c_str());
if(isBinary(b1,b2)==true){
while (binary1 != 0 || binary2 != 0){
sum[i++] =(binary1 % 10 + binary2 % 10 + remainder) % 2;
remainder =(binary1 % 10 + binary2 % 10 + remainder) / 2;
binary1 = binary1 / 10;
binary2 = binary2 / 10;
}
if (remainder != 0){
sum[i++] = remainder;
}
--i;
cout<<"Result: ";
while (i >= 0){
cout<<sum[i--];
}
cout<<endl;
}else cout<<"Wrong input"<<endl;
return 0;
}
bool isBinary(string b1,string b2){
bool rozhodnuti1,rozhodnuti2;
for (int i = 0; i < b1.length();i++) {
if (b1[i]!='0' && b1[i]!='1') {
rozhodnuti1=false;
break;
}else rozhodnuti1=true;
}
for (int k = 0; k < b2.length();k++) {
if (b2[k]!='0' && b2[k]!='1') {
rozhodnuti2=false;
break;
}else rozhodnuti2=true;
}
if(rozhodnuti1==false || rozhodnuti2==false){ return false;}
else{ return true;}
}

One of the problems might be here: sum[i++]
This expression, as it is, first returns the value of i and then increases it by one.
Did you do it on purporse?
Change it to ++i.
It'd help if you could also post the "bad" output, so that we can try to move backward through the code starting from it.
EDIT 2015-11-7_17:10
Just to be sure everything was correct, I've added a cout to check what binary1 and binary2 contain after you assing them the result of the atol function: they contain the integer numbers 547284487 and 18333230, which obviously dont represent the correct binary-to-integer transposition of the two 01 strings you presented in your post.
Probably they somehow exceed the capacity of atol.
Also, the result of your "math" operations bring to an even stranger result, which is 6011111101, which obviously doesnt make any sense.
What do you mean, exactly, when you say you want to count these two numbers? Maybe you want to make a sum? I guess that's it.
But then, again, what you got there is two signed integer numbers and not two binaries, which means those %10 and %2 operations are (probably) misused.
EDIT 2015-11-07_17:20
I've tried to use your program with small binary strings and it actually works; with small binary strings.
It's a fact(?), at this point, that atol cant handle numerical strings that long.
My suggestion: use char arrays instead of strings and replace 0 and 1 characters with numerical values (if (bin1[i]){bin1[i]=1;}else{bin1[i]=0}) with which you'll be able to perform all the math operations you want (you've already written a working sum function, after all).
Once done with the math, you can just convert the char array back to actual characters for 0 and 1 and cout it on the screen.
EDIT 2015-11-07_17:30
Tested atol on my own: it correctly converts only strings that are up to 10 characters long.
Anything beyond the 10th character makes the function go crazy.

Convert binary ( integer and fraction) from VHDL to decimal, negative value in C code

I have a 14-bit data that is fed from FPGA in vhdl, The NIos II processor reads the 14-bit data from FPGA and do some processing tasks, where Nios II system is programmed in C code
The 14-bit data can be positive, zero or negative. In Altera compiler, I can only define the data to be 8,16 or 32. So I define this to be 16 bit data.
First, I need to check if the data is negative, if it is negative, I need to pad the first two MSB to be bit '1' so the system detects it as negative value instead of positive value.
Second, I need to compute the real value of this binary representation into a decimal value of BOTH integer and fraction.
I learned from this link (Correct algorithm to convert binary floating point "1101.11" into decimal (13.75)?) that I could convert a binary (consists of both integer and fraction) to decimal values.
To be specified, I am able to use this code quoted from this link (Correct algorithm to convert binary floating point "1101.11" into decimal (13.75)?) , reproduced as below:
#include <stdio.h>
#include <math.h>
double convert(const char binary[]){
int bi,i;
int len = 0;
int dot = -1;
double result = 0;
for(bi = 0; binary[bi] != '\0'; bi++){
if(binary[bi] == '.'){
dot = bi;
}
len++;
}
if(dot == -1)
dot=len;
for(i = dot; i >= 0 ; i--){
if (binary[i] == '1'){
result += (double) pow(2,(dot-i-1));
}
}
for(i=dot; binary[i] != '\0'; i++){
if (binary[i] == '1'){
result += 1.0/(double) pow(2.0,(double)(i-dot));
}
}
return result;
}
int main()
{
char bin[] = "1101.11";
char bin1[] = "1101";
char bin2[] = "1101.";
char bin3[] = ".11";
printf("%s -> %f\n",bin, convert(bin));
printf("%s -> %f\n",bin1, convert(bin1));
printf("%s -> %f\n",bin2, convert(bin2));
printf("%s -> %f\n",bin3, convert(bin3));
return 0;
}
I am wondering if this code can be used to check for negative value? I did try with a binary string of 11111101.11 and it gives the output of 253.75...
I have two questions:
What are the modifications I need to do in order to read a negative value?
I know that I can do the bit shift (as below) to check if the msb is 1, if it is 1, I know it is negative value...
if (14bit_data & 0x2000) //if true, it is negative value
The issue is, since it involves fraction part (but not only integer), it confused me a bit if the method still works...
If the binary number is originally not in string format, is there any way I could convert it to string? The binary number is originally fed from a fpga block written in VHDL say, 14 bits, with msb as the sign bit, the following 6 bits are the magnitude for integer and the last 6 bits are the magnitude for fractional part. I need the decimal value in C code for Altera Nios II processor.

OK so I m focusing on the fact that you want to reuse the algorithm you mention at the beginning of your question and assume that the binary representation you have for your signed number is Two's complement but I`m not really sure according to your comments that the input you have is the same than the one used by the algorithm
First pad the 2 MSB to have a 16 bit representation
16bit_data = (14_bit_data & 0x2000) ? ( 14_bit_data | 0xC000) : 14_bit_data ;
In case value is positive then value will remained unchanged and if negative this will be the correct two`s complement representation on 16bits.
For fractionnal part everything is the same compared to algorithm you mentionned in your question.
For integer part everything is the same except the treatment of MSB.
For unsigned number MSB (ie bit[15]) represents pow(2,15-6) ( 6 is the width of frationnal part ) whereas for signed number in Two`s complement representation it represents -pow(2,15-6) meaning that algorithm become
/* integer part operation */
while(p >= 1)
{
rem = (int)fmod(p, 10);
p = (int)(p / 10);
dec = dec + rem * pow(2, t) * (9 != t ? 1 : -1);
++t;
}
or said differently if you don`t want * operator
/* integer part operation */
while(p >= 1)
{
rem = (int)fmod(p, 10);
p = (int)(p / 10);
if( 9 != t)
{
dec = dec + rem * pow(2, t);
}
else
{
dec = dec - rem * pow(2, t);
}
++t;
}
For the second algorithm that you mention, considering you format if dot == 11 and i == 0 we are at MSB ( 10 integer bits followed by dot) so the code become
for(i = dot - 1; i >= 0 ; i--)
{
if (binary[i] == '1')
{
if(11 != dot || i)
{
result += (double) pow(2,(dot-i-1));
}
else
{
// result -= (double) pow(2,(dot-i-1));
// Due to your number format i == 0 and dot == 11 so
result -= 512
}
}
}
WARNING : in brice algorithm the input is character string like "11011.101" whereas according to your description you have an integer input so I`m not sure that this algorithm is suited to your case

I think this should work:
float convert14BitsToFloat(int16_t in)
{
/* Sign-extend in, since it is 14 bits */
if (in & 0x2000) in |= 0xC000;
/* convert to float with 6 decimal places (64 = 2^6) */
return (float)in / 64.0f;
}
To convert any number to string, I would use sprintf. Be aware it may significantly increase the size of your application. If you don't need the float and what to keep a small application, you should make your own conversion function.

Remove occurrences of substring recursively

Here's a problem:
Given string A and a substring B, remove the first occurence of substring B in string A till it is possible to do so. Note that removing a substring, can further create a new same substring. Ex. removing 'hell' from 'hehelllloworld' once would yield 'helloworld' which after removing once more would become 'oworld', the desired string.
Write a program for the above for input constraints of length 10^6 for A, and length 100 for B.
This question was asked to me in an interview, I gave them a simple algorithm to solve it that was to do exactly what the statement was and remove it iteratievly(to decresae over head calls), I later came to know there's a better solution for it that's much faster what would it be ? I've thought of a few optimizations but it's still not as fast as the fastest soln for the problem(acc. the company), so can anyone tell me of a faster way to solve the problem ?
P.S> I know of stackoverflow rules and that having code is better, but for this problem, I don't think that having code would be in any way beneficial...

Your approach has a pretty bad complexity. In a very bad case the string a will be aaaaaaaaabbbbbbbbb, and the string b will be ab, in which case you will need O(|a|) searches, each taking O(|a| + |b|) (assuming using some sophisticated search algorithm), resulting in a total complexity of O(|a|^2 + |a| * |b|), which with their constraints is years.
For their constraints a good complexity to aim for would be O(|a| * |b|), which is around 100 million operations, will finish in subsecond. Here's one way to approach it. For each position i in the string a let's compute the largest length n_i, such that the a[i - n_i : i] = b[0 : n_i] (in other words, the longest suffix of a at that position which is a prefix of b). We can compute it in O(|a| + |b|) by using Knuth-Morris-Pratt algorithm.
After we have n_i computed, finding the first occurrence of b in a is just a matter of finding the first n_i that is equal to |b|. This will be the right end of one of the occurrences of b in a.
Finally, we will need to modify Knuth-Morris-Pratt slightly. We will be logically removing occurrences of b as soon as we compute an n_i that is equal to |b|. To account for the fact that some letters were removed from a we will rely on the fact that Knuth-Morris-Pratt only relies on the last value of n_i (and those computed for b), and the current letter of a, so we just need a fast way of retrieving the last value of n_i after we logically remove an occurrence of b. That can be done with a deque, that stores all the valid values of n_i. Each value will be pushed into the deque once, and popped from it once, so that complexity of maintaining it is O(|a|), while the complexity of the Knuth-Morris-Pratt is O(|a| + |b|), resulting in O(|a| + |b|) total complexity.
Here's a C++ implementation. It could have some off-by-one errors, but it works on your sample, and it flies for the worst case that I described at the beginning.
#include <deque>
#include <string>
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main() {
string a, b;
cin >> a >> b;
size_t blen = b.size();
// make a = b$a
a = b + "$" + a;
vector<size_t> n(a.size()); // array for knuth-morris-pratt
vector<bool> removals(a.size()); // positions of right ends at which we remove `b`s
deque<size_t> lastN;
n[0] = 0;
// For the first blen + 1 iterations just do vanilla knuth-morris-pratt
for (size_t i = 1; i < blen + 1; ++ i) {
size_t z = n[i - 1];
while (z && a[i] != a[z]) {
z = n[z - 1];
}
if (a[i] != a[z]) n[i] = 0;
else n[i] = z + 1;
lastN.push_back(n[i]);
}
// For the remaining iterations some characters could have been logically
// removed from `a`, so use lastN to get last value of n instaed
// of actually getting it from `n[i - 1]`
for (size_t i = blen + 1; i < a.size(); ++ i) {
size_t z = lastN.back();
while (z && a[i] != a[z]) {
z = n[z - 1];
}
if (a[i] != a[z]) n[i] = 0;
else n[i] = z + 1;
if (n[i] == blen) // found a match
{
removals[i] = true;
// kill last |b| - 1 `n_i`s
for (size_t j = 0; j < blen - 1; ++ j) {
lastN.pop_back();
}
}
else {
lastN.push_back(n[i]);
}
}
string ret;
size_t toRemove = 0;
for (size_t pos = a.size() - 1; a[pos] != '$'; -- pos) {
if (removals[pos]) toRemove += blen;
if (toRemove) -- toRemove;
else ret.push_back(a[pos]);
}
reverse(ret.begin(), ret.end());
cout << ret << endl;
return 0;
}
[in] hehelllloworld
[in] hell
[out] oworld
[in] abababc
[in] ababc
[out] ab
[in] caaaaa ... aaaaaabbbbbb ... bbbbc
[in] ab
[out] cc

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string