How can i ensure a string has even quotations using automata? - state-machine

How can i ensure a string has even quotations using automata?
For e.g.
This is a "valid" string which has two quotes.
This is "invalid" string " because it has three quotes.

You can use following autumata:
Online Demo

We need to remember the parity (even vs odd) of the number of quotes we've seen so far. We'll do this using the current state. We'll accept if we end in a state corresponding to having seen an even number.
Initially, before processing any symbols, we have seen zero quotes. Zero is typically considered to be an even number, though some people may not mean to include it some of the time. We'll assume the usual meaning and accept strings with no quotes at all. This tells us that the initial state must be the one for even parity and must therefore be accepting. The other state represents odd parity and is not accepting.
The transitions will all be self-loops on the states, except that a quote symbol takes each state to the other of the two.

Related

How to deal with CHARACTER variables, longer than the allowed maximum of 2000?

I'm working with Progress-4GL, appBuilder and procedure editor, release 11.6.
I've just found a CHARACTER type global variable (DEF VAR global_variable AS CHAR NO-UNDO.), containing up to 12901 characters. The variable is only used for passing information within the application, the information will never be stored as one tuple within a table.
The information in that variable seems to be handled well: the content is correct.
Yet, as this URL mentions, the maximum length of a character variable in Progress being 2000 characters, and this makes me worry: I'm afraid that one day, another limit may be crossed and from that moment on, we'll need to rethink the whole idea, and I'd like to be prepared for that day.
Therefore, does anybody know the "next" length limit of a character variable in Progress?
That reference you mention points to SQL limitations.
In the ABL, a CHARACTER variable can hold ~ 32 k
DEFINE VARIABLE c AS CHARACTER NO-UNDO.
ASSIGN c = FILL ("*", 31000) .
MESSAGE LENGTH (c)
VIEW-AS ALERT-BOX INFORMATION BUTTONS OK.
Beyond that you have to use LONGCHAR with it's limitations:
slightly slower
cannot be indexed in temp-tables or database tables.
CHARACTER variables are always stored in the CPINTERNAL codepage. LONGCHAR's can use a different codepage through the FIX-CODEPAGE statement.

Read substrings from a string containing multiplication [duplicate]

This question already has answers here:
'*' and '/' not recognized on input by a read statement
(2 answers)
Closed 4 years ago.
I am a scientist programming in Fortran, and I came up with a strange behaviour. In one of my programs I have a string containing several "words", and I want to read all words as substrings. The first word starts with an integer and a wildcard, like "2*something".
When I perform an internal read on that string, I expect to read all wods, but instead, the READ function repeatedly reads the first substring. I do not understand why, nor how to avoid this behaviour.
Below is a minimalist sample program that reproduces this behaviour. I would expect it to read the three substrings and to print "3*a b c" on the screen. Instead, I get "a a a".
What am I doing wrong? Can you please help me and explain what is going on?
I am compiling my programs under GNU/Linux x64 with Gfortran 7.3 (7.3.0-27ubuntu1~18.04).
PROGRAM testread
IMPLICIT NONE
CHARACTER(LEN=1024):: string
CHARACTER(LEN=16):: v1, v2, v3
string="3*a b c"
READ(string,*) v1, v2, v3
PRINT*, v1, v2, v3
END PROGRAM testread
You are using list-directed input (the * format specifier). In list-directed input, a number (n) followed by an asterisk means "repeat this item n times", so it is processed as if the input was a a a b c. You would need to have as input '3*a' b c to get what you want.
I will use this as another opportunity to point out that list-directed I/O is sometimes the wrong choice as its inherent flexibility may not be what you want. That it has rules for things like repeat counts, null values, and undelimited strings is often a surprise to programmers. I also often see programmers complaining that list-directed input did not give an error when expected, because the compiler had an extension or the programmer didn't understand just how liberal the feature can be.
I suggest you pick up a Fortran language reference and carefully read the section on list-directed I/O. You may find you need to use an explicit format or change your program's expectations.
Following the answer of #SteveLionel, here is the relevant part of the reference on list-directed sequential READ statements (in this case, for Intel Fortran, but you could find it for your specific compiler and it won't be much different).
A character string does not need delimiting apostrophes or quotation marks if the corresponding I/O list item is of type default character, and the following is true:
The character string does not contain a blank, comma (,), or slash ( / ).
The character string is not continued across a record boundary.
The first nonblank character in the string is not an apostrophe or a quotation mark.
The leading character is not a string of digits followed by an asterisk.
A nondelimited character string is terminated by the first blank, comma, slash, or end-of-record encountered. Apostrophes and quotation marks within nondelimited character strings are transferred as is.
In total, there are 4 forms of sequential read statements in Fortran, and you may choose the option that best fits your need:
Formatted Sequential Read:
To use this you change the * to an actual format specifier. If you know the length of the strings at advance, this would be as easy as '(a3,a2,a2)'. Or, you could come with a format specifier that matches your data, but this generally demands you knowing the length or format of stuff.
Formatted Sequential List-Directed:
You are currently using this option (the * format descriptor). As we already showed you, this kind of I/O comes with a lot of magic and surprising behavior. What is hitting you is the n*cte thing, that is interpreted as n repetitions of cte literal.
As said by Steve Lionel, you could put quotation marks around the problematic word, so it will be parsed as one-piece. Or, as proposed by #evets, you could split or break your string using the intrinsics index or scan. Another option could be changing your wildcard from asterisk to anything else.
Formatted Namelist:
Well, that could be an option if your data was (or could be) presented in the namelist format, but I really think it's not your case.
Unformatted:
This may not apply to your case because you are reading from a character variable, and an internal READ statement can only be formatted.
Otherwise, you could split your string by means of a function instead of a I/O operation. There is no intrinsic for this, but you could come with one without much trouble (see this thread for reference). As you may have noted already, manipulating strings in fortran is... awkward, at least. There are some libraries out there (like this) that may be useful if you are doing lots of string stuff in Fortran.

Why do ANSI color escapes end in 'm' rather than ']'?

ANSI terminal color escapes can be done with \033[...m in most programming languages. (You may need to do \e or \x1b in some languages)
What has always seemed odd to me is how they start with \033[, but they end in m Is there some historical reason for this (perhaps ] was mapped to the slot that is now occupied by m in the ASCII table?) or is it an arbitrary character choice?
It's not completely arbitrary, but follows a scheme laid out by committees, and documented in ECMA-48 (the same as ISO 6429). Except for the initial Escape character, the succeeding characters are specified by ranges.
While the pair Escape[ is widely used (this is called the control sequence introducer CSI), there are other control sequences (such as Escape], the operating system command OSC). These sequences may have parameters, and a final byte.
In the question, using CSI, the m is a final byte, which happens to tell the terminal what the sequence is supposed to do. The parameters if given are a list of numbers. On the other hand, with OSC, the command-type is at the beginning, and the parameters are less constrained (they might be any string of printable characters).

Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a
new-line character is deleted, splicing physical source lines to form
logical source lines. Only the last backslash on any physical source
line shall be eligible for being part of such a splice.
(ยง5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Haskell Char and unicode surrogates

I was playing around with strings and discovered that Haskell (correctly) disallows characters above Unicode code point 0x10ffff (ie one gets something like a sequence out of range error if one attempts to use something above this limit). Out of curiosity, i played around with the Unicode surrogate halves (0xd800 to 0xdfff) - invalid Unicode codepoints, and discovered that they seem to be permitted. I am curious as to why this is. Is it simply because being a bounded item means only defining a maximum and a minimum?
Disallowing the surrogate code units would indeed make Char a more correct type for Unicode code points. The Report says that Char is "an enumeration whose values represent Unicode characters", so probably this should be considered a GHC bug.
There's no specific notion of "a bounded item", but it would require extra checks in various places (right now chr just needs to make one comparison to check if its argument is valid, for instance) and possibly make some things behave more strangely (if people indirectly expect code points to be contiguous).
I don't know that there's an especially good rationale for it, though, or that the trade-off was even considered originally. In Haskell 1.4, Char was just a 16-bit type, so it would have been natural to extend it to 17*2^16 values without adding extra checks. This issue is occasionally brought up -- I've brought it up before -- but most people don't seem to worry about it very much. It's probably reasonable to file a GHC bug about it, though, to get a proper discussion going.
Note that Data.Text (which uses UTF-16 as its internal representations) does disallow the invalid code units (it has to).

Resources