read token after token from a file in haskell - haskell

I want to read from a file in Visual Haskell Studio, token after token, by loading each time the next token in a variable.
For example: getNextToken.
Thanks!! :)

You can use Alex, which is a lexer (which split a string into a list of tokens). Then you can do whatever you want with the token list. If you really want to "load" them and put them in a variable, in a procedural way , I'm not sure Haskell is the right language to do it.

Well, the right answer is somewhat complicated:
Use a parser combinator library like Parsec that will let you fully define the meaning of the word 'token', which varies from context to context.

Related

Converting an ASTNode into code

How does one convert an ASTNode (or at least a CompilationUnit) into a valid piece of source code?
The documentation says that one shouldn't use toString, but doesn't mention any alternatives:
Returns a string representation of this node suitable for debugging purposes only.
CompilationUnits have rewrite, but that one does not work for ASTs created by hand.
Formatting options would be nice to have, but I'd basically be satisfied with anything that turns arbitrary ASTNodes into semantically equivalent source code.
In JDT the normal way for AST manipulation is to start with a basic CompilationUnit and then use a rewriter to add content. Then ASTRewriteAnalyzer / ASTRewriteFormatter should take care of creating formatted source code. Creating a CU just containing a stub type declaration shouldn't be hard, so that's one option.
If that doesn't suite your needs, you may want to experiement with directly calling the internal org.eclipse.jdt.internal.core.dom.rewrite.ASTRewriteFlattener.asString(ASTNode, RewriteEventStore). If not editing existing files, you may probably ignore the events collected in the RewriteEventStore, just use the returned String.

make menhir find all alternatives?

I would like to change the behavior of menhir's output in follwoing way:
I want it to look up all grammatical alternatives if it finds any, and put them in a list and get me back this ambigouus interpretation. It shall not reduce conflicts, just store them.
In the source code of menhir, it seems to me, that I have to look in "Engine.ml". The resultant syntactically determined token comes in a variant type item "Accepted v" as a state of a checkpoint of the grammatical automaton. This content is found by a function "accept env prod" before, that is part of a bundle of recursive functions, that change the states.
Do you have a tip, how I could change these functions to put all the possible results in the list here and proceed as if nothing happened? Or do you think, that this wont work anyway?
Thanks.
What you are looking for is a GLR parser generator (G is for generalized). Menhir is not such tool, and I doubt you could modify it easily to do what you want.
However, there is another tool that does exactly what you want: dypgen.

Use Alex macros from another file

Is there any way to have an Alex macro defined in one source file and used in other source files? In my case, I have definitions for $LowerCaseLetter and $UpperCaseLetter (these are all letters except e and O, since they have special roles in my code). How can I refer to these macros from other .x files?
Disproving something exists is always harder than finding something that does exist, but I think the info below does show that Alex can only get macro definitions from the .x file it is reading (other than predefinied stuff like $white), and not via includes from other files....
You can get the sourcecode for Alex by doing the following:
> cabal unpack alex
> cd alex-3.1.3
In src/Main.hs, predefined macros are first set in variables called initSetEnv (charset macros $white, $printable, and "."), and initREEnv (regexp macros, there are none). This gets passed into runP, in src/ParseMonad.hs, which is used to hold the current parsing state, including all defined macros. The initial state is set using the values passed in, but macros can be added using a function called newSMac (or newRMac for regular expression macros).
Since this seems to be the only way that macros can be set, it is then only a matter of some grep bookkeeping to verify the only ways that macros can be added is through an actual macro definition in the source .x file. Unsurprisingly, Alex recursively uses its own .x/.y files for .x source file parsing (src/parser.y, src/Scan.x). It is a couple of levels of indirection away, but you can verify that the only way newSMac can be called is through the src/Scan.x macro
#smac = \$ #id | \$ \{ #id \}
<0> #smac #ws? \= { smacdef }
Other than some obvious predefined stuff, I don't believe reuse in lexers is all that typical anyway, because at the token level things are usually pretty simple (often simple tokens like SPACE, WORD, NUMBER, and a few operators, symbols and parens are all that are needed). The complexity comes at the parsing stage, although for technical reasons, parser-includes aren't that common either (see scannerless parsing for a newer technology that does allow reuse through nesting, like javascript embedded in html.... The tools for scannerless parsing are still pretty primitive though).

Does this Groovy closure token '->' have a name or a nickname?

Followup to this question. Groovy.codehaus.org just refers to it as a token. Are there any informal monikers floating around like 'splat' and 'bang'?
It is informally known as "the arrow". I cite this post from 2008. Though I can't claim it is a standard idiom, I've never heard it called anything else (in conversation).
In the antlr parser script, it's called the 'closable block operator'
I don't know of a snappier name than 'arrow'
I've heard it called "Rocket" before.
Also a "Fat Rocket"/"Hash Rocket" would be '=>'
But this more applies to Ruby lingo.

lexer/parser ambiguity

How does a lexer solve this ambiguity?
/*/*/
How is it that it doesn't just say, oh yeah, that's the begining of a multi-line comment, followed by another multi-line comment.
Wouldn't a greedy lexer just return the following tokens?
/*
/*
/
I'm in the midst of writing a shift-reduce parser for CSS and yet this simple comment thing is in my way. You can read this question if you wan't some more background information.
UPDATE
Sorry for leaving this out in the first place. I'm planning to add extensions to the CSS language in this form /* # func ( args, ... ) */ but I don't want to confuse an editor which understands CSS but not this extension comment of mine. That's why the lexer just can't ignore comments.
One way to do it is for the lexer to enter a different internal state on encountering the first /*. For example, flex calls these "start conditions" (matching C-style comments is one of the examples on that page).
The simplest way would probably be to lex the comment as one single token - that is, don't emit a "START COMMENT" token, but instead continue reading in input until you can emit a "COMMENT BLOCK" token that includes the entire /*(anything)*/ bit.
Since comments are not relevant to the actual parsing of executable code, it's fine for them to basically be stripped out by the lexer (or at least, clumped into a single token). You don't care about token matches within a comment.
In most languages, this is not ambiguous: the first slash and asterix are consumed to produce the "start of multi-line comment" token. It is followed by a slash which is plain "content" within the comment and finally the last two characters are the "end of multi-line comment" token.
Since the first 2 characters are consumed, the first asterix cannot also be used to produce an end of comment token. I just noted that it could produce a second "start of comment" token... oops, that could be a problem, depending on the amount of context is available for the parser.
I speak here of tokens, assuming a parser-level handling of the comments. But the same applies to a lexer, whereby the underlying rule is to start with '/*' and then not stop till '*/' is found. Effectively, a lexer-level handling of the whole comment wouldn't be confused by the second "start of comment".
Since CSS does not support nested comments, your example would typically parse into a single token, COMMENT.
That is, the lexer would see /* as a start-comment marker and then consume everything up to and including a */ sequence.
Use the regexp's algorithm, search from the beginning of the string working way back to the current location.
if (chars[currentLocation] == '/' and chars[currentLocation - 1] == '*') {
for (int i = currentLocation - 2; i >= 0; i --) {
if (chars[i] == '/' && chars[i + 1] == '*') {
// .......
}
}
}
It's like applying the regexp /\*([^\*]|\*[^\/])\*/ greedy and bottom-up.
One way to solve this would be to have your lexer return:
/
*
/
*
/
And have your parser deal with it from there. That's what I'd probably do for most programming languages, as the /'s and *'s can also be used for multiplication and other such things, which are all too complicated for the lexer to worry about. The lexer should really just be returning elementary symbols.
If what the token is starts to depend too much on context, what you're looking for may very well be a simpler token.
That being said, CSS is not a programming language so /'s and *'s can't be overloaded. Really afaik they can't be used for anything else other than comments. So I'd be very tempted to just pass the whole thing as a comment token unless you have a good reason not to: /\*.*\*/

Resources