Use Alex macros from another file - haskell

Is there any way to have an Alex macro defined in one source file and used in other source files? In my case, I have definitions for $LowerCaseLetter and $UpperCaseLetter (these are all letters except e and O, since they have special roles in my code). How can I refer to these macros from other .x files?

Disproving something exists is always harder than finding something that does exist, but I think the info below does show that Alex can only get macro definitions from the .x file it is reading (other than predefinied stuff like $white), and not via includes from other files....
You can get the sourcecode for Alex by doing the following:
> cabal unpack alex
> cd alex-3.1.3
In src/Main.hs, predefined macros are first set in variables called initSetEnv (charset macros $white, $printable, and "."), and initREEnv (regexp macros, there are none). This gets passed into runP, in src/ParseMonad.hs, which is used to hold the current parsing state, including all defined macros. The initial state is set using the values passed in, but macros can be added using a function called newSMac (or newRMac for regular expression macros).
Since this seems to be the only way that macros can be set, it is then only a matter of some grep bookkeeping to verify the only ways that macros can be added is through an actual macro definition in the source .x file. Unsurprisingly, Alex recursively uses its own .x/.y files for .x source file parsing (src/parser.y, src/Scan.x). It is a couple of levels of indirection away, but you can verify that the only way newSMac can be called is through the src/Scan.x macro
#smac = \$ #id | \$ \{ #id \}
<0> #smac #ws? \= { smacdef }
Other than some obvious predefined stuff, I don't believe reuse in lexers is all that typical anyway, because at the token level things are usually pretty simple (often simple tokens like SPACE, WORD, NUMBER, and a few operators, symbols and parens are all that are needed). The complexity comes at the parsing stage, although for technical reasons, parser-includes aren't that common either (see scannerless parsing for a newer technology that does allow reuse through nesting, like javascript embedded in html.... The tools for scannerless parsing are still pretty primitive though).

Related

Arbitrary lookaheads in PLY

I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments
You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.

squeak(smalltallk) how to 'inject' string into string

I'm writing a class named "MyObject".
one of the class methods is:
addTo: aCodeString assertType: aTypeCollection
when the method is called with aCodeString, I want to add (in runtime) a new method to "MyObject" class which aCodeString is it's source code and inject type checking code into the source code.
for example, if I call addTo: assertType: like that:
a := MyObject new.
a addTo: 'foo: a boo:b baz: c
^(a*b+c)'
assertType: #(SmallInteger SmallInteger SmallInteger).
I expect that I could write later:
answer := (a foo: 2 boo: 5 baz: 10).
and get 20 in answer.
and if I write:
a foo: 'someString' boo: 5 baz: 10.
I get the proper message because 'someString' is not a SmallInteger.
I know how to write the type checking code, and I know that to add the method to the class in runtime I can use 'compile' method from Behavior class.
the problem is that I want to add the type checking code inside the source code.
I'm not really familiar with all of squeak classes so I'm not sure if I rather edit the aCodeString as a string inside addTo: assertType: and then use compile: (and I don't know how to do so), or that there is a way to inject code to an existing method in Behavior class or other squeak class.
so basically, what I'm asking is how can I inject string into an existing string or to inject code into an existing method.
There are many ways you could achieve such type checking...
The one you propose is to modify the source code (a String) so as to insert additional pre-condition type checks.
The key point with this approach is that you will have to insert the type checking at the right place. That means somehow parsing the original source (or at least the selector and arguments) so as to find its exact span (and the argument names).
See method initPattern:return: in Parser and its senders. You will find quite low level (not most beautiful) code that feed the block (passed thru return: keyword) with sap an Array of 3 objects: the method selector, the method arguments and the method precedence (a code telling if the method is connected to unary, binary or keyword message). From there, you'll get enough material for achieving source code manipulation (insert a string into another with copyReplace:from:to:with:).
Do not hesitate to write small snippets of code and execute in the Debugger (select code to debug, then use debug it menu or ALT+Shift+D). Also use the inspectors extensively to gain more insight on how things work!
Another solution is to parse the whole Abstract Syntax Tree (AST) of the source code, and manipulate that AST to insert the type checks. Normally, the Parser builds the AST, so observe how it works. From the modified AST, you can then generate new CompiledMethod (the bytecode instructions) and install it in methodDictionary - see the source code of compile: and follow the message sent until you discover generateMethodFromNode:trailer:. This is a bit more involved, and has a bad side effect that the source code is now not in phase with generated code, which might become a problem once you want to debug the method (fortunately, Squeak can used decompiled code in place of source code!).
Last, you can also arrange to have an alternate compiler and parser for some of your classes (see compilerClass and/or parserClass). The alternate TypeHintParser would accept modified syntax with the type hints in source code (once upon a time, it was implemented with type hints following the args inside angle brackets foo: x <Integer> bar: y <Number>). And the alternate TypeHintCompiler would arrange to compile preconditions automatically given those type hints. Since you will then be very advanced in Squeak, you will also create special mapping between source code index and bytecodes so as to have sane debugger and even special Decompiler class that could recognize the precondition type checks and transform them back to type hints just in case.
My advice would be to start with the first approach that you are proposing.
EDIT
I forgot to say, there is yet another way, but it is currently available in Pharo rather than Squeak: Pharo compiler (named OpalCompiler) does reify the bytecode instructions as objects (class names beginning with IR) in the generation phase. So it is also possible to directly manipulate the bytecode instructions by proper hacking at this stage... I'm pretty sure that we can find examples of usage. Probably the most advanced technic.

Converting an ASTNode into code

How does one convert an ASTNode (or at least a CompilationUnit) into a valid piece of source code?
The documentation says that one shouldn't use toString, but doesn't mention any alternatives:
Returns a string representation of this node suitable for debugging purposes only.
CompilationUnits have rewrite, but that one does not work for ASTs created by hand.
Formatting options would be nice to have, but I'd basically be satisfied with anything that turns arbitrary ASTNodes into semantically equivalent source code.
In JDT the normal way for AST manipulation is to start with a basic CompilationUnit and then use a rewriter to add content. Then ASTRewriteAnalyzer / ASTRewriteFormatter should take care of creating formatted source code. Creating a CU just containing a stub type declaration shouldn't be hard, so that's one option.
If that doesn't suite your needs, you may want to experiement with directly calling the internal org.eclipse.jdt.internal.core.dom.rewrite.ASTRewriteFlattener.asString(ASTNode, RewriteEventStore). If not editing existing files, you may probably ignore the events collected in the RewriteEventStore, just use the returned String.

sas generate all possible miss spelling

Does any one know how to generate the possible misspelling ?
Example : unemployment
- uemployment
- onemploymnet
-- etc.
If you just want to generate a list of possible misspellings, you might try a tool like this one. Otherwise, in SAS you might be able to use a function like COMPGED to compute a measure of the similarity between the string someone entered, and the one you wanted them to type. If the two are "close enough" by your standard, replace their text with the one you wanted.
Here is an example that computes the Generalized Edit Distance between "unemployment" and a variety of plausible mispellings.
data misspell;
input misspell $16.;
length misspell string $16.;
retain string "unemployment";
GED=compged(misspell, string,'iL');
datalines;
nemployment
uemployment
unmployment
uneployment
unemloyment
unempoyment
unemplyment
unemploment
unemployent
unemploymnt
unemploymet
unemploymen
unemploymenyt
unemploymenty
unemploymenht
unemploymenth
unemploymengt
unemploymentg
unemploymenft
unemploymentf
blahblah
;
proc print data=misspell label;
label GED='Generalized Edit Distance';
var misspell string GED;
run;
Essentially you are trying to develop a list of text strings based on some rule of thumb, such as one letter is missing from the word, that a letter is misplaced into the wrong spot, that one letter was mistyped, etc. The problem is that these rules have to be explicitly defined before you can write the code, in SAS or any other language (this is what Chris was referring to). If your requirement is reduced to this one-wrong-letter scenario then this might be managable; otherwise, the commenters are correct and you can easily create massive lists of incorrect spellings (after all, all combinations except "unemployment" constitute a misspelling of that word).
Having said that, there are many ways in SAS to accomplish this text manipulation (rx functions, some combination of other text-string functions, macros); however, there are probably better ways to accomplish this. I would suggest an external Perl process to generate a text file that can be read into SAS, but other programmers might have better alternatives.
If you are looking for a general spell checker, SAS does have proc spell.
It will take some tweaking to get it working for your situation; it's very old and clunky. It doesn't work well in this case, but you may have better results if you try and use another dictionary? A Google search will show other examples.
filename name temp lrecl=256;
options caps;
data _null_;
file name;
informat name $256.;
input name &;
put name;
cards;
uemployment
onemploymnet
;
proc spell in=name
dictionary=SASHELP.BASE.NAMES
suggest;
run;
options nocaps;

How to number floats in LaTeX consistently?

I have a LaTeX document where I'd like the numbering of floats (tables and figures) to be in one numeric sequence from 1 to x rather than two sequences according to their type. I'm not using lists of figures or tables either and do not need to.
My documentclass is report and typically my floats have captions like this:
\caption{Breakdown of visualisations created.}
\label{tab:Visualisation_By_Types}
A quick way to do it is to put \addtocounter{table}{1} after each figure, and \addtocounter{figure}{1} after each table.
It's not pretty, and on a longer document you'd probably want to either include that in your style sheet or template, or go with cristobalito's solution of linking the counters.
The differences between the figure and table environments are very minor -- little more than them using different counters, and being maintained in separate sequences.
That is, there's nothing stopping you putting your {tabular} environments in a {figure}, or your graphics in a {table}, which would mean that they'd end up in the same sequence. The problem with this case (as Joseph Wright notes) is that you'd have to adjust the \caption, so that doesn't work perfectly.
Try the following, in the preamble:
\makeatletter
\newcounter{unisequence}
\def\ucaption{%
\ifx\#captype\#undefined
\#latex#error{\noexpand\ucaption outside float}\#ehd
\expandafter\#gobble
\else
\refstepcounter{unisequence}% <-- the only change from default \caption
\expandafter\#firstofone
\fi
{\#dblarg{\#caption\#captype}}%
}
\def\thetable{\#arabic\c#unisequence}
\def\thefigure{\#arabic\c#unisequence}
\makeatother
Then use \ucaption in your tables and figures, instead of \caption (change the name ad lib). If you want to use this same sequence in other environments (say, listings?), then define \the<foo> the same way.
My earlier attempt at this is in fact completely broken, as the OP spotted: the getting-the-lof-wrong is, instead of being trivial and only fiddly to fix, absolutely fundamental (ho, hum).
(For the afficionados, it comes about because \advance commands are processed in TeX's gut, but the content of the .lof, .lot, and .aux files is fixed in TeX's mouth, at expansion time, thus what was written to the files was whatever random value \#tempcnta had at the point \caption was called, ignoring the \advance calculations, which were then dutifully written to the file, and then ignored. Doh: how long have I know this but never internalised it!?)
Dutiful retention of earlier attempt (on the grounds that it may be instructively wrong):
No problem: try putting the following in the preamble:
\makeatletter
\def\tableandfigurenum{\#tempcnta=0
\advance\#tempcnta\c#figure
\advance\#tempcnta\c#table
\#arabic\#tempcnta}
\let\thetable\tableandfigurenum
\let\thefigure\tableandfigurenum
\makeatother
...and then use the {table} and {figure} environments as normal. The captions will have the correct 'Table/Figure' text, but they'll share a single numbering sequence.
Note that this example gets the numbers wrong in the listoffigures/listoftables, but (a) you say you don't care about that, (b) it's fixable, though probably mildly fiddly, and (c) life is hard!
I can't remember the syntax, but you're essentially looking for counters. Have a look here, under the custom floats section. Assign the counters for both tables and figures to the same thing and it should work.
I'd just use one type of float (let's say 'figure'), then use the caption package to remove the automatically added "Figure" text from the caption and deal with it by hand.

Resources