Using XText to create a DSL for describing proprietary XML-formats

Using XText to create a DSL for describing proprietary XML-formats - dsl

At the moment, I have to work with XACML. As there doesn't seem to be an editor to fit my needs, and as writing documents in it is a real pain, I wonder if I could not create some sort of DSL to make creating documents easier (are less error-prone). Is this possible with XText? I have a feeling it's possible but quite hard to do (especially for someone who doesn't know XText ;-)).

Getting rid of manually edited XML files is a typical use case for Xtext. The tedious part is the syntax definition itself. As soon as you have an idea how your files should look like, it's usually straight forward to get a working prototype with Xtext. What sort of concerns do you have?

Related

Tools for Domain Specific Language/Functions

Our users can enter questions that get answered by students. Our users need a extensible, flexible way to define the correct answers to these questions (which are stored as a simple string).
I would like to expose a library of domain specific functions that users can call on to describe the correct answer. Eg:
exact_match("puppy") // means the correct answer is the string 'puppy'
or
contains("yesterday") // means any answer with the word 'yesterday' is correct
The naive implementation would involve eval'ing user supplied strings in a sandboxed runtime (like a javascript vm or ruby vm). But I'd like to go further and only allow specific functions to be called. Any other scripting would be discarded. Such that:
puts("foo"); contains("yesterday")
would be illegal. Since we don't expose or allow puts().
How can I constrain the execution environment to only run a whitelist of functions? Or is there a different approach to build this kind of external-facing DSL instead of trying to constrain an existing language to a subset of functions?

I would check out MPS by JetBrains if I were you, its an open source DSL creation tool. I have never used it myself, but from everything I have seen on it, it's very intuitive; and all of their other products are incredibly powerful.

Just because you're creating a DSL, that doesn't necessarily mean that you have to give the user the ability to enter the code in text.
The key to this is providing a list of method names and your special keyword for them, the "FunCode" tag in the code example below:
Create a mapping from keyword to code, and letting them define everything they need, and then use it. And I would actually build my own XML parser so that it's not hackable, at least not on a list of zero-day-exploits hackable.
<strDefs>
<strDef><strNam>sickStr</strNam>
<strText>sick</strText><strNum>01</strNum><strDef>
<strDef><strNam>pupStr</strNam>
<strText>puppy</strText><strNum>02</strNum><strDef>
</strDefs>
<funDefs>
<funDef><funCode>pfContainsStr</funCode><funLabel>contains</funLabel>
<funNum>01</funNum></funDef>
<funDef><funCode>pfXact</funCode><funLabel>exact_match</funLabel>
<funNum>02</funNum></funDef>
</funDefs>
<queries>
<query><fun>01</fun><str>02</str>
</query>
</queries>
The above XML more represents the idea and the structure of what to do, but rather in a user interface, so the user is constrained. The user interface code that allows the data-entry of the above data should be running on your server, and they only interact with it. Any code that runs on their browser is hackable, because they can just save the page, edit the HTML (and/or JavaScript), and run that, which is their code now, not yours anymore.
You can't really open the door (pandora's box) and allow just anyone to write just any code and have it evaluated / interpreted by the language parser, because some hacker is going to exploit it. You must lock down the strings, probably by having them enter them into your database in an earlier step, and each string gets its own token that YOU generate (a SQL Server primary key is very simple, usable, and secure), but give them a display representation so it's readable to them.
Then give them a list of methods / functions they can use, along with a token (a primary key can also serve here, perhaps with a kind of table prefix) and also a display representation (label).
If you have them put all of their labels into yet another table, you can have SQL make sure that all of their labels are unique to each other in the whole "language", and then you can allow them to try to define their expressions in the language they want to use. This has the advantage that foreign languages can be used, but you don't have to do anything terribly special.
An important piece would be the verify button, that would translate their expression into unique tokens and back again, checking that the round-trip was successful. If it wasn't successful, there's some kind of ambiguity, and you might be able to allow them an option to use the list of tokens as the source in that case.
If you heavily rely on set-based logic for the underlying foundation of the language and your tables, you should be able to produce a coherent DSL that works. Many DSL creation problems are ones of integrity, where there are underlying assumptions that are contradictory, unintentionally mutually exclusive, or nonsensical. Truth is an unshakeable foundation. Anything else has a lie somewhere -- that you're trying to build on.
Sudoku is illustrative here. When you screw up a Sudoku, you often don't know that you have done so, and you keep building on that false foundation, until you get to the completion of the puzzle, and one whole string of assumptions disagrees with a different string of assumptions. They can't both be true. But you can't tell where you went wrong because you're too far away from the mistake and can not work backwards (easily). All steps taken look correct. A DSL, a database schema, and code, are all this way. Baby steps, that are double- and even triple-checked, and hopefully "correct by inspection", are the best way to "grow" a DSL, slowly, piece-by-piece. The best way to not have flaws is to not add them in the first place.
You don't want bugs in your DSL. Keep it spartan. KISS - Keep it simple, Sparticus! And I have personally found that keeping it set-based, if not overtly, under the covers, accomplishes this very well.
Finally, to be able to think this way, I've studied languages for a long time, and have cultivated a curiosity about how languages have come to be. Books are a good quality source of information, as they have a higher quality level than the internet, which is nevertheless also an indispensable source. Some of my favorite languages: Forth, Factor, SETL, F#, C#, Visual FoxPro (especially for its embedded SQL), T-SQL, Common LISP, Clojure, and probably my favorite, Dylan, an INFIX Lisp without parentheses that Apple experimented with and abandoned, with a syntax that seems to me reminiscent of Pascal, which I sort of liked. The language list is actually much longer than that (and I haven't written code for many of them -- just studied them or their genesis), but that's enough for now.
One of my favorite books, and immensely interesting for the "people" side of it, is "Masterminds of Programming: Conversations with the Creators of Major Programming Languages" (Theory in Practice (O'Reilly)) 1st Edition, Kindle Edition
by Federico Biancuzzi (Author), Chromatic (Author)
By the way, don't let them compromise the integrity of your DSL -- require that it is expressible set-based, and things should go well (IMHO). I hope it works out well for you. Add a comment to my answer telling me how it worked out, if you think of it. And don't forget to choose my answer if you think it's the best! We work hard for the money! ;-)

Extracting Validations using NLP

Considering me a newbie into this but if I want to develop an engine to extract only validations from a technical documents like functional specification. Usually the validations are quick to identify.
If this can be done somehow I can use it for further automation.
I checked and few frameworks are available like
https://opennlp.apache.org/
http://nlp.stanford.edu/
I did few POC's as well , but designing an intelligent engine having generic rules is where I am getting blocked.
Any pointers will be helpful ...

For extracting only validation you need to have the grammar or similar like structure maybe a regular expression. If you dont have ANY idea what type of validation you might face then you can follow un-supervised learning approach. Designing any generic rule is not as easy as it sound so you need to work extra hard for this.

Groovy and dynamic methods: need groovy veteran enlightment

First, I have to say that I really like Groovy and all the good stuff it is bringing to the Java dev world. But since I'm using it for more than little scripts, I have some concerns.
In this Groovy help page about dynamic vs static typing, there is this statement about the absence of compilation error/warning when you have typo in your code because it could be a call to a method added later at runtime:
It might be scary to do away with all of your static typing and
compile time checking at first. But many Groovy veterans will attest
that it makes the code cleaner, easier to refactor, and, well, more
dynamic.
I'm pretty agree with the 'more dynamic' part, but not with cleaner and easier to refactor:
For the other two statements I'm not sure: from my Groovy beginner perspective, this is resulting in less code, but in more difficult to read later and in more trouble to maintain (can not rely on the IDE anymore to find who is declaring a dynamic method and who is using one).
To clarify, I find that reading groovy code is very pleasant, I love the collection and closure (concise and expressive way of tackle complicated problem).
But I have a lot of trouble in these situations:
no more auto-completion inside 'builder' using Map (Of Map (of Map))
everywhere
confusing dynamic methods call (you don't know if it is a typo or a
dynamic name)
method extraction is more complicated inside closure (often resulting in code duplicate: 'it is only a small closure after all')
hard to guess closure parameters when you have to write one for a method of a subsystem
no more learning by browsing the code: you have to use text search instead
I can only saw some benefits with GORM, but in this case the dynamic method are wellknown and my IDE is aware of them (so it is more looking like a systematic code generation than dynamic method for me)
I would be very glad to learn from groovy veteran how they can attest of these benefits.

It does lead to different classes of bugs and processes. It also makes writing tests faster and more natural, helping to alleviate the bug issues.
Discovering where behavior is defined, and used, can be problematic. There isn't a great way around it, although IDEs are getting better at it over time.
Your code shouldn't be more difficult to read--mainline code should be easier to read. The dynamic behavior should disappear into the application, and be documented appropriately for developers that need to understand functionality at those levels.
Magic does make discovery more difficult. This implies that other means of documentation, particularly human-readable tests (think easyb, spock, etc.) and prose, become that much more important.

This is somewhat old, but i'd like to share my experience if someone comes looking for some thoughts on the topic:
Right now we are using eclipse 3.7 and groovy-eclipse 2.7 on a small team (3 developers) and since we don't have tests scripts, mostly of our groovy development we do by explicitly using types.
For example, when using service classes methods:
void validate(Product product) {
// groovy stuff
}
Box pack(List<Product> products) {
def box = new Box()
box.value = products.inject(0) { total, item ->
// some BigDecimal calculations =)
}
box
}
We usually fill out the type, which enable eclipse to autocomplete and, most important, allows us to refactor code, find usages, etc..
This blocks us from using metaprogramming, except for Categories which i found that are supported and is detected by groovy-eclipse.
Still, Groovy is pretty good and a LOT of our business logic is in groovy code.
We had two issues in production code when using groovy, and both cases were due bad manual testing.
We also have a lot of XML building and parsing, and we validate it before sending it to webservices and the likes.
There's a small script we use to connect to an internal system whose usage is very restricted (and not needed in other parts of the system). This code i developed using entirely dynamic typing, overriding methods using metaclass and all that stuff, but this is an exception.
I think groovy 2.0 (with groovy-eclipse coming along, of course) and it's #TypeChecked will be great for those of us that uses groovy as a "java++".

To me there are 2 types of refactoring:
IDE based refactoring (extract to method, rename method, introduce variable, etc.).
Manual refactoring. (moving a method to a different class, changing the return value of a method)
For IDE based refactoring I haven't found an IDE that does as good of a job with Groovy as it does with Java. For example in eclipse when you extract to method it looks for duplicate instances to refactor to call the method instead of having duplicated code. For Groovy, that doesn't seem to happen.
Manual refactoring is where I believe that you could see refactoring made easier. Without tests though I would agree that it is probably harder.
The statement at cleaner code is 100% accurate. I would venture a guess that good Java to good Groovy code is at least a 3:1 reduction in lines of code. Being a newbie at Groovy though I would strive to learn at least 1 new way to do something everyday. Something that greatly helped me improve my Groovy was to simply read the APIs. I feel that Collection, String, and List are probably the ones that have the most functionality and I used the most to help make my Groovy code actually Groovy.
http://groovy.codehaus.org/groovy-jdk/java/util/Collection.html
http://groovy.codehaus.org/groovy-jdk/java/lang/String.html
http://groovy.codehaus.org/groovy-jdk/java/util/List.html
Since you edited the question I'll edit my answer :)
One thing you can do is tell intellij about the dynamic methods on your objects: What does 'add dynamic method' do in Groovy/IntelliJ?. That might help a little bit.
Another trick that I use is to type my objects when doing the initial coding and remove the typing when I'm done. For example I can never seem to remember if it's .substring(..) or .subString(..) on a String. So if you type your object you get a little better code completion.
As for your other bullet points, I'd really need to look at some code to be able to give a better answer.

Refactoring: Cross Cutting Concerns workaround

Is there a workaround for implementing cross cutting concerns without going into aspects and point cuts et al.?
We're in Spring MVC, and working on a business app where it's not feasible to go into AspectJ or Spring's aspect handling due to various reasons.
And some of our controllers have become heavily bloated (too heavily), with tons of out-of-focus code creeping in everywhere.
Everytime I sit down to refactor, I see the same things being done over and over again. Allow me to explain:
Everytime I have to prepare a view, I add a list of countries to it for the UI. (Object added to the ModelAndView). That list is pulled out of a DB into ehCache.
Now, initially it was terrible when I was trying to add the lists INLINE to the mav's everywhere. Instead, I prepared a function which would process every ModelAndView. How? well, with more garbage calls to the function!
And I bought out one trouble for another.
What's a design pattern/trick which can help me out a bit? I'm sick of calling functions to add things to my ModelAndView, and with over 3500 lines of only controller code, I'm going mad finding all the glue points where things have gone missing!
Suggestion are welcome. Cross cutting concerns flavor without AspectJ or Spring native.

Since you are using Java, you may consider moving your code to Scala, since it interacts well with Java, then you can use traits to get the functionality you want.
Unfortunately cross-cutting is a problem with OOP, so changing to functional programming may be a solution, but, I expect that in actuality they are using AOP to implement these mixins, so it would still be AOP, just abstracted out.
The other option is to look at redesigning your application, and make certain that you don't have duplicate code, but a major refactoring is very difficult and fraught with risk.
But, for example, you may end up with your ModelAndView calling several static utility classes to get the data it needs, or do ensure that the user has the correct role, for example.
You may want to look at a book, Refactoring to Patterns (http://www.industriallogic.com/xp/refactoring/) for some ideas.

Writing easily modified code

What are some ways in which I can write code that is easily modified?
The one I have learned from experience is that I almost always need to write one to throw away. That way I have developed a sense of the domain knowledge and program structure required before coding the actual application.

The general guidelines are offcourse
High cohesion, low coupling
Dont repeat yourself
Recognize design patterns and implement them
Dont recognize design patterns where they are not existing or necassary
Use a coding standard, stick to it
Comment everyting that should be commented, when in doubt : comment
Use unit tests
Write comments and tests before implementation, that way you know exactly what you want to do
And when it goes wrong : refactor, refactor, refactor. With good tests you can be sure nothing breaks
And oh yeah:
read this : http://www.pragprog.com/the-pragmatic-programmer
Everything (i think) above and more is in it

I think your emphasis on modifiability is more important than readability. It is not hard to make something easy to read, but the real test of how well it is understood comes when someone else (or you) has to modify it in repsonse to changing requirements.
What I try to do is assume that modifications will be necessary, and if it is not really clear how to do them, leave explicit directions in the code for how to do them.
I assume that I may have to do some educating of the reader of the code to get him or her to know how to modify the code properly. This requires energy on my part, and it requires energy on the part of the person reading the code.
So while I admire the idea of literate programming, that can be easily read and understood, sometimes it is more like math, where the only way to do it is for the reader to buckle down, pay close attention, re-read it a few times, and make sure they understand.

Readability helps a lot: If you do something non-obvious, or you are taking a shortcut, comment. Comments are places where you can go back and refactor if you have time later. Use sensible names for everything, makes it easier to understand what is going on.
Continuous revision will let you move from that first draft to a better one without throwing away (too much) work. Any time you rewrite from scratch you may lose lessons learned. As you code, use refactoring tools to eliminate code representing areas of exploration that are no longer needed, and to make obvious things that were obscure. The first one reduces the amount that you need to maintain; the second reduces the effort per square foot. (Sqft makes about as much sense as lines of code, really.)
Modularize appropriately and enforce encapsulation and separation of logic between your modules. You don't want too many dependencies on any one part of the code or that part becomes inherently harder to understand.
Considering using tried and true methods over cutting edge ones. You give up some functionality for predictability.
Finally, if this is code that people will be using before and after modification, you need(ed) to have an appropriate API insulating your code from theirs. Having a strong API lets you change things behind the scenes without needing to alert all your consumers. I think there's a decent article on Coding Horror about this.

Hang Your Code Out to D.R.Y.
I learned this early when assigned the task of changing the appearance of a web-interface. The code was in C, which I hated, and was compiled to a CGI executable. And, worse, it was built on a library that was abandoned—no updates, no support, and too many man-hours put into its use to change it. On top of the framework was a disorderly web of code, consisting of various form and element builders, custom string implementations, and various other arcane things (for a non-C programmer to commit suicide with).
For each change I made there were several, sometimes many, exceptions to the output HTML. Each one of these exceptions required a small change or improvement in the form builder, thanks to the language there's no inheritance and therefore only functions and structs, and instead of putting the hours in the team instead wrote these exceptions frequently.
In my inexperience I was forced to change the output of each exception, rather than consolidate the changes in an improved form builder. But, trawling through 15,000 lines of code for several hours after ineffective changes would induce code-burn, and a fogginess that took a night's sleep to cure.
Always run your code through the DRY-er.

The easiest way to modify a code is NOT to write code. Write pseudo code not just for algo but how your code should be structured if you are unsure.
Designing while writing code never works...for me :-)

Here is my current experience: I'm working (Java) with a kind of database schema that might often change (fields added/removed, data types modified). My strategy is to parse this schema and to generate the code with apache velocity. The BaseClass generated is never modified by the programmer. Else, a MyClass extends BaseClass is created and the logical components of this class (e.g. toString() ! )are implemented using the 'getters' and the 'setters' of the super class.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string