tools to create dfa from grammar rules - rule

I am looking for any tool or software that converts a set of rules into a Deterministic Finite Automata. Actually I'm developing a stemmer, something like porter stemmer for Enlglish. I have a set of rules that remove suffixes and/or prefixes from terms leaving a stem. I can translate this rules into a DFA manually, but it's an adhoc solution and results into flexibility problems.
Any help appreciated.
Thanks!

I don't know about converting rules to a DFA, but for manipulation of DFA's, as well as testing and debugging them a great (free) program is JFlap. It has a ton of built-in tools, as well as support for a variety of automatons and machines. Maybe somewhere in there is what you're looking for!

Related

Are there published generative grammars for natural languages?

I have some ideas to do with natural language processing. I will need some grammars of the
S -> NP VP
variety in order to play with them.
If I try to write these rules myself it will be a tedious and error-prone business. Has anyone ever typed up and released comprehensive rule sets for English and other natural languages? Ideally written in BNF, Prolog or similar syntax.
My project only relates to context-free grammars, I'm not interested in statistical methods or machine learning -- I need to systematically produce Engligh-like and Foobarian-like sentences.
If you know where to find such materiel, I'd very much appreciate it.
You might want to look at Attempto Controlled English and its Prolog-based tools.
Since statistical parsing came in vogue in the early 90s, grammars have usually not been distributed, except for specific problem domains, but derived from distributed corpora such as the Penn Treebank. If you can get a hold of that (I believe a sample is distributed with NLTK), you can "roll your own" grammar by looking at all tree fragments and translating them to rules. (E.g., if you find a node labeled S with children labeled NP and VP, you know there should be a rule S -> NP VP. Pruning the rules that occur infrequently would be a good idea.)
The most comprehensive context-free grammar for English that I know of is the one described in:
Gazdar, Gerald; Ewan H. Klein, Geoffrey K. Pullum, Ivan A. Sag. 1985. Generalized Phrase Structure Grammar. Oxford: Blackwell.
There are also several rule-based but non-context-free grammars freely available online, e.g., the Penn XTAG grammar or the HPSG English Resource Grammar.
Look into the Grammatical Framework. It is a functional programming language for multilingual grammar applications which comes with libraries for ~30 languages, among them English.

Creating source to source translator

I want to know that what are the strategies to create a source to source translator i.e creating translation from one high level language to another. The two ways that come into my mind are
1- Changing syntax tree of one language to other language syntax tree
2- Changing it to intermediate language and then converting that to other high level language
My question is that is it possible to do the conversion using both strategies and which is more feasible to do, can anyone give some refernces to any theory or implementation done by some converter like any of above methods. And is there any standard xml based intermediate language, i know that xmlvm uses xml as intermediate language but it does not provide any proper specification of the intermediate language.
Any compiler is, roughly, a source-to-source translator. Target language can be an assembly language (or directly a binary machine code language), or C, or whatever high level language you fancy. So, the general compilers theory is applicable.
And just as a word of advice - one intermediate language is normally not nearly enough. Use more. Use dozens of intermediate languages, each different from a previous one in just one tiny aspect. This way any language-to-language translation is nothing but trivial.
Another word of advice (anticipating downvotes here) - stay away from XML, especially as a representation for ASTs.
I would look at LLVM, which can do source to source. Although the output isn't pretty, it might provide some good ideas.
The converters are usually based on constructing the semantic tree of one program and then re-shaping it to the target PL. As an example, take a look at C# to Java convertor.
The second approach is also possible, but the organization of your code may change completely after conversion. So, it is better to keep the intermediate common structure (IL, ST, etc), as high level as possible.
Try Clang! It is powerful for source-to-source translation. As of now it fully supports C, C++, Objective C and Objective C++.
You may also want to look at ROSE compiler infrastructure.

Programmatic parsing and understanding of language (English)

I am looking for some resources pertaining to the parsing and understanding of English (or just human language in general). While this is obviously a fairly complicated and wide field of study, I was wondering if anyone had any book or internet recommendations for study of the subject. I am aware of the basics, such as searching for copulas to draw word relationships, but anything you guys recommend I will be sure to thoroughly read.
Thanks.
Check out WordNet.
You probably want a book like "Representation and Inference for Natural Language - A First Course in Computational Semantics"
http://homepages.inf.ed.ac.uk/jbos/comsem/book1.html
Another way is looking at existing tools that already do the job on the basis of research papers: http://nlp.stanford.edu/index.shtml
I've used this tool once, and it's very nice. There's even an online version that lets you parse English and draws dependency trees and so on.
So you can start taking a look at their papers or the code itself.
Anyway take in consideration that in any field, what you get from such generic tools is almost always not what you want. In the sense that the semantics attributed by such tools is not what you would expect. For most cases, given a specific constrained domain it's preferable to roll your own parser, and do your best to avoid any ambiguities beforehand.
The process that you describe is called natural language understanding. There are various algorithms and software tools that have been developed for this purpose.

Why are there not more control structures in most programming languages?

Why do most languages seem to only exhibit fairly basic control structures from a logic point of view? Stuff like If ... then, Else..., loops, For each, switch statement, etc. The standard list seems fairly basic from a logic point of view.
Why is there not much more in the way of logic syntactical sugar? Perhaps something like a proposition engine, where you could feed an array of premises or functions that return complicated self referential interdependent functions and results. Something where you could chain together a complex array of conditions, but represented in a way that was easy and clear to read in the code.
Premise 1
Premise 2 if and only if Premise 1
Premise 3
Premise 4 if Premise 2 and Premise 3
Premise 5 if and only if Premise 4
etc...
Conclusion
I realize that this kind of logic this can be constructed in functions and/or nested conditional statements. But why are there not generally more syntax options for structuring these kind of logical propositions without resulting in hairy looking conditional statements that can be hard to read and debug?
Is there an explanation for the kinds of control structures we typically see in mainstream programming languages? Are there specific control structures you would like to see directly supported by a language's syntax? Does this just add unnecessary complexity to the language?
Have you looked a Prolog? A Prolog program is basically a set of rules that is turned into one big evaluation engine.
From my personal experience Prolog is a bit too weird and I actually prefer ifs, whiles and so on but YMMV.
Boolean algebra is not difficult, and provides a solution for any conditionals you can think of, plus an infinite number of other variants.
You might as well ask for special syntax for "commonly-used" arithmetic expressions. Who is to say what qualifies as commonly-used? And where do you stop adding special-case syntax?
Adding to the complexity of a language parser is not preferable to using constructive expression syntax, combined with extensibility through defining functions.
It's been a long time since my Logic class in college but I would guess it's a mixture of difficulty in writing them into the language vs. the frequency with which they'd be used. I can't say I've ever had the need for them (not that I can recall). For those times that you would require something of that ilk the language designers probably figure you can work out the logic yourself using just the basic structures.
Just my wild guess though.
Because most programming languages don't provide sufficient tools for users to implement them, it is not seen as an important enough feature for the implementer to provide as an extension, and it isn't demanded enough or used enough to be added to the standard.
If you really want it, use a language that provides it, or provides the tools to implement it (for instance, lisp macros).
It sounds as though you are describing a rules engine.
The basic control algorithms we use mirror what processor can do efficiently. Basicly this boils down to simple test-and-branches.
It may seem limiting to you, but many people don't like the idea of writing a simple-looking line of code that requires hundreds or thousands (or millions) of processor cycles to complete. Among these people are systems software folks, who write things like Operating Systems and compilers. Naturally most compilers are going to reflect their own writer's concerns.
It relates to the concern regarding atomicity. If you can express A,B,C,D in simpler structures Y, Z, why not simply not supply A,B,C,D but supply Y, Z instead?
The existing languages reflect 60 years of the tension between atomicity and usability. The modern approach is "small language, large libraries". (C#, Java, C++, etc).
Because computers are binary, all decisions must come down to a 1/0, yes/no, true/false, etc.
To be efficient, the language constructs must reflect this.
Eventually all your code goes down to a micro-code that is executed one instruction at a time. Until the micro-code and accompanying CPU can describe something more colorful, we are stuck with a very plain language.

A common set of problems to learn new languages

With "Polyglot" programming techniques becoming more relevant, it is almost a necessity to use the "right" PL for the problem. However, learning new languages takes time which usually most project team can't afford. What is the best way to learn a new programming language? Is there a common set of problems that can be solved to reach a certain level of competence?
Well, it depends what you want to do. (web, db, whatever).
Generally I'd want to know:
What's the library like, how do I reference it
What ORMs are there
What build/deployment platforms exist for it
How does it handle updates
How do I do general things, like:
DB Access
File things
Display UI's
and so on.
Really, learning is only by doing -- you need a project that you can use the given language for.
Project Euler is the first thing to come to mind as an oft-used set of problems to try in a new language, even if it's not something I've ever tried.
If the language is another JVM or CLR hosted one, the issues about learning the environment can be set aside -- you can use all your familiar APIs in your Clojure/Scala/F#... code -- and concentrate on the syntax and idiom.
Otherwise, you're probably using the new language because it has a good fit for the particular problem you want to solve (e.g. native code and functional -> Haskell; distributed and concurrent -> Erlang) so the fit of the feature set is known in advance but you have the extra load of learning the standard APIs. And that's what prototyping is for.
The book Programming Challenges and the associated website provide a large list of algorithmic problems, with automatic online judging in several languages (Java, C, C++). Any algorithm textbook can give you lots of examples of basic data structures and procedures to try and implement, which is often a nice way to get some practice with basic language syntax and features. My personal favourite for this is The Algorithm Design Manual, which is language agnostic, but there are plenty of good language-specific books available as well (Mastering Algorithms in Perl or Data Structures and Algorithms in Java, for example).
If you're interested in a general set of mathematical problems to try and solve, Project Euler is a great resource.
For more day to day problems, I find the cookbook approach most helpful. For example, both Perl and Python have excellent O'Reilly cookbooks, as well as online resources, which provide short examples of many common and important problems. As mentioned in another answer, the key here is to find canonical examples of basic features you will need, particularly by leveraging what's available in standard libraries. I usually try and build up my own small library of examples as I go along, e.g. a socket example, a DB access example, a file reading example, a simple numerical solver, etc, which I then pillage for ideas when it's time to write production code.

Resources