Data Mining in Lisp - statistics

Data Mining in Lisp - statistics

I'm looking for a way to accomplish data mining tasks in Common Lisp; does anything exist that would make this possible? I found Incanter for Clojure, but I have to stick to Common Lisp for the task at hand.

These are libraries I use often and think helpful:
GSLL: GNU Scientific Library for Lisp
LLA: Lisp Linear Algebra (blas and lapack bindings)
Gabor Melis' ML Libraries(svm, svd, statistics, etc)
There are a lot more listed on cliki that I haven't had a chance to evaluate.

Your question is a bit vague, data mining is a huge field
For statistics, I would also check out:
Tamas Papp's other libraries as well as LLA. In particular cl-random, cl-slice& Cl-num-utils useful stuff.
Mirko Vukevic has a nice implementation of data tables
For the moment I would not worry too much about common-lisp-stat. To say that its pre-alpha would be an understatement. However that will change Real Soon Now, as i intend another round of development
for data munging - Alain Picards CSV (or the many variants thereof, or Pascal Bourgignon's implementation).
Check out the http://www.cliki.net/database page for various database clients.

Related

Is there a suitable replacement for C++, when I would like to write video processing applications?

I want to write a video editing software, and the "logical" conclusion is that the language I must to use is C++... But I don't like it (sorry c++ fans)
I would like to write it with something cool, like Lisp or Haskell or Erlang... But I don't know if the open source implementation of those languages (I don't have money to buy licenses) let me made a competitive software (in the performance area)
What do you think? what do you recommend?

I can't speak to Lisp, but both Erlang and Haskell are capable of the performance necessary for video processing. Achieving that performance is likely to be more difficult than with C++ because there are fewer existing libraries in the domain, so you'll have to implement more yourself. Which means you'll have to be capable of writing high-performance code yourself. In Haskell I expect this would require a significant investment of time (6 months minimum) to become proficient.
Which language you choose should depend a great deal upon the goals of the project. If it's a hobby project, or you want to learn a lot about processing algorithms (and therefore don't mind having to do a lot of low-level coding yourself), there's nothing wrong with using an out-of-mainstream language. Haskell has bindings to a lot of things you would probably want to use eventually, such as a wrapper for GLSL.
As somebody working with audio processing (including real-time), I can say that Haskell's performance hasn't been a problem for me. For a recent project I did write some functions in C, but that was necessary to implement a custom vectorization scheme. Doing high-level work in Haskell and calling out to C when necessary is a perfectly valid approach, although thankfully it's less necessary now than in the past.
Of course, this presumes a few things about the nature of your project. If you want something you can use right away, Haskell, Lisp, and Erlang are probably not the languages for you because there are fewer resources. Have you considered Processing? It's Java, I don't know if you consider that better than C++ or worse.
I had motivations besides productivity for working in Haskell (and my productivity took a big hit for a while), without those other goals I wouldn't have persevered. If you want to write something to use it, stick with what's going to be most productive. If you have other motivations, tell us what they are and it's more likely people will make helpful suggestions.

For what it's worth, Wings3D is written in Erlang.
You could always try D, if you want something somewhat similar to C++ but not C++. Also, D could use some love.

For both Haskell and Erlang, the open source implementation is the standard, most efficient available implementation available. There's no reason that Haskell shouldn't be performant enough for your needs -- for video stuff I assume you'll be using matrices and such. There are quality bindings available for BLAS & co for Haskell. I don't know of a great deal of existing video editing work, but Alberto Ruiz (the author of HMatrix) has done work with Haskell and computer vision: http://dis.um.es/profesores/alberto/research.html
There's also a great deal of work on sound libraries and processing in Haskell.

I'd use the language that gives me the best coverage by third-party libraries for what I'm trying to do; for manipulating video data that's probably going to be a mainstream language like C++.
If this project is for fun/to learn a new language then by all means, take the road less traveled. But if this is something you need to ship in a reasonable amount of time, avoiding the best tools for the job because you don't like them is unsound strategy.

That depends at least on what's your goal with the project. If it's a hobby project and you want to learn a different language, then you should choose that language. In this case, however, I assume you're familiar with video processing. On the other hand, if you want to learn about video processing, I'd recommend using a language you are already comfortable with.
Now, if it's a professional project of a decent size (video processing software can be huge) you should probably consider using different languages for different things. The kind of systems I work with usually require writing some code in C (for efficiency reasons), but we always try to keep that to de indispensable minimum and use a higher level language for most of the system behaviour (we use erlang, but that applies to any other higher level language).
IMO, writing big systems in C or C++ is almost a suicide. There are projects that succeed, but I find that much harder than complementing the C part with higher level languages.

There is already some video streaming server written in Erlang http://erlyvideo.org/. You can look for some inspiration https://github.com/erlyvideo/erlyvideo.

is it possible to markup all programming languages under object oriented paradigm using a common markup schema?

i have planned to develop a tool that converts a program written in a programming language (eg: Java) to a common markup language (eg: XML) and that markup code is converted to another language (eg: C#).
in simple words, it is a programming language converter that converts program written in one language to another language.
i think it is possible but i don know where to start. i wanna know the possibilities to do so and information about some existing system.

What you are trying to do is extremely hard, but if you want to know what you are up for I've listed the steps you need to follow below:
First the hard bit:
First you obtain or derive an operational semantics for your source and target languages.
Then you enhance the semantics to capture your source and target memory models.
Then you need to unify the two enhanced-semantics within a common operational model.
Then you need to define a mapping from your source languages onto the common operational model.
Then you need to define a mapping from your operational model to your target language
Step 4, as you pointed out in your question, is trivial.
Step 1 is difficult, as most languages do not have sufficiently formal semantics specified; but I recommend checking out http://lucacardelli.name/TheoryOfObjects.html as this is the best starting point for building a traditional OO semantics.
Step 2 is almost certainly impossible in general, but may be merely obscenely difficult if you are willing to sacrifice some efficiency.
Step 3 will depend on how clean the result of step 1 turned out, but is going to be anything from delicate and tricky to impossible.
Step 5 is not going to be trivial, it is effectively writing a compiler.
Ultimately, what you propose to do is impossible in general, due to the difficulties inherited in steps 1 and 2. However it should be difficult, but doable, if you are willing to: severely restrict the source language constructs supported; pretty much forget handling threads correctly; and pick two languages with sufficiently similar semantics (ie. Java and C# are ok, but C++ and anything-else is not).

It depends on what languages you want to support, but in general this is a huge & difficult task unless you plan to only support a very small subset of each language.
The real problem is that each programming languages has different features (with some areas that overlap and others that don't) and different ways of solving the same problems -- and it's pretty tricky to detect the problem the programmer is trying to solve and convert that to a new idiom. :) And think about the differences between GUIs created in different languages....
See http://xmlvm.org/ as an example (a project aimed at converting between source code of many different languages, with an XML middle-point) -- the site covers in some depth the challenges they are tackling and the compromises they take, and (if you still have any interest in this kind of project...) ask more specific followup questions.
Notice specifically what the output source code looks like -- it's not at all readable, maintainable, efficient, etc..

It is "technically easy" to produce XML for any single langauge: build a parser, construct and abstract syntax tree, and dump out that tree as XML. (I build tools that do this off-the-shelf for many languages). By technically easy, I mean that the community knows how to do this (see any compiler textbook, e.g., Aho&Ullman Dragon book). I do not mean this is a trivial exercise in terms of effort, because real languages are complicated and messy; there have been many attempts to build C++ parsers and few successes. (I have one of the successes, and it was expensive to get right).
What is really hard (and I don't try to do) is produce XML according to a single schema in which the language semantics are exposed. And without that, it will be essentially impossible to write a translator from a generic XML to an arbitrary target language. This is known as the UNCOL problem and people have been looking since 1958 for the answer. I note that the Wikipedia article seems to indicate the problem is solved, but you can't find many references to UNCOL in the literature since 1961.
The closest attempt I've seen to this is the OMG's "ASTM" model (http://www.omg.org/spec/ASTM/1.0/Beta1/); it exports XMI which is XML. But the ASTM model has lots of escapes built into it to allow langauges that it doesn't model perfectly (AFAIK, that means every language) to extend the XMI in arbitrary ways so that the language-specific information can be encoded. Consequently each language parser produces a custom version of the XMI, and thus each reader has to pretty much know about the extensions and full generality vanishes.

Programmatic parsing and understanding of language (English)

I am looking for some resources pertaining to the parsing and understanding of English (or just human language in general). While this is obviously a fairly complicated and wide field of study, I was wondering if anyone had any book or internet recommendations for study of the subject. I am aware of the basics, such as searching for copulas to draw word relationships, but anything you guys recommend I will be sure to thoroughly read.
Thanks.

Check out WordNet.

You probably want a book like "Representation and Inference for Natural Language - A First Course in Computational Semantics"
http://homepages.inf.ed.ac.uk/jbos/comsem/book1.html

Another way is looking at existing tools that already do the job on the basis of research papers: http://nlp.stanford.edu/index.shtml
I've used this tool once, and it's very nice. There's even an online version that lets you parse English and draws dependency trees and so on.
So you can start taking a look at their papers or the code itself.
Anyway take in consideration that in any field, what you get from such generic tools is almost always not what you want. In the sense that the semantics attributed by such tools is not what you would expect. For most cases, given a specific constrained domain it's preferable to roll your own parser, and do your best to avoid any ambiguities beforehand.

The process that you describe is called natural language understanding. There are various algorithms and software tools that have been developed for this purpose.

A common set of problems to learn new languages

With "Polyglot" programming techniques becoming more relevant, it is almost a necessity to use the "right" PL for the problem. However, learning new languages takes time which usually most project team can't afford. What is the best way to learn a new programming language? Is there a common set of problems that can be solved to reach a certain level of competence?

Well, it depends what you want to do. (web, db, whatever).
Generally I'd want to know:
What's the library like, how do I reference it
What ORMs are there
What build/deployment platforms exist for it
How does it handle updates
How do I do general things, like:
DB Access
File things
Display UI's
and so on.

Really, learning is only by doing -- you need a project that you can use the given language for.
Project Euler is the first thing to come to mind as an oft-used set of problems to try in a new language, even if it's not something I've ever tried.
If the language is another JVM or CLR hosted one, the issues about learning the environment can be set aside -- you can use all your familiar APIs in your Clojure/Scala/F#... code -- and concentrate on the syntax and idiom.
Otherwise, you're probably using the new language because it has a good fit for the particular problem you want to solve (e.g. native code and functional -> Haskell; distributed and concurrent -> Erlang) so the fit of the feature set is known in advance but you have the extra load of learning the standard APIs. And that's what prototyping is for.

The book Programming Challenges and the associated website provide a large list of algorithmic problems, with automatic online judging in several languages (Java, C, C++). Any algorithm textbook can give you lots of examples of basic data structures and procedures to try and implement, which is often a nice way to get some practice with basic language syntax and features. My personal favourite for this is The Algorithm Design Manual, which is language agnostic, but there are plenty of good language-specific books available as well (Mastering Algorithms in Perl or Data Structures and Algorithms in Java, for example).
If you're interested in a general set of mathematical problems to try and solve, Project Euler is a great resource.
For more day to day problems, I find the cookbook approach most helpful. For example, both Perl and Python have excellent O'Reilly cookbooks, as well as online resources, which provide short examples of many common and important problems. As mentioned in another answer, the key here is to find canonical examples of basic features you will need, particularly by leveraging what's available in standard libraries. I usually try and build up my own small library of examples as I go along, e.g. a socket example, a DB access example, a file reading example, a simple numerical solver, etc, which I then pillage for ideas when it's time to write production code.

Good APIs for scope analyzers

I'm working on some code generation tools, and a lot of complexity comes from doing scope analysis.
I frequently find myself wanting to know things like
What are the free variables of a function or block?
Where is this symbol declared?
What does this declaration mask?
Does this usage of a symbol potentially occur before initialization?
Does this variable potentially escape?
and I think it's time to rethink my scoping kludge.
I can do all this analysis but am trying to figure out a way to structure APIs so that it's easy to use, and ideally, possible to do enough of this work lazily.
What tools like this are people familiar with, and what did they do right and wrong in their APIs?

I'm a bit surprised at at the question, as I've done tons of code generation and the question of scoping rarely comes up (except occasionally the desire to generate unique names).
To answer your example questions requires serious program analysis well beyond scoping. Escape analysis by itself is nontrivial. Use-before-initialization can be trivial or nontrivial depending on the target language.
In my experience, APIs for program analysis are difficult to design and frequently language-specific. If you're targeting a low-level language you might learn something useful from the Machine SUIF APIs.
In your place I would be tempted to steal someone else's framework for program analysis. George Necula and his students built CIL, which seems to be the current standard for analyzing C code. Laurie Hendren's group have built some nice tools for analyzing Java.
If I had to roll my own I'd worry less about APIs and more about a really good representation for abstract-syntax trees.
In the very limited domain of dataflow analysis (which includes the uninitialized-variable question), João Dias and I have adapted some nice work by Sorin Lerner, David Grove, and Craig Chambers. Only our preliminary results are published.
Finally if you want to generate code in multiple languages this is a complete can of worms. I have done it badly several times. If you create something you like, publish it!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string