Driver code in one language and executors in different languages - apache-spark

How can I use different programming languages to define the logic of my executors, than what I use for the driver? Is this possible at all?
E.g.: I would write the driver in Scala, then call different functions written in Java, Python for the distributed processing of the dataset.

You could, but only under certain circumstances, and with some work.
It should be possible to use the code generation feature of SparkSQL/DataSet to implement methods in other languages and call them through JNI or other interfaces.
Furthermore, the generated code is Java code, so technically you are already running Java code, independently of which language you use to program the Spark program.
As far as I know, it's also possible to use Python UDFs inside a Spark program written in Java or Scala.
With the RDD API it should also be possible to call libraries in other programming languages - with Scala-Java mixes being trivial to implement, and non-JVM languages needing the appropriate bridging logic.
There's - at least in current versions of Spark - a performance penalty to pay, for getting data out of the JVM and back into it, so I would use this sparingly, and only when you have weighed the performance pros and cons carefully.

Related

Kotlin for game dev

Background:
I'm always searching for a language to replace Java for game development. Kotlin looks promising with a good IDE support and Java interop. But one of the FPS killers for a game (on Android especially) is GC usage. So, some libraries (like libgdx) are using pools of objects, custom collections and other tricks to avoid frequent GC run. For Java that can be done in a clear way. Some other JVM languages espesially with functional support using a lot of GC by it's nature, so it is hard to avoid.
Questions:
Does Kotlin creates any invisible GC overhead in comparison to Java?
Which features of Kotlin is better to avoid to have less GC work?
You can write Kotlin Code for the JVM which causes the same allocations than the Java corresponding logic. In both cases you have to carefully check if a library call allocates new memory on the heap, or not. Using Kotlin in combination with LibGDX doesn't introduce any invisible GC overhead. It's an effective way and works well (especially with the ktx extension.
But there are Kotlin language features which may help you to write your code with fewer allocations.
Singletons are a language feature. (Object declarations, companion object )
You can create wrapper classes for primitive types which compile to primitives. But you get the power of type safety and rich domain models (Inline classes).
With the combination of Operator overloading and Inline Functions you can build nice APIs which modify objects without allocating new ones. (Example: Allocation-free Vectorial operations using custom operators)
If you use any kind of dependency injection mechanism or object pooling to connect your game logic and reuse objects, then Reified type parameters may help to use it in a very elegant an short way. You can skip a class as type parameter, if the compiler knows the actual type.
But there is also another option which indeed gives you a different behavior in memory management. Thanks to Kotlin Multiplatform, you can write your game logic as Kotlin common module and cross compile it to native code or to Javascript.
I did this in a sample Game project Candy Crush Clone. It works with Korge a Modern Multiplatform Game Engine for Kotlin. The game runs on the JVM, as HTML web app and as Native binary in Win, Linux, Mac, Android or IOS.
The native compiled code has its own simpler garbage collection and can run faster. So the speed-increase and the different memory management may give you the power reserve to bother even less with the GC.
In conclusion I can recommend Kotlin for Game dev, also for GC critical scenarios. In my projects I tend to create more classes and allocate more memory when I write Kotlin code. But this is a question of programming style, not a technical one.
As a rule of thumb, Kotlin generates bytecode as close as possible to the one generated by Java. So, for example, if you use a function as a value, an inner class will be created, like in Java, but no more. There are also some optimization tricks like IntArray and inline to perform even better.
And as #Peter-Lawrey said, it's always a better idea to measure the values for your specific case.
Technically, your questions comparing Kotlin to Java are moot, they will perform the same. But Kotlin will be a better development experience.
If Java is good for writing Games, then Kotlin would only better due to developer productivity.
Note: the gaming library LWJGL 3 uses Kotlin in part, with GitHub stats showing 67.3% of the code being Kotlin (template module looks to be mostly Kotlin). So asking people who work with LWJGL will give you the best answer to this question since they have a lot of experience in this area.

API compatibility between scala and python?

I have read a dozen pages of docs, and it seems that:
I can skip learning the scala part
the API is completely implemented in python (I don't need to learn scala for anything)
the interactive mode works as completely and as quickly as the scala shell and troubleshooting is equally easy
python modules like numpy will still be imported (no crippled python environment)
Are there fall-short areas that will make it impossible?
In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
Original answer as of Spark 0.9
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
Spark 0.8.1 added support for persist(), sample(), and sort().
The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
Spark 0.9 also adds Python bindings for MLLib (docs).
I've implemented tools to help keep the Java API up-to-date.
As of Spark 0.9, the main missing features in PySpark are:
zip() / zipPartitions.
Support for reading and writing non-text input formats, like Hadoop SequenceFile (there's an open pull request for this).
Support for running on YARN clusters.
Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
Support for job cancellation.
Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
Original answer as of Spark 0.7.2:
The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.
I'd like to add some points about why many people who have used both APIs recommend the Scala API. It's very difficult for me to do this without pointing out just general weaknesses in Python vs Scala and my own distaste of dynamically typed and interpreted languages for writing production quality code. So here are some reasons specific to the use case:
Performance will never be quite as good as Scala, not by orders, but by fractions, this is partly because python is interpreted. This gap may widen in future as Java 8 and JIT technology becomes part of the JVM and Scala.
Spark is written in Scala, so debugging Spark applications, learning how Spark works, and learning how to use Spark is much easier in Scala because you can just quite easily CTRL + B into the source code and read the lower levels of Spark to suss out what is going on. I find this particularly useful for optimizing jobs and debugging more complicated applications.
Now my final point may seem like just a Scala vs Python argument, but it's highly relevant to the specific use case - that is scale and parallel processing. Scala actually stands for Scalable Language and many interpret this to mean it was specifically designed with scaling and easy multithreading in mind. It's not just about lambda's, it's head to toe features of Scala that make it the perfect language for doing Big Data and parallel processing. I have some Data Science friends that are used to Python and don't want to learn a new language, but stick to their hammer. Python is a scripting language, it was not designed for this specific use case - it's an awesome tool, but the wrong one for this job. The result is obvious in the code - their code is often 2 - 5x longer than my Scala code as Python lacks a lot of features. Furthermore they find it harder to optimize their code as they are further away from the underlying framework.
Let me put it this way, if someone knows both Scala and Python, then they will nearly always choose to use the Scala API. The only people IME that use Python are those that simply do not want to learn Scala.

Functional static typed languages and parallel computation

It seems to me that statically typed, and functional languages are perfect for parallel computation. Since asserting strong type constraints such as the functional purity of functions should be easy. And additionally, these programming languages are already well suited for the types of computation-heavy programs that would trivially benefit from data parallelism.
However, it seems that beyond Haskell, none of the other strongly typed functional languages support OS-level threads to back their parallelism. Is it actually the case that Haskell is the only language that supports this sort of thing in the modern day, and that any of the ML series languages, don't provide good threading support among other statically typed language?
In Frege, you can use the fork/join Java API, here is a blog post covering it: http://fregepl.blogspot.de/2011/09/parallelism-in-frege-employing-forkjoin.html
I don't know much about Haskell, but I know that Erlang deals pretty well with concurrency. However, Erlang is a dynamically typed language and the concurrency is handled in the programming language and not in the operating system.
It is suited for parallel programming as it creates sets of parallel processes that may interact via message passing without requiring the use of locks.
For those who are not familiar with Erlang, here is the link for the language introduction.
There is also Scala, which integrates features of functional programming and object oriented programming. Scala runs on the Java VM and the threads communicate using an event-based model. This may be an indicator of the OS-level threads that you are looking for. Furthermore, it has built-in support for the actor model.
There is a book about Scala and multicore systems that you should take a look.

Are there real world applications that use metaprogramming?

We all know that MetaProgramming is a Concept of Code == Data (or programs that write programs).
But are there any applications that use it & what are the advantages of using it?
This Question can be closed but i didnt see any related questions.
IDEs are full with metaprogramming:
code completion
code generation
automated refactoring
Metaprogramming is often used to work around the limitations of Java:
code generation to work around the verbosity (e.g. getter/setter)
code generation to work around the complexity (e.g. generating Swing code from a WYSIWIG editor)
compile time/load time/runtime bytecode rewriting to work around missing features (AOP, Kilim)
generating code based on annotations (Hibernate)
Frameworks are another example:
generating Models, Views, Controllers, Helpers, Testsuites in Ruby on Rails
generating Generators in Ruby on Rails (metacircular metaprogramming FTW!)
In Ruby, you pretty much cannot do anything without metaprogramming. Even simply defining a method is actually running code that generates code.
Even if you just have a simple shell script that sets up your basic project structure, that is metaprogramming.
Since code as data is one of key concepts of Lisp, the best thing would be to see the real applications of projects written in these.
On this link you can see an article about a real world application written partly in Clojure, a dialect of Lisp.
The thing is not to write programs that write programs, just because you can, but to add new functionality to your language when you really need it. Just think if you could simply add new keyword to Java or C#...
If you implement metaprogramming in a language-independent way, you get a program analysis and transformation system. This is precisely a tool that treats (arbitrary) programs as data. These can be used to carry out arbitrary transformations on arbitrary programs.
It also means you aren't limited by the specific metaprogramming features that the compiler guys happened to put into your language. For instance, while C++ has templates, it has no "reflection". But a program transformation system can provide reflection even if the base langauge doesn't have it. In particular, having a program transformation engine means never having to say "I'm sorry, your language doesn't support metaprogramming (well enough) so I can't do much except write code manually".
See our DMS Software Reengineering Toolkit for such a program transformation system. It has been used to build test coverage and profiling tools, code generation tools, tools to reshape the architecture of large scale C++ applications, tools to migrate applications from one langauge to another, ... This is all extremely practical. Most of the tasks done with DMS would completely impractical to do by hand.
Not a real world application, but a talk about metaprogramming in ruby:
http://video.google.com/videoplay?docid=1541014406319673545
Google TechTalks August 3, 2006 Jack Herrington, the author of Code Generation in Action (Manning, July 2003) , will talk about code generation techniques using Ruby. He will cover both do-it-yourself and off-the-shelf solutions in a conversation about where Ruby is as a tool, and where it's going.
A real world example would be Django's model metaclass. It is the class of the class, from which models inherit from and responsible for the outfit of the model instances with all their attributes and methods.
Any ORM in a dynamic language is an instant example of practical metaprogramming. E.g. see how SQLAlchemy or Django's ORM creates classes for tables it discovers in the database, dynamically, in runtime.
ORMs and other tools in Java world that use #annotations to modify class behavior do a bit of metaprogramming, too.
Metaprogramming in C++ allows you to write code that will get transformed at compilation.
There are a few great examples I know about (google for them):
Blitz++, a library to write efficient code for manipulating arrays
Intel Array Building Blocks
CGAL
Boost::spirit, Boost::graph
Many compilers and interpreters are implemented with metaprogramming techniques internally - as a chain of code rewriting passes.
ORMs, project templates, GUI code generation in IDEs had been mentioned already.
Domain Specific Languages are widely used, and the best way to implement them is to use metaprogramming.
Things like Autoconf are obviously cases of metaprogramming.
Actually, it's unlikely one can find an area of software development which won't benefit from one or another form of metaprogramming.

What's so great about Scala? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What makes Scala such a wonderful language, other than the type system? Almost everything I read about the language brings out 'strong typing' as a big reason to use Scala, but there has to be more than that. What are some of the other compelling and/or cool language features that make Scala a really useful tool?
Here are some of the things that made me favour Scala (over, say, usual Java):
a) Type inference. The Java way of doing it:
Map<Something, List<SomethingElse>> list = new HashMap<Something, List<SomethingElse>>()
.. is rather verbose compared to Scala. The compiler should be able to figure it out if you give one of these lists.
b) First-order functions. Again, this functionality can be emulated with classes, but it's ugly.
c) Collections that have map and fold. These two tie in with (b), and also these two are something I wish for every time I have to write Java.
d) Pattern matching and case classes.
e) Variances, which mean that if S extends T, then List[S] extends List[T] as well.
Throw in some static types goodness as well, and I was sold on the language quite fast.
It's a mash up of the best bits from a bunch of languages, what's not to love:
Ruby's terse syntax
Java's performance
Erlang's Actor Support
Closures/Blocks
Convenient shorthand for maps & arrays
Scala is often paraded for having closures and implicits. Not surprising really, as lack of closures and explicit typing are perhaps the two biggest sources of Java boilerplate!
But once you examine it a little deeper, it goes far beyond Java-without-the-annoying bits, Perhaps the greatest strength of Scala is not one specific named feature, but how successful it is in unifying all of the features mentioned in other answers.
Post Functional
The union of object orientation and functional programming for example: Because functions are objects, Scala was able to make Maps implement the Function interface, so when you use a map to look up a value, it's no different syntactically from using a function to calculate a value. In unifying these paradigms so well, Scala truly is a post-functional language.
Or operator overloading, which is achieved by not actually having operators, they're just methods used in infix notation. So 1 + 2 is just calling the + method on an integer. If the method was named plus instead then you'd use it as 1 plus 2 which is no different from 1.plus(2). This is made possible because of another combination of features; everything in Scala is an object, there are no primitives, so integers can have methods.
Other Feature Fusion
Type classes were also mentioned, achieved by a combination of higher-kinded types, singleton objects, and implicits.
Other features that work well together are case classes and pattern matching, allowing you to easily build and deconstruct algebraic data types, without having to manually write all the tedious equality, hashcode, constructor and getter/setter logic that Java demands.
Specifying immutability by default, offering lazy values, and providing first class functions all combine to give you a language that's very suited to building efficient functional data structures.
The list goes on, but I've been using Scala for over 3 years now, and I'm still amazed almost daily at how well everything just works together.
Efficient and Versatile
Scala is also a small language, with a spec that (surprisingly!) only needs to be around 1/3 the size of Java's. This is partly because Java has a lot of special cases in the spec that Scala simplifies away, partly because of removing features such as primitives and operators, and partly because a lot of functionality has been moved from the language and into the libraries.
As a benefit of this, all the techniques available to the Scala library authors are also available to any Scala user, which makes it a great language for defining your own control-flow constructs and for building DSLs. This has been used to great effect in projects like Akka - a 3rd-party Actor framework.
Deep
Finally, it scales the full range of programming styles.
The runtime interpreter (known as the REPL) allows you to very quickly explore ideas in an interactive session, and Scala files can also be run as scripts without needing explicit compilation. When coupled with type inference, this gives Scala the feel of a dynamic language such as Ruby, Perl, or a bash script.
At the other end of the spectrum, traits, classes, objects and self-types allow you to build a full-scale enterprise system based on distinct components and using dependency injection without the need of 3rd-party tools. Scala also integrates with Java libraries at a level almost on-par with native Java, and by running on the JVM can take advantage of all the speed benefits offered on that platform, as well as being perfectly usable in containers such as tomcat, or with OSGi.
I'm new to Scala, but my impression is:
Really good JVM integration will be the driving factor. JRuby can call java and java can call JRuby code, but it's explicitly calling into another language, not the clean integration of Scala-Java. So you can use Java libraries, and even mix and match in the same project.
I started looking at scala when I had a realization that the thing which will drive the next great language is easy concurrency. The JVM has good concurrency from a performance standpoint. I'm sure someone will say that Erlang is better, but Scala is actually usable by normal programmers.
Where Java falls down is that it's just so painfully verbose. It takes way too many characters to create and pass a Functor. Scala allows passing functions as arguments.
It isn't possible in Java to create a union type, or to apply an interface to an existing class. These are both easy in Scala.
Static typing usually has a big penalty of verboseness. Scala eliminates this downside while still giving the upside of static typing, which is compile time type checking, and it makes code assist in editors easier.
The ability to extend the language. This has been the thing that has kept Lisp going for decades, and that allowed Ruby on Rails.
The type system really is Scala's most distinguishing feature. It also has a lot of syntactic conveniences over, say, Java.
But for me, the most compelling features of Scala are:
First-class modules.
Higher-kinded types (type constructor polymorphism).
Implicits.
In effect, these features let you approximate (and in some ways surpass) Haskell's type classes. Combined, they let you write exceptionally modular code.
Just shortly:
You get the power and platform-independency of the Java libraries, but without the boilerplate and verbosity.
You get the simplicity and productivity of Ruby, but with static typing and compiled bytecode.
You get the functional goodnesses and concurrency support of Haskell, but without complete paradigm shift and with the benefits of object-orientation.
What I find especially attractive in all of its magnificient features, among others:
Most of the object-oriented design patterns which require loads of boilerplate code in Java are supported natively, e.g. Singleton (via objects), Adapter, Decorator (via traits and implicits), Visitor (via pattern matching), Strategy (via closures) etc.
You can define your domain models and DSLs very concisely, then you can extend them with the necessary features (notification, association handling; parsing, serialization), without the need of code generation or frameworks.
And finally, there is full interoperability with the well-supported Java platform. You can mix Java and Scala in both directions. There is not much penalty nor compatibility problems when switching to Scala after having experienced the annoyances of Java which make code hard to maintain.
Functional Programming brought to JVM
Supposedly it's very easy to make Scala code run concurrently on multiple processors.
Expressiveness of control flow. For example, it's very common to have a collection of data which you need to process in some way. This might be a list of trades in which the processing involves grouping by some properties (the currencies of the investment instruments) and then doing a summation (to get totals-per-currency perhaps).
In Java this involves separating out a piece of code to do the grouping (a few lines of for-loop) and then another piece of code to do the summation (another for loop). In Scala, this type of thing is typically achievable in one line of code using functional programming and then folding, which reads very expressively l-to-r.
Of course, this is just an argument for a functional language over Java.
The great features of Scala has already been mentioned. One thing that shines through past all features though, is how tastefully everything is integrated.
Scala manages to be one of the most powerful language around without having a feeling of having bolted on features in haste. Neither are the language an academic exercise in proving a point. Innovation and really advanced concepts are brought in to the language with uncanny practicality and elegance.
In short: Martin Odersky is a pure design genius. That is what's so great about Scala!
I want to add the multi-paradigm (OO and FP) nature gives Scala an edge over other languages
Every day you code Java you will become more and more miserable, every day you code Scala you will become happier.
Here's a few fairly in depth explanations for the appeal of functional languages.
How/why do functional languages (specifically Erlang) scale well?
If we abandon feature discussion and will talk about style, i would say it's pipe-line style of coding. You start from some object or collection, types dot and property or dot and transformation and do it until you form desired result. This way it's easy to write a chain of transoformations that will be easy to read them also. Traits to some extend will also allow you to apply the same approach to constructing types.

Resources