KMeans with Spark 1.6.2 VS Spark 2.0.0 - apache-spark

I am using Kmeans() in an environment I have no control and I will abandon in <1 month. Spark 1.6.2. is installed.
Should I pay the price for urging 'them' to upgrade to Spark 2.0.0 before I leave? In other words, does Spark 2.0.0 introduce any significant improvements when it comes to Spark Mllib KMeans()?
In my case, quality is a more important factor than speed.

It is rather unlikely.
Spark 2.0.0 doesn't introduce any significant improvements to the core RDD API and KMeans implementation didn't change much since 1.6 with relatively significant changes introduced only by SPARK-15322, SPARK-16696 and SPARK-16694.
If you use ML API there can be also some improvements related to SPARK-14850 but overall I don't see any game changers here.

Related

Tungsten encoding in Spark SQL?

I am running a Spark application that has a series of Spark SQL statements that are executed one after the other. The SQL queries are quite complex and the application is working (generating output). These days, I am working towards improving the performance of processing within Spark.
Please suggest whether Tungsten encoding has to be enabled separately or it kicks in automatically while running Spark SQL?
I am using Cloudera 5.13 for my cluster (2 node).
It is enabled by default in spark 2.X (and maybe 1.6: but i'm not sure on that).
In any case you can do this
spark.sql.tungsten.enabled=true
That can be enabled on the spark-submit as follows:
spark-submit --conf spark.sql.tungsten.enabled=true
Tungsten should be enabled if you see a * next to the plan:
Also see: How to enable Tungsten optimization in Spark 2?
Tungsten became the default in Spark 1.5 and can be enabled in an earlier version by setting the spark.sql.tungsten.enabled = true.
Even without Tungsten, SparkSQL uses a columnar storage format with Kyro serialization to minimize storage cost.
To make sure your code benefits as much as possible from Tungsten optimizations try to use the default Dataset API with Scala (instead of RDD).
Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations. DataSet APIs are the most up to date and adds type-safety along with better error handling and far more readable unit tests.

Which version of hadoop-aws should I use

I'm running spark jobs on Yarn on EMR 5.14 (hadoop 2.8.3).
Can I use a superior version of hadoop-aws (e.g. 2.9 or 3.1) to benefit from recent optimization in s3a protocol ?
You need to stick with whatever EMR give you. Their s3:// connector is the one which AWS develop and probably your safest option.
FWIW, s3a since in 2.8.3 for input performance. hasn't changed much from later versions, except in 3.1 if you leave fs.s3a.experimental.fadvise to normal, it automatically switches from optimising for sequential IO to random IO (columnar data) on the first backward seek. Still best to set that property to random from the outset if you know all your data is stored as Parquet/ORC in a seekable compression format (i.e. not gzip). No speedup in writes either. You get a consistency layer equivalent to "consistent EMR" in Hadoop 2.9+, and a high performance output committer in Hadoop 3.1. But you cannot try and use those features by dropping in the later JARs. it will only give you stack traces

Which is the best HBase connector to use for batch loading data into HBase from Spark?

As mentioned also in
Which HBase connector for Spark 2.0 should I use?
mainly there are two options:
RDD based https://github.com/apache/hbase/tree/master/hbase-spark
DataFrame based https://github.com/hortonworks-spark/shc
I do understand the optimizations and the differences with regard to READING from HBase.
However it's not clear for me which should I use for BATCH inserting into HBase.
I am not interested in one by one records, but by high throughput.
After digging through code, it seems that both resort to TableOutputFormat,
http://hbase.apache.org/1.2/book.html#arch.bulk.load
The project uses Scala 2.11, Spark 2, HBase 1.2
Does the DataFrame library provide any performance improvements over the RDD lib specifically for BULK LOAD ?
Lately, hbase-spark connector has been released to a new maven central repository with 1.0.0 version and supports Spark version 2.4.0 and Scala 2.11.12
<dependency>
<groupId>org.apache.hbase.connectors.spark</groupId>
<artifactId>hbase-spark</artifactId>
<version>1.0.0</version>
</dependency>
This supports both RDD and DataFrames. Please refer spark-hbase-connectors for more details
Happy Learning !!
Have you looked at bulk load examples on Hbase project.
See Hbase Bulk Examples, github page have java examples, you can write scala code easily.
Also read Apache Spark Comes to Apache HBase with HBase-Spark Module
Given a choice RDD vs DataFrame, we should use DataFrame as per recommendation on official documentation.
A DataFrame is a Dataset organized into named columns. It is
conceptually equivalent to a table in a relational database or a data
frame in R/Python, but with richer optimizations under the hood.
Hoping this helps.
Cheers !

Apache Spark - MLlib - Matrix multiplication

I'm trying to use MLlib for matrix multiplication problem.
I am aware that Spark MLLib uses native libraries, which need to be present on the nodes. (that it does not come with spark installation).
So I already installed libgfortran library on all nodes (I did the same as
Apache Spark -- MlLib -- Collaborative filtering)
But then I still encounter this error when running on a cluster.
Lost task 0.3 in stage 2.0 (TID 11, ibm-power-6.dima.tu-berlin.de): java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V
at org.jblas.NativeBlas.dgemm(Native Method)
at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247)
.....
How can I solve this error?
Spark hasn't used jblas for a while; as far as I can tell at the moment not since 1.4.0, which came out more than a year ago. The answer you linked to links to documentation of Spark 0.9.0, which is definitely ancient. So the simplest solution seems to be to use a more up to date version of Spark.
If that is not possible, or if you run into a situation where you have to use jblas again: it looks like you are using IBM PowerLinux hardware. Support for this platform was added to jblas in version 1.2.4, so you would have to make sure you are using at least that version.

Are there any use cases where hadoop map-reduce can do better than apache spark?

I agree that iterative and interactive programming paradigms are very good with spark than map-reduce. And I also agree that we can use HDFS or any hadoop data store like HBase as a storage layer for Spark.
Therefore, my question is - Do we have any use cases in real world that can say hadoop MR is better than apache spark on those contexts. Here "Better" is used in terms of performance, throughput, latency. Is hadoop MR is still the good one to do BATCH processing than using spark.
If so, Can any one please tell the advantages of hadoop MR over apache spark? Please keep the entire scope of discussion with respect to COMPUTATION LAYER.
As you said, in iterativeand interactive programming, the spark is better than hadoop. But spark has a huge need to the memory, if the memory is not enough, it would throw the OOM exception easily, hadoop can deal the situation very well, because hadoop has a good fault tolerant Mechanism.
Secondly, if Data Tilt happened, spark maybe also collapse. I compare the spark and hadoop on the system robustness, because this would decide the success of job.
Recently I test the spark and hadoop performance use some benchmark, according to the result, the spark performance is not better than hadoop on some load, e.g. kmeans, pagerank. Maybe the memory is a limitation to spark.

Resources