Apache Spark - MLlib - Matrix multiplication - apache-spark

I'm trying to use MLlib for matrix multiplication problem.
I am aware that Spark MLLib uses native libraries, which need to be present on the nodes. (that it does not come with spark installation).
So I already installed libgfortran library on all nodes (I did the same as
Apache Spark -- MlLib -- Collaborative filtering)
But then I still encounter this error when running on a cluster.
Lost task 0.3 in stage 2.0 (TID 11, ibm-power-6.dima.tu-berlin.de): java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V
at org.jblas.NativeBlas.dgemm(Native Method)
at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247)
.....
How can I solve this error?

Spark hasn't used jblas for a while; as far as I can tell at the moment not since 1.4.0, which came out more than a year ago. The answer you linked to links to documentation of Spark 0.9.0, which is definitely ancient. So the simplest solution seems to be to use a more up to date version of Spark.
If that is not possible, or if you run into a situation where you have to use jblas again: it looks like you are using IBM PowerLinux hardware. Support for this platform was added to jblas in version 1.2.4, so you would have to make sure you are using at least that version.

Related

Which version of hadoop-aws should I use

I'm running spark jobs on Yarn on EMR 5.14 (hadoop 2.8.3).
Can I use a superior version of hadoop-aws (e.g. 2.9 or 3.1) to benefit from recent optimization in s3a protocol ?
You need to stick with whatever EMR give you. Their s3:// connector is the one which AWS develop and probably your safest option.
FWIW, s3a since in 2.8.3 for input performance. hasn't changed much from later versions, except in 3.1 if you leave fs.s3a.experimental.fadvise to normal, it automatically switches from optimising for sequential IO to random IO (columnar data) on the first backward seek. Still best to set that property to random from the outset if you know all your data is stored as Parquet/ORC in a seekable compression format (i.e. not gzip). No speedup in writes either. You get a consistency layer equivalent to "consistent EMR" in Hadoop 2.9+, and a high performance output committer in Hadoop 3.1. But you cannot try and use those features by dropping in the later JARs. it will only give you stack traces

Why is difference between sqlContext.read.load and sqlContext.read.text?

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.
With sqlContext.read.load you can define the data source format using format parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.
The difference is:
text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6
To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit / pyspark commands.
Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

KMeans with Spark 1.6.2 VS Spark 2.0.0

I am using Kmeans() in an environment I have no control and I will abandon in <1 month. Spark 1.6.2. is installed.
Should I pay the price for urging 'them' to upgrade to Spark 2.0.0 before I leave? In other words, does Spark 2.0.0 introduce any significant improvements when it comes to Spark Mllib KMeans()?
In my case, quality is a more important factor than speed.
It is rather unlikely.
Spark 2.0.0 doesn't introduce any significant improvements to the core RDD API and KMeans implementation didn't change much since 1.6 with relatively significant changes introduced only by SPARK-15322, SPARK-16696 and SPARK-16694.
If you use ML API there can be also some improvements related to SPARK-14850 but overall I don't see any game changers here.

Zeppelin fails; Class UTF8String.class is different between Vora and Spark 1.5.2 libraries

I installed Vora 1.1. Patch 1 on HDP 2.3 with Spark 1.5.2, on SLES 11 SP3. It's not precisely the configuration mentioned in the Note 2213226, but shell-version of Vora seems to be working properly with the test 2.7 of the Installation manual (the latter did't prescribe HDP versions depending on the OS version, hence I went for HDP2.3 under SLES).
I have problems with Zeppelin, though. The github installation of version 0.5.6 seems to be successful, and I can execute the "create table" statement in Zeppelin notepad, but when executing "show tables" statement I get error:
Error: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 36, eba156.extendtec.com.au): java.io.InvalidClassException: org.apache.spark.unsafe.types.UTF8String; local class incompatible: stream classdesc serialVersionUID = 7459647620003804432, local class serialVersionUID = 7786395165093970948 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at
(blablabla)
I believe I found the reason why:
The class UTF8String.class coming from the library spark-sap-datasources-1.2.10-assembly.jar (and then used by Zeppelin) is dated Jan 20 and has size 17919 bytes.
The class UTF8String.class contained in the Spark's 1.5.2. library is dated Dec 16 and has size 18653
So I guess versions of these libraries do not match. How should I proceed? Thanks!
Up to Vora1.1 Patch 1 the Spark 1.5.2 version that comes with HDP2.3.4 is not officially supported (the HDP-Spark1.5.2 version is slightly different from the Apache Spark1.5.2 version). There are 2 known issues with the Thriftserver and Zeppelin. Easiest workaround is to install Apache Spark 1.5.2 outside of Ambari and not use the HDP-Spark version.
As of Vora 1.2 (released March 31 2016) both issues with the HDP-Spark 1.5.2 version are resolved and Vora is fully compatible with it.
I have copied the mentioned class over from Spark's "unbinded" library to the spark-vora-zeppelin combined one, overwriting the class there. The "SHOW TABLES" executed without any issues. I wonder whether it is the appropriate solution, but so far it worked.

Why do apache spark artifact names include scala versions

In maven repository http://mvnrepository.com/artifact/org.apache.spark, apache-spark version 1.4.1 is available in 2 flavours.
spark-*_2.10 & spark-*_2.11
These seem to be Scala versions. Which of these is preferred if I am deploying apache-spark with java distribution?
The Scala SDK is not binary compatible between major releases (for example, 2.10 and 2.11). If you have Scala code that you will be using with Spark and that code is compiled against a particular major version of Scala (say 2.10) then you will need to use the compatible version of Spark. For example, if you are writing Spark 1.4.1 code in Scala and you are using the 2.11.4 compiler, then you should use Spark 1.4.1_2.11.
If you are not using Scala code then there should be no functional difference between Spark 1.4.1_2.10 and Spark 1.4.1_2.11 (if there is, it is most likely a bug). The only difference should be the version of the Scala compiler used to compile Spark and the corresponding libraries.
I don't think it matters if you are using java as the bytecode should be close enough. The current default for spark is 2.10, but you might get some minor gains if you choose 2.11. But, ultimately I don't think it matters
As zero323 mentions, there are some areas that might not be fully supported in 2.11, so as I stated above, 2.10 is the default for now and probably the safest route.

Resources