Can I have two working pyspark versions (1.6.1 and 2.0) on my Macbook Pro at the same time? - apache-spark

I am currently working with Spark 1.6.1 and use it in both Jupyter ipython notebooks and from Java 8. For Java I can just modify my maven pom to import Spark 2.0, but I'm not sure how to do the equivalent in Ipython. I think I need to install 2.0, but is that doable since I already have Spark 1.6.1 installed? Can I have both versions on my macbook and select which one to use from pyspark? how?
Update: This is how I launch my Jupyter pyspark notebook => on terminal: % IPYTHON_OPTS="notebook" pyspark. How do I tell it to launch with Spark 2.0?

Can I have both versions on my macbook and select which one to use from pyspark
Yes.
Say you have extracted Spark to /opt/apache-spark folder. Then, in there, you could have both versions of 2.0.0 and 1.6.1.
Then, to run pyspark of version 2.0.0, you simply run
/opt/apache-spark/2.0.0/bin/pyspark
The real question you need to ask is why do you think you need both versions?

Related

Which spark should I download?

I'm new to spark and try to build spark+hadoop+hive environment.
I've download the lastest version hive, and accroding to the [Version Compatibility] section on the https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started, I should download spark 2.3.0, and at the page https://archive.apache.org/dist/spark/spark-2.3.0/, I found there are some different versions, such as spark-2.3.0-bin-hadoop2.7.tgz, spark-2.3.0-bin-without-hadoop.tgz, SparkR_2.3.0.tar.gz and so on.
Now I'm confused! I don't konw which version of spark I need download, and if I download spark-2.3.0-bin-hadoop2.7.tgz, is it mean I need't download hadoop? And what's the different between SparkR_2.3.0.tar.gz and spark-2.3.0-bin-without-hadoop.tgz?
thanks
You should download the latest version that includes Hadoop since that's what you want to setup. That would be Spark 3.x, not 2.3
If you already have Hadoop environment (HDFS/YARN), download the one without Hadoop.
If you're not going to right R code, don't download the SparkR version
AFAIK, "Spark on Hive" execution engine is no longer being worked on. Spark Thriftserver can be used in place of running your own HiveServer

Different Spark versions used on using the source code and getting a pre-built version

I have downloaded Spark source code(branch 2.4) and built the jars using the built instruction for Hadoop 2.7.4. I have also downloaded a pre-built version of Spark 2.4.4(Pre-built for Hadoop 2.7).
When I start spark-shell I see two different versions of Spark as shown in the picture below:
In the first picture, version is 3.0.0 for the jars built after downloading source code of branch 2.4. The second picture is from the pre-built version available from apache spark website. Not only that, the plans are using RelationV2 in first case and Relation logical node in second case.
Can anyone explain why is there such a difference?
Pretty sure you got mixed up, as 3.0.0 is the default choice for dowloading source or prebuilt version. Maybe I am mistaked, but, as of my comment, carefully check what version you have built.

SPARK individual upgrade to 2.1.0 in Ambari HDP 2.5.0

I want to upgrade my SPark component to 2.1.0 from its default 2.0.x.2.5 in Ambari.
I am using HDP 2.5.0 with Ambari 2.4.2.
Appreciate any idea to achieve this.
HDP 2.5 shipped with a technical preview of Spark 2.0, and also Spark 1.6.x. If you do not want to use either of those versions and you want Ambari to manage the service for you then you will need to write a custom service for the Spark version that you want. If you don't want ambari to manage the Spark instance you can follow similar instructions as provided on the Hortonworks Community Forum to manually install Spark 2.x without management.
Newer versions of Ambari (maybe 3.0) probably will support per-component upgrade/multiple component versions
See https://issues.apache.org/jira/browse/AMBARI-12556 for details.

Unable to build Spark+cassandra using sbt-assembly

I am trying to build a simple project with Spark+Cassandra for a SQL-analytics demo.
I need to use Cassandra v2.0.14 (can't upgrade it for now). I am unable to find the correct version of Spark and Spark-cassandra-connector. I referred to Datastax's git project at - https://github.com/datastax/spark-cassandra-connector, and I know that the Spark and Spark-cassandra-connector versions need to match and be compatible with Cassandra. Hence, would like anyone to help pointing out the exact versions for Spark, Spark-Cassandra-connector. I tried using v1.1.0 and v1.2.1 for both Spark and Spark-Cassandra-connector - but unable to build the spark-cassandra-connector jat jar with neither the supplied sbt (fails because the downloaded sbt-launch jar just contains a 404 not found html), nor my local sbt v0.13.8 (fails for compilation error for "import sbtassembly.Plugin.", "import AssemblyKeys.")
The connector works with Cassandra 2.0 and 2.1 but some features may also work fine with 2.2 and 3.0 (not officially supported yet) using the older Java driver 2.1. This is because C* Java driver supports a wide range of Cassandra versions. The newer driver works with older C* versions, but also the older driver versions work with newer C* versions, excluding new C* features.
However, there is a one minor caveat with using C* 2.0:
Since version 1.3.0, we dropped the thrift client from the connector. This move was to simplify connectivity code and make it easier to debug - debugging one type of connection should be easier than two. It either connects or not, no more surprises of a kind "it writes fine, but can't connect for reading". Unfortunately, not all of the thrift functionality was exposed by the native protocol in C* 2.0 nor in the system tables. Therefore, if you use C* prior to version 2.1.5, automatic split sizing won't work properly and you have to tell the connector the preferred number of splits. This is to be set in ReadConf object passed at the creation of the RDD.
As for the interface between the Connector and Spark, there is much less freedom. Spark APIs change quite often and you typically need a connector dedicated to the Spark version you use. See the version table in the README.
(fails because the downloaded sbt-launch jar just contains a 404 not found html)
This looks like an SBT problem, not a connector problem.
I just tried to do sbt clean assembly on all v1.2.5, v1.3.0, b1.4 and it worked fine.
if you can upgrade version of spark then you can connect with spark with cassandra .
put following maven dependency in pom file :-
cassandra-all
cassandra-core
cassandra-mapping
cassandra-thrift
cassandra-client
spark-cassandra-connector
spark-cassandra-connector-java
this will be work.

How to build and debug spark from IDE(preferrably Eclipse)?

I wanted to contribute to spark.
I cloned the git repository locally. Please suggest how to setup spark first and then run a hello world over it from IDE itself.
For importing/building Spark in IntelliJ or Eclipse follow this guide.
If you are interested in contributing to Spark visit this wiki page for more information:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
I assume you already have the latest release of Scala IDE (4.0 at this point) from scala-ide.org.
export projects using sbt eclipse, I guess you figured that out already.
import all projects in your workspace (Import Existing projects)
you will probably see a number of errors related to "cross-compiled libraries"
If you want to develop on Scala 2.10, you need to configure a Scala installation for the exact Scala version that’s used to compile Spark. At the time of this writing that is Scala 2.10.4.
you can do that in Eclipse Preferences -> Scala -> Installations by pointing to the lib/ directory of your Scala 2.10.4 distribution.
select all Spark projects and right-click, choose Scala -> Set Scala Installation and point to the 2.10.4 installation. This should clear all errors about invalid cross-compiled libraries.
a clean build should succeed.
You can easily find examples on getting started with Spark, for example here. You can run a Spark app using right-click -> Run As Scala Application.

Resources