Spark integration in knime - apache-spark

I am planning to execute spark from KNIME analytics platform. For this I need to install KNIME spark executors in the KNIME analytics platform.
Can any one please let me know how to install KNIME spark executors in the KNIME analytics platform for hadoop distribution CDH 5.10.X.
I am referring the installation guide from the below link:
https://www.knime.org/knime-spark-executor

I could successfully configure/integrate spark in KNIME.
I did it in CDH 5.7.
I followed the following steps:
1.Downloaded knime-full_3.3.2.linux.gtk.x86_64.tar.gz.
2.Exract the above mentioned pacakge and run installation for KNIME.
3.After KNIME is installed goto File ->Install KNIME Extensions -> Install Bigdata extensions(Check all the Spark related extensions and proceed).
Follow this link:
https://tech.knime.org/installation-instructions#download
4.Till now only the Bigdata related extensions have been installed but they need license to be functional.
5.License needs to be purchased.However,free trail for 30 days can be availed after which it needs to be purchased.
Folow this link :
https://www.knime.org/knime-spark-executor
6.After plugins are installed we need to configure Spark-job-server.
For that we need to download the compatible version of spark-job-server for the hadoop version we have.
Folow this link for version of spark-job-server and its compatible version :
https://www.knime.org/knime-spark-executor

I'm pretty sure it's as easy as registering for the free trial (and buying the license for longer than 30 days) and then installing the software from the Help->Install New Software menu.

As of version KNIME 3.6 (latest), it should be possible to connect to Spark via Livy, no specific executor deployment on a KNIME Server. Still in preview, but it should do it.
https://www.knime.com/whats-new-in-knime-36

Related

Which spark should I download?

I'm new to spark and try to build spark+hadoop+hive environment.
I've download the lastest version hive, and accroding to the [Version Compatibility] section on the https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started, I should download spark 2.3.0, and at the page https://archive.apache.org/dist/spark/spark-2.3.0/, I found there are some different versions, such as spark-2.3.0-bin-hadoop2.7.tgz, spark-2.3.0-bin-without-hadoop.tgz, SparkR_2.3.0.tar.gz and so on.
Now I'm confused! I don't konw which version of spark I need download, and if I download spark-2.3.0-bin-hadoop2.7.tgz, is it mean I need't download hadoop? And what's the different between SparkR_2.3.0.tar.gz and spark-2.3.0-bin-without-hadoop.tgz?
thanks
You should download the latest version that includes Hadoop since that's what you want to setup. That would be Spark 3.x, not 2.3
If you already have Hadoop environment (HDFS/YARN), download the one without Hadoop.
If you're not going to right R code, don't download the SparkR version
AFAIK, "Spark on Hive" execution engine is no longer being worked on. Spark Thriftserver can be used in place of running your own HiveServer

Spring Version in Azure databricks

I am currently using spring boot 2.3.0 version to build an apache spark job in java. This job is working fine in my local. I want to deploy this spring boot spark job on Azure databricks(7.2.0). But while deploying spring boot jar on Azure databricks, I am getting following error -
ava.lang.NoSuchMethodError: org.springframework.core.ResolvableType.forInstance(Ljava/lang/Object;)Lorg/springframework/core/ResolvableType;
at org.springframework.context.event.SimpleApplicationEventMulticaster.resolveDefaultEventType(SimpleApplicationEventMulticaster.java:145)
at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:127)
at org.springframework.boot.context.event.EventPublishingRunListener.starting(EventPublishingRunListener.java:74)
at org.springframework.boot.SpringApplicationRunListeners.starting(SpringApplicationRunListeners.java:47)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:305)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1237)
I have checked azure databricks documentation, it has spring core 4.1.3 installed by deafult on azure databricks and in my code the spring core version is 5.2.8. So, I want to ask if there is any way I can upgrade spring core version on azure databricks.
To make third-party or locally-built code available to notebooks and jobs running on your clusters, you can install a library. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries and point to external packages in PyPI, Maven, and CRAN repositories
Steps to install Spring version in Azure Databricks:
Step1: Download the spring core library from the Maven repository. Click on jar file to download.
Step2: Choose the cluster in which you want to install the library.
Libraries => Install New => Library Source: "Upload", Library Type: "Jar", Click on drop here and choose the previously downloaded jar file => Click install.
Successfully, installed spring_core_5_2_8 library on the cluster.
For different methods to install packages in Azure Databricks, refer: How to install a library on a databricks cluster using some command in the notebook?

Different Spark versions used on using the source code and getting a pre-built version

I have downloaded Spark source code(branch 2.4) and built the jars using the built instruction for Hadoop 2.7.4. I have also downloaded a pre-built version of Spark 2.4.4(Pre-built for Hadoop 2.7).
When I start spark-shell I see two different versions of Spark as shown in the picture below:
In the first picture, version is 3.0.0 for the jars built after downloading source code of branch 2.4. The second picture is from the pre-built version available from apache spark website. Not only that, the plans are using RelationV2 in first case and Relation logical node in second case.
Can anyone explain why is there such a difference?
Pretty sure you got mixed up, as 3.0.0 is the default choice for dowloading source or prebuilt version. Maybe I am mistaked, but, as of my comment, carefully check what version you have built.

SPARK individual upgrade to 2.1.0 in Ambari HDP 2.5.0

I want to upgrade my SPark component to 2.1.0 from its default 2.0.x.2.5 in Ambari.
I am using HDP 2.5.0 with Ambari 2.4.2.
Appreciate any idea to achieve this.
HDP 2.5 shipped with a technical preview of Spark 2.0, and also Spark 1.6.x. If you do not want to use either of those versions and you want Ambari to manage the service for you then you will need to write a custom service for the Spark version that you want. If you don't want ambari to manage the Spark instance you can follow similar instructions as provided on the Hortonworks Community Forum to manually install Spark 2.x without management.
Newer versions of Ambari (maybe 3.0) probably will support per-component upgrade/multiple component versions
See https://issues.apache.org/jira/browse/AMBARI-12556 for details.

Upgrade Apache Spark version from 1.6 to 2.0

Currently I have Spark version 1.6.2 installed.
I want to upgrade the Spark version to the newest 2.0.1. How do I do this without losing the existing configurations?
Any help would be appreciated.
If its maven or sbt application you simply change dependency version of spark and also migrate your code according to 2.0 so you will not lose you configurations. and for spark binary you can take backup of config folder.
There is no much change related to configuration, some method signatures are changed , major changes i observed was mapPartitions method signature and some changes to metrics/listener api, apart from new features.

Resources