Difference between apache spark 2 and cloudera spark 2 - apache-spark

i searched but couldn't get a concrete difference between the Apache distribution of spark 2 and the Cloudera distribution of spark 2. Can anybody help me on this in understanding the differences they have in spark core, spark sql and spark streaming.

They are referring to the same thing. Cloudera distributes a packaged version of Hadoop including Apache Spark 2. There are slight differences in this Apache Spark 2 and the latest upstream version of Spark 2 from https://spark.apache.org/. These are usually spelled out in the Release Notes for CDH Spark 2.
For example, the release notes have a section called: Spark 2 Known Issues which describe some missing features.
In general, incompatibilities arise because there is a lag between upstream releases and CDH releases and CDH has to maintain major version compatibility between minor releases.

Related

CDH(Cloudera Distributed Hadoop) to CDP(Cloudera Data Platform) migration Spark 1x-3x query

We are currently doing a feasibility study on migrating from CDH(Cloudera Distributed Hadoop) to CDP(Cloudera Data Platform) wrt spark(currently in version 1.6).
When checked the documenation,it was understood that 1.6 is not supported ,we need to refactor it to 2.4 and the steps to do manually is given
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdh/topics/cdp-one-workload-migra...
But We are planning to migrate to Spark 3.x in CDP.In one of the cloudera blogs about the same(link below
https://blog.cloudera.com/upgrade-journey-the-path-from-cdh-to-cdp-private-cloud/
As part of pre upgrade step ,it is mentioned that we need to convert Spark 1.x jobs to 2.4.5.
Phase 2: Pre-upgrade
Backup existing cluster using the backup steps list here
Confirm if all the prerequisites are addressed. Ensure all outstanding dependencies are met.
Convert Spark 1.x jobs to Spark 2.4.5. Test and validate the jobs to ensure all the required code changes are performed and tested.
My doubt is :
If the migration is from Spark 1.x-3.x when moving from cdh to cdp,is it mandatory to have a step in between to convert spark 1x-2x and then 2x to 3,if yes then the refactoring of 1x-2x is automated or it should be done manually as the steps given in cloudera
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdh/topics/cdp-one-workload-migration-spark16-to-spark24.html
If not,can we directly refactor from spark 1x-3x when moving from CDH to CDP.Kindly help.
Thanks in advance.
tried looking for the solution in exisiting cloudera docuementation but couldnt get anything,in terms of Migrating Spark workloads to CDP ,there are only 2 options
Spark 1.6 to Spark 2.4 Refactoring
Because Spark 1.6 is not supported on CDP, you need to refactor Spark workloads from Spark 1.6 on CDH or HDP to Spark 2.4 on CDP.
Spark 2.3 to Spark 2.4 Refactoring
Because Spark 2.3 is not supported on CDP, you need to refactor Spark workloads from Spark 2.3 on CDH or HDP to Spark 2.4 on CDP.
Spark 2.4 to 3.x
But, if in case if we have Spark 1.6,then moving it to 2.4 and then to 3 will be double the effort

Spark streaming with version > 2.1.1 is slower than 2.1.1

I have a spark streaming application with Spark 2.1.1 and after the upgrade in higher version I have worse performance (higher computation time, based on UI statistics). Specifically, I compare it out of the box with the following spark versions 2.3.1, 2.3.3, 2.4.3 and 2.4.4 (latest).
I compare the configuration Spark configuration pages and I didn't find something suspicious. About my case, I use Pyspark, the application is a streaming api which reads from Kafka, do some aggregations and writes in parquet files in HDFS.
Does anyone knows what it has changed in the configuration and the performance has become worse?

Apache Spark 2.3.1 compatibility with Hadoop 3.0 in HDP 3.0

I am plannig to upgrade from Hortonworks Data platform[HDP] (version 2.6.x) to HDP 3.0. But, there seems to be some major bugs in Apache Spark 2.3.x and its integration with Hadoop 3.0, which are still unresolved in Apache Spark JIRA issues. Although the Spark development team is working to resolve them. Do these issues have a workaround/resolutions by Hortonworks team, or do they still exist in HDP 3.0?
Some unresolved issues concerning my use case:
Spark DataFrames does not work with Hadoop 3.0 https://issues.apache.org/jira/browse/SPARK-18673
Kerberos Ticket renewal fails in Hadoop 3 https://issues.apache.org/jira/browse/SPARK-24493
Spark run on Hadoop 3 https://issues.apache.org/jira/browse/SPARK-23534
I checked integration with HDP Spark-2.3.1 and Hadoop - 3.0.1. It works perfectly and above issues were resolved in HDP version of Spark, but were not provided in HDP-3 release notes.
Check the community answer

How to run different Spark versions on each node in a cluster?

Can I have an apache Spark cluster where different nodes run different versions of Spark? For example, could I have a master which is Spark 2.2.0, one node that is 2.0.1, another that is 2.2.0 and another that is 1.6.3 or should all nodes have the same version of Spark?
Usually when we want to install different versions of spark on the cluster, all the versions will be installed on all the nodes, spark execution depends on which spark-submit (spark 1.6 or spark 2.0 or spark 2.2) is used while running the script.
Lets say we have installed spark 1.6 on master node only, when we submit the job on the cluster, say master node is fully utilized , then yarn-resource manager will see which node is free to run the job, here yarn will not wait until master node gets some resources,yarn will submit the job to the node which has free resources. So, for this reason all versions of spark has to be installed on all nodes on the cluster.
Can I have an apache Spark cluster where different nodes run different versions of Spark?
No. This is not possible.
The reason is that there is no notion of Spark installation. Spark is a library and as such is a dependency of an application that once submitted for execution will be deployed and executed on cluster nodes (at least one, i.e. the driver).
With that said, just the version of the Spark dependency of your application is exactly the version of Spark in use. To be precise, the version of spark-submit in use (unless you use so-called a uber-jar with the Spark dependency bundled).

Does any of Cloudera Hadoop distribution supports Apache Spark SQL

I am new to Apache Spark. I heard that none of the versions of CDH are supposrting Apache Spark SQL as of now, same case with hortonworks distribution as well. Is that true..?
And another one is I have CDH 5.0.0 installed in my PC, which version of Apache Spark my CDH supports..?
Also could someone please provide me the steps to execute my Spark program in my CDH distribution. I have written some basic programs using Apache Spark 1.2 version and I am not able to run those programs in CDH environment, i am facing very basic problem when I am running Spark program using spark-submit command
spark-submit: Command not found
Do i need to configure anything prior to run my Spark program..?
Thanks in advance
All of the distributions of CDH include the whole Spark distribution, including Spark SQL.
EDIT: It is supported as of CDH 5.5.x.
CDH 5.0.x includes Spark 0.9.x. CDH 5.3.x includes Spark 1.2.x and 5.4.x should ship 1.3.x since it is about to be released upstream.
spark-submit is already part of your path if you are using CDH. If you're running from somewhere else, you have to put this file on your path or give the full path to it. This is the same as any program. So, this is something wrong with what you set up.

Resources