Run spark program locally with intellij

Run spark program locally with intellij - apache-spark

I tried to run a simple test code in intellij IDEA. Here is my code:
import org.apache.spark.sql.functions._
import org.apache.spark.{SparkConf}
import org.apache.spark.sql.{DataFrame, SparkSession}
object hbasetest {
val spconf = new SparkConf()
val spark = SparkSession.builder().master("local").config(spconf).getOrCreate()
import spark.implicits._
def main(args : Array[String]) {
val df = spark.read.parquet("file:///Users/cy/Documents/temp")
df.show()
spark.close()
}
}
My dependencies list:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
<!--<scope>provided</scope>-->
</dependency>
when I click with run button, it throw an exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.TaskID.<init>(Lorg/apache/hadoop/mapreduce/JobID;Lorg/apache/hadoop/mapreduce/TaskType;I)V
I checked this post, but situation don't change after making modification. Can I get some help with running local spark application in IDEA? THx.
Update: I can run this code with spark-submit. I hope to directly run it with run button in IDEA.

Are you using cloudera sandbox and running this application because in POM.xml i could see CDH dependencies '2.6.0-mr1-cdh5.5.0'.
If you are using cloudera please use the below dependency for your spark scala project because the 'spark-core_2.10' artifact version gets changed.
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.0.0-cdh5.1.0</version>
</dependency>
</dependencies>
I used the below reference to run my spark application.
Reference: http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

Here are the settings I use for Run/Debug configuration in IntelliJ:
*Main class:*
org.apache.spark.deploy.SparkSubmit
*VM Options:*
-cp <spark_dir>/conf/:<spark_dir>/jars/* -Xmx6g
*Program arguments:*
--master
local[*]
--conf
spark.driver.memory=6G
--class
com.company.MyAppMainClass
--num-executors
8
--executor-memory
6G
<project_dir>/target/scala-2.11/my-spark-app.jar
<my_spark_app_args_if_any>
spark-core and spark-sql jars are referred in my build.sbt as "provided" dependencies and their versions must match one of the Spark installed in spark_dir. I use Spark 2.0.2 at the moment with hadoop-aws jar version 2.7.2.

It may be late for the reply, but I just had the same issue. You can run with spark-submit, probably you already had related dependencies. My solution is:
Change the related dependencies in Intellij Module Settings for your projects from provided to compile. You may only change part of them but you have to try. Brutal solution is to change all.
If you have further exception after this step such as some dependencies are "too old", change the order of related dependencies in module settings.

I ran into this issue as well, and I also had an old cloudera hadoop reference in my code. (You have to click the 'edited' link in the original poster's link to see his original pom settings).
I could leave that reference in as long as I put this at the top of my dependencies (order matters!). You should match it against your own hadoop cluster settings.
<dependency>
<!-- THIS IS REQUIRED FOR LOCAL RUNNING IN INTELLIJ -->
<!-- IT MUST REMAIN AT TOP OF DEPENDENCY LIST TO 'WIN' AGAINST OLD HADOOP CODE BROUGHT IN-->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0-cdh5.12.0</version>
<scope>provided</scope>
</dependency>
Note that in 2018.1 version of Intellij, you can check Include dependiencies with "Provided" Scope which is a simple way to keep your pom scopes clean.

Related

NoSuchMethodError with Guava , GCP Cloud Storage & Datastax

java.lang.NoSuchMethodError: com.google.common.io.ByteStreams.exhaust(Ljava/io/InputStream;)J
Getting above error msg , when using Guava - 18.0v with Cloud Storage - 2.2.2v:
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.2.0</version>
</dependency>
If I use Guava - 23v with Datastax -3.2.0v I'm getting below error msg:
java.lang.NoClassDefFoundError:
com/google/common/util/concurrent/FutureFallback
so Cloud storage needs Guava version above 20 , but DataStax needs version below 20 , so either thing is working at a time but I want both the things.
My code:
StorageOptions options = StorageOptions.newBuilder()
.setProjectId(PROJECT_ID)
.setCredentials(GoogleCredentials
.fromStream(new FileInputStream(PATH_TO_JSON_KEY))).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, OBJECT_NAME);
ReadChannel r = blob.reader();

3.2.0 seems like a very old version of Cassandra driver from 2017 -- try upgrading to the latest in this major version, 3.11.0.
It also looks like the driver switched packages to com.datastackx.oss:java-driver-core and moved to the major version 4.x.

Tried to reproduce your error using the code and dependencies you have provided.
and was able to resolve the issue by using these version.
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>20.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.11.0</version>
</dependency>
Output:
It is also advisable to use the latest version of DataStax when you're using Guava 20.0 and above.

java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter in SparkSubmit

I've been trying to submit applications to a Kubernetes. I have followed the tutorial in https://spark.apache.org/docs/latest/running-on-kubernetes.html such as building the spark image and etc.
But whenever I tried to run the command spark-submit, the pod always throw error. This is the logs from the command: kubectl logs <spark-driver-pods>:
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
I have tried to use something like:
spark-submit
...
--jars $(echo /opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/pyspark/jars/*.jar | tr ' ' ',')
...
But that also still throw error.
Some notes related to my development environment:
I use Kubernetes built-in the Docker desktop
I use pyspark in conda environment, and yes I have activated the environment. That's why I can use pyspark in the terminal.
Anything else I should do? Or forget to do?

I'm using Maven, but I encountered this error while migrating from log4j 1.x to log4j 2.x and realized I still had some code that only worked with 1.x. Instead of refactoring code, I added this dependency to my pom.xml in order to maintain compatibility.
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
<version>2.17.1</version>
</dependency>

below line worked for me
libraryDependencies += "log4j" % "log4j" % "1.2.17"

Kafka Embedded with Spark. Dependencies problems

I'm trying to use Spark Streaming 2.0.0 with Kafka 0.10. I'm using to my integration test https://github.com/manub/scalatest-embedded-kafka but I have some problems starting the server. When I tried with Spark 2.2.0 it works.
<dependency>
<groupId>net.manub</groupId>
<artifactId>scalatest-embedded-kafka_2.11</artifactId>
<version>${embedded-kafka.version}</version> -->I tried many versions.
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.0.2</version>
</dependency>
An exception or error caused a run to abort: kafka.server.KafkaServer$.$lessinit$greater$default$2()Lorg/apache/kafka/common/utils/Time;
java.lang.NoSuchMethodError: kafka.server.KafkaServer$.$lessinit$greater$default$2()Lorg/apache/kafka/common/utils/Time;
at net.manub.embeddedkafka.EmbeddedKafkaSupport$class.startKafka(EmbeddedKafka.scala:467)
at net.manub.embeddedkafka.EmbeddedKafka$.startKafka(EmbeddedKafka.scala:38)
at net.manub.embeddedkafka.EmbeddedKafka$.start(EmbeddedKafka.scala:55)
at iris.orange.ScalaTest$$anonfun$1.apply$mcV$sp(ScalaTest.scala:10)
It seems an problem about dependencies but I didnt' get to work. I chose a embedded kafka which uses the same kafka version.

You need to use the proper version of the spark-streaming-kafka
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.10/2.0.0
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.10</artifactId>
<version>2.0.0</version>
</dependency>

Spark Error: Could not initialize class org.apache.spark.rdd.RDDOperationScope

I've created a spark standalone cluster on my laptop,
then I go into an sbt console on a spark project and try to embed a spark instance as so:
val conf = new SparkConf().setAppName("foo").setMaster(/* Spark Master URL*/)
val sc = new SparkContext(conf)
Up to there everything works fine, then I try
sc.parallelize(Array(1,2,3))
// and I get: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
How do I fix this?

maybe you missed following lib.
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.4.4</version>
</dependency>

this error message is usually accompanied by
Cause: com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.9.8
it means there are conflict versions in the dependencies (obviously). in the Spark world, usually it is because some lib we use has dependency conflict with spark shipped one.
use coursier resolve can find out what's happening. (gradle also has debugging dependency.
cs resolve org.apache.spark:spark-core_2.11:2.4.5 | grep jackson
cs resolve com.thesamet.scalapb:scalapb-json4s_2.11:0.10.0 | grep jackson
then either build a uber jar for our application, or excluding the conflict in the build (if it is possible). e.g. build.gradle
testCompile 'com.thesamet.scalapb:scalapb-json4s_%%:0.10.0', { exclude group: 'com.fasterxml.jackson.core' }

Adding the following jar to the Bin folder of the spark if you using spark console.
https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind/2.9.9.3

Failed to load class for data source: com.databricks.spark.csv

My build.sbt file has this:
scalaVersion := "2.10.3"
libraryDependencies += "com.databricks" % "spark-csv_2.10" % "1.1.0"
I am running Spark in standalone cluster mode and my SparkConf is SparkConf().setMaster("spark://ec2-[ip].compute-1.amazonaws.com:7077").setAppName("Simple Application") (I am not using the method setJars, not sure whether I need it).
I package the jar using the command sbt package. Command I use to run the application is ./bin/spark-submit --master spark://ec2-[ip].compute-1.amazonaws.com:7077 --class "[classname]" target/scala-2.10/[jarname]_2.10-1.0.jar.
On running this, I get this error:
java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
What's the issue?

Use the dependencies accordingly. For example:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>

Include the option: --packages com.databricks:spark-csv_2.10:1.2.0 but do it after --class and before the target/

add --jars option and download the jars below from repository such as search.maven.org
--jars commons-csv-1.1.jar,spark-csv-csv.jar,univocity-parsers-1.5.1.jar \
Use the --packages option as claudiaann1 suggested also works if you have internet access without proxy. If you need to go through proxy, it won't work.

Here is the example that worked: spark-submit --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file:/root/Downloads/jars/com‌mons-csv-1.2.jar,file:/root/Downloads/jars/spark-sql_2.11-1.4.1.jar --class "SampleApp" --master local[2] target/scala-2.11/my-proj_2.11-1.0.jar

Use below Command , its working :
spark-submit --class ur_class_name --master local[*] --packages com.databricks:spark-csv_2.10:1.4.0 project_path/target/scala-2.10/jar_name.jar

Have you tried using the --packages argument with spark-submit? I've run into this issue with spark not respecting the dependencies listed as libraryDependencies.
Try this:
./bin/spark-submit --master spark://ec2-[ip].compute-1.amazonaws.com:7077
--class "[classname]" target/scala-2.10/[jarname]_2.10-1.0.jar
--packages com.databricks:spark-csv_2.10:1.1.0
_
From the Spark Docs:
Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. All transitive dependencies will be handled when using this command.
https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Run spark program locally with intellij - apache-spark

Related

NoSuchMethodError with Guava , GCP Cloud Storage & Datastax

java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter in SparkSubmit

Kafka Embedded with Spark. Dependencies problems

Spark Error: Could not initialize class org.apache.spark.rdd.RDDOperationScope

Failed to load class for data source: com.databricks.spark.csv

Categories

Resources