Trainning a spark ml linear regresion model fail after migrating to 1.6.1 - apache-spark

I use spark-ml to train a linear regression model.
It worked perfectly with spark version 1.5.2 but now with 1.6.1 I get the following error :
java.lang.AssertionError: assertion failed: lapack.dppsv returned 228.
It seems to be related to some low level linear algebra library but it worked fine before the spark version update.
In both version I get the same warnings before the training start saying that it can't load BLAS and LAPACK
[Executor task launch worker-6] com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
[Executor task launch worker-6] com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
[main] com.github.fommil.netlib.LAPACK - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
[main] com.github.fommil.netlib.LAPACK - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
Here is a minimal code :
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.feature.OneHotEncoder;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.regression.LinearRegression;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
public class Application {
public static void main(String args[]) {
// create context
JavaSparkContext javaSparkContext = new JavaSparkContext("local[*]", "CalculCote");
SQLContext sqlContext = new SQLContext(javaSparkContext);
// describre fields
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataTypes.createStructField("brand", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("commercial_name", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("mileage", DataTypes.IntegerType, true));
fields.add(DataTypes.createStructField("price", DataTypes.DoubleType, true));
// load dataframe from file
DataFrame df = sqlContext.read().format("com.databricks.spark.csv") //
.option("header", "true") //
.option("InferSchema", "false") //
.option("delimiter", ";") //
.schema(DataTypes.createStructType(fields)) //
.load("input.csv").persist();
// show first rows
df.show();
// indexers and encoders for non numerical values
StringIndexer brandIndexer = new StringIndexer() //
.setInputCol("brand") //
.setOutputCol("brandIndex");
OneHotEncoder brandEncoder = new OneHotEncoder() //
.setInputCol("brandIndex") //
.setOutputCol("brandVec");
StringIndexer commNameIndexer = new StringIndexer() //
.setInputCol("commercial_name") //
.setOutputCol("commNameIndex");
OneHotEncoder commNameEncoder = new OneHotEncoder() //
.setInputCol("commNameIndex") //
.setOutputCol("commNameVec");
// model predictors
VectorAssembler predictors = new VectorAssembler() //
.setInputCols(new String[] { "brandVec", "commNameVec", "mileage" }) //
.setOutputCol("features");
// train model
LinearRegression lr = new LinearRegression().setLabelCol("price");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { //
brandIndexer, brandEncoder, commNameIndexer, commNameEncoder, predictors, lr });
PipelineModel pm = pipeline.fit(df);
DataFrame result = pm.transform(df);
result.show();
}
}
And input.csv data
brand;commercial_name;mileage;price
APRILIA;ATLANTIC 125;18237;1400
BMW;R1200 GS;10900;12400
HONDA;CB 1000;58225;4250
HONDA;CB 1000;1780;7610
HONDA;CROSSRUNNER 800;2067;11490
KAWASAKI;ER-6F 600;51600;2010
KAWASAKI;VERSYS 1000;5900;13900
KAWASAKI;VERSYS 650;3350;6200
KTM;SUPER DUKE 990;36420;4760
the pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>sparkmigration</artifactId>
<packaging>jar</packaging>
<name>sparkmigration</name>
<version>0.0.1</version>
<url>http://maven.apache.org</url>
<properties>
<java.version>1.8</java.version>
<spark.version>1.6.1</spark.version>
<!-- <spark.version>1.5.2</spark.version> -->
<spark.csv.version>1.3.0</spark.csv.version>
<slf4j.version>1.7.2</slf4j.version>
<logback.version>1.0.9</logback.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>${spark.csv.version}</version>
</dependency>
<!-- Logs -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>${logback.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>${logback.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
</project>

Problem fixed (thanks apache spark mailing list)
Since spark 1.6, linear regression model is set to "auto", in some coditions (features<=4096, no elastic net param set, ...), WSL algo is used instead of L-BFGS.
I forced the solver to l-bfgs and it worked
LinearRegression lr = new LinearRegression().setLabelCol("price").setSolver("l-bfgs");

Related

NoSuchMethodErro:org.apache.hadoop.hive.ql.exec.Utilities.copyTableJobPropertiesToConf

Every Hi:
There is a exception i have never encountered,Pls see the below:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.Utilities.copyTableJobPropertiesToConf(Lorg/apache/hadoop/hive/ql/plan/TableDesc;Lorg/apache/hadoop/conf/Configuration;)V
at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:399)
at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:314)
at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:314)
at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
What's the code is:
import org.apache.spark.sql.SparkSession
object test {
def main(args:Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "nuochengze")
val spark: SparkSession = SparkSession.builder()
.appName("Test")
.master("local[*]")
.config("hadoop.home.dir", "hdfs://pc001:8082/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
spark.sql("use test")
spark.sql(
"""
|select * from emp
|""".stripMargin).show
spark.close()
}
}
A thing that made me at a loss happended when i used spark to operate hiveļ¼š
I can perform DDL operations through spark.sql(...).But when i try perform DML operations,such as select ,the above Exception will be reported,I know the lock of this method.But after searching the internet,i did not find any related blogs that if this method is missing,how can solve it?
Have you encountered it? if ture, can i ask for help?
Thinks!!!
I have found the cause of the error. Due to my negligence, when importing modules into pom.xml, there are some inconsistencies between the versions of some modules. If you encounter similar errors, you can refer to my current maven configuration:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.25</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.10.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>3.1.2</version>
</dependency>
</dependencies>
</project>

Spark phoenix read breaks due to hbase-spark dependency with ClassNotFoundException: org.apache.hadoop.hbase.client.HConnectionManager

I am writing a simple spark program to read from Phoenix and Write to Hbase using Spark-Hbase-Connector. I am successful in reading from Phoenix and write to Hbase using SHC separately. But, when I put everything together(adding hbase-spark dependency in specific) the pipeline breaks at Phoenix read statement.
Code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
object SparkHbasePheonix {
def main(args: Array[String]): Unit = {
def catalog =
s"""{
|"table":{"namespace":"default", "name":"employee"},
|"rowkey":"key",
|"columns":{
|"key":{"cf":"rowkey", "col":"key", "type":"string"},
|"fName":{"cf":"person", "col":"firstName", "type":"string"},
|"lName":{"cf":"person", "col":"lastName", "type":"string"},
|"mName":{"cf":"person", "col":"middleName", "type":"string"},
|"addressLine":{"cf":"address", "col":"addressLine", "type":"string"},
|"city":{"cf":"address", "col":"city", "type":"string"},
|"state":{"cf":"address", "col":"state", "type":"string"},
|"zipCode":{"cf":"address", "col":"zipCode", "type":"string"}
|}
|}""".stripMargin
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HbaseSparkWrite")
.getOrCreate()
val df = spark.read.format("org.apache.phoenix.spark")
.option("table", "ph_employee")
.option("zkUrl", "0.0.0.0:2181")
.load()
df.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
}
}
pom:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.tools.version>2.11</scala.tools.version>
<scala.version>2.11.8</scala.version>
<spark.version>2.3.2.3.1.0.31-28</spark.version>
<hbase.version>2.0.2.3.1.0.31-28</hbase.version>
<phoenix.version>5.0.0.3.1.5.9-1</phoenix.version>
</properties>
<!-- Hbase dependencies-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-spark -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>2.0.2.3.1.0.6-1</version>
</dependency>
<dependency>
<groupId>com.hortonworks</groupId>
<artifactId>shc-core</artifactId>
<version>1.1.1-2.1-s_2.11</version>
</dependency>
<!-- Phoenix dependencies-->
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-client</artifactId>
<version>${phoenix.version}</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>${phoenix.version}</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
Exception:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/HConnectionManager
at org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:396)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:228)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2374)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2352)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2352)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:232)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:147)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
at org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:279)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:118)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at com.test.SparkPheonixToHbase$.main(SparkHbasePheonix.scala:33)
at com.test.SparkPheonixToHbase.main(SparkHbasePheonix.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.HConnectionManager
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 24 more
20/05/19 16:57:44 INFO SparkContext: Invoking stop() from shutdown hook
Phoenix read fails when I add hbase-spark dependency.
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>2.0.2.3.1.0.6-1</version>
</dependency>
How can I get rid this error?
Just use either one of those connectors.
If you want to read phoenix table and the output table is not a Phoenix table, but standard HBase table, use just SHC or HBase Spark connector. They can read Phoenix table directly from HBase, without the Phoenix layer. See here the options: https://sparkbyexamples.com/hbase/spark-hbase-connectors-which-one-to-use/#spark-sql
If you want to save to Phoenix as well, just use the Phoenix conector for reading and writing.
Normally, mixing up connectors can cause conflict in building, since they may overlap in their internal classes, especially if you don't care to import exactly the versions that use the same HBase client under the hoods. Unless you have a really good reason to use different libraries for reading and writing, stick just with one of them that fits your needs the most.

Getting error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

I am working on Kafka Spark Streaming. The IDLE doesn't show any errors and the program builds successfully as well but I am getting this error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at KafkaSparkStream1$.main(KafkaSparkStream1.scala:13)
at KafkaSparkStream1.main(KafkaSparkStream1.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more
I am using maven. I have also set up my environment variables correctly as every component is working individually My spark version is 3.0.0-preview2, Scala version is 2.12
I have exported a spark-streaming-Kafka jar file.
Here is my pom file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.org.cpg.casestudy</groupId>
<artifactId>Kafka_casestudy</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>3.0.0</spark.version>
<scala.version>2.12</scala.version>
</properties>
<build>
<plugins>
<!-- Maven Compiler Plugin-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- Apache Kafka Clients-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Apache Kafka Streams-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Apache Log4J2 binding for SLF4J -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.11.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0-preview2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0-preview2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>
</dependencies>
Here is my code (word count of message send by producer):
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark._
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.codehaus.jackson.map.deser.std.StringDeserializer
object KafkaSparkStream {
def main(args: Array[String]): Unit = {
val brokers = "localhost:9092";
val groupid = "GRP1";
val topics = "KafkaTesting";
val SparkConf = new SparkConf().setMaster("local[*]").setAppName("KafkaSparkStreaming");
val ssc = new StreamingContext(SparkConf,Seconds(10))
val sc = ssc.sparkContext
sc.setLogLevel("off")
val topicSet = topics.split(",").toSet
val kafkaPramas = Map[String , Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val messages = KafkaUtils.createDirectStream[String,String](
ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String,String](topicSet,kafkaPramas)
)
val line=messages.map(_.value)
val words = line.flatMap(_.split(" "))
val wordCount = words.map(x=> (x,1)).reduceByKey(_+_)
wordCount.print()
ssc.start()
ssc.awaitTermination()
}
}
Try cleaning your mvn local repository or else run below command to override you dependency JARs from online
mvn clean install -U
Your spark dependencies, specially spark-core_2.12-3.0.0-preview2.jar is not added to your class path while executing the Spark JAR.
you can do it via
spark-submit --jars <path>/spark-core_2.12-3.0.0-preview2.jar

Unable to read file from Azure Blob Storage mount from Databrick's Connect Apache Spark

I configured a databricks connect on Azure to run my spark programs on Azure cloud. For a dry run I tested a wordcount program. But the program is failing with following error.
"Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
I am using Intellij to run the program. I have the necessary permissions to access the cluster. But I still I am getting this error.
The following program is a wrapper which takes in the parameters and publishes the results.
package com.spark.scala
import com.spark.scala.demo.{Argument, WordCount}
import org.apache.spark.sql.SparkSession
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
import scala.collection.mutable.Map
object Test {
def main(args: Array[String]): Unit = {
val argumentMap: Map[String, String] = Argument.parseArgs(args)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
println(spark.range(100).count())
val rawread = String.format("/mnt/%s", argumentMap.get("--raw-reads").get)
val data = spark.sparkContext.textFile(rawread)
print(data.count())
val rawwrite = String.format("/dbfs/mnt/%s", argumentMap.get("--raw-write").get)
WordCount.executeWordCount(spark, rawread, rawwrite);
// The Spark code will execute on the Databricks cluster.
spark.stop()
}
}
The following code performs the wordcount logic:-
package com.spark.scala.demo
import org.apache.spark.sql.SparkSession
object WordCount{
def executeWordCount(sparkSession:SparkSession, read: String, write: String)
{
println("starting word count process ")
//val path = String.format("/mnt/%s", "tejatest\wordcount.txt")
//Reading input file and creating rdd with no of partitions 5
val bookRDD=sparkSession.sparkContext.textFile(read)
//Regex to clean text
val pat = """[^\w\s\$]"""
val cleanBookRDD=bookRDD.map(line=>line.replaceAll(pat, ""))
val wordsRDD=cleanBookRDD.flatMap(line=>line.split(" "))
val wordMapRDD=wordsRDD.map(word=>(word->1))
val wordCountMapRDD=wordMapRDD.reduceByKey(_+_)
wordCountMapRDD.saveAsTextFile(write)
}
}
I have written a mapper to map the paths given and I am passing the read and write locations through command line. My pom.xml is as follows: -
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ex-com.spark.scala</groupId>
<artifactId>ex- demo</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.clapper</groupId>
<artifactId>grizzled-slf4j_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>dbutils-api_2.11</artifactId>
<version>0.0.3</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>

Spark Cassandra Java integration Issues

I am new to spark and Cassandra both.
I am trying to achieve aggregate functionality using spark+java on Cassandra Data.
I am not able to fetch the Cassandra data in my code. I read multiple discussions and found out that there are some compatibility issues with spark and spark-Cassandra connector. I tried a lot to fix my issue but was not able to fix it.
Find below pom.xml (kindly do not mind extra dependencies also. I need to make sure which library is causing the issue) -
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>IBeatCassPOC</groupId>
<artifactId>ibeatCassPOC</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!--CASSANDRA START-->
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-mapping</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-extras</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.sparkjava</groupId>
<artifactId>spark-core</artifactId>
<version>2.5.4</version>
</dependency>
<!--https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10-->
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>2.0.0-M3</version>
</dependency>
<!--CASSANDRA END-->
<!-- Kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.1</version>
</dependency>
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.2</version>
</dependency>
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<!-- Logging -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<!-- Spark-Kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<!-- Jackson -->
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>
<!-- Google Collection Library -->
<dependency>
<groupId>com.google.collections</groupId>
<artifactId>google-collections</artifactId>
<version>1.0-rc2</version>
</dependency>
<!--UA Detector dependency for AgentType in PageTrendLog-->
<dependency>
<groupId>net.sf.uadetector</groupId>
<artifactId>uadetector-core</artifactId>
<version>0.9.12</version>
</dependency>
<dependency>
<groupId>net.sf.uadetector</groupId>
<artifactId>uadetector-resources</artifactId>
<version>2013.12</version>
</dependency>
<dependency>
<groupId>com.esotericsoftware</groupId>
<artifactId>kryo</artifactId>
<version>3.0.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>4.0.4</version>
</dependency>
<!-- MongoDb Java Connector -->
<!-- <dependency> <groupId>org.mongodb</groupId> <artifactId>mongo-java-driver</artifactId>
<version>2.13.0</version> </dependency> -->
</dependencies>
Java source code being used to fetch the data -
import com.datastax.spark.connector.japi.CassandraJavaUtil;
import com.datastax.spark.connector.japi.CassandraRow;
import com.datastax.spark.connector.japi.rdd.CassandraJavaRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import java.util.ArrayList;
public class ReadCassData {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Spark-Cassandra Integration");
sparkConf.setMaster("local[4]");
sparkConf.set("spark.cassandra.connection.host", "stagingServer22");
sparkConf.set("spark.cassandra.connection.port", "9042");
sparkConf.set("spark.cassandra.connection.timeout_ms", "5000");
sparkConf.set("spark.cassandra.read.timeout_ms", "200000");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
String keySpaceName = "testKeyspace";
String tableName = "testTable";
CassandraJavaRDD<CassandraRow> cassandraRDD = CassandraJavaUtil.javaFunctions(javaSparkContext).cassandraTable(keySpaceName, tableName);
System.out.println("Cassandra Count" + cassandraRDD.cassandraCount());
final ArrayList<CassandraRow> data = new ArrayList<CassandraRow>();
cassandraRDD.reduce(new Function2<CassandraRow, CassandraRow, CassandraRow>() {
public CassandraRow call(CassandraRow v1, CassandraRow v2) throws Exception {
System.out.println("hello");
System.out.println(v1 + " ____ " + v2);
data.add(v1);
data.add(v2);
return null;
}
});
System.out.println( "data Size -" + data.size());
}
}
Exception being encountered is -
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2, localhost): java.lang.NoSuchMethodError: org.apache.spark.TaskContext.getMetricsSources(Ljava/lang/String;)Lscala/collection/Seq;
at org.apache.spark.metrics.MetricsUpdater$.getSource(MetricsUpdater.scala:20)
at org.apache.spark.metrics.InputMetricsUpdater$.apply(InputMetricsUpdater.scala:56)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.compute(CassandraTableScanRDD.scala:329)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I have Cassandra cluster deployed on a remote location and Cassandra version being used is 3.9.
Please guide what are the compatible dependencies. I can not change my Cassandra version (currently 3.9). Please suggest what spark/spark-cassandra-connector version to use to successfully execute map-reduce jobs on DB.
I have tried with connecting with spark and have used spark cassandra connector in scala .
val spark = "com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0"
val sparkCore = "org.apache.spark" %% "spark-sql" % "1.6.1"
And below is my working code -
import com.datastax.driver.dse.graph.GraphResultSet
import com.spok.util.LoggerUtil
import com.datastax.spark.connector._
import org.apache.spark._
object DseSparkGraphFactory extends App {
val dseConn = {
LoggerUtil.info("Connecting with DSE Spark Cluster....")
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("test")
.set("spark.cassandra.connection.host", "Ip-Address")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("spokg_test", "Url_p")
rdd.collect().map(println)
}
Please refer to Cassandra Spark Connector for relevant version of connector depending on your spark version in your environment. It should be 1.5, 1.6 or 2.0
Following POM worked for me:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.6</version>
<scope>system</scope>
<systemPath>D:\Jars\tools-1.6.0.jar</systemPath>
</dependency>
</dependencies>
Check it.I successfully ingested streaming data from Kafka to Cassandra. Similarly u can pull data into javaRDD.

Resources