Spark Cassandra Streaming - apache-spark

i want to save data from spark streaming to cassandra using scala maven project. this is the code that save data to cassandra table
import org.apache.maventestsparkproject._
import com.datastax.spark.connector.streaming._
import com.datastax.spark.connector.SomeColumns
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object SparkCassandra {
def main(args: Array[String]) {
val sparkConf = new SparkConf()
.setAppName("KakfaStreamToCassandra").setMaster("local[*]")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.connection.port", "9042")
val topics = "fayssal1,fayssal2"
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.saveToCassandra(keysspace, table, SomeColumns("word", "count"))
ssc.awaitTermination()
ssc.start()
}
}
the project is builting successfly, this is my pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.apache.maventestsparkproject</groupId>
<artifactId>testmavenapp</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>testmavenapp</name>
<url>http://maven.apache.org</url>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>1.6.2</spark.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.0.0-rc4</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.0.0-rc4</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.5</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
but when i run this commande:
scala -cp /home/darif/TestProject/testmavenapp/target/testmavenapp-1.0-SNAPSHOT.jar /home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala
i get the following errors look like this:
home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:1: error: object apache is not a member of package org
import org.apache.maventestsparkproject._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:2: error: object datastax is not a member of package com
import com.datastax.spark.connector.streaming._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:3: error: object datastax is not a member of package com
import com.datastax.spark.connector.SomeColumns
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:5: error: object apache is not a member of package org
import org.apache.spark.SparkConf
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:6: error: object apache is not a member of package org
import org.apache.spark.streaming._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:7: error: object apache is not a member of package org
import org.apache.spark.streaming.kafka._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:12: error: not found: type SparkConf
val sparkConf = new SparkConf()
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:19: error: not found: type StreamingContext
val ssc = new StreamingContext(sparkConf, Seconds(5))
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:19: error: not found: value Seconds
val ssc = new StreamingContext(sparkConf, Seconds(5))
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:22: error: not found: value brokers
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:23: error: not found: value KafkaUtils
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:23: error: not found: type StringDecoder
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:23: error: not found: type StringDecoder
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
^
13 errors found
i am using :
Scala 2.11.8
Spark 1.6.2
Kafka Client APIs 0.8.2.11
Cassandra 3.9
Datastax Spark-Cassandra Connector compatible with Spark 1.6.2

The classpath for your application has not been set up correctly. It is recommend in various places to use spark-submit as your launcher as it will setup the vast majority of the classpath. Third party dependencies would be set using --packages.
Datastax Example
Spark Documentation
That said you could achieve the same result by custom setting various things in your Spark Conf as well as manually setting the classpath to include all of the Spark, DSE, and Kafka libraries.

Related

spark elasticsearch read with basic auth giving 403 error

I get this below error when I am trying to read from my ES cluster
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [DELETE] on [_search/scroll] failed; server[<HOST>:<PORT>] returned [403|Forbidden:]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:505)
at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:476)
at org.elasticsearch.hadoop.rest.RestClient.deleteScroll(RestClient.java:541)
at org.elasticsearch.hadoop.rest.ScrollQuery.close(ScrollQuery.java:77)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.close(AbstractEsRDDIterator.scala:81)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.closeIfNeeded(AbstractEsRDDIterator.scala:74)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply$mcV$sp(AbstractEsRDDIterator.scala:54)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply(AbstractEsRDDIterator.scala:54)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply(AbstractEsRDDIterator.scala:54)
Code used to read
// spark conf.
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("ReadORESData").setMaster("local[*]");
// elasticsearch specific configuration.
sparkConf.set("es.nodes", "<HOST>")
.set("es.port", "<PORT>")
.set("es.net.ssl", "true")
.set("es.index.read.missing.as.empty", "true")
.set("es.net.http.auth.user", "<USERNAME>")
.set("es.net.http.auth.pass", "<PASSWORD>")
.set("es.nodes.wan.only", "true")
.set("es.nodes.discovery","false")
.set("es.input.use.sliced.partitions","false")
.set("es.resource", "<INDEX_NAME>")
.set("es.scroll.size","500");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, Map<String, Object>> rdd = JavaEsSpark.esRDD(jsc);
for ( Map<String, Object> item : rdd.values().collect()) {
System.out.println(item);
}
jsc.stop();
curl is working fine from my machine.
curl -XGET 'https://<HOST>:<PORT>/<INDEX>/_search' --user <USERNAME>:<PASSWORD>
Any index I try I am seeing that same error. If invalid index it is correctly saying index not found. I am connecting to ES 5.6 cluster using these below dependencies.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.6.16</version>
</dependency>
Noticed that the account I was using doesn't have r/w permissions. Used a different account and it fixed that 403 error.

Spark phoenix read breaks due to hbase-spark dependency with ClassNotFoundException: org.apache.hadoop.hbase.client.HConnectionManager

I am writing a simple spark program to read from Phoenix and Write to Hbase using Spark-Hbase-Connector. I am successful in reading from Phoenix and write to Hbase using SHC separately. But, when I put everything together(adding hbase-spark dependency in specific) the pipeline breaks at Phoenix read statement.
Code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
object SparkHbasePheonix {
def main(args: Array[String]): Unit = {
def catalog =
s"""{
|"table":{"namespace":"default", "name":"employee"},
|"rowkey":"key",
|"columns":{
|"key":{"cf":"rowkey", "col":"key", "type":"string"},
|"fName":{"cf":"person", "col":"firstName", "type":"string"},
|"lName":{"cf":"person", "col":"lastName", "type":"string"},
|"mName":{"cf":"person", "col":"middleName", "type":"string"},
|"addressLine":{"cf":"address", "col":"addressLine", "type":"string"},
|"city":{"cf":"address", "col":"city", "type":"string"},
|"state":{"cf":"address", "col":"state", "type":"string"},
|"zipCode":{"cf":"address", "col":"zipCode", "type":"string"}
|}
|}""".stripMargin
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HbaseSparkWrite")
.getOrCreate()
val df = spark.read.format("org.apache.phoenix.spark")
.option("table", "ph_employee")
.option("zkUrl", "0.0.0.0:2181")
.load()
df.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
}
}
pom:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.tools.version>2.11</scala.tools.version>
<scala.version>2.11.8</scala.version>
<spark.version>2.3.2.3.1.0.31-28</spark.version>
<hbase.version>2.0.2.3.1.0.31-28</hbase.version>
<phoenix.version>5.0.0.3.1.5.9-1</phoenix.version>
</properties>
<!-- Hbase dependencies-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-spark -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>2.0.2.3.1.0.6-1</version>
</dependency>
<dependency>
<groupId>com.hortonworks</groupId>
<artifactId>shc-core</artifactId>
<version>1.1.1-2.1-s_2.11</version>
</dependency>
<!-- Phoenix dependencies-->
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-client</artifactId>
<version>${phoenix.version}</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>${phoenix.version}</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
Exception:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/HConnectionManager
at org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:396)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:228)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2374)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2352)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2352)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:232)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:147)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
at org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:279)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:118)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at com.test.SparkPheonixToHbase$.main(SparkHbasePheonix.scala:33)
at com.test.SparkPheonixToHbase.main(SparkHbasePheonix.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.HConnectionManager
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 24 more
20/05/19 16:57:44 INFO SparkContext: Invoking stop() from shutdown hook
Phoenix read fails when I add hbase-spark dependency.
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>2.0.2.3.1.0.6-1</version>
</dependency>
How can I get rid this error?
Just use either one of those connectors.
If you want to read phoenix table and the output table is not a Phoenix table, but standard HBase table, use just SHC or HBase Spark connector. They can read Phoenix table directly from HBase, without the Phoenix layer. See here the options: https://sparkbyexamples.com/hbase/spark-hbase-connectors-which-one-to-use/#spark-sql
If you want to save to Phoenix as well, just use the Phoenix conector for reading and writing.
Normally, mixing up connectors can cause conflict in building, since they may overlap in their internal classes, especially if you don't care to import exactly the versions that use the same HBase client under the hoods. Unless you have a really good reason to use different libraries for reading and writing, stick just with one of them that fits your needs the most.

Getting error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

I am working on Kafka Spark Streaming. The IDLE doesn't show any errors and the program builds successfully as well but I am getting this error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at KafkaSparkStream1$.main(KafkaSparkStream1.scala:13)
at KafkaSparkStream1.main(KafkaSparkStream1.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more
I am using maven. I have also set up my environment variables correctly as every component is working individually My spark version is 3.0.0-preview2, Scala version is 2.12
I have exported a spark-streaming-Kafka jar file.
Here is my pom file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.org.cpg.casestudy</groupId>
<artifactId>Kafka_casestudy</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>3.0.0</spark.version>
<scala.version>2.12</scala.version>
</properties>
<build>
<plugins>
<!-- Maven Compiler Plugin-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- Apache Kafka Clients-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Apache Kafka Streams-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Apache Log4J2 binding for SLF4J -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.11.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0-preview2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0-preview2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>
</dependencies>
Here is my code (word count of message send by producer):
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark._
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.codehaus.jackson.map.deser.std.StringDeserializer
object KafkaSparkStream {
def main(args: Array[String]): Unit = {
val brokers = "localhost:9092";
val groupid = "GRP1";
val topics = "KafkaTesting";
val SparkConf = new SparkConf().setMaster("local[*]").setAppName("KafkaSparkStreaming");
val ssc = new StreamingContext(SparkConf,Seconds(10))
val sc = ssc.sparkContext
sc.setLogLevel("off")
val topicSet = topics.split(",").toSet
val kafkaPramas = Map[String , Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val messages = KafkaUtils.createDirectStream[String,String](
ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String,String](topicSet,kafkaPramas)
)
val line=messages.map(_.value)
val words = line.flatMap(_.split(" "))
val wordCount = words.map(x=> (x,1)).reduceByKey(_+_)
wordCount.print()
ssc.start()
ssc.awaitTermination()
}
}
Try cleaning your mvn local repository or else run below command to override you dependency JARs from online
mvn clean install -U
Your spark dependencies, specially spark-core_2.12-3.0.0-preview2.jar is not added to your class path while executing the Spark JAR.
you can do it via
spark-submit --jars <path>/spark-core_2.12-3.0.0-preview2.jar

Unable to read file from Azure Blob Storage mount from Databrick's Connect Apache Spark

I configured a databricks connect on Azure to run my spark programs on Azure cloud. For a dry run I tested a wordcount program. But the program is failing with following error.
"Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
I am using Intellij to run the program. I have the necessary permissions to access the cluster. But I still I am getting this error.
The following program is a wrapper which takes in the parameters and publishes the results.
package com.spark.scala
import com.spark.scala.demo.{Argument, WordCount}
import org.apache.spark.sql.SparkSession
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
import scala.collection.mutable.Map
object Test {
def main(args: Array[String]): Unit = {
val argumentMap: Map[String, String] = Argument.parseArgs(args)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
println(spark.range(100).count())
val rawread = String.format("/mnt/%s", argumentMap.get("--raw-reads").get)
val data = spark.sparkContext.textFile(rawread)
print(data.count())
val rawwrite = String.format("/dbfs/mnt/%s", argumentMap.get("--raw-write").get)
WordCount.executeWordCount(spark, rawread, rawwrite);
// The Spark code will execute on the Databricks cluster.
spark.stop()
}
}
The following code performs the wordcount logic:-
package com.spark.scala.demo
import org.apache.spark.sql.SparkSession
object WordCount{
def executeWordCount(sparkSession:SparkSession, read: String, write: String)
{
println("starting word count process ")
//val path = String.format("/mnt/%s", "tejatest\wordcount.txt")
//Reading input file and creating rdd with no of partitions 5
val bookRDD=sparkSession.sparkContext.textFile(read)
//Regex to clean text
val pat = """[^\w\s\$]"""
val cleanBookRDD=bookRDD.map(line=>line.replaceAll(pat, ""))
val wordsRDD=cleanBookRDD.flatMap(line=>line.split(" "))
val wordMapRDD=wordsRDD.map(word=>(word->1))
val wordCountMapRDD=wordMapRDD.reduceByKey(_+_)
wordCountMapRDD.saveAsTextFile(write)
}
}
I have written a mapper to map the paths given and I am passing the read and write locations through command line. My pom.xml is as follows: -
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ex-com.spark.scala</groupId>
<artifactId>ex- demo</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.clapper</groupId>
<artifactId>grizzled-slf4j_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>dbutils-api_2.11</artifactId>
<version>0.0.3</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>

Trainning a spark ml linear regresion model fail after migrating to 1.6.1

I use spark-ml to train a linear regression model.
It worked perfectly with spark version 1.5.2 but now with 1.6.1 I get the following error :
java.lang.AssertionError: assertion failed: lapack.dppsv returned 228.
It seems to be related to some low level linear algebra library but it worked fine before the spark version update.
In both version I get the same warnings before the training start saying that it can't load BLAS and LAPACK
[Executor task launch worker-6] com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
[Executor task launch worker-6] com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
[main] com.github.fommil.netlib.LAPACK - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
[main] com.github.fommil.netlib.LAPACK - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
Here is a minimal code :
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.feature.OneHotEncoder;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.regression.LinearRegression;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
public class Application {
public static void main(String args[]) {
// create context
JavaSparkContext javaSparkContext = new JavaSparkContext("local[*]", "CalculCote");
SQLContext sqlContext = new SQLContext(javaSparkContext);
// describre fields
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataTypes.createStructField("brand", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("commercial_name", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("mileage", DataTypes.IntegerType, true));
fields.add(DataTypes.createStructField("price", DataTypes.DoubleType, true));
// load dataframe from file
DataFrame df = sqlContext.read().format("com.databricks.spark.csv") //
.option("header", "true") //
.option("InferSchema", "false") //
.option("delimiter", ";") //
.schema(DataTypes.createStructType(fields)) //
.load("input.csv").persist();
// show first rows
df.show();
// indexers and encoders for non numerical values
StringIndexer brandIndexer = new StringIndexer() //
.setInputCol("brand") //
.setOutputCol("brandIndex");
OneHotEncoder brandEncoder = new OneHotEncoder() //
.setInputCol("brandIndex") //
.setOutputCol("brandVec");
StringIndexer commNameIndexer = new StringIndexer() //
.setInputCol("commercial_name") //
.setOutputCol("commNameIndex");
OneHotEncoder commNameEncoder = new OneHotEncoder() //
.setInputCol("commNameIndex") //
.setOutputCol("commNameVec");
// model predictors
VectorAssembler predictors = new VectorAssembler() //
.setInputCols(new String[] { "brandVec", "commNameVec", "mileage" }) //
.setOutputCol("features");
// train model
LinearRegression lr = new LinearRegression().setLabelCol("price");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { //
brandIndexer, brandEncoder, commNameIndexer, commNameEncoder, predictors, lr });
PipelineModel pm = pipeline.fit(df);
DataFrame result = pm.transform(df);
result.show();
}
}
And input.csv data
brand;commercial_name;mileage;price
APRILIA;ATLANTIC 125;18237;1400
BMW;R1200 GS;10900;12400
HONDA;CB 1000;58225;4250
HONDA;CB 1000;1780;7610
HONDA;CROSSRUNNER 800;2067;11490
KAWASAKI;ER-6F 600;51600;2010
KAWASAKI;VERSYS 1000;5900;13900
KAWASAKI;VERSYS 650;3350;6200
KTM;SUPER DUKE 990;36420;4760
the pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>sparkmigration</artifactId>
<packaging>jar</packaging>
<name>sparkmigration</name>
<version>0.0.1</version>
<url>http://maven.apache.org</url>
<properties>
<java.version>1.8</java.version>
<spark.version>1.6.1</spark.version>
<!-- <spark.version>1.5.2</spark.version> -->
<spark.csv.version>1.3.0</spark.csv.version>
<slf4j.version>1.7.2</slf4j.version>
<logback.version>1.0.9</logback.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>${spark.csv.version}</version>
</dependency>
<!-- Logs -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>${logback.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>${logback.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Problem fixed (thanks apache spark mailing list)
Since spark 1.6, linear regression model is set to "auto", in some coditions (features<=4096, no elastic net param set, ...), WSL algo is used instead of L-BFGS.
I forced the solver to l-bfgs and it worked
LinearRegression lr = new LinearRegression().setLabelCol("price").setSolver("l-bfgs");

Resources