spark elasticsearch read with basic auth giving 403 error - apache-spark

I get this below error when I am trying to read from my ES cluster
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [DELETE] on [_search/scroll] failed; server[<HOST>:<PORT>] returned [403|Forbidden:]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:505)
at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:476)
at org.elasticsearch.hadoop.rest.RestClient.deleteScroll(RestClient.java:541)
at org.elasticsearch.hadoop.rest.ScrollQuery.close(ScrollQuery.java:77)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.close(AbstractEsRDDIterator.scala:81)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.closeIfNeeded(AbstractEsRDDIterator.scala:74)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply$mcV$sp(AbstractEsRDDIterator.scala:54)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply(AbstractEsRDDIterator.scala:54)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply(AbstractEsRDDIterator.scala:54)
Code used to read
// spark conf.
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("ReadORESData").setMaster("local[*]");
// elasticsearch specific configuration.
sparkConf.set("es.nodes", "<HOST>")
.set("es.port", "<PORT>")
.set("es.net.ssl", "true")
.set("es.index.read.missing.as.empty", "true")
.set("es.net.http.auth.user", "<USERNAME>")
.set("es.net.http.auth.pass", "<PASSWORD>")
.set("es.nodes.wan.only", "true")
.set("es.nodes.discovery","false")
.set("es.input.use.sliced.partitions","false")
.set("es.resource", "<INDEX_NAME>")
.set("es.scroll.size","500");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, Map<String, Object>> rdd = JavaEsSpark.esRDD(jsc);
for ( Map<String, Object> item : rdd.values().collect()) {
System.out.println(item);
}
jsc.stop();
curl is working fine from my machine.
curl -XGET 'https://<HOST>:<PORT>/<INDEX>/_search' --user <USERNAME>:<PASSWORD>
Any index I try I am seeing that same error. If invalid index it is correctly saying index not found. I am connecting to ES 5.6 cluster using these below dependencies.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.6.16</version>
</dependency>

Noticed that the account I was using doesn't have r/w permissions. Used a different account and it fixed that 403 error.

Related

Spark phoenix read breaks due to hbase-spark dependency with ClassNotFoundException: org.apache.hadoop.hbase.client.HConnectionManager

I am writing a simple spark program to read from Phoenix and Write to Hbase using Spark-Hbase-Connector. I am successful in reading from Phoenix and write to Hbase using SHC separately. But, when I put everything together(adding hbase-spark dependency in specific) the pipeline breaks at Phoenix read statement.
Code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
object SparkHbasePheonix {
def main(args: Array[String]): Unit = {
def catalog =
s"""{
|"table":{"namespace":"default", "name":"employee"},
|"rowkey":"key",
|"columns":{
|"key":{"cf":"rowkey", "col":"key", "type":"string"},
|"fName":{"cf":"person", "col":"firstName", "type":"string"},
|"lName":{"cf":"person", "col":"lastName", "type":"string"},
|"mName":{"cf":"person", "col":"middleName", "type":"string"},
|"addressLine":{"cf":"address", "col":"addressLine", "type":"string"},
|"city":{"cf":"address", "col":"city", "type":"string"},
|"state":{"cf":"address", "col":"state", "type":"string"},
|"zipCode":{"cf":"address", "col":"zipCode", "type":"string"}
|}
|}""".stripMargin
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HbaseSparkWrite")
.getOrCreate()
val df = spark.read.format("org.apache.phoenix.spark")
.option("table", "ph_employee")
.option("zkUrl", "0.0.0.0:2181")
.load()
df.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
}
}
pom:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.tools.version>2.11</scala.tools.version>
<scala.version>2.11.8</scala.version>
<spark.version>2.3.2.3.1.0.31-28</spark.version>
<hbase.version>2.0.2.3.1.0.31-28</hbase.version>
<phoenix.version>5.0.0.3.1.5.9-1</phoenix.version>
</properties>
<!-- Hbase dependencies-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-spark -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>2.0.2.3.1.0.6-1</version>
</dependency>
<dependency>
<groupId>com.hortonworks</groupId>
<artifactId>shc-core</artifactId>
<version>1.1.1-2.1-s_2.11</version>
</dependency>
<!-- Phoenix dependencies-->
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-client</artifactId>
<version>${phoenix.version}</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>${phoenix.version}</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
Exception:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/HConnectionManager
at org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:396)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:228)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2374)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2352)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2352)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:232)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:147)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
at org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:279)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:118)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at com.test.SparkPheonixToHbase$.main(SparkHbasePheonix.scala:33)
at com.test.SparkPheonixToHbase.main(SparkHbasePheonix.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.HConnectionManager
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 24 more
20/05/19 16:57:44 INFO SparkContext: Invoking stop() from shutdown hook
Phoenix read fails when I add hbase-spark dependency.
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>2.0.2.3.1.0.6-1</version>
</dependency>
How can I get rid this error?
Just use either one of those connectors.
If you want to read phoenix table and the output table is not a Phoenix table, but standard HBase table, use just SHC or HBase Spark connector. They can read Phoenix table directly from HBase, without the Phoenix layer. See here the options: https://sparkbyexamples.com/hbase/spark-hbase-connectors-which-one-to-use/#spark-sql
If you want to save to Phoenix as well, just use the Phoenix conector for reading and writing.
Normally, mixing up connectors can cause conflict in building, since they may overlap in their internal classes, especially if you don't care to import exactly the versions that use the same HBase client under the hoods. Unless you have a really good reason to use different libraries for reading and writing, stick just with one of them that fits your needs the most.

Unable to read file from Azure Blob Storage mount from Databrick's Connect Apache Spark

I configured a databricks connect on Azure to run my spark programs on Azure cloud. For a dry run I tested a wordcount program. But the program is failing with following error.
"Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
I am using Intellij to run the program. I have the necessary permissions to access the cluster. But I still I am getting this error.
The following program is a wrapper which takes in the parameters and publishes the results.
package com.spark.scala
import com.spark.scala.demo.{Argument, WordCount}
import org.apache.spark.sql.SparkSession
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
import scala.collection.mutable.Map
object Test {
def main(args: Array[String]): Unit = {
val argumentMap: Map[String, String] = Argument.parseArgs(args)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
println(spark.range(100).count())
val rawread = String.format("/mnt/%s", argumentMap.get("--raw-reads").get)
val data = spark.sparkContext.textFile(rawread)
print(data.count())
val rawwrite = String.format("/dbfs/mnt/%s", argumentMap.get("--raw-write").get)
WordCount.executeWordCount(spark, rawread, rawwrite);
// The Spark code will execute on the Databricks cluster.
spark.stop()
}
}
The following code performs the wordcount logic:-
package com.spark.scala.demo
import org.apache.spark.sql.SparkSession
object WordCount{
def executeWordCount(sparkSession:SparkSession, read: String, write: String)
{
println("starting word count process ")
//val path = String.format("/mnt/%s", "tejatest\wordcount.txt")
//Reading input file and creating rdd with no of partitions 5
val bookRDD=sparkSession.sparkContext.textFile(read)
//Regex to clean text
val pat = """[^\w\s\$]"""
val cleanBookRDD=bookRDD.map(line=>line.replaceAll(pat, ""))
val wordsRDD=cleanBookRDD.flatMap(line=>line.split(" "))
val wordMapRDD=wordsRDD.map(word=>(word->1))
val wordCountMapRDD=wordMapRDD.reduceByKey(_+_)
wordCountMapRDD.saveAsTextFile(write)
}
}
I have written a mapper to map the paths given and I am passing the read and write locations through command line. My pom.xml is as follows: -
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ex-com.spark.scala</groupId>
<artifactId>ex- demo</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.clapper</groupId>
<artifactId>grizzled-slf4j_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>dbutils-api_2.11</artifactId>
<version>0.0.3</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>

Janusgraph OLAP query outside gremlin console

I have a graph in which some nodes are having millions of incoming edges. I need to obtain the edge count of such nodes periodically. I'm using cassandar as storage backend.
Query :
g.V().has('vid','qwerty').inE().count().next()
All the documentation available explains how to leverage apache spark to do it from gremlin console.
Would it be possible for me to somehow write the logic outside gremlin console as a spark job and run id periodically on a hadoop cluster.
Here's the output of the query on gremlin console when i'm not using spark:
14108889 [gremlin-server-session-1] WARN org.apache.tinkerpop.gremlin.server.op.AbstractEvalOpProcessor -
Exception processing a script on request [RequestMessage{,
requestId=c3d902b7-0fdd-491d-8639-546963212474, op='eval',
processor='session',
args={gremlin=g.V().has('vid','qwerty').inE().count().next(),
session=2831d264-4566-4d15-99c5-d9bbb202b1f8, bindings={},
manageTransaction=false, batchSize=64}}]. TimedOutException() at
org.apache.cassandra.thrift.Cassandra$multiget_slice_result$multiget_slice_resultStandardScheme.read(Cassandra.java:14696) at
org.apache.cassandra.thrift.Cassandra$multiget_slice_result$multiget_slice_resultStandardScheme.read(Cassandra.java:14633) at
org.apache.cassandra.thrift.Cassandra$multiget_slice_result.read(Cassandra.java:14559)
at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at
org.apache.cassandra.thrift.Cassandra$Client.recv_multiget_slice(Cassandra.java:741)
at
org.apache.cassandra.thrift.Cassandra$Client.multiget_slice(Cassandra.java:725)
at
org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.getNamesSlice(CassandraThriftKeyColumnValueStore.java:143)
at
org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.getSlice(CassandraThriftKeyColumnValueStore.java:100)
at
org.janusgraph.diskstorage.keycolumnvalue.KCVSProxy.getSlice(KCVSProxy.java:82)
at
org.janusgraph.diskstorage.keycolumnvalue.cache.ExpirationKCVSCache.getSlice(ExpirationKCVSCache.java:129)
at
org.janusgraph.diskstorage.BackendTransaction$2.call(BackendTransaction.java:288)
at
org.janusgraph.diskstorage.BackendTransaction$2.call(BackendTransaction.java:285)
at
org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:69)
at
org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:55)
at
org.janusgraph.diskstorage.BackendTransaction.executeRead(BackendTransaction.java:470)
at
org.janusgraph.diskstorage.BackendTransaction.edgeStoreMultiQuery(BackendTransaction.java:285)
at
org.janusgraph.graphdb.database.StandardJanusGraph.edgeMultiQuery(StandardJanusGraph.java:441)
at
org.janusgraph.graphdb.transaction.StandardJanusGraphTx.lambda$executeMultiQuery$3(StandardJanusGraphTx.java:1054)
at
org.janusgraph.graphdb.query.profile.QueryProfiler.profile(QueryProfiler.java:98)
at
org.janusgraph.graphdb.query.profile.QueryProfiler.profile(QueryProfiler.java:90)
at
org.janusgraph.graphdb.transaction.StandardJanusGraphTx.executeMultiQuery(StandardJanusGraphTx.java:1054)
at
org.janusgraph.graphdb.query.vertex.MultiVertexCentricQueryBuilder.execute(MultiVertexCentricQueryBuilder.java:113)
at
org.janusgraph.graphdb.query.vertex.MultiVertexCentricQueryBuilder.edges(MultiVertexCentricQueryBuilder.java:133)
at
org.janusgraph.graphdb.tinkerpop.optimize.JanusGraphVertexStep.initialize(JanusGraphVertexStep.java:95)
at
org.janusgraph.graphdb.tinkerpop.optimize.JanusGraphVertexStep.processNextStart(JanusGraphVertexStep.java:101)
at
org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.hasNext(AbstractStep.java:143)
at
org.apache.tinkerpop.gremlin.process.traversal.step.util.ExpandableStepIterator.hasNext(ExpandableStepIterator.java:42)
at
org.apache.tinkerpop.gremlin.process.traversal.step.util.ReducingBarrierStep.processAllStarts(ReducingBarrierStep.java:83)
at
org.apache.tinkerpop.gremlin.process.traversal.step.util.ReducingBarrierStep.processNextStart(ReducingBarrierStep.java:113)
at
org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:128)
at
org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:38)
at
org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:200)
at java_util_Iterator$next.call(Unknown Source) at
org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
at
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
at
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:117)
at Script13.run(Script13.groovy:1) at
org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:843) at
org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:548) at
javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:233)
at
org.apache.tinkerpop.gremlin.groovy.engine.ScriptEngines.eval(ScriptEngines.java:120)
at
org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor.lambda$eval$0(GremlinExecutor.java:290)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
However g.V().has('vid','qwerty').inE().limit(10000).count().next() works fine and gives ==>10000
Here is the Java client, which is using SparkGraphComputer for creating the graph:
public class FollowCountSpark {
private static Graph hgraph;
private static GraphTraversalSource traversalSource;
public static void main(String[] args) {
createHGraph();
System.exit(0);
}
private static void createHGraph() {
hgraph = GraphFactory.open("/resources/jp_spark.properties");
traversalSource = hgraph.traversal().withComputer(SparkGraphComputer.class);
System.out.println("traversalSource = "+traversalSource);
getAllEdgesFromHGraph();
}
static long getAllEdgesFromHGraph(){
try{
GraphTraversal<Vertex, Vertex> allV = traversalSource.V();
GraphTraversal<Vertex, Vertex> gt = allV.has("vid", "supernode");
GraphTraversal<Vertex, Long> c = gt.inE()
// .limit(600000)
.count();
long l = c.next();
System.out.println("All edges = "+l);
return l;
}catch (Exception e) {
System.out.println("Error while fetching the edges for : ");
e.printStackTrace();
}
return -1;
}
}
And the corresponding properties file is:
storage.backend=cassandrathrift
storage.cassandra.keyspace=t_graph
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
ids.block-size = 100000
storage.batch-loading = true
storage.buffer-size = 1000
# read-cassandra-3.properties
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
#
# JanusGraph Cassandra InputFormat configuration
#
# These properties defines the connection properties which were used while write data to JanusGraph.
janusgraphmr.ioformat.conf.storage.backend=cassandrathrift
# This specifies the hostname & port for Cassandra data store.
#janusgraphmr.ioformat.conf.storage.hostname=10.xx.xx.xx,xx.xx.xx.18,xx.xx.xx.141
janusgraphmr.ioformat.conf.storage.port=9160
# This specifies the keyspace where data is stored.
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=t_graph
#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
spark.cassandra.input.split.size=256
#
# SparkGraphComputer Configuration
#
spark.master=local[1]
spark.executor.memory=1g
spark.cassandra.input.split.size_in_mb=512
spark.executor.extraClassPath=/opt/lib/janusgraph/*
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
And the corresponding pom.xml dependencies for all the spark and hadoop specific classes:
<dependencies>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-core</artifactId>
<version>${janusgraph.version}</version>
</dependency>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-cassandra</artifactId>
<version>${janusgraph.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>spark-gremlin</artifactId>
<version>3.1.0-incubating</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>spark-gremlin</artifactId>
<version>3.2.5</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-hadoop-core</artifactId>
<version>${janusgraph.version}</version>
</dependency>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-hbase</artifactId>
<version>${janusgraph.version}</version>
</dependency>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-cql</artifactId>
<version>${janusgraph.version}</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.8.1</version>
</dependency>
</dependencies>
Hope this helps :)

Spark Cassandra Streaming

i want to save data from spark streaming to cassandra using scala maven project. this is the code that save data to cassandra table
import org.apache.maventestsparkproject._
import com.datastax.spark.connector.streaming._
import com.datastax.spark.connector.SomeColumns
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object SparkCassandra {
def main(args: Array[String]) {
val sparkConf = new SparkConf()
.setAppName("KakfaStreamToCassandra").setMaster("local[*]")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.connection.port", "9042")
val topics = "fayssal1,fayssal2"
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.saveToCassandra(keysspace, table, SomeColumns("word", "count"))
ssc.awaitTermination()
ssc.start()
}
}
the project is builting successfly, this is my pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.apache.maventestsparkproject</groupId>
<artifactId>testmavenapp</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>testmavenapp</name>
<url>http://maven.apache.org</url>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>1.6.2</spark.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.0.0-rc4</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.0.0-rc4</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.5</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
but when i run this commande:
scala -cp /home/darif/TestProject/testmavenapp/target/testmavenapp-1.0-SNAPSHOT.jar /home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala
i get the following errors look like this:
home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:1: error: object apache is not a member of package org
import org.apache.maventestsparkproject._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:2: error: object datastax is not a member of package com
import com.datastax.spark.connector.streaming._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:3: error: object datastax is not a member of package com
import com.datastax.spark.connector.SomeColumns
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:5: error: object apache is not a member of package org
import org.apache.spark.SparkConf
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:6: error: object apache is not a member of package org
import org.apache.spark.streaming._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:7: error: object apache is not a member of package org
import org.apache.spark.streaming.kafka._
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:12: error: not found: type SparkConf
val sparkConf = new SparkConf()
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:19: error: not found: type StreamingContext
val ssc = new StreamingContext(sparkConf, Seconds(5))
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:19: error: not found: value Seconds
val ssc = new StreamingContext(sparkConf, Seconds(5))
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:22: error: not found: value brokers
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:23: error: not found: value KafkaUtils
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:23: error: not found: type StringDecoder
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
^
/home/darif/TestProject/testmavenapp/src/main/java/org/apache/maventestsparkproject/SparkCassandra.scala:23: error: not found: type StringDecoder
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
^
13 errors found
i am using :
Scala 2.11.8
Spark 1.6.2
Kafka Client APIs 0.8.2.11
Cassandra 3.9
Datastax Spark-Cassandra Connector compatible with Spark 1.6.2
The classpath for your application has not been set up correctly. It is recommend in various places to use spark-submit as your launcher as it will setup the vast majority of the classpath. Third party dependencies would be set using --packages.
Datastax Example
Spark Documentation
That said you could achieve the same result by custom setting various things in your Spark Conf as well as manually setting the classpath to include all of the Spark, DSE, and Kafka libraries.

Trainning a spark ml linear regresion model fail after migrating to 1.6.1

I use spark-ml to train a linear regression model.
It worked perfectly with spark version 1.5.2 but now with 1.6.1 I get the following error :
java.lang.AssertionError: assertion failed: lapack.dppsv returned 228.
It seems to be related to some low level linear algebra library but it worked fine before the spark version update.
In both version I get the same warnings before the training start saying that it can't load BLAS and LAPACK
[Executor task launch worker-6] com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
[Executor task launch worker-6] com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
[main] com.github.fommil.netlib.LAPACK - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
[main] com.github.fommil.netlib.LAPACK - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
Here is a minimal code :
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.feature.OneHotEncoder;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.regression.LinearRegression;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
public class Application {
public static void main(String args[]) {
// create context
JavaSparkContext javaSparkContext = new JavaSparkContext("local[*]", "CalculCote");
SQLContext sqlContext = new SQLContext(javaSparkContext);
// describre fields
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataTypes.createStructField("brand", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("commercial_name", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("mileage", DataTypes.IntegerType, true));
fields.add(DataTypes.createStructField("price", DataTypes.DoubleType, true));
// load dataframe from file
DataFrame df = sqlContext.read().format("com.databricks.spark.csv") //
.option("header", "true") //
.option("InferSchema", "false") //
.option("delimiter", ";") //
.schema(DataTypes.createStructType(fields)) //
.load("input.csv").persist();
// show first rows
df.show();
// indexers and encoders for non numerical values
StringIndexer brandIndexer = new StringIndexer() //
.setInputCol("brand") //
.setOutputCol("brandIndex");
OneHotEncoder brandEncoder = new OneHotEncoder() //
.setInputCol("brandIndex") //
.setOutputCol("brandVec");
StringIndexer commNameIndexer = new StringIndexer() //
.setInputCol("commercial_name") //
.setOutputCol("commNameIndex");
OneHotEncoder commNameEncoder = new OneHotEncoder() //
.setInputCol("commNameIndex") //
.setOutputCol("commNameVec");
// model predictors
VectorAssembler predictors = new VectorAssembler() //
.setInputCols(new String[] { "brandVec", "commNameVec", "mileage" }) //
.setOutputCol("features");
// train model
LinearRegression lr = new LinearRegression().setLabelCol("price");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { //
brandIndexer, brandEncoder, commNameIndexer, commNameEncoder, predictors, lr });
PipelineModel pm = pipeline.fit(df);
DataFrame result = pm.transform(df);
result.show();
}
}
And input.csv data
brand;commercial_name;mileage;price
APRILIA;ATLANTIC 125;18237;1400
BMW;R1200 GS;10900;12400
HONDA;CB 1000;58225;4250
HONDA;CB 1000;1780;7610
HONDA;CROSSRUNNER 800;2067;11490
KAWASAKI;ER-6F 600;51600;2010
KAWASAKI;VERSYS 1000;5900;13900
KAWASAKI;VERSYS 650;3350;6200
KTM;SUPER DUKE 990;36420;4760
the pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>sparkmigration</artifactId>
<packaging>jar</packaging>
<name>sparkmigration</name>
<version>0.0.1</version>
<url>http://maven.apache.org</url>
<properties>
<java.version>1.8</java.version>
<spark.version>1.6.1</spark.version>
<!-- <spark.version>1.5.2</spark.version> -->
<spark.csv.version>1.3.0</spark.csv.version>
<slf4j.version>1.7.2</slf4j.version>
<logback.version>1.0.9</logback.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>${spark.csv.version}</version>
</dependency>
<!-- Logs -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>${logback.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>${logback.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Problem fixed (thanks apache spark mailing list)
Since spark 1.6, linear regression model is set to "auto", in some coditions (features<=4096, no elastic net param set, ...), WSL algo is used instead of L-BFGS.
I forced the solver to l-bfgs and it worked
LinearRegression lr = new LinearRegression().setLabelCol("price").setSolver("l-bfgs");

Resources