org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame - apache-spark

I'm trying to write a Spark Structured Streaming (2.3) dataset to ScyllaDB (Cassandra).
My code to write the dataset:
def saveStreamSinkProvider(ds: Dataset[InvoiceItemKafka]) = {
ds
.writeStream
.format("cassandra.ScyllaSinkProvider")
.outputMode(OutputMode.Append)
.queryName("KafkaToCassandraStreamSinkProvider")
.options(
Map(
"keyspace" -> namespace,
"table" -> StreamProviderTableSink,
"checkpointLocation" -> "/tmp/checkpoints"
)
)
.start()
}
My ScyllaDB Streaming Sinks:
class ScyllaSinkProvider extends StreamSinkProvider {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): ScyllaSink =
new ScyllaSink(parameters)
}
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit =
data.write
.cassandraFormat(
parameters("table"),
parameters("keyspace")
//parameters("cluster")
)
.mode(SaveMode.Append)
.save()
}
However, when I run this code, I receive an exception:
...
[error] +- StreamingExecutionRelation KafkaSource[Subscribe[transactions_load]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
[error] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
[error] at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
[error] Caused by: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame;
[error] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[error] at org.apache.spark.sql.Dataset.write(Dataset.scala:3103)
[error] at cassandra.ScyllaSink.addBatch(CassandraDriver.scala:113)
[error] at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
...
I have seen a similar question, but that is for CosmosDB - Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

You could convert it to an RDD first and then write:
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[Row] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
// write the RDD to Cassandra
}
}

Related

write delta lake in Databricks error: HttpRequest 409 err PathAlreadyExist

Sometimes I get this error when a job in Databricks is writing in Azure data lake:
HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=f448-0832-41ac-a2ab-8821453ef3c8,rid=7d4-101f-005a-578c-f82000000,connMs=0,sendMs=0,recvMs=38,sent=0,recv=168,method=PUT,url=https://awutmp.dfs.core.windows.net/bronze/app/_delta_log/_last_checkpoint?resource=file&timeout=90
My code read from a blob storage using autoloader and write in Azure Data Lake:
Schemas:
val binarySchema = StructType(List(
StructField("path", StringType, true),
StructField("modificationTime", TimestampType, true),
StructField("length", LongType, true),
StructField("content", BinaryType, true)
))
val jsonSchema = StructType(List(
StructField("EquipmentId", StringType, true),
StructField("EquipmentName", StringType, true),
StructField("EquipmentType", StringType, true),
StructField("Name", StringType, true),
StructField("Value", StringType, true),
StructField("ValueType", StringType, true),
StructField("LastSourceTimeStamp", StringType, true),
StructField("LastReprocessDate", StringType, true),
StructField("LastStateDuration", StringType, true),
StructField("MessageId", StringType, true)
))
Create delta table if not exists:
val sinkPath = "abfss://bronze#awutmp.dfs.core.windows.net/app"
val tableSQL =
s"""
CREATE TABLE IF NOT EXISTS bronze.awutmpapp(
path STRING,
file_modification_time TIMESTAMP,
file_length LONG,
value STRING,
json struct<EquipmentId STRING, EquipmentName STRING, EquipmentType STRING, Name STRING, Value STRING,ValueType STRING, LastSourceTimeStamp STRING, LastReprocessDate STRING, LastStateDuration STRING, MessageId STRING>,
job_name STRING,
job_version STRING,
schema STRING,
schema_version STRING,
timestamp_etl_process TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(file_modification_time)) COMMENT 'generated from file_modification_time',
month INT GENERATED ALWAYS AS (MONTH(file_modification_time)) COMMENT 'generated from file_modification_time',
day INT GENERATED ALWAYS AS (DAY(file_modification_time)) COMMENT 'generated from file_modification_time'
)
USING DELTA
PARTITIONED BY (year, month, day)
LOCATION '${sinkPath}'
"""
spark.sql(tableSQL)
Options:
val options = Map[String, String](
"cloudFiles.format" -> "BinaryFile",
"cloudFiles.useNotifications" -> "true",
"cloudFiles.queueName" -> queue,
"cloudFiles.connectionString" -> queueConnString,
"cloudFiles.validateOptions" -> "true",
"cloudFiles.allowOverwrites" -> "true",
"cloudFiles.includeExistingFiles" -> "true",
"recursiveFileLookup" -> "true",
"modifiedAfter" -> "2022-01-01T00:00:00.000+0000",
"pathGlobFilter" -> "*.json.gz",
"ignoreCorruptFiles" -> "true",
"ignoreMissingFiles" -> "true"
)
Method process each microbatch:
def decompress(compressed: Array[Byte]): Option[String] =
Try {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}.toOption
def binaryToStringUDF: UserDefinedFunction = {
udf { (data: Array[Byte]) => decompress(data).orNull }
}
def processMicroBatch: (DataFrame, Long) => Unit = (df: DataFrame, id: Long) => {
val resultDF = df
.withColumn("content_string", binaryToStringUDF(col("content")))
.withColumn("array_value", split(col("content_string"), "\n"))
.withColumn("array_noempty_values", expr("filter(array_value, value -> value <> '')"))
.withColumn("value", explode(col("array_noempty_values")))
.withColumn("json", from_json(col("value"), jsonSchema))
.withColumnRenamed("length", "file_length")
.withColumnRenamed("modificationTime", "file_modification_time")
.withColumn("job_name", lit("jobName"))
.withColumn("job_version", lit("1.0"))
.withColumn("schema", lit(schema.toString))
.withColumn("schema_version", lit("1.0"))
.withColumn("timestamp_etl_process", current_timestamp())
.withColumn("timestamp_tz", expr("current_timezone()"))
.withColumn("timestamp_etl_process",
to_utc_timestamp(col("timestamp_etl_process"), col("timestamp_tz")))
.drop("timestamp_tz", "array_value", "array_noempty_values", "content", "content_string")
resultDF
.write
.format("delta")
.mode("append")
.option("path", sinkPath)
.save()
}
val storagePath = "wasbs://signal#externalaccount.blob.core.windows.net/"
val checkpointPath = "/checkpoint/signal/autoloader"
spark
.readStream
.format("cloudFiles")
.options(options)
.schema(binarySchema)
.load(storagePath)
.writeStream
.format("delta")
.outputMode("append")
.foreachBatch(processMicroBatch)
.option("checkpointLocation", checkpointPath)
.trigger(Trigger.AvailableNow)
.start()
.awaitTermination()
It is aditional information I have seen in Azure log analytics:
How can I solve this error?

Default schema value conversion fails in to_avro() while publishing data to Kafka using databricks spark-avro

Trying to publish data into Kafka topic using confluent schema registry.
Following is my schema registry
schemaRegistryClient.register("primitive_type_str_avsc", new Schema.Parser().parse(
s"""
|{
| "type": "record",
| "name": "RecordLevel",
| "fields": [
| {"name": "id", "type":["string","null"], "default": null}
| ]
|}
""".stripMargin
))
Following case class is used to match the schema
case class myCaseClass (id:Option[String] = None)
Here is my notebook code snippet
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
import scala.util.Try
import spark.implicits._
val df1 = Seq(("Welcome")).toDF("a")
.map(row => myCaseClass(Some(row.getAs("a"))))
val cols = df1.columns
df1.select(struct(cols.map(column):_*).as('struct))
.select(to_avro('struct, lit("primitive_type_str_avsc"), schemaRegistryAddress).as('value))
.show()
Facing following exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 77.0 failed 4 times, most recent failure: Lost task 0.3 in stage 77.0 (TID 186, 10.73.122.72, executor 3): org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:191)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:218)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:284)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:272)
at org.spark_project.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getIdFromRegistry(CachedSchemaRegistryClient.java:78)
at org.spark_project.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getId(CachedSchemaRegistryClient.java:205)
at org.apache.spark.sql.avro.SchemaRegistryClientProxy.getId(SchemaRegistryClientProxy.java:52)
at org.apache.spark.sql.avro.SchemaRegistryAvroEncoder.encoder(SchemaRegistryUtils.scala:97)
at org.apache.spark.sql.avro.CatalystDataToAvroWithSchemaRegistry.nullSafeEval(CatalystDataToAvroWithSchemaRegistry.scala:57)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:544)
Could you please help in resolving this issue. Thanks in advance.

Using Sparksql and SparkCSV with SparkJob Server

Am trying to JAR a simple scala application which make use of SparlCSV and spark sql to create a Data frame of the CSV file stored in HDFS and then just make a simple query to return the Max and Min of specific column in CSV file.
I am getting error when i use the sbt command to create the JAR which later i will curl to jobserver /jars folder and execute from remote machine
Code:
import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object sparkSqlCSV extends SparkJob {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("sparkSqlCSV")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
println("Result is " + results)
}
override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
SparkJobValid
}
override def runJob(sc: sqlContext, config: Config): Any = {
val value = "com.databricks.spark.csv"
val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
ControlDF.registerTempTable("Control")
val aggDF = sqlContext.sql("select max(DieX) from Control")
aggDF.collectAsList()
}
}
Error:
[hduser#ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
[info] Loading project definition from /usr/local/hadoop/spark-jobserver/project
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to root (in build file:/usr/local/hadoop/spark-jobserver/)
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 2 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 9 ms
[success] created output: /usr/local/hadoop/spark-jobserver/ashesh-jobs/target
[warn] Credentials file /home/hduser/.bintray/.credentials does not exist
[info] Updating {file:/usr/local/hadoop/spark-jobserver/}ashesh-jobs...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 5 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 1 ms
[success] created output: /usr/local/hadoop/spark-jobserver/job-server-api/target
[info] Compiling 2 Scala sources and 1 Java source to /usr/local/hadoop/spark-jobserver/ashesh-jobs/target/scala-2.10/classes...
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:8: object sql is not a member of package org.apache.spark
[error] import org.apache.spark.sql.SQLContext
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:14: object sql is not a member of package org.apache.spark
[error] val sqlContext = new org.apache.spark.sql.SQLContext(sc)
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:25: not found: type sqlContext
[error] override def runJob(sc: sqlContext, config: Config): Any = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:21: not found: type sqlContext
[error] override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:27: not found: value sqlContext
[error] val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:29: not found: value sqlContext
[error] val aggDF = sqlContext.sql("select max(DieX) from Control")
[error] ^
[error] 6 errors found
[error] (ashesh-jobs/compile:compileIncremental) Compilation failed
[error] Total time: 10 s, completed May 26, 2016 4:42:52 PM
[hduser#ptfhadoop01v spark-jobserver]$
I guess the main issue being that its missing the dependencies for sparkCSV and sparkSQL , But i have no idea where to place the dependencies before compiling the code using sbt.
I am issuing the following command to package the application , The source codes are placed under "ashesh_jobs" directory
[hduser#ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
I hope someone can help me to resolve this issue.Can you specify me the file where i can specify the dependency and the format to input
The following link has more information in creating other contexts https://github.com/spark-jobserver/spark-jobserver/blob/master/doc/contexts.md
Also you need job-server-extras
add library dependency in buil.sbt
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"

saveToCassandra works with Cassandra Lucene plugin?

I am implementing the example on Lucene plugin for Cassandra page (https://github.com/Stratio/cassandra-lucene-index) and when I try to save the data using saveToCassandra I get the exception NoSuchElementException.
If I use CassandraConnector.withSessionDo I am able to add elements into Cassandra and no exception is raised.
The tables:
CREATE KEYSPACE demo
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
id INT PRIMARY KEY,
user TEXT,
body TEXT,
time TIMESTAMP,
latitude FLOAT,
longitude FLOAT
);
CREATE CUSTOM INDEX tweets_index ON tweets ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
id : {type : "integer"},
user : {type : "string"},
body : {type : "text", analyzer : "english"},
time : {type : "date", pattern : "yyyy/MM/dd", sorted : true},
place : {type : "geo_point", latitude:"latitude", longitude:"longitude"}
}
}'
};
The code :
import org.apache.spark.{SparkConf, SparkContext, Logging}
import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.spark.connector._
object App extends Logging{
def main(args: Array[String]) {
// Get the cassandra IP and create the spark context
val cassandraIP = System.getenv("CASSANDRA_IP");
val sparkConf = new SparkConf(true)
.set("spark.cassandra.connection.host", cassandraIP)
.set("spark.cleaner.ttl", "3600")
.setAppName("Simple Spark Cassandra Example")
val sc = new SparkContext(sparkConf)
// Works
CassandraConnector(sparkConf).withSessionDo { session =>
session.execute("INSERT INTO demo.tweets(id, user, body, time, latitude, longitude) VALUES (19, 'Name', 'Body', '2016-03-19 09:00:00-0300', 39, 39)")
}
// Does not work
val demo = sc.parallelize(Seq((9, "Name", "Body", "2016-03-29 19:00:00-0300", 29, 29)))
// Raises the exception
demo.saveToCassandra("demo", "tweets", SomeColumns("id", "user", "body", "time", "latitude", "longitude"))
}
}
The exception:
16/03/28 14:15:41 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
Exception in thread "main" java.util.NoSuchElementException: Column not found in demo.tweets
at com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)
at com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)
at scala.collection.Map$WithDefault.default(Map.scala:52)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:153)
at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:152)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.datastax.spark.connector.cql.TableDef.<init>(Schema.scala:152)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:283)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:271)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.Set$Set4.foreach(Set.scala:137)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:271)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:295)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:294)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:294)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:307)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:304)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:120)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:120)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:304)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:275)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
at com.webradar.spci.spark.cassandra.App$.main(App.scala:27)
at com.webradar.spci.spark.cassandra.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) 16/03/28 14:15:41 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
Exception in thread "main" java.util.NoSuchElementException: Column not found in demo.tweets
at com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)
at com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)
at scala.collection.Map$WithDefault.default(Map.scala:52)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:153)
at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:152)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.datastax.spark.connector.cql.TableDef.<init>(Schema.scala:152)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:283)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:271)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.Set$Set4.foreach(Set.scala:137)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:271)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:295)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:294)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:294)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:307)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:304)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:120)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:120)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:304)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:275)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
at com.webradar.spci.spark.cassandra.App$.main(App.scala:27)
at com.webradar.spci.spark.cassandra.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
EDITED:
Versions
Spark 1.6.0
Cassandra 3.0.3
Lucene plugin 3.0.3.1
For Jar creation I used maven-assembly-plugin to get a fat JAR.
If I remove the custom index I am able to use saveToCassandra
It seems that the problem is caused by a problem in the Cassandra Spark driver, and not in the plugin.
Since CASSANDRA-10217 Cassandra 3.x per-row indexes don't require to be created on a fake column anymore. Thus, from Cassandra 3.x the "CREATE CUSTOM INDEX %s ON %s(%s)" column-based syntax is replaced with the new "CREATE CUSTOM INDEX %s ON %s()" row-based syntax. However, DataStax Spark driver doesn't seem to support this new feature yet.
When "com.datastax.spark.connector.RDDFunctions.saveToCassandra" is called it tries to load the table schema and the index schema related to a table column. Since this new index syntax does not have the fake-column anymore it results in a NoSuchElementException due to an empty column name.
However, saveToCassandra works well if you execute the same example with prior fake column syntax:
CREATE KEYSPACE demo
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
id INT PRIMARY KEY,
user TEXT,
body TEXT,
time TIMESTAMP,
latitude FLOAT,
longitude FLOAT,
lucene TEXT
);
CREATE CUSTOM INDEX tweets_index ON tweets (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
id : {type : "integer"},
user : {type : "string"},
body : {type : "text", analyzer : "english"},
time : {type : "date", pattern : "yyyy/MM/dd", sorted : true},
place : {type : "geo_point", latitude:"latitude", longitude:"longitude"}
}
}'
};

Error in simple spark application

I'm running a simple spark application which does the 'word to vector'. here is my code (this is from the spark website)
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Word2Vector")
val sc = new SparkContext(conf)
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
val synonyms = model.findSynonyms("china", 40)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
// Save and load model
model.save(sc, "myModelPath")
}
}
when running it it gives me the following error message
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://GXYDEVVM:8020/user/hadoop/YOUR_SPARK_HOME/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:442)
at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:47)
at SimpleApp.main(SimpleApp.java:13)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What is the problem? where this addess is coming from /user/hadoop/YOUR_SPARK_HOME/README.md
This is probably related to your default Spark configuration.
Take a look (or use grep) in the conf directory of your Spark home directory. You should find a spark-env.sh file, which could contain a reference to the strange file.
In fact, Spark is trying to load a file from HDFS (kind of a standard if you run Spark on a cluster : your input / output should be reachable by the master, and the workers slaves). If you use Spark locally you have to configure the Spark Context using setMaster method. Here is my version :
object SparkDemo {
def log[A](key:String)(job : =>A) = {
val start = System.currentTimeMillis
val output = job
println("===> %s in %s seconds"
.format(key, (System.currentTimeMillis - start) / 1000.0))
output
}
def main(args: Array[String]):Unit ={
val modelName ="w2vModel"
val sc = new SparkContext(
new SparkConf()
.setAppName("SparkDemo")
.set("spark.executor.memory", "4G")
.set("spark.driver.maxResultSize", "16G")
.setMaster("spark://192.168.1.53:7077") // ip of the spark master.
// .setMaster("local[2]") // does not work... workers loose contact with the master after 120s
)
// take a look into target folder if you are unsure how the jar is named
// onliner to compile / run : sbt package && sbt run
sc.addJar("./target/scala-2.10/sparkling_2.10-0.1.jar")
val input = sc.textFile("./text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = log("compute model") { word2vec.fit(input) }
log ("save model") { model.save(sc, modelName) }
val synonyms = model.findSynonyms("china", 40)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
val model2 = log("reload model") { Word2VecModel.load(sc, modelName) }
}
}

Resources