I am having an issue when I'm trying to unpickle an array of tuples. Here is the use case:
import scala.pickling._
import json._
object JsonTest extends App {
val simplePickled = new Simple(Array(("test", 3))).pickle
val unpickled = simplePickled.unpickle[Simple]
}
class Simple(val x: Array[(String, Int)]) {}
The above produces a runtime exception when unpickling.
Thanks in advance for helping with this.
Here is the exception I'm getting:
Exception in thread "main" scala.reflect.internal.MissingRequirementError: class scala.Tuple2[java.lang.String in JavaMirror with...
at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
at scala.pickling.internal.package$.typeFromString(package.scala:61)
at scala.pickling.internal.package$$anonfun$2.apply(package.scala:63)
at scala.pickling.internal.package$$anonfun$2.apply(package.scala:63)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at scala.pickling.internal.package$.typeFromString(package.scala:63)
at scala.pickling.FastTypeTag$.apply(FastTags.scala:57)
Related
I'm trying to tokenize a piece of Chinese text with Stanford NLP but the program throws exceptions all the time.
I tried different ways to load the properties file but they didn't work.
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import java.io.InputStream;
import java.util.*;
public class Spider {
public static void main(String[] args) {
try {
StanfordCoreNLP ppl;
Properties prop = new Properties();
InputStream in = Spider.class.getClassLoader().getResourceAsStream("StanfordCoreNLP-chinese.properties");
prop.load(in);
ppl = new StanfordCoreNLP(prop);
Annotation doc = new Annotation("浮云白日,山川庄严温柔。");
ppl.annotate(doc);
ppl.prettyPrint(doc, System.out);
} catch (Exception e) {
e.printStackTrace();
}
}
}
The exceptions are as follows:
java.io.StreamCorruptedException: invalid type code: 3F at
java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1622)
at
java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:1993)
at
java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1588)
at
java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
at
edu.stanford.nlp.ie.crf.CRFClassifier.loadClassifier(CRFClassifier.java:2642)
at
edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1473)
at
edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1505)
at
edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2939)
at
edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:286)
at
edu.stanford.nlp.ie.ClassifierCombiner.loadClassifiers(ClassifierCombiner.java:270)
at
edu.stanford.nlp.ie.ClassifierCombiner.(ClassifierCombiner.java:142)
at
edu.stanford.nlp.ie.NERClassifierCombiner.(NERClassifierCombiner.java:108)
at
edu.stanford.nlp.pipeline.NERCombinerAnnotator.(NERCombinerAnnotator.java:125)
at
edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$5(StanfordCoreNLP.java:523)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$30(StanfordCoreNLP.java:602)
at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126) at
edu.stanford.nlp.util.Lazy.get(Lazy.java:31) at
edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:251)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:192)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:188)
at Spider.main(Spider.java:13)
edu.stanford.nlp.io.RuntimeIOException: java.io.IOException: Couldn't
load classifier from
edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz at
edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:70)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$5(StanfordCoreNLP.java:523)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$30(StanfordCoreNLP.java:602)
at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126) at
edu.stanford.nlp.util.Lazy.get(Lazy.java:31) at
edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:251)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:192)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:188)
at Spider.main(Spider.java:13) Caused by: java.io.IOException:
Couldn't load classifier from
edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz at
edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:296)
at
edu.stanford.nlp.ie.ClassifierCombiner.loadClassifiers(ClassifierCombiner.java:270)
at
edu.stanford.nlp.ie.ClassifierCombiner.(ClassifierCombiner.java:142)
at
edu.stanford.nlp.ie.NERClassifierCombiner.(NERClassifierCombiner.java:108)
at
edu.stanford.nlp.pipeline.NERCombinerAnnotator.(NERCombinerAnnotator.java:125)
at
edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
... 9 more Caused by: java.lang.ClassCastException: class
java.util.ArrayList cannot be cast to class
edu.stanford.nlp.classify.LinearClassifier (java.util.ArrayList is in
module java.base of loader 'bootstrap';
edu.stanford.nlp.classify.LinearClassifier is in unnamed module of
loader 'app') at
edu.stanford.nlp.ie.ner.CMMClassifier.loadClassifier(CMMClassifier.java:1095)
at
edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1473)
at
edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1505)
at
edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1495)
at
edu.stanford.nlp.ie.ner.CMMClassifier.getClassifier(CMMClassifier.java:1141)
at
edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:292)
... 14 more
Connecting to spark streaming with external source like MS SQL server and publishing table data to Kafka.
Getting
java.io.NotSerializableException:org.apache.kafka.clients.producer.KafkaProducer
error.
Please find below deails.
**CustomReceiver.sacla**
package com.sparkdemo.app
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
import java.util.List
import java.util.Map
import com.sparkdemo.entity.Inventory
import org.apache.kafka.clients.consumer.{ConsumerRecords, KafkaConsumer}
import java.net.ConnectException
import scala.util.{Try, Success, Failure}
import collection.JavaConversions._
class CustomReceiver(topic: String, kafkaParams: Map[String, Object]) extends Receiver[Inventory](StorageLevel.MEMORY_AND_DISK_2) {
override def onStart(): Unit = {
val dataService = new DataService()
var records: Inventory = dataService.selectAll()
new Thread("Socket Receiver") {
override def run() {
Try {
val consumer = new KafkaConsumer(kafkaParams)
consumer.subscribe(topic)
while (!isStopped && records!=null) {
// store(tokenData)
// tokenData = new DataService().selectAll();
val records = new DataService().selectAll();
store(records)
}
} match {
case e: ConnectException =>
restart("Error connecting to...", e)
case t: Throwable =>
restart("Error receiving data", t)
}
}
}.start()
}
override def onStop(): Unit = {
println("Nothing")
}
}
**DataService.scala**
package com.sparkdemo.app;
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.Statement
import java.util._
import scala.collection.JavaConversions._
import com.sparkdemo.entity.Inventory
class DataService {
var connect: Connection = DriverManager.getConnection(
"jdbc:sqlserver://localhost;databaseName=TestDB;user=SA;password=Sqlserver#007")
var statement: Statement = connect.createStatement()
var resultSet: ResultSet = null
var inv: Inventory = new Inventory()
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
def selectAll(): Inventory = {
resultSet = statement.executeQuery("select * from Inventory")
while (resultSet.next()) {
val inv: Inventory = new Inventory()
inv.setId(resultSet.getInt("id"))
inv.setName(resultSet.getString("name"))
inv.setQuantity(resultSet.getInt("quantity"))
}
inv
}
}
Scala main class **Stream.scala**
package com.sparkdemo.app
import org.apache.spark.streaming.dstream.DStream
import com.sparkdemo.config.Config
import org.apache.kafka.common.serialization.{ StringDeserializer, StringSerializer }
import org.apache.kafka.clients.producer.{ ProducerRecord, KafkaProducer }
import java.util.Properties
import collection.JavaConversions._
import com.sparkdemo.entity.Inventory
object Stream extends Serializable{
def main(args: Array[String]) {
import org.apache.spark.streaming._
def getKafkaParams: Map[String, Object] = {
Map[String, Object](
"auto.offset.reset" -> "earliest",
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group3")
}
val properties = new Properties()
properties.put("bootstrap.servers", "localhost:9092")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val topic1 = "topic1"
val topic2 = "topic2"
val producer :KafkaProducer[String, Object] = new KafkaProducer(properties)
val ssc = Config.configReceiver()
val stream = ssc.receiverStream(new CustomReceiver(topic1, getKafkaParams))
stream.map(Inventory=>producer.send(new ProducerRecord[String,Object](topic2,Inventory)))
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
Entity class: **Inventory.scala**
package com.sparkdemo.entity
import scala.beans.{BeanProperty}
class Inventory extends Serializable{
#BeanProperty
var id: Int = _
#BeanProperty
var name: String = _
#BeanProperty
var quantity: Int = _
}
Error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:546)
at com.sparkdemo.app.Stream$.main(Stream.scala:36)
at com.sparkdemo.app.Stream.main(Stream.scala)
Caused by: java.io.NotSerializableException: org.apache.kafka.clients.producer.KafkaProducer
Serialization stack:
- object not serializable (class: org.apache.kafka.clients.producer.KafkaProducer, value: org.apache.kafka.clients.producer.KafkaProducer#557286ad)
- field (class: com.sparkdemo.app.Stream$$anonfun$main$1, name: producer$1, type: class org.apache.kafka.clients.producer.KafkaProducer)
- object (class com.sparkdemo.app.Stream$$anonfun$main$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 12 more
You have ran issue where kafkaproducer is sent unintentionally to executor because of below code
stream.map(Inventory=>producer.send(new ProducerRecordString,Object))
You can mappartitions and create producer in mappartitions so that it is not shipped to executors.
The problem is the type of Serializer you are using for Object type value.
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
Please write a Serializer to read the Object type values.You can refer below link
Send Custom Java Objects to Kafka Topic
I have developed this simple LogTransformer by extending the UnaryTransformer to apply log transfromation on the "age" column in the DataFrame. I am able to apply this transformer and include that as a pipeline stage and persist the pipeline model after training.
class LogTransformer(override val uid: String) extends UnaryTransformer[Int,
Double, LogTransformer] with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("logTransformer"))
override protected def createTransformFunc: Int => Double = (feature: Int) => {Math.log10(feature)}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == DataTypes.IntegerType, s"Input type must be integer type but got $inputType.")
}
override protected def outputDataType: DataType = DataTypes.DoubleType
override def copy(extra: ParamMap): LogTransformer = defaultCopy(extra)
}
object LogTransformer extends DefaultParamsReadable[LogTransformer]
But when I read the persisted model I get the following exception.
val MODEL_PATH = "model/census_pipeline_model"
cvModel.bestModel.asInstanceOf[PipelineModel].write.overwrite.save(MODEL_PATH)
val same_pipeline_model = PipelineModel.load(MODEL_PATH)
exception in thread "main" java.lang.NoSuchMethodException: dsml.Census$LogTransformer$2.read()
at java.lang.Class.getMethod(Class.java:1786)
at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:652)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
at dsml.Census$.main(Census.scala:572)
at dsml.Census.main(Census.scala)
Any pointers on how to fix that would be helpful. Thank You.
My code:
package read_write;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import com.google.common.base.Function;
public class Readexcel {
public static void main(String[] args) throws IOException {
File src = new File("D:\\J\\clients_pw.xlsx");
FileInputStream fis = new FileInputStream (src);
XSSFWorkbook wb = new XSSFWorkbook(fis);
XSSFSheet sheet1= wb.getSheet("MAS_details");
String data1 = sheet1.getRow(1).getCell(0).getStringCellValue();
System.out.println(data1);
}
}
I am facing following error while trying to execute this
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/commons/compress/archivers/zip/ZipFile at
org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at
org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:37)
at
org.apache.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:307)
at read_write.Readexcel.main(Readexcel.java:19)
Caused by: java.lang.ClassNotFoundException:
org.apache.commons.compress.archivers.zip.ZipFile at
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown
Source) at
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
Source) at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
I am not sure if i have added all jars. i have added all apache poi jars and google collect
Thank you so much even i had the same issue. It resolved after adding Compress jar
I am using Spark 1.5.0 from CDH 5.5.2 distro. I switched to Scala 2.10.5 from 2.10.4. I am using the following code for UDAF. Is this somehow String vs UTF8String issue? If yes, any help will be greatly appreciated.
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
However, I get this error message at runtime:
Exception in thread "main" java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleName(Class.java:1190)
at org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:464)
at java.lang.String.valueOf(String.java:2847)
at java.lang.StringBuilder.append(StringBuilder.java:128)
at scala.StringContext.standardInterpolator(StringContext.scala:122)
at scala.StringContext.s(StringContext.scala:90)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression2.toString(interfaces.scala:96)
at org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:174)
at org.apache.spark.sql.GroupedData$$anonfun$1.apply(GroupedData.scala:86)
at org.apache.spark.sql.GroupedData$$anonfun$1.apply(GroupedData.scala:80)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:80)
at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:227)
I received the same exception because my object that extends UserDefinedAggregateFunction was inside of another function.
Change this:
object Driver {
def main(args: Array[String]) {
object GroupConcat extends UserDefinedAggregateFunction {
...
}
}
}
To this:
object Driver {
def main(args: Array[String]) {
...
}
object GroupConcat extends UserDefinedAggregateFunction {
...
}
}
I ran into this as a conflict with packages that I was importing. If you are importing anything then try testing on a spark shell with nothing imported.
When you define your UDAF check what the name being returned looks like. It should be something like
FractionOfDayCoverage: org.apache.spark.sql.expressions.UserDefinedAggregateFunction{def dataType: org.apache.spark.sql.types.DoubleType.type; def evaluate(buffer: org.apache.spark.sql.Row): Double} = $anon$1#27506b6d
that $anon$1#27506b6d at the end is a reasonable name that will work. When I imported the conflicting package the name returned was 3 times as long. Here is an example:
$$$$bec6d1991b88c272b3efac29d720f546$$$$anon$1#6886601d