documentation about the file format of spark rdd.saveAsObjectFile

documentation about the file format of spark rdd.saveAsObjectFile - apache-spark

Spark can save a rdd to a file with rdd.saveAsObjectFile("file").
I need to read this file outside Spark. According to doc, using the default spark serializer, this file is just a sequence of objects serialized with the standard Java serialization. However, I guess the file has a header and a separator between objects. I need to read this file, and use jdeserialize to deserialize each Java/Scala object (as I don't have the class definition).
Where can I find the documentation about the file format produced by rdd.saveAsObjectFile("file") (with the standard serializer, not Kryo serializer)?
Update
Working example based on VladoDemcak answer:
import org.apache.hadoop.io._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
def deserialize(data: Array[Byte]) =
new ObjectInputStream(new ByteArrayInputStream(data)).readObject()
val path = new Path("/tmp/part-00000")
val config = new Configuration()
val reader = new SequenceFile.Reader(FileSystem.get(new Configuration()), path, config)
val key = NullWritable.get
val value = new BytesWritable
while (reader.next(key, value)) {
println("key: {} and value: {}.", key, value.getBytes)
println(deserialize(value.getBytes()))
}
reader.close()

It is very interesting question so I will try to explain what I know about this staff. You can check saveAsObjectFile and only documentation I saw about some details is API javadoc
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
so as I know saveAsObjectFile produces SequenceFile. And based on documentation for sequenceFile it has header with version, classname, metadata ...
There are 3 different SequenceFile formats:
Uncompressed key/value records. Record compressed key/value records -
only 'values' are compressed here. Block compressed key/value records
- both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
all of the above formats share a common header (which is used by the
SequenceFile.Reader to return the appropriate key/value pairs).
For reading sequencefile we can use hadoop SequenceFile.Reader implementation.
Path path = new Path("/hdfs/file/path/seqfile");
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(new Configuration()), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)){
logger.info("key: {} and value: {}.", key, value.getBytes());
// (MyObject) deserialize(value.getBytes());
}
reader.close();
I have not tested this but based on doc link you noticed in your question:
By default, Spark serializes objects using Java’s ObjectOutputStream
framework
so in loop you can get bytes for value and deserialize with ObjectInputStream
public static Object deserialize(byte[] data){
return new ObjectInputStream(new ByteArrayInputStream(data)).readObject();
}
in your case you need to use your library (jdeserialize) in deserialize method - i guess run(InputStream is, boolean shouldConnect) etc.

Related

Working with Protobuf-encoded MQTT streams in Apache Beam

I am trying to decode and process protobuf-encoded MQTT messages (from an Eclipse Mosquitto broker) using Apache Beam. In addition to the encoded fields, I also want to process the full topic of each message for grouping and aggregations, as well as the timestamp.
What I have tried so far
I can connect to Mosquitto via
val options = PipelineOptionsFactory.create()
val pipeline = Pipeline.create(options)
val mqttReader: MqttIO.Read = MqttIO
.read()
.withConnectionConfiguration(
MqttIO.ConnectionConfiguration.create(
"tcp://localhost:1884",
"my/topic/+"
)
)
val readMessages = pipeline.apply<PCollection<ByteArray>>(mqttReader)
In order to decode the messages, I have compiled the .proto schema (in my case quote.proto containing the Quote message) via Gradle, which allows my to transform ByteArray into Quote objects via Quote.parseFrom():
val quotes = readMessages
.apply(
ParDo.of(object : DoFn<ByteArray, QuoteOuterClass.Quote>() {
#ProcessElement
fun processElement(context: ProcessContext) {
val protoRow = context.element()
context.output(QuoteOuterClass.Quote.parseFrom(protoRow))
}
})
)
Using this, in the next apply, I can then access individual fields with a ProcessFunction and a lambda, e.g. { quote -> "${quote.volume}" }. However, there are two problems:
With this pipeline I do not have access to the topic or timestamp of each message.
After sending the decoded messages back to the broker with plain UTF8 encoding, I believe that they do not get decoded correctly.
Additional considerations
Apache Beam provides a ProtoCoder class, but I cannot figure out how to use it in conjunction with MqttIO. I suspect that the implementation has to look similar to
val coder = ProtoCoder
.of(QuoteOuterClass.Quote::class.java)
.withExtensionsFrom(QuoteOuterClass::class.java)
Instead of a PCollection<ByteArray>, the Kafka IO reader provides a PCollection<KafkaRecord<Long, String>>, which has all the relevant fields (including topic). I am wondering if something similar can be achieved with Mqtt + ProtoBuf.
A similar implementation to what I want to achieve can be done in Spark Structured Streaming + Apache Bahir as follows:
val df_mqttStream = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic", topic)
.load(brokerUrl)
val parsePayload = ProtoSQL.udf { bytes: Array[Byte] => Quote.parseFrom(bytes) }
val quotesDS = df_mqttStream.select("id", "topic", "payload")
.withColumn("quote", parsePayload($"payload"))
.select("id", "topic", "quote.*")
However, with Spark 2.4 (the latest supported version), accessing the message topic is broken (related issue, my ticket in Apache Jira).

From my understanding, the latest version of Apache Beam (2.27.0) does simply not offer a way to extract the specific topics of MQTT messages.
I have extended the MqttIO to return MqttMessage objects that include a topic (and a timestamp) in addition to the byte array payload. The changes currently exist as a pull request draft.
With these changes, the topic can simply be accessed as message.topic.
val readMessages = pipeline.apply<PCollection<MqttMessage>>(mqttReader)
val topicOfMessages: PCollection<String> = mqttMessages
.apply(
ParDo.of(object : DoFn<MqttMessage, String>() {
#ProcessElement
fun processElement(
#Element message: MqttMessage,
out: OutputReceiver<String>
) { out.output(message.topic) }
})
)

Azure DataBricks Stream foreach fails with NotSerializableException

I want to continuously elaborate rows of a dataset stream (originally initiated by a Kafka): based on a condition I want to update a Radis hash. This is my code snippet (lastContacts is the result of a previous command, which is a stream of this type: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: long]. This expands to org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]):
class MyStreamProcessor extends ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)
}
override def close(errorOrNull: Throwable): Unit = {}
}
val query = lastContacts
.writeStream
.foreach(new MyStreamProcessor())
.start()
query.awaitTermination()
I receive a huge stack trace, which the relevant part (I think) is this: java.io.NotSerializableException: org.apache.spark.sql.streaming.DataStreamWriter
Could anyone explain why this exception occurs and how to avoid? Thank you!
This question is related to the following two:
DataFrame to RDD[(String, String)] conversion
Call a function with each element a stream in Databricks

Spark Context is not serializable.
Any implementation of ForeachWriter must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
In your code, you are trying to use spark context within process method,
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
*sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)*
}
To send data to redis, you need to create your own connection and open it in the open method and then use it in the process method.
Take a look how to create redis connection pool. https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/ConnectionPool.scala

How to load tar.gz files in streaming datasets?

I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data.
I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files.
Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream?
The code I use to process the files is this:
val schema = Encoders.product[RawData].schema
val trackerData = spark
.readStream
.option("delimiter", "\t")
.schema(schema)
.csv(path)
val exceptions = rawCientData
.as[String]
.flatMap(extractExceptions)
.as[ExceptionData]
produced output as expected when path points to csv files.
But I would like to use tar gzip files.
When I try to place those files at the given path, I do not get any exceptions and batch output tells me
"sources" : [ {
"description" : "FileStreamSource[file:/Users/matthias/spark/simple_spark/src/main/resources/zsessionlog*]",
"startOffset" : null,
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 1095,
"processedRowsPerSecond" : 211.0233185584891
} ],
But I do not get any actual data processed.
Console sink looks like this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+

I solved the part of reading .tar.gz (.tgz) files this way:
Inspired by this site I created my own TGZ codec
final class DecompressTgzCodec extends CompressionCodec {
override def getDefaultExtension: String = ".tgz"
override def createOutputStream(out: OutputStream): CompressionOutputStream = ???
override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream = ???
override def createCompressor(): Compressor = ???
override def getCompressorType: Class[_ <: Compressor] = ???
override def createInputStream(in: InputStream): CompressionInputStream = {
new TarDecompressorStream(new TarArchiveInputStream(new GzipCompressorInputStream(in)))
}
override def createInputStream(in: InputStream, decompressor: Decompressor): CompressionInputStream = createInputStream(in)
override def createDecompressor(): Decompressor = null
override def getDecompressorType: Class[_ <: Decompressor] = null
final class TarDecompressorStream(in: TarArchiveInputStream) extends DecompressorStream(in) {
def updateStream(): Unit = {
// still have data in stream -> done
if (in.available() <= 0) {
// create stream content from following tar elements one by one
in.getNextTarEntry()
}
}
override def read: Int = {
checkStream()
updateStream()
in.read()
}
override def read(b: Array[Byte], off: Int, len: Int): Int = {
checkStream()
updateStream()
in.read(b, off, len)
}
override def resetState(): Unit = {}
}
}
And registered it for use by spark.
val conf = new SparkConf()
conf.set("spark.hadoop.io.compression.codecs", classOf[DecompressTgzCodec].getName)
val spark = SparkSession
.builder()
.master("local[*]")
.config(conf)
.appName("Streaming Example")
.getOrCreate()
Works exactly like I wanted it to do.

I do not think reading tar.gz'ed files is possible in Spark (see Read whole text files from a compression in Spark or gzip support in Spark for some ideas).
Spark does support gzip files, but they are not recommended as not splittable and result in a single partition (that in turn makes Spark of little to no help).
In order to have gzipped files loaded in Spark Structured Streaming you have to specify the path pattern so the files are included in loading, say zsessionlog*.csv.gz or alike. Else, csv alone loads CSV files only.
If you insist on Spark Structured Streaming to handle tar.gz'ed files, you could write a custom streaming data Source to do the un-tar.gz.
Given gzip files are not recommended as data format in Spark, the whole idea of using Spark Structured Streaming does not make much sense.

Reading/writing with Avro schemas AND Parquet format in SparkSQL

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).

I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Cannot read map from cassandra

I am using map in one of my table as below:
media map<UUID, frozen<map<int, varchar>>>
Alhough I was able to successfully insert/update into this map, couldn't read from it.
I am using datastax java driver 3.0.0
So far I have tried this:
Map<UUID, Map> media = row.getMap("media", UUID.class, Map.class);
But this line gives below exception:
com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [map<int, varchar> <-> java.util.Map]
How can I read from this field?

You are trying to use GettableByNameData.getMap(String, Class, Class) method. Instead of that, to get complex data type like Map, you can use GettableByNameData.getMap(String, TypeToken, TypeToken) method.
import com.google.common.reflect.TypeToken;
TypeToken uuidToken = new TypeToken<UUID>() {};
TypeToken mapToken = new TypeToken<Map<Integer, String>>() {};
Map<UUID, Map<Integer, String>> media = row.getMap("media", uuidToken, mapToken);

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string