Reading/writing with Avro schemas AND Parquet format in SparkSQL

Reading/writing with Avro schemas AND Parquet format in SparkSQL - apache-spark

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).

I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Related

Working with Protobuf-encoded MQTT streams in Apache Beam

I am trying to decode and process protobuf-encoded MQTT messages (from an Eclipse Mosquitto broker) using Apache Beam. In addition to the encoded fields, I also want to process the full topic of each message for grouping and aggregations, as well as the timestamp.
What I have tried so far
I can connect to Mosquitto via
val options = PipelineOptionsFactory.create()
val pipeline = Pipeline.create(options)
val mqttReader: MqttIO.Read = MqttIO
.read()
.withConnectionConfiguration(
MqttIO.ConnectionConfiguration.create(
"tcp://localhost:1884",
"my/topic/+"
)
)
val readMessages = pipeline.apply<PCollection<ByteArray>>(mqttReader)
In order to decode the messages, I have compiled the .proto schema (in my case quote.proto containing the Quote message) via Gradle, which allows my to transform ByteArray into Quote objects via Quote.parseFrom():
val quotes = readMessages
.apply(
ParDo.of(object : DoFn<ByteArray, QuoteOuterClass.Quote>() {
#ProcessElement
fun processElement(context: ProcessContext) {
val protoRow = context.element()
context.output(QuoteOuterClass.Quote.parseFrom(protoRow))
}
})
)
Using this, in the next apply, I can then access individual fields with a ProcessFunction and a lambda, e.g. { quote -> "${quote.volume}" }. However, there are two problems:
With this pipeline I do not have access to the topic or timestamp of each message.
After sending the decoded messages back to the broker with plain UTF8 encoding, I believe that they do not get decoded correctly.
Additional considerations
Apache Beam provides a ProtoCoder class, but I cannot figure out how to use it in conjunction with MqttIO. I suspect that the implementation has to look similar to
val coder = ProtoCoder
.of(QuoteOuterClass.Quote::class.java)
.withExtensionsFrom(QuoteOuterClass::class.java)
Instead of a PCollection<ByteArray>, the Kafka IO reader provides a PCollection<KafkaRecord<Long, String>>, which has all the relevant fields (including topic). I am wondering if something similar can be achieved with Mqtt + ProtoBuf.
A similar implementation to what I want to achieve can be done in Spark Structured Streaming + Apache Bahir as follows:
val df_mqttStream = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic", topic)
.load(brokerUrl)
val parsePayload = ProtoSQL.udf { bytes: Array[Byte] => Quote.parseFrom(bytes) }
val quotesDS = df_mqttStream.select("id", "topic", "payload")
.withColumn("quote", parsePayload($"payload"))
.select("id", "topic", "quote.*")
However, with Spark 2.4 (the latest supported version), accessing the message topic is broken (related issue, my ticket in Apache Jira).

From my understanding, the latest version of Apache Beam (2.27.0) does simply not offer a way to extract the specific topics of MQTT messages.
I have extended the MqttIO to return MqttMessage objects that include a topic (and a timestamp) in addition to the byte array payload. The changes currently exist as a pull request draft.
With these changes, the topic can simply be accessed as message.topic.
val readMessages = pipeline.apply<PCollection<MqttMessage>>(mqttReader)
val topicOfMessages: PCollection<String> = mqttMessages
.apply(
ParDo.of(object : DoFn<MqttMessage, String>() {
#ProcessElement
fun processElement(
#Element message: MqttMessage,
out: OutputReceiver<String>
) { out.output(message.topic) }
})
)

Spark - ignoring corrupted files

In the ETL process that we are managing, we are receiving sometimes corrupted files.
We tried this Spark configuration and it seems it works (the Spark job is not failing because the corrupted files are discarded):
spark.sqlContext.setConf("spark.sql.files.ignoreCorruptFiles", "true")
But I don't know if there is anyway to know which files were ignored. Is there anyway to get those filenames?
Thanks in advance

One way is look through your executor logs. If you have setup following configuratios to true in your spark configuration.
RDD: spark.files.ignoreCorruptFiles
DataFrame: spark.sql.files.ignoreCorruptFiles
Then spark will log corrupted file as a WARN message in your executor logs.
Here is code snippet from Spark that does that:
if (ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
// vectorized Parquet reader. Here we use lazy val to delay the creation of
// iterator so that we will throw exception in `getNext`.
private lazy val internalIter = readCurrentFile()
override def getNext(): AnyRef = {
try {
if (internalIter.hasNext) {
internalIter.next()
} else {
finished = true
null
}
} catch {
// Throw FileNotFoundException even `ignoreCorruptFiles` is true
case e: FileNotFoundException => throw e
case e # (_: RuntimeException | _: IOException) =>
logWarning(
s"Skipped the rest of the content in the corrupted file: $currentFile", e)
finished = true
null
}
}

Did you solve it?
If not, may be you can try the below approach:
Read everything from the location with that ignoreCorruptFiles setting
You can get the file names each record belongs to using the input_file_name UDF. Get distinct names out.
Separately get list of all the objects in the respective directory
Find the difference.
Did you use a different approach?

How to load tar.gz files in streaming datasets?

I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data.
I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files.
Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream?
The code I use to process the files is this:
val schema = Encoders.product[RawData].schema
val trackerData = spark
.readStream
.option("delimiter", "\t")
.schema(schema)
.csv(path)
val exceptions = rawCientData
.as[String]
.flatMap(extractExceptions)
.as[ExceptionData]
produced output as expected when path points to csv files.
But I would like to use tar gzip files.
When I try to place those files at the given path, I do not get any exceptions and batch output tells me
"sources" : [ {
"description" : "FileStreamSource[file:/Users/matthias/spark/simple_spark/src/main/resources/zsessionlog*]",
"startOffset" : null,
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 1095,
"processedRowsPerSecond" : 211.0233185584891
} ],
But I do not get any actual data processed.
Console sink looks like this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+

I solved the part of reading .tar.gz (.tgz) files this way:
Inspired by this site I created my own TGZ codec
final class DecompressTgzCodec extends CompressionCodec {
override def getDefaultExtension: String = ".tgz"
override def createOutputStream(out: OutputStream): CompressionOutputStream = ???
override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream = ???
override def createCompressor(): Compressor = ???
override def getCompressorType: Class[_ <: Compressor] = ???
override def createInputStream(in: InputStream): CompressionInputStream = {
new TarDecompressorStream(new TarArchiveInputStream(new GzipCompressorInputStream(in)))
}
override def createInputStream(in: InputStream, decompressor: Decompressor): CompressionInputStream = createInputStream(in)
override def createDecompressor(): Decompressor = null
override def getDecompressorType: Class[_ <: Decompressor] = null
final class TarDecompressorStream(in: TarArchiveInputStream) extends DecompressorStream(in) {
def updateStream(): Unit = {
// still have data in stream -> done
if (in.available() <= 0) {
// create stream content from following tar elements one by one
in.getNextTarEntry()
}
}
override def read: Int = {
checkStream()
updateStream()
in.read()
}
override def read(b: Array[Byte], off: Int, len: Int): Int = {
checkStream()
updateStream()
in.read(b, off, len)
}
override def resetState(): Unit = {}
}
}
And registered it for use by spark.
val conf = new SparkConf()
conf.set("spark.hadoop.io.compression.codecs", classOf[DecompressTgzCodec].getName)
val spark = SparkSession
.builder()
.master("local[*]")
.config(conf)
.appName("Streaming Example")
.getOrCreate()
Works exactly like I wanted it to do.

I do not think reading tar.gz'ed files is possible in Spark (see Read whole text files from a compression in Spark or gzip support in Spark for some ideas).
Spark does support gzip files, but they are not recommended as not splittable and result in a single partition (that in turn makes Spark of little to no help).
In order to have gzipped files loaded in Spark Structured Streaming you have to specify the path pattern so the files are included in loading, say zsessionlog*.csv.gz or alike. Else, csv alone loads CSV files only.
If you insist on Spark Structured Streaming to handle tar.gz'ed files, you could write a custom streaming data Source to do the un-tar.gz.
Given gzip files are not recommended as data format in Spark, the whole idea of using Spark Structured Streaming does not make much sense.

documentation about the file format of spark rdd.saveAsObjectFile

Spark can save a rdd to a file with rdd.saveAsObjectFile("file").
I need to read this file outside Spark. According to doc, using the default spark serializer, this file is just a sequence of objects serialized with the standard Java serialization. However, I guess the file has a header and a separator between objects. I need to read this file, and use jdeserialize to deserialize each Java/Scala object (as I don't have the class definition).
Where can I find the documentation about the file format produced by rdd.saveAsObjectFile("file") (with the standard serializer, not Kryo serializer)?
Update
Working example based on VladoDemcak answer:
import org.apache.hadoop.io._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
def deserialize(data: Array[Byte]) =
new ObjectInputStream(new ByteArrayInputStream(data)).readObject()
val path = new Path("/tmp/part-00000")
val config = new Configuration()
val reader = new SequenceFile.Reader(FileSystem.get(new Configuration()), path, config)
val key = NullWritable.get
val value = new BytesWritable
while (reader.next(key, value)) {
println("key: {} and value: {}.", key, value.getBytes)
println(deserialize(value.getBytes()))
}
reader.close()

It is very interesting question so I will try to explain what I know about this staff. You can check saveAsObjectFile and only documentation I saw about some details is API javadoc
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
so as I know saveAsObjectFile produces SequenceFile. And based on documentation for sequenceFile it has header with version, classname, metadata ...
There are 3 different SequenceFile formats:
Uncompressed key/value records. Record compressed key/value records -
only 'values' are compressed here. Block compressed key/value records
- both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
all of the above formats share a common header (which is used by the
SequenceFile.Reader to return the appropriate key/value pairs).
For reading sequencefile we can use hadoop SequenceFile.Reader implementation.
Path path = new Path("/hdfs/file/path/seqfile");
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(new Configuration()), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)){
logger.info("key: {} and value: {}.", key, value.getBytes());
// (MyObject) deserialize(value.getBytes());
}
reader.close();
I have not tested this but based on doc link you noticed in your question:
By default, Spark serializes objects using Java’s ObjectOutputStream
framework
so in loop you can get bytes for value and deserialize with ObjectInputStream
public static Object deserialize(byte[] data){
return new ObjectInputStream(new ByteArrayInputStream(data)).readObject();
}
in your case you need to use your library (jdeserialize) in deserialize method - i guess run(InputStream is, boolean shouldConnect) etc.

How to insert (not save or update) RDD into Cassandra?

I am working with Apache Spark and Cassandra, and I want to save my RDD to Cassandra with spark-cassandra-connector.
Here's the code:
def saveToCassandra(step: RDD[(String, String, Date, Int, Int)]) = {
step.saveToCassandra("keyspace", "table")
}
This works fine most of the time, but overrides data that's already present in the db. I would like not to override any data. Is it somehow possible ?

What I do is this:
rdd.foreachPartition(x => connector.WithSessionDo(session => {
someUpdater.UpdateEntries(x, session)
// or
x.foreach(y => someUpdater.UpdateEntry(y, session))
}))
The connector above is CassandraConnector(sparkConf).
It's not as nice as a simple saveToCassandra, but it allows for a fine-grained control.

I think it's better to use WithSessionDo outside the foreach partition instead. There's overhead involved in that call that need not be repeated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string