I have an InputStream coming from the source and want to read the data using Spark.
How to do that?
Related
I have a Kafka topic where Protobuf messages arive. I want to store them in blob storage and then be able to query them. Due to the volume of data I want to do this with Spark in order to have a scalable solution. The problem is that Spark does not have native support for Protobuf. I can think of two possible solutions:
Store the raw data as a binary column in some format such as Parquet. When querying, use a UDF to parse the raw data.
Convert the Protobuf to Parquet (this should map 1:1) on write. Then reading/querying with Spark becomes trivial.
I have managed to implement (2) using plain Java using the org.apache.parquet.proto.ProtoParquetWriter, i.e.
ProtoParquetWriter w = new ProtoParquetWriter(file, MyProtobufClass.class);
for (Object record : messages) {
w.write(record);
}
w.close();
The problem is how to implement this using Spark?
There is this sparksql-protobuf project but it hasn't been updated in years.
I have found this related question but it is 3,5 years old and has no answer.
AFAIK there is no easy way to do it using dataset apis. We(in our team/company) use spark hadoop mapreduce apis which uses protobuf-parquet and use spark RDDs to encode/decore protobuf in parquet format. I can show some examples if you have trouble following parquet-protobuf docs.
Working sample
import org.apache.parquet.hadoop.ParquetInputFormat;
import org.apache.parquet.hadoop.ParquetOutputFormat;
import org.apache.parquet.proto.ProtoParquetOutputFormat;
import org.apache.parquet.proto.ProtoReadSupport;
import org.apache.parquet.proto.ProtoWriteSupport;
...
this.job = Job.getInstance();
ParquetInputFormat.setReadSupportClass(job, ProtoReadSupport.class);
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoParquetOutputFormat.setProtobufClass(job, YourProtoMessage.class); // your proto class
// read
context = new JavaSparkContext(sparkContext)
JavaPairRDD<Void, YourProtoMessage.Builder> input =
context.newAPIHadoopFile(
inputDir,
ParquetInputFormat.class,
Void.class,
YourProtoMessage.Builder.class,
job.getConfiguration());
// write
rdd.saveAsNewAPIHadoopFile(OutputDir,
Void.class,
YourProtoMessage.class,
ParquetOutputFormat.class,
job.getConfiguration());
Is there a way to read gzip files from Eventhub and decompress them using spark structured streaming, want to store uncompressed json at ADLS using Spark Structured Streaming Trigger once.
I'm getting NULL data when i tried to read the EventHub Data which is currently compressed via Spark Structured Streaming. I would need some logic how to decompress the EventHub data while reading.
Any help would be greatly appreciated.
I was able to achieve this by writing a scala UDF. Hope this may help somebody in the future.
val decompress = udf{compressed: Array[Byte] => {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}}
I have a dataset I've read in from hive/orc in Spark, but I'm getting all kinds of errors I did not get when reading in from a csv. How can I tell spark to convert that dataset to something that's not orc without hitting the disk? Right now I'm using this:
FileSystem.get(sc.hadoopConfiguration).delete(new Path(name));
loadedTbl.write.json(name);
val q = hc.read.json(name);
You can rewrite to any format and use it.
df.write.json('json_file_name')
df.write.parquet('parquet_file_name')
I have a DStream which is type [String , ArrayList[String]] , and I want to convert this DStream to avro format and save that to hdfs. How can I accomplish that?
You can convert your stream to JavaRDD or convert it to DataFrame and write it to a file and provide format as Avro.
// Apply a schema to an RDD
DataFrame booksDF = sqlContext.createDataFrame(books, Books.class);
booksDF.write()
.format("com.databricks.spark.avro")
.save("/output");
Please visit Accessing Avro Data Files From Spark SQL for more examples.
Hoping this helps.
I'm working in a scenario where i want to broadcast Spark context and get it in the other side. Is it possible in any other way? If not can someone explain why.
Any help is highly appreciated.
final JavaStreamingContext jsc = new JavaStreamingContext(conf,
Durations.milliseconds(2000));
final JavaSparkContext context = jsc.sc();
final Broadcast<JavaSparkContext> broadcastedFieldNames = context.broadcast(context);
Here's what i'm trying to achieve.
1. We have a XML EVENT that is coming form Kafka.
2. In the xml event we have one HDFS file path (hdfs:localhost//test1.txt)
3. We are using SparkStreamContext to create a DSTREAM and fetch the xml. Using Map Function we are reading the file path in each xml.
4. Now we need to read the file from HDFS (hdfs:localhost//test1.txt).
To Read this i need sc.readfile so i'm trying to broadcast the spark context to executor for parallel read of the input file.
Currently we are using HDFS Read file but that will not read parallel right?
You can't delete row using apache spark but if you use spark as olap engine to run SQL queries you also conce check apache incubator carbondata its provide you support of update delete records and it build on top of spark