How to create a custom streaming data source? - apache-spark

I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming.
How to create a streaming data source in Spark Structured Streaming?

As Spark is moving to the V2 API, you now have to implement DataSourceV2, MicroBatchReadSupport, and DataSourceRegister.
This will involve creating your own implementation of Offset, MicroBatchReader, DataReader<Row>, and DataReaderFactory<Row>.
There are some examples of custom structured streaming examples online (in Scala) which were helpful to me in writing mine.
Once you've implemented your custom source, you can follow Jacek Laskowski's answer in registering the source.
Also, depending on the encoding of messages you'll receive from the socket, you may be able to just use the default socket source and use a custom map function to parse the information into whatever Beans you'll be using. Although do note that Spark says that the default socket streaming source shouldn't be used in production!
Hope this helps!

A streaming data source implements org.apache.spark.sql.execution.streaming.Source.
The scaladoc of org.apache.spark.sql.execution.streaming.Source should give you enough information to get started (just follow the types to develop a compilable Scala type).
Once you have the Source you have to register it so you can use it in format of a DataStreamReader. The trick to make the streaming source available so you can use it for format is to register it by creating the DataSourceRegister for the streaming source. You can find examples in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.TextSocketSourceProvider
org.apache.spark.sql.execution.streaming.RateSourceProvider
That's the file that links the short name in format to the implementation.
What I usually recommend people doing during my Spark workshops is to start development from both sides:
Write the streaming query (with format), e.g.
val input = spark
.readStream
.format("yourCustomSource") // <-- your custom source here
.load
Implement the streaming Source and a corresponding DataSourceRegister (it could be the same class)
(optional) Register the DataSourceRegister by writing the fully-qualified class name, say com.mycompany.spark.MyDataSourceRegister, to META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
$ cat META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
com.mycompany.spark.MyDataSourceRegister
The last step where you register the DataSourceRegister implementation for your custom Source is optional and is only to register the data source alias that your end users use in DataFrameReader.format method.
format(source: String): DataFrameReader
Specifies the input data source format.
Review the code of org.apache.spark.sql.execution.streaming.RateSourceProvider for a good head start.

As Spark 3.0 introduced some major changes to the data source API, here is an updated version:
A class named DefaultSource extending TableProvider is the entry-point for the API. The getTable method returns a table class extending SupportsRead. This class has to provide a ScanBuilder as well as define the sources capabilities, in this case TableCapability.MICRO_BATCH_READ.
The ScanBuilder creates a class extending Scan that has to implement the toMicroBatchStream method (for a non-streaming use case we would implement the toBatch method instead). toMicroBatchStream now returns as class extending MicroBatchStream which implements the logic of what data is available and how to partition it (docs).
Now the only thing left is a PartitionReaderFactory that creates a PartitionReader responsible for actually reading a partition of the data with get returning the rows one by one. You can use InternalRow.fromSeq(List(1,2,3)) to convert the data to an InternalRow.
I created a minimal example project: here

Also Here is a sample implementation for a custom WebSocket Stream Reader/Writer which implements Offset, MicroBatchReader, DataReader<Row>, and DataReaderFactory<Row>

Related

How programmatically via GeoMesa/Spark can I read a shapefile?

I am going through the documentation https://www.geomesa.org/documentation/user/convert/shp.html but I cannot find a way to read shapefiles (in my case stored on S3) using GeoMesa/Spark. Any idea?
There are three broad options.
GeoMesa loads data into Spark via 'RDD Providers'. The converters you linked to can be used in Spark via the ConverterRDD Provider. (https://www.geomesa.org/documentation/user/spark/providers.html#converter-rdd-provider) This may just work.
There is also an GeoTools DataStore RDD Provider implementation. (https://www.geomesa.org/documentation/user/spark/providers.html#geotools-rdd-provider) That could be used with the GeoTools ShapefileDataStore (https://docs.geotools.org/stable/userguide/library/data/shape.html) The work here is to line up the correct jars and parameters.
If you are fine with using the GeoTools Shapefile DataStore, you could use that directly in Spark to load features into memory and then sort out how to make an RDD/Dataframe. (This is kinda skipping on the use of the RDD Provider bits.)

Data source on GCP BigQuery

I tried to look for any existing intake components such as Driver, Plugin that can support GCP BigQuery. Given that if it cannot support, please advise on how to implement subclassing of intake.source.base.DataSource
Pandas can read from BigQuery with the function read_gbq. If you are only interested in reading whole results in a single shot, then this is all you need. You would need to do something like the sql source, which calls pandas to load the data in _get_schema method.
There is currently no GBQ reader for dask, so you cannot load out-of-core or in parralel, but see the discussion in this thread.

How to use a Twitter stream source in Hazelcast Jet without needing a DAG?

I want to do simple analysis on a live stream of tweets.
How do you use a Twitter stream source in Hazelcast Jet without needing a DAG?
Details
The encapsulation of Twitter API is pretty good at StreamTwitterP.java.
However, the caller uses that as part of a DAG, c/o:
Vertex twitterSource =
dag.newVertex("twitter", StreamTwitterP.streamTwitterP(properties, terms));
My use case doesn't need the power of DAG, so I'd rather avoid that needless extra complexity.
To avoid a DAG, I'm looking to use SourceBuilder to define a new data source for live stream of tweets.
I assume that would have code similar to StreamTwitterP.java, mentioned above, however it's not clear to me the fit using the API of Hazelcast JET.
I was referring to SourceBuilder example from the docs.
You can convert a processor to a pipeline source:
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<String>streamFromProcessor("twitter",
streamTwitterP(properties, terms)))
...
There's also twitterSource version that uses SourceBuilder here.

How to create an hbase sequencefile key in Spark for load to Bigtable?

I want to be able to easily create test data files that I can save and re-load into a dev Bigtable instance at will, and pass to other members of my team so they can do the same. The suggested way of using Dataflow to Bigtable seems ridiculously heavy-weight (anyone loading a new type of data--not for production purposes, even just playing around with Bigtable for the first time--needs to know Apache Beam, Dataflow, Java, and Maven??--that's potentially going to limit Bigtable adoption for my team) and my data isn't already in HBase so I can't just export a sequencefile.
However, per this document, it seems like the sequencefile key for HBase should be constructible in regular Java/Scala/Python code:
The HBase Key consists of: the row key, column family, column qualifier, timestamp and a type.
It just doesn't go into enough detail for me to actually do it. What delimiters exist between the different parts of the key? (This is my main question).
From there, Spark at least has a method to write a sequencefile so I think I should be able to create the files I want as long as I can construct the keys.
I'm aware that there's an alternative (described in this answer, whose example link is broken) that would involve writing a script to spin up a Dataproc cluster, push a TSV file there, and use HBase ImportTsv to push the data to Bigtable. This also seems overly heavy-weight to me but maybe I'm just not used to the cloud world yet.
The sequence file solution is meant for situations where large sets of data need to be imported and/or exported from Cloud Bigtable. If your file is small enough, then create a script that creates a table, reads from a file, and uses a BufferedMutator (or bath writes in your favorite language) to write to Cloud Bigtable.

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Resources