Data source on GCP BigQuery - intake

I tried to look for any existing intake components such as Driver, Plugin that can support GCP BigQuery. Given that if it cannot support, please advise on how to implement subclassing of intake.source.base.DataSource

Pandas can read from BigQuery with the function read_gbq. If you are only interested in reading whole results in a single shot, then this is all you need. You would need to do something like the sql source, which calls pandas to load the data in _get_schema method.
There is currently no GBQ reader for dask, so you cannot load out-of-core or in parralel, but see the discussion in this thread.

Related

How programmatically via GeoMesa/Spark can I read a shapefile?

I am going through the documentation https://www.geomesa.org/documentation/user/convert/shp.html but I cannot find a way to read shapefiles (in my case stored on S3) using GeoMesa/Spark. Any idea?
There are three broad options.
GeoMesa loads data into Spark via 'RDD Providers'. The converters you linked to can be used in Spark via the ConverterRDD Provider. (https://www.geomesa.org/documentation/user/spark/providers.html#converter-rdd-provider) This may just work.
There is also an GeoTools DataStore RDD Provider implementation. (https://www.geomesa.org/documentation/user/spark/providers.html#geotools-rdd-provider) That could be used with the GeoTools ShapefileDataStore (https://docs.geotools.org/stable/userguide/library/data/shape.html) The work here is to line up the correct jars and parameters.
If you are fine with using the GeoTools Shapefile DataStore, you could use that directly in Spark to load features into memory and then sort out how to make an RDD/Dataframe. (This is kinda skipping on the use of the RDD Provider bits.)

How to use a Twitter stream source in Hazelcast Jet without needing a DAG?

I want to do simple analysis on a live stream of tweets.
How do you use a Twitter stream source in Hazelcast Jet without needing a DAG?
Details
The encapsulation of Twitter API is pretty good at StreamTwitterP.java.
However, the caller uses that as part of a DAG, c/o:
Vertex twitterSource =
dag.newVertex("twitter", StreamTwitterP.streamTwitterP(properties, terms));
My use case doesn't need the power of DAG, so I'd rather avoid that needless extra complexity.
To avoid a DAG, I'm looking to use SourceBuilder to define a new data source for live stream of tweets.
I assume that would have code similar to StreamTwitterP.java, mentioned above, however it's not clear to me the fit using the API of Hazelcast JET.
I was referring to SourceBuilder example from the docs.
You can convert a processor to a pipeline source:
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<String>streamFromProcessor("twitter",
streamTwitterP(properties, terms)))
...
There's also twitterSource version that uses SourceBuilder here.

How to create an hbase sequencefile key in Spark for load to Bigtable?

I want to be able to easily create test data files that I can save and re-load into a dev Bigtable instance at will, and pass to other members of my team so they can do the same. The suggested way of using Dataflow to Bigtable seems ridiculously heavy-weight (anyone loading a new type of data--not for production purposes, even just playing around with Bigtable for the first time--needs to know Apache Beam, Dataflow, Java, and Maven??--that's potentially going to limit Bigtable adoption for my team) and my data isn't already in HBase so I can't just export a sequencefile.
However, per this document, it seems like the sequencefile key for HBase should be constructible in regular Java/Scala/Python code:
The HBase Key consists of: the row key, column family, column qualifier, timestamp and a type.
It just doesn't go into enough detail for me to actually do it. What delimiters exist between the different parts of the key? (This is my main question).
From there, Spark at least has a method to write a sequencefile so I think I should be able to create the files I want as long as I can construct the keys.
I'm aware that there's an alternative (described in this answer, whose example link is broken) that would involve writing a script to spin up a Dataproc cluster, push a TSV file there, and use HBase ImportTsv to push the data to Bigtable. This also seems overly heavy-weight to me but maybe I'm just not used to the cloud world yet.
The sequence file solution is meant for situations where large sets of data need to be imported and/or exported from Cloud Bigtable. If your file is small enough, then create a script that creates a table, reads from a file, and uses a BufferedMutator (or bath writes in your favorite language) to write to Cloud Bigtable.

How to create a custom streaming data source?

I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming.
How to create a streaming data source in Spark Structured Streaming?
As Spark is moving to the V2 API, you now have to implement DataSourceV2, MicroBatchReadSupport, and DataSourceRegister.
This will involve creating your own implementation of Offset, MicroBatchReader, DataReader<Row>, and DataReaderFactory<Row>.
There are some examples of custom structured streaming examples online (in Scala) which were helpful to me in writing mine.
Once you've implemented your custom source, you can follow Jacek Laskowski's answer in registering the source.
Also, depending on the encoding of messages you'll receive from the socket, you may be able to just use the default socket source and use a custom map function to parse the information into whatever Beans you'll be using. Although do note that Spark says that the default socket streaming source shouldn't be used in production!
Hope this helps!
A streaming data source implements org.apache.spark.sql.execution.streaming.Source.
The scaladoc of org.apache.spark.sql.execution.streaming.Source should give you enough information to get started (just follow the types to develop a compilable Scala type).
Once you have the Source you have to register it so you can use it in format of a DataStreamReader. The trick to make the streaming source available so you can use it for format is to register it by creating the DataSourceRegister for the streaming source. You can find examples in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.TextSocketSourceProvider
org.apache.spark.sql.execution.streaming.RateSourceProvider
That's the file that links the short name in format to the implementation.
What I usually recommend people doing during my Spark workshops is to start development from both sides:
Write the streaming query (with format), e.g.
val input = spark
.readStream
.format("yourCustomSource") // <-- your custom source here
.load
Implement the streaming Source and a corresponding DataSourceRegister (it could be the same class)
(optional) Register the DataSourceRegister by writing the fully-qualified class name, say com.mycompany.spark.MyDataSourceRegister, to META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
$ cat META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
com.mycompany.spark.MyDataSourceRegister
The last step where you register the DataSourceRegister implementation for your custom Source is optional and is only to register the data source alias that your end users use in DataFrameReader.format method.
format(source: String): DataFrameReader
Specifies the input data source format.
Review the code of org.apache.spark.sql.execution.streaming.RateSourceProvider for a good head start.
As Spark 3.0 introduced some major changes to the data source API, here is an updated version:
A class named DefaultSource extending TableProvider is the entry-point for the API. The getTable method returns a table class extending SupportsRead. This class has to provide a ScanBuilder as well as define the sources capabilities, in this case TableCapability.MICRO_BATCH_READ.
The ScanBuilder creates a class extending Scan that has to implement the toMicroBatchStream method (for a non-streaming use case we would implement the toBatch method instead). toMicroBatchStream now returns as class extending MicroBatchStream which implements the logic of what data is available and how to partition it (docs).
Now the only thing left is a PartitionReaderFactory that creates a PartitionReader responsible for actually reading a partition of the data with get returning the rows one by one. You can use InternalRow.fromSeq(List(1,2,3)) to convert the data to an InternalRow.
I created a minimal example project: here
Also Here is a sample implementation for a custom WebSocket Stream Reader/Writer which implements Offset, MicroBatchReader, DataReader<Row>, and DataReaderFactory<Row>

how to write Spark data frame to Neo4j database

I'd like to build this workflow:
preprocess some data with Spark, ending with a data frame
write such dataframe to Neo4j as a set of nodes
My idea is really basic: write each row in the df as a node, where each column value represents the value of the node's attribute
I have seen many articles, including neo4j-spark-connector and Introducing the Neo4j 3.0 Apache Spark Connector but they all focus on importing into Spark data from a Neo4j db... so far, I wasn't able to find a clear example of writing a Spark data frame to a Neo4j database.
Any pointer to documentation or very basic examples are much appreciated.
Read this issue to answer my question.
Long story short, neo4j-spark-connector can write Spark data to Neo4j db, and yes, there is a lack in the documentation of the new release.
you can write some routine and use an opensource neo4j java driver
https://github.com/neo4j/neo4j-java-driver
for example.
Simple serialise the result of an RDD (using rdd.toJson) and then use the above driver to create your neo4j nodes and push into your neo4j instance.
I know the question is pretty old but I don't think the neo4j-spark-connector can solve your issue. The full story, sample code and the details are available here but to cut the long story short if you look carefully at the Neo4jDataFrame.mergeEdgeList example (which has been suggested), you'll noticed that what it does is to instantiate a driver for each row in the dataframe. That will work in a unit test with 10 rows but you can't expect it to work in a real case scenario with millions or billions of rows. Besides there are other defects explained in the link above where you can find a csv based solution. Hope it helps.

Resources