How can I get kafka schema registry in Pyspark? - apache-spark

I am looking at the relevant library for PySpark to get the schema registry from Kafka and decode the data. Does anyone know what is the code/library convert from scala to pyspark in scala-code?

You can use requests package to send requests to schema-registry restAPI and get the schema of your topic and also if you are listening to some specific topics you can cache the schema of them on spark and use them

Pyspark can import and use any JVM Spark class. Any Scala other Java examples you find, therefore should just work
Running custom Java class in PySpark

Related

How to programmatically create a Topic in Kafka through spark structured streaming

I want to create multiple kafka Topics run time in my Spark Structured Streaming application. I found that there are various methods available in Java API. But I couldn't find any with Spark Structured Streaming.
Please let me know if there is any way available or I need to use java library
My apache Spark version is 2.4.4 and Kafka library dependency is spark-sql-kafka-0-10_2.12
AFAIK, Spark doesn't create topics.
You can use the same Java APIs you've found before initializing your SparkSession
spark-sql-kafka includes kafka-clients, so you have the AdminClient class available
How to create a Topic in Kafka through Java

How to connect Spark Streaming to standalone Solr on windows?

I want to integrate Spark Streaming with Standalone Solr. I am using Spark 1.6.1 and Solr 5.2 standalone on windows with no Zookeeper configuration. I am able to find some solution where they are connecting to Solr from spark by passing the Zookeeper config.
How can I connect my spark program to standalone Solr?
Please see if this example is helpful http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
From example, you will need to write your own Connection class which wraps object of HttpSolrClient or ConcurrentUpdateSolrClient. You need to also write your own ConnectionPool class which will implement pool of your own Connection objects (or if its thread safe, just return same singleton object).

Source API in Spark 2.0+

I would like to write a data source using the spark source API. I found in the internet examples and documentations that were written on top of spark 1.X using RDD.
Is it still relevant for spark 2.0+?
It is still relevant. RDD is a core data structure in Spark and it didn't change with Spark 2.0.

Cache Dataframe with Apache Ignite RDD using scala

Have a dataframe that reads data from oracle db. I need to cache this dataframe to Ignite Shared RDD so it can be accessed across multiple sesssions. I tried the solution posted in "How to cache Dataframe in Apache ignite". Lookslike the loadCache() api's are available in java. Am not able to find the loadCache() method in scala when I import the libraries. Any info on this will help. Thanks, VM
loadCache() method is on IgniteCache API, not Ignite. Looks like there is a mistake in that response. All Java APIs can be used in Scala.

Loading Avro into BigQuery via the Spark Connector

From what I've seen in these example, it's only do-able via Gson. Is it possible to directly load Avro objects into a BigQuery table via the Spark Connector? Converting from Avro to BigQuery Json becomes a pain when the avro specification starts going beyond simple primitive values. (e.g. Unions)
Cheers
Not through Spark Connector, but BigQuery supports loading AVRO files directly: https://cloud.google.com/bigquery/loading-data#loading_avro_files

Resources