Cache Dataframe with Apache Ignite RDD using scala - apache-spark

Have a dataframe that reads data from oracle db. I need to cache this dataframe to Ignite Shared RDD so it can be accessed across multiple sesssions. I tried the solution posted in "How to cache Dataframe in Apache ignite". Lookslike the loadCache() api's are available in java. Am not able to find the loadCache() method in scala when I import the libraries. Any info on this will help. Thanks, VM

loadCache() method is on IgniteCache API, not Ignite. Looks like there is a mistake in that response. All Java APIs can be used in Scala.

Related

Multiple tables to json with Spark

Can you guys explain to me what’s the best way to export an extensive list of tables from various Oracle and SQL Server schemas in json format using Apache Spark? Can Spark handle multiples dataframes in the same application?
Thanks!
Yes, you can.. Assume that you have data in SQL Server and Oracle DB as well, create the connection in and load the data into two dataframe , post that you can use toJson or likewise functions and create your own json structure as per the requirement, in short, yes, spark can handle reading from multiple different source in a single application.
The reading from various source like - Oracle , PostGreSQL is easily available in google.

How can I get kafka schema registry in Pyspark?

I am looking at the relevant library for PySpark to get the schema registry from Kafka and decode the data. Does anyone know what is the code/library convert from scala to pyspark in scala-code?
You can use requests package to send requests to schema-registry restAPI and get the schema of your topic and also if you are listening to some specific topics you can cache the schema of them on spark and use them
Pyspark can import and use any JVM Spark class. Any Scala other Java examples you find, therefore should just work
Running custom Java class in PySpark

Source API in Spark 2.0+

I would like to write a data source using the spark source API. I found in the internet examples and documentations that were written on top of spark 1.X using RDD.
Is it still relevant for spark 2.0+?
It is still relevant. RDD is a core data structure in Spark and it didn't change with Spark 2.0.

Is it possible to use an apache-ignite rdd implementation in pyspark?

I am using apache spark to run some python data code via pyspark. I am running in Spark standalone mode with 7 nodes.
Is it possible to use the apache-ignite RDD implementation in this set up? Doe it offer any advantage?
Many thanks
Yes, you can use Ignite in any Spark deployment. Please refer to documentation to better understand possible advantages: https://apacheignite-fs.readme.io/docs/ignite-for-spark

Is it possible to run Shark queries over Spark Streaming data?

Is it possible to run Shark queries over the data contained in the DStreams of a Spark Streaming application? (for istance inside a foreachRDD call)
Are there any specific API to do that?
Thanks.
To answer my question if someone is worried about the same problem:
the direct answer to my question is NO, you cannot run Shark directly on Spark Streaming data.
Spark SQL is currently a valid alternative, at least it was for my needs.
It is included in Spark and doesn't require more configuration, you can have a look at it here: http://spark.apache.org/docs/latest/sql-programming-guide.html

Resources