Ignite for Spark - apache-spark

I am trying to load ignite cache into spark . But I am getting java.sql.SQLException : Unsupported type 1111
SparkSession spark = SparkSession.builder()
.appName("Java Spark SQL data sources example")
.config("spark.master", "spark://10.104.146.199:7077")
.getOrCreate();
Dataset<Row> df = spark.read().format("jdbc")
.option("url", "jdbc:ignite:cfg://cache=DEVICE_CACHE:distributedJoins=false#file:///C:/Users/IBM_ADMIN/Desktop/ignite-client-config-with-timeout.xml")
.option("driver", "org.apache.ignite.IgniteJdbcDriver")
.option("dbtable", "DEVICE")
.load();
df.count();

Related

Unable to Connect to Apache Spark Kafka Streaming Using SSL on EMR Notebooks

I'm trying to connect to a Kafka consumer secured by SSL using spark structured streaming but I am having issues. I have the Kafka consumer working and reading events using confluent_kafka with the following code:
conf = {'bootstrap.servers': 'url:port',
'group.id': 'group1',
'enable.auto.commit': False,
'security.protocol': 'SSL',
'ssl.key.location': 'file1.key',
'ssl.ca.location': 'file2.pem',
'ssl.certificate.location': 'file3.cert',
'auto.offset.reset': 'earliest'
}
consumer = Consumer(conf)
consumer.subscribe(['my_topic'])
# Reads events without issues
msg = consumer.poll(timeout=0)
I'm having issues replicating this code with Spark Structured Streaming on EMR Notebooks.
This is the current setup I have on EMR Notebooks:
%%configure -f
{
"conf": {
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0",
"livy.rsc.server.connect.timeout":"600s",
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}
}
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
cert_file = 'file3.cert'
pem_file = 'file2.pem'
key_file = 'file3.key'
sc.addFile(f's3://.../{cert_file}')
sc.addFile(f's3://.../{pem_file}')
sc.addFile(f's3://.../{key_file}')
spark = SparkSession\
.builder \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "url:port") \
.option("kafka.group.id", "group1") \
.option("enable.auto.commit", "False") \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.key.location", SparkFiles.get(key_file)) \ # SparkFiles.get() works
.option("kafka.ssl.ca.location", SparkFiles.get(pem_file)) \
.option("kafka.ssl.certificate.location", SparkFiles.get(cert_file)) \
.option("startingOffsets", "earliest") \
.option("subscribe", "my_topic") \
.load()
query = kafka_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
And no rows appear in the Structured Streaming tab in the spark UI even though I expect the rows to show up instantly since I am using the earliest startingOffsets.
My hypothesis is that readStream doesn't work because the SSL information is not set up correctly. I've looked and haven't found .option() parameters that directly correspond to the confluent_kafka API.
Any help would be appreciated.

Error while passing dataframe to UDF in Structured Streaming

I am reading events from Kafka in Spark Structured streaming and need to process events one by one and write to redis. I wrote a UDF for that but it gives me spark context error.
conf = SparkConf()\
.setAppName(spark_app_name)\
.setMaster(spark_master_url)\
.set("spark.redis.host", "redis")\
.set("spark.redis.port", "6379")\
.set("spark.redis.auth", "abc")
spark = SparkSession.builder\
.config(conf=conf)\
.getOrCreate()
def func(element, event, timestamp):
#redis i/o
pass
schema = ArrayType(StructType(
[
StructField("element_id", StringType()),
StructField("event_name", StringType()),
StructField("event_time", StringType())
]
))
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", topic) \
.load()
#.option("includeTimestamp", value = True)\
ds = df.selectExpr(("CAST(value AS STRING)"))\
.withColumn("value", explode(from_json("value", schema)))
filter_func = udf(func, ArrayType(StringType()))
ds = ds.withColumn("column_name", filter_func(
ds['value']['element_id'],
ds['value']['event_name'],
ds['value']['event_time']
))
query = ds.writeStream \
.format("console") \
.start()
query.awaitTermination()
Error message: _pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Any help is appreciated.
I was trying to access spark context from within user defined function which is not allowed.
Within the udf, I was trying to write to spark-redis by using spark context.

Writing Spark Structured Streaming Output to a Kafka Topic

I have a simple structured streaming application that just reads data from one Kafka topic and writes to another.
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test");
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "start")
.load();
StreamingQuery query = dataset
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "checkpoint")
.option("topic", "end")
.start();
query.awaitTermination(20000);
There are two messages to be processed on the topic start. This code runs without exception, however no messages ever end up on the topic end. What is wrong with this example?
The problem is that the messages were already on the stream and the starting offset was not set to "earliest".
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", start.getTopicName())
.option("startingOffsets", "earliest")
.load();

Spark 2.0 - Databricks xml reader Input path does not exist

I am trying to use Databricks XML file reader api.
Sample code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///C:/TestData")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("books.xml")
df.show()
If i give the file path directly , its looking for some warehouse directory. so i set the spark.sql.warehouse.dir option, but now it throws Input path does not exist.
It is actually looking under the project root directory , why is it looking for project root directory?
Finally its working.. We need to specify warehouse directory as well pass the absolute file path in the load method. I am not sure what is the use of warehouse directory.
The main part is we dont need to give C: as mentioned by other Stackoverflow answer.
working code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///TestData/")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("file:///TestData/books.xml")
df.show()

SparkSession: using the SQL API to query Cassandra

In Python, using SparkSession I can load a Cassandra keyspace and table like:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("TestApp") \
.getOrCreate()
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testdb", table="test")
df.collect()
How can I use the SQL API instead? Something like:
SELECT * FROM testdb.test
Try register temp table in Spark and run queries against it like in a following snippet:
df.createOrReplaceTempView("my_table")
df2 = spark.sql("SELECT * FROM my_table")

Resources