Connect Remote Cassandra Node From Spark Structured Streaming - apache-spark

I've trying to connect remote cassandra node with spark structured streaming.
I can connect on my local machine to existing cassandra node.
This is the code that I can be able to connect Cassandra on my local machine:
parsed = parsed_df \
.withWatermark("sourceTimeStamp", "10 minutes") \
.groupBy(
window(parsed_df.sourceTimeStamp, "4 seconds"),
parsed_df.id
) \
.agg({"value": "avg"}) \
.withColumnRenamed("avg(value)", "avg")\
.withColumnRenamed("window", "sourceTime")
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="opc", keyspace="poc")\
.save()
parsed.writeStream \
.foreachBatch(writeToCassandra) \
.outputMode("update") \
.start()
But, I want to connect remote cassandra node. How can I specify that?

To connect to the remote hosts you need to specify a single address or comma-separated list of addresses of Cassandra nodes in the spark.cassandra.connection.host configuration property of Spark - this could be done either via command-line parameters (most flexible), or in your code. If the Cassandra cluster uses authentication, then you need to provide spark.cassandra.auth.username and spark.cassandra.auth.password properties as well. For SSL, and other stuff, see the parameters reference.

Related

Can I send messages to KAFKA cluster via Azure Databricks as a batch job (close my connection once the messages i sent are consummed)?

I want to send messages once a day to Kafka via Azure Databricks. I want the messages received as a batch job.
I need to send them to a kafka server, but we don't want to have a cluster on all day running for this job.
I saw the databricks writeStream method (i can't make it work yet, but that is not the purpose of my question). It looks like i need to be streaming day and night to make it run.
Is there a way to use it as a batch job? Can i send the messages to Kafka server, and close my cluster once they are received?
df = spark \
.readStream \
.format("delta") \
.option("numPartitions", 5) \
.option("rowsPerSecond", 5) \
.load('/mnt/sales/marketing/numbers/DELTA/')
(df.select("Sales", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "rferferfeez.eu-west-1.aws.confluent.cloud:9092")
.option("topic", "bingofr")
.option("kafka.sasl.username", "jakich")
.option("kafka.sasl.password", 'ozifjoijfziaihufzihufazhufhzuhfzuoehza')
.option("checkpointLocation", "/mnt/sales/marketing/numbers/temp/")
.option("spark.kafka.clusters.cluster.sasl.token.mechanism", "cluster-buyit")
.option("request.timeout.ms",30) \
.option("includeHeaders", "true") \
.start()
)
kafkashaded.org.apache.kafka.common.errors.TimeoutException: Topic
bingofr not present in metadata after
60000 ms.
It is worth noting we also have event hub. Would i be better off sending messages to our event hub, and implement a triggered function that writes to kafka ?
Just want to elaborate on #Alex Ott comment as it seems to work.
By adding ".trigger(availableNow=True)",you can
"periodically spin up a cluster, process everything that is available
since the last period, and then shutdown the cluster. In some case,
this may lead to significant cost savings."
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
**(
df.select("key", "value","partition")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", host)
.option("topic", topic)
.trigger(availableNow=True)
.option("kafka.sasl.jaas.config",
'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(userid, password))
.option("checkpointLocation", "/mnt/Sales/Markerting/Whiteboards/temp/")
.option("kafka.security.protocol", "SASL_SSL")
Normally KAFKA is a continuous service/capability. At least, where I have been.
I would consider a Cloud Service like AZURE where an Event Hub is used on a per message basis with KAFKA API used. Always on, pay per message.
Otherwise, you will need to have a batch job that starts KAFKA, do your execution, then stop KAFKA. You do not state if all on Databricks, though.

I cannot connect from my cloud kafka to databricks community edition's spark cluster

1- I have a spark cluster on databricks community edition and I have a Kafka instance on GCP.
2- I just want to data ingestion Kafka streaming from databricks community edition and I want to analyze the data on spark.
3-
This is my connection code.
val UsYoutubeDf =
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "XXX.XXX.115.52:9092")
.option("subscribe", "usyoutube")
.load`
As is mentioned my datas arriving to the kafka.
I'm entering firewall settings spark.driver.host otherwise ı cannot sending any ping to my kafka machine from databricks's cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
val sortedModelCountQuery = sortedyouTubeSchemaSumDf
.writeStream
.outputMode("complete")
.format("console")
.option("truncate","false")
.trigger(ProcessingTime("5 seconds"))
.start()
After this post the datas dont coming to my spark on cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
sortedModelCountQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#3bd8a775
It stays like this. Actually, the data is coming, but the code I wrote for analysis does not work here

Query remote Hive Metastore from PySpark

I am trying to query a remote Hive metastore within PySpark using a username/password/jdbc url. I can initialize the SparkSession just fine but am unable to actually query the tables. I would like to keep everything in a python environment if possible. Any ideas?
from pyspark.sql import SparkSession
url = f"jdbc:hive2://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"
driver = "org.apache.hive.jdbc.HiveDriver"
# initialize
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("hive.metastore.uris", url) \ # also tried .config("javax.jdo.option.ConnectionURL", url)
.config("javax.jdo.option.ConnectionDriverName", driver) \
.config("javax.jdo.option.ConnectionUserName", username) \
.config("javax.jdo.option.ConnectionPassword", password) \
.enableHiveSupport() \
.getOrCreate()
# query
spark.sql("select * from database.tbl limit 100").show()
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Before I was able to connect to a single table using JDBC but was unable to retrieve any data, see Errors querying Hive table from PySpark
The metastore uris are not JDBC addresses, they are simply server:port addresses opened up by the Metastore server process. Typically port 9083
The metastore itself would not be a jdbc:hive2 connection, and would instead be the respective RDBMS that the metastore would be configured with (as set by the hive-site.xml)
If you want to use Spark with JDBC, then you don't need those javax.jdo options, as the JDBC reader has its own username, driver, etc options

Remote Database not found while Connecting to remote Hive from Spark using JDBC in Python?

I am using pyspark script to read data from remote Hive through JDBC Driver. I have tried other method using enableHiveSupport, Hive-site.xml. but that technique is not possible for me due to some limitations(Access was blocked to launch yarn jobs from outside the cluster). Below is the only way I can connect to Hive.
from pyspark.sql import SparkSession
spark=SparkSession.builder \
.appName("hive") \
.config("spark.sql.hive.metastorePartitionPruning", "true") \
.config("hadoop.security.authentication" , "kerberos") \
.getOrCreate()
jdbcdf=spark.read.format("jdbc").option("url","urlname")\
.option("driver","com.cloudera.hive.jdbc41.HS2Driver").option("user","username").option("dbtable","dbname.tablename").load()
spark.sql("show tables from dbname").show()
Giving me below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'vqaa' not found;
Could someone please help how I can access remote db/tables using this method? Thanks
add .enableHiveSupport() to your sparksession in order to access hive catalog

How to connect to remote hive server from spark [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.
I'm able to access the hive tables by lauching beeline under SPARK_HOME
[ml#master spark-2.0.0]$./bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>
how can I access the remote hive tables programmatically from spark?
JDBC is not required
Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,
Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?
Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.
Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Verify sqlContext.sql("show tables") to see if it works
SparkSQL on Hive tables
Conclusion : If you must go with jdbc way
Have a look connecting apache spark with apache hive remotely.
Please note that beeline also connects through jdbc. from your log it self its evident.
[ml#master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by
Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
So please have a look at this interesting article
Method 1: Pull table into Spark using JDBC
Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
Method 3: Fetch dataset on a client side, then create RDD manually
Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3
Below is example code snippet though which it can be achieved
Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
first_name: String,
last_name: String,
action_dtm: Timestamp,
size: Long,
size_p: Long,
size_d: Long
)
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
.executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
var rec = StatsRec(res.getString("first_name"),
res.getString("last_name"),
Timestamp.valueOf(res.getString("action_dtm")),
res.getLong("size"),
res.getLong("size_p"),
res.getLong("size_d"))
fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
// Basically we are done. To check loaded data:
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)
After providing the hive-ste.xml configuration to SPARK and after starting the HIVE Metastore service,
Two things need to be configured in SPARK Session while connecting to HIVE:
Since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session.
Hive Metastore warehouse which is the directory where Spark SQL persists tables.
Use Property 'spark.sql.warehouse.dir' which is corresponding to 'hive.metastore.warehouse.dir' (as this is deprecated in Spark 2.0)
Something like:
SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");
Hope this was helpful !!
As per documentation:
Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
So in SparkSession you need to specify spark.sql.uris instead of hive.metastore.uris
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.uris", "thrift://<remote_ip>:9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("show tables").show()

Resources