Spark stream with console sink wants HDFS write access - apache-spark

I do have a simple setup of reading from Kafka and writing to local console:
SparkSession is created with .master("local[*]")
and I start the stream with:
var df = spark.readStream.format("kafka").options(...).load()
df = df.select("some_column")
df.writeStream.format("console")
.outputMode("append")
.start()
.awaitTermination()
The same Kafka setup works perfectly fine when using with batch/normal DataFrame, but for this streaming job I do get the exception:
Permission denied: user=user, access=WRITE, inode="/":hdfs:hdfs:drwxr-xr-x
Why does it want access to HDFS, when I want to get the data locally to the console?
And how can I solve this?

Related

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

I cannot connect from my cloud kafka to databricks community edition's spark cluster

1- I have a spark cluster on databricks community edition and I have a Kafka instance on GCP.
2- I just want to data ingestion Kafka streaming from databricks community edition and I want to analyze the data on spark.
3-
This is my connection code.
val UsYoutubeDf =
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "XXX.XXX.115.52:9092")
.option("subscribe", "usyoutube")
.load`
As is mentioned my datas arriving to the kafka.
I'm entering firewall settings spark.driver.host otherwise ı cannot sending any ping to my kafka machine from databricks's cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
val sortedModelCountQuery = sortedyouTubeSchemaSumDf
.writeStream
.outputMode("complete")
.format("console")
.option("truncate","false")
.trigger(ProcessingTime("5 seconds"))
.start()
After this post the datas dont coming to my spark on cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
sortedModelCountQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#3bd8a775
It stays like this. Actually, the data is coming, but the code I wrote for analysis does not work here

Spark integration with Vertica Failing

We are using Vertica Community Edition "vertica_community_edition-11.0.1-0", and are using Spark 3.2, with local[*] master. When we are trying to save data in vertica database using following:
member.write()
.format("com.vertica.spark.datasource.VerticaSource")
.mode(SaveMode.Overwrite)
.option("host", "192.168.1.25")
.option("port", "5433")
.option("user", "Fred")
.option("db", "store")
.option("password", "password")
//.option("dbschema", "store")
.option("table", "Test")
// .option("staging_fs_url", "hdfs://172.16.20.17:9820")
.save();
We are getting following exception:
com.vertica.spark.util.error.ConnectorException: Fatal error: spark context did not exist
at com.vertica.spark.datasource.VerticaSource.extractCatalog(VerticaDatasourceV2.scala:76)
at org.apache.spark.sql.connector.catalog.CatalogV2Util$.getTableProviderCatalog(CatalogV2Util.scala:363)
Kindly let know how to solve the exception.
We had a similar case. The root cause was that SparkSession.getActiveSession() returned None, due to that spark session was registered on another thread of the JVM. We could still get to the single session we had using SparkSession.getDefaultSession() and manually register it with SparkSession.SetActiveSession(...).
Our case happened in a jupyter kernel where we were using pyspark.
The workaround code was:
sp = sc._jvm.SparkSession.getDefaultSession().get()
sc._jvm.SparkSession.setActiveSession(sp)
I can't try scala or java, I suppose it should look like this:
SparkSession.setActiveSession(SparkSession.getDefaultSession())
vertica doesn't support spark version 3.2 with vertica 11.0 officially. Please find the below documentation link.
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm
Please try using spark connector v2 with the supported version of spark and try running examples from github
https://github.com/vertica/spark-connector/tree/main/examples

Query remote Hive Metastore from PySpark

I am trying to query a remote Hive metastore within PySpark using a username/password/jdbc url. I can initialize the SparkSession just fine but am unable to actually query the tables. I would like to keep everything in a python environment if possible. Any ideas?
from pyspark.sql import SparkSession
url = f"jdbc:hive2://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"
driver = "org.apache.hive.jdbc.HiveDriver"
# initialize
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("hive.metastore.uris", url) \ # also tried .config("javax.jdo.option.ConnectionURL", url)
.config("javax.jdo.option.ConnectionDriverName", driver) \
.config("javax.jdo.option.ConnectionUserName", username) \
.config("javax.jdo.option.ConnectionPassword", password) \
.enableHiveSupport() \
.getOrCreate()
# query
spark.sql("select * from database.tbl limit 100").show()
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Before I was able to connect to a single table using JDBC but was unable to retrieve any data, see Errors querying Hive table from PySpark
The metastore uris are not JDBC addresses, they are simply server:port addresses opened up by the Metastore server process. Typically port 9083
The metastore itself would not be a jdbc:hive2 connection, and would instead be the respective RDBMS that the metastore would be configured with (as set by the hive-site.xml)
If you want to use Spark with JDBC, then you don't need those javax.jdo options, as the JDBC reader has its own username, driver, etc options

Remote Database not found while Connecting to remote Hive from Spark using JDBC in Python?

I am using pyspark script to read data from remote Hive through JDBC Driver. I have tried other method using enableHiveSupport, Hive-site.xml. but that technique is not possible for me due to some limitations(Access was blocked to launch yarn jobs from outside the cluster). Below is the only way I can connect to Hive.
from pyspark.sql import SparkSession
spark=SparkSession.builder \
.appName("hive") \
.config("spark.sql.hive.metastorePartitionPruning", "true") \
.config("hadoop.security.authentication" , "kerberos") \
.getOrCreate()
jdbcdf=spark.read.format("jdbc").option("url","urlname")\
.option("driver","com.cloudera.hive.jdbc41.HS2Driver").option("user","username").option("dbtable","dbname.tablename").load()
spark.sql("show tables from dbname").show()
Giving me below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'vqaa' not found;
Could someone please help how I can access remote db/tables using this method? Thanks
add .enableHiveSupport() to your sparksession in order to access hive catalog

Resources