Spark integration with Vertica Failing - apache-spark

We are using Vertica Community Edition "vertica_community_edition-11.0.1-0", and are using Spark 3.2, with local[*] master. When we are trying to save data in vertica database using following:
member.write()
.format("com.vertica.spark.datasource.VerticaSource")
.mode(SaveMode.Overwrite)
.option("host", "192.168.1.25")
.option("port", "5433")
.option("user", "Fred")
.option("db", "store")
.option("password", "password")
//.option("dbschema", "store")
.option("table", "Test")
// .option("staging_fs_url", "hdfs://172.16.20.17:9820")
.save();
We are getting following exception:
com.vertica.spark.util.error.ConnectorException: Fatal error: spark context did not exist
at com.vertica.spark.datasource.VerticaSource.extractCatalog(VerticaDatasourceV2.scala:76)
at org.apache.spark.sql.connector.catalog.CatalogV2Util$.getTableProviderCatalog(CatalogV2Util.scala:363)
Kindly let know how to solve the exception.

We had a similar case. The root cause was that SparkSession.getActiveSession() returned None, due to that spark session was registered on another thread of the JVM. We could still get to the single session we had using SparkSession.getDefaultSession() and manually register it with SparkSession.SetActiveSession(...).
Our case happened in a jupyter kernel where we were using pyspark.
The workaround code was:
sp = sc._jvm.SparkSession.getDefaultSession().get()
sc._jvm.SparkSession.setActiveSession(sp)
I can't try scala or java, I suppose it should look like this:
SparkSession.setActiveSession(SparkSession.getDefaultSession())

vertica doesn't support spark version 3.2 with vertica 11.0 officially. Please find the below documentation link.
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm
Please try using spark connector v2 with the supported version of spark and try running examples from github
https://github.com/vertica/spark-connector/tree/main/examples

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

Remote Database not found while Connecting to remote Hive from Spark using JDBC in Python?

I am using pyspark script to read data from remote Hive through JDBC Driver. I have tried other method using enableHiveSupport, Hive-site.xml. but that technique is not possible for me due to some limitations(Access was blocked to launch yarn jobs from outside the cluster). Below is the only way I can connect to Hive.
from pyspark.sql import SparkSession
spark=SparkSession.builder \
.appName("hive") \
.config("spark.sql.hive.metastorePartitionPruning", "true") \
.config("hadoop.security.authentication" , "kerberos") \
.getOrCreate()
jdbcdf=spark.read.format("jdbc").option("url","urlname")\
.option("driver","com.cloudera.hive.jdbc41.HS2Driver").option("user","username").option("dbtable","dbname.tablename").load()
spark.sql("show tables from dbname").show()
Giving me below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'vqaa' not found;
Could someone please help how I can access remote db/tables using this method? Thanks
add .enableHiveSupport() to your sparksession in order to access hive catalog

Accessing Spark RDDs from a web browser via thrift server - java

We have processed our data using Spark 1.2.1 with Java and stored in Hive tables. We want to access this data as RDDs from an web browser.
I read documentation and I understood the steps to do the task.
I am unable to find the way to interact with Spark SQL RDDs via thrift server. Examples I found have belw line in the code and I am not find the class for this in Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
In github i saw scala examples using
import org.apache.spark.sql.hive.thriftserver , but I dont see this in Java API docs. Not sure if I am missing something.
Did anybody had luck with accessing Spark SQL RDDs from a browser via thrift? Can you post the code snippet. We are using Java.
I've got most of this working. Lets dissect each part of it: (References at bottom of post)
HiveThriftServer2.startWithContext is defined in Scala. I was never able to access it from Java or from Python using Py4j, and am no JVM expert, but I ended up switching to Scala. This may have something to do with the annotation #DeveloperApi . This is how I imported it Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For anyone reading this and not using Hive, a Spark SQL context won't do, and you need a hive context. However, the HiveContext constructor requires a Java spark context, not a scala one.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the thrift server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we have to convert them into Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then, we need to turn them into Spark SQL tables. You do this by persisting them to Hive, or making the RDD available as a temporary table.
Persist to Hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or, use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note - temporary tables are isolated to an SQL session.
Spark's hive thrift server is multi-session by default
in version 1.6 (one session per connection). Therefore,
for clients to access temporary tables you've registered,
you'll need to set the option spark.sql.hive.thriftServer.singleSession to true
You can test this by querying the tables in beeline, a command line utility for interacting with the hive thrift server. It ships with Spark.
Finally, you need a way of accessing the hive thrift server from the browser. Thanks to its awesome developers, it has an HTTP mode, so if you want to build a web app, you can use the thrift protocol over AJAX requests from the browser. A simpler strategy might be to create an IPython notebook, and use pyhive to connect to the thrift server.
Data Frame Reference:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
singleSession option pull request:
https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/%3Cc2bd1313f7ca4e618ec89badbd8f9f31#git.apache.org%3E
HTTP mode and beeline howto:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive:
https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition:
https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56-73
Thrift is JDBC/ODBC server.
You can connect to it via JDBC/ODBC connections and access content through the HiveDriver.
You can not get RDDs back from it, because HiveContext is not available.
What you refered to is an experimental feature not available for Java.
As a workaround, you could re-parse the results and create your structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

Connecting to Cassandra with Spark

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?
Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

Resources