I have created one azure CosmosDB account using MongoDB API. I need to connect CosmosDB(MongoDB API) to Azure Databricks cluster in order to read and write data from cosmos.
How to connect Azure Databricks cluster to CosmosDB account?
Here is the pyspark piece of code I use to connect to a CosmosDB database using MongoDB API from Azure Databricks (5.2 ML Beta (includes Apache Spark 2.4.0, Scala 2.11) and MongoDB connector: org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
):
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.getOrCreate()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", CONNECTION_STRING) \
.load()
With a CONNECTION_STRING that looks like that:
"mongodb://USERNAME:PASSWORD#testgp.documents.azure.com:10255/DATABASE_NAME.COLLECTION_NAME?ssl=true&replicaSet=globaldb"
I tried a lot of different other options (adding database and collection names as option or config of the SparkSession) without success.
Tell me if it works for you...
After adding the org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 package, this worked for me:
import json
query = {
'$limit': 100,
}
query_config = {
'uri': 'myConnectionString'
'database': 'myDatabase',
'collection': 'myCollection',
'pipeline': json.dumps(query),
}
df = spark.read.format("com.mongodb.spark.sql") \
.options(**query_config) \
.load()
I do, however, get this error with some collections:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.6, executor 0): com.mongodb.MongoInternalException: The reply message length 10168676 is less than the maximum message length 4194304
Answering the same way I did to my own question.
Using MAVEN as the source, I installed the right library to my cluster using the path
org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
Spark 2.4
An example of code I used is as follows (for those who wanna try):
# Read Configuration
readConfig = {
"URI": "<URI>",
"Database": "<database>",
"Collection": "<collection>",
"ReadingBatchSize" : "<batchSize>"
}
pipelineAccounts = "{'$sort' : {'account_contact': 1}}"
# Connect via azure-cosmosdb-spark to create Spark DataFrame
accountsTest = (spark.read.
format("com.mongodb.spark.sql").
options(**readConfig).
option("pipeline", pipelineAccounts).
load())
accountsTest.select("account_id").show()
Related
We are using Vertica Community Edition "vertica_community_edition-11.0.1-0", and are using Spark 3.2, with local[*] master. When we are trying to save data in vertica database using following:
member.write()
.format("com.vertica.spark.datasource.VerticaSource")
.mode(SaveMode.Overwrite)
.option("host", "192.168.1.25")
.option("port", "5433")
.option("user", "Fred")
.option("db", "store")
.option("password", "password")
//.option("dbschema", "store")
.option("table", "Test")
// .option("staging_fs_url", "hdfs://172.16.20.17:9820")
.save();
We are getting following exception:
com.vertica.spark.util.error.ConnectorException: Fatal error: spark context did not exist
at com.vertica.spark.datasource.VerticaSource.extractCatalog(VerticaDatasourceV2.scala:76)
at org.apache.spark.sql.connector.catalog.CatalogV2Util$.getTableProviderCatalog(CatalogV2Util.scala:363)
Kindly let know how to solve the exception.
We had a similar case. The root cause was that SparkSession.getActiveSession() returned None, due to that spark session was registered on another thread of the JVM. We could still get to the single session we had using SparkSession.getDefaultSession() and manually register it with SparkSession.SetActiveSession(...).
Our case happened in a jupyter kernel where we were using pyspark.
The workaround code was:
sp = sc._jvm.SparkSession.getDefaultSession().get()
sc._jvm.SparkSession.setActiveSession(sp)
I can't try scala or java, I suppose it should look like this:
SparkSession.setActiveSession(SparkSession.getDefaultSession())
vertica doesn't support spark version 3.2 with vertica 11.0 officially. Please find the below documentation link.
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm
Please try using spark connector v2 with the supported version of spark and try running examples from github
https://github.com/vertica/spark-connector/tree/main/examples
I submitted spark batch job through Livy to the remote cluster with the following request body.
REQUEST_BODY = {
'file': '/spark/batch/job.py',
'conf': {
'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation': 'true',
'spark.driver.cores': 1,
'spark.driver.memory': f'12g',
'spark.executor.cores': 1,
'spark.executor.memory': f'8g',
'spark.dynamicAllocation.maxExecutors': 4,
},
}
And in the python file containing application to be run, the SparkSession is created using the following command:
-- inside /spark/batch/job.py --
spark = SparkSession.builder.getOrCreate()
// spark application after this point using the SparkSession created above
The application work just fine, but the spark acquire all of the resources in the cluster neglecting the configuration set in the request body 😕.
I am suspicious that the /spark/batch/job.py create another SparkSession apart from the one specifying in the Livy request body. But I am not sure how to use the SparkSession provided by the Livy though. The document about this topic is so less.
Does anyone facing the same issue? How can I solved this problem?
Thanks in advance everyone!
I am trying to query a remote Hive metastore within PySpark using a username/password/jdbc url. I can initialize the SparkSession just fine but am unable to actually query the tables. I would like to keep everything in a python environment if possible. Any ideas?
from pyspark.sql import SparkSession
url = f"jdbc:hive2://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"
driver = "org.apache.hive.jdbc.HiveDriver"
# initialize
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("hive.metastore.uris", url) \ # also tried .config("javax.jdo.option.ConnectionURL", url)
.config("javax.jdo.option.ConnectionDriverName", driver) \
.config("javax.jdo.option.ConnectionUserName", username) \
.config("javax.jdo.option.ConnectionPassword", password) \
.enableHiveSupport() \
.getOrCreate()
# query
spark.sql("select * from database.tbl limit 100").show()
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Before I was able to connect to a single table using JDBC but was unable to retrieve any data, see Errors querying Hive table from PySpark
The metastore uris are not JDBC addresses, they are simply server:port addresses opened up by the Metastore server process. Typically port 9083
The metastore itself would not be a jdbc:hive2 connection, and would instead be the respective RDBMS that the metastore would be configured with (as set by the hive-site.xml)
If you want to use Spark with JDBC, then you don't need those javax.jdo options, as the JDBC reader has its own username, driver, etc options
I am using pyspark script to read data from remote Hive through JDBC Driver. I have tried other method using enableHiveSupport, Hive-site.xml. but that technique is not possible for me due to some limitations(Access was blocked to launch yarn jobs from outside the cluster). Below is the only way I can connect to Hive.
from pyspark.sql import SparkSession
spark=SparkSession.builder \
.appName("hive") \
.config("spark.sql.hive.metastorePartitionPruning", "true") \
.config("hadoop.security.authentication" , "kerberos") \
.getOrCreate()
jdbcdf=spark.read.format("jdbc").option("url","urlname")\
.option("driver","com.cloudera.hive.jdbc41.HS2Driver").option("user","username").option("dbtable","dbname.tablename").load()
spark.sql("show tables from dbname").show()
Giving me below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'vqaa' not found;
Could someone please help how I can access remote db/tables using this method? Thanks
add .enableHiveSupport() to your sparksession in order to access hive catalog
I am writing this not for asking the question, but sharing the knowledge.
I was using Spark to connect to snowflake. But I could not access snowflake. It seemed like there was something wrong with internal JDBC driver in databricks.
Here was the error I got.
java.lang.NoClassDefFoundError:net/snowflake/client/jdbc/internal/snowflake/common/core/S3FileEncryptionMaterial
I tried many versions of snowflake jdbc drivers and snowflake drivers. It seemed like I could match the correct one.
Answer as given by the asker (I just extracted it from the question for better site usability:
Step 1: Create cluster with Spark version - 2.3.0. and Scala Version - 2.11
Step 2: Attached snowflake-jdbc-3.5.4.jar to the cluster.
https://mvnrepository.com/artifact/net.snowflake/snowflake-jdbc/3.5.4
Step 3: Attached spark-snowflake_2.11-2.3.2 driver to the cluster.
https://mvnrepository.com/artifact/net.snowflake/spark-snowflake_2.11/2.3.2
Here is the sample code.
val SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
val sfOptions = Map(
"sfURL" -> "<snowflake_url>",
"sfAccount" -> "<your account name>",
"sfUser" -> "<your account user>",
"sfPassword" -> "<your account pwd>",
"sfDatabase" -> "<your database name>",
"sfSchema" -> "<your schema name>",
"sfWarehouse" -> "<your warehouse name>",
"sfRole" -> "<your account role>",
"region_id"-> "<your region name, if you are out of us region>"
)
val df: DataFrame = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "<your table>")
.load()
If you are using Databricks, there is a Databricks Snowflake connector created jointly by Databricks and Snowflake people. You just have to provide a few items to create a Spark dataframe (see below -- copied from the Databricks document).
# snowflake connection options
options = dict(sfUrl="<URL for your Snowflake account>",
sfUser=user,
sfPassword=password,
sfDatabase="<The database to use for the session after connecting>",
sfSchema="<The schema to use for the session after connecting>",
sfWarehouse="<The default virtual warehouse to use for the session after connecting>")
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("dbtable", "<The name of the table to be read>") \
.load()
display(df)
As long as you are accessing your own databases with all the access rights granted correctly, this only take a few minutes, even during our first attempt.
Good luck!
You need to set the CLASSPATH Variables to point to jar like below. You need to set up SPARK_HOME and SCALA_HOME besides PYTHONPATH also.
export CLASSPATH=/snowflake-jdbc-3.8.0.jar:/spark-snowflake_2.11-2.4.14-spark_2.4.jar
You can also load in memory jars in your code to resolve this issue.
spark = SparkSession \
.builder \
.config("spark.jars", "file:///app/snowflake-jdbc-3.9.1.jar,file:///app/spark-snowflake_2.11-2.5.3-spark_2.2.jar") \
.config("spark.repl.local.jars",
"file:///app/snowflake-jdbc-3.9.1.jar,file:///app/spark-snowflake_2.11-2.5.3-spark_2.2.jar") \
.config("spark.sql.catalogImplementation", "in-memory") \
.getOrCreate()
Please update to the latest version of the Snowflake JDBC driver (3.2.5); that should resolve this issue. Thanks!