I´m trying to connect to an Azure Mysql database server to create a table from a Dataframe in Azure Synapse with Spark.
I have this url and this properties
All variables like jdbcXYZ are fulled with the correct values from the database
import java.util.Properties
val jdbc_url = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/database=${jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=60;"
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
And i try to write to the database with
spark.table("tabletemp").write.mode("append").jdbc(jdbc_url, "table", connectionProperties)
I also tried
df.write.format("jdbc").mode("append").option("url", jdbc_url).option("dbtable", jdbcDatabase).option("user", jdbcUsername).option("password", jdbcPassword).save()
And i´m always receiving the same error
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server
Do you know how to solve it? Thanks in advance
I am very confident that this a config issue , you are not passing all correct value for all the properties . I quick scan makes me think that this is not correct .
.option("dbtable", jdbcDatabase).
Related
I have trouble connecting to the Azure postgres database from python. I am following the guide here - https://learn.microsoft.com/cs-cz/azure/postgresql/connect-python
I have basically the same code for setting up the connection.
But the psycopg2 and SQLalchemy throw me the same error:
OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
I am able to connect to the instance by other client tools like dbeaver but from python it does not work.
When I investigate in Postgres logs I can see that the server actually authorized the connection but the next line says
could not receive data from client: An existing connection was forcibly closed by the remote host.
Python is 3.7
psycopg's version is 2.8.5
Azure Postgres region is in West Europe
Does someone has any suggestion on what should I try to make it work?
Thank you!
EDIT:
The issue resolved itself. I tried the same setup a few days later and it started working. Might have been something wrong with the Azure West Europe.
I had this issue too. I think I read somewhere (I forget where) that Azure has an issue with the # you have to for the username (user#serverName).
I created variables and an f-string and then it worked OK.
import sqlalchemy
username = 'user#server_name'
password = 'PassWord!'
host = 'server_name.postgres.database.azure.com'
database = 'your_database'
conn_str = f'postgresql+psycopg2://{username}:{password}#{host}/{database}'
After that:
engine = sqlalchemy.create_engine(conn_str, pool_pre_ping=True)
conn = engine.connect()
Test it with a simple SQL statement.
sql = 'SELECT * FROM public.some_table;'
results = conn.engine.execute(sql)
This was a connection in UK South. Before that it did complain about the format of the username having to use #, although the username was correct, as tested from the command line with PSQL and another SQL client.
I am able to connect to hive using hive.metastore.uris in Sparksession. What I want is to connect to a particular database of hive with this connection so that I don't need to add database name to each table names in queries. Is there any way to achieve this ?
Expecting code something like
SparkSession sparkSession = SparkSession.config("hive.metastore.uris", "thrift://dhdhdkkd136.india.sghjd.com:9083/hive_database")
You can use the catalog API accessible from the SparkSession.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Catalog
You can then call sparkSession.catalog.setCurrentDatabase(<db_name>)
I have created a java application starting spark (local[*]) and exploiting it to read a csv file as a Dataset<Row> and to create a temporary view with createOrReplaceTempView.
At this point I am able to exploit SQL to query the view inside my application.
What I would like to do, for development and debugging purposes, is to execute queries in an interactive way from outside my application.
Any hints?
Thanks in advance
You can use spark's DeveloperApi - HiveThriftServer2.
#DeveloperApi
def startWithContext(sqlContext: SQLContext): Unit = {
val server = new HiveThriftServer2(sqlContext)
Only thing you need to do in your application is to get SQLContext and use it as follows:
HiveThriftServer2.startWithContext(sqlContext)
This will start hive thrift server (by default on port 10000) and you can use sql client - e.g. beeline for accessing and querying your data in temp tables.
Also you will need to set --conf spark.sql.hive.thriftServer.singleSession=true which allows you to see temp tables. By default it's set to false so each connection has it's own session and they dont see others temp tables.
"spark.sql.hive.thriftServer.singleSession" - When set to true, Hive Thrift server is running in a single session
mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.
thanks for the time. I am trying to access a remote Cassandra DB in order to complete my assertions. I see that the Server is running:
Cassandra V 3.0.8.1293
Driver Type: Cassandra CQL
Datastax Java Driver for Apache Cassandra - Core [3.0.5]
So, I am trying with the following simple code to access the DB
import com.datastax.driver.core.*
Cluster cluster = null;
try {
cluster = Cluster.builder().addContactPoint("x.x.x.x").withCredentials("xxxxxxx", "xxxxxx").withPort(9042).build()
Session session = cluster.connect();
ResultSet rs = session.execute("select * from TABLE");
Row row = rs.one();
} finally {
if (cluster != null) cluster.close();
}
when I use the cassandra-driver-core-2.0.1.jar I am getting the error :
ERROR:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /x.x.x.x(null))
Read the documentation and a lot of posts here and on other blogs and I saw that there may be an incompatibility with the driver version so I tried to upgrade the driver to many versions (cassandra-driver-core-2.5,cassandra-driver-core-3,cassandra-driver-core-3.2), but on that I am getting the following:
ERROR:java.lang.ExceptionInInitializerError
Have also tried to connect using JDBC, but to no avail, using the configuration proposed at this thread
SoapUI JDBC connection with Apache Cassandra
Actually I am running out of ideas. Can anyone propose or point to some direction on how to actually achieve this, either by pointing me to some tutorial or any idea.
Thank you very much
I think you haven't enable remote access to cassandra.
Try enabling remote access using below configuration -
File Path /etc/cassandra/default.conf/cassandra.yaml
rpc_address: 0.0.0.0
broadcast_rpc_address: <serverIPAddress>
After that, restart cassandra service.
I have setup vertica on cluster , there are 5 nodes . I am using below code to write data frame to vertica table:
Map<String, String> opts = new HashMap<>();
opts.put("table", tableName);
opts.put("db", verticaDB);
opts.put("dbschema", dashboardSchema);
opts.put("user", verticaUserName);
opts.put("password", options.verticaPassword);
opts.put("host", verticaHost);
opts.put("hdfs_url",hdfs url);
opts.put("web_hdfs_url",web_hdfs_url);
String SPARK_VERTICA_SOURCE = "com.vertica.spark.datasource.DefaultSource";
dataFrame.write().format(SPARK_VERTICA_SOURCE).options(opts).
mode(saveMode).save();
Above code is working fine, But it is connection to single master node of vertica.
I tried to pass host as connection url for multi cluster node
master_node_ip:5433/schema?Connectionloadbalance=1&backupservernode=node2_ip,node3_ip
I am new to spark , How i can use load balancing to connect vertica from Spark ?
Thank in Advance .
If you connect to Vertica that way, ConnectionLoadBalance has exactly the effect that you send the connection request to master_node_ip (strange name, as Vertica has no master node). To put it in a simplified way: The node in the cluster receiving the connect request "asks" all nodes in the cluster which is the one with the currently lowest load in number of connections. That node will then respond to the connection request, and you will be connected with that one.
If you want more than that, your client (Spark in this case) will have to instantiate for example as many threads as you have Vertica nodes; each connects to a different Vertica node, with ConnectionLoadBalance=False, so that they remain connected exactly where they "wanted" to.
Hope this helps - Marco