Pyspark: How to setup multiple JDBC connections? - apache-spark

Usecase: I have two databases, one for prod and one for dev. The prod uses an SAP JDBC driver, and the dev uses an Oracle JDBC driver as they are based on different DB's. I have to fetch data from prod DB, perform few operations and save it in dev DB for few project needs.
Issue: Currently am using these third-party drivers by setting "spark.driver.extraClassPath" in Spark Context. But this takes in only one argument. Thus, I am able to connect to only one of the DB's at a time.
Is there are any way I can make two different JDBC class path configuration? If not, then how can I approach this issue? Any guidance is much appreciated!!

Solution:
Instead of defining the driver file path, providing the folder path loads all drivers in that folder. So, in my case, I placed both the SAP and Oracle JDBC drivers in same folder and mentioned it in the Spark Context Configuration like shown in the below snippet.
.set("spark.driver.extraClassPath", r"<folder_path_jdbc_drivers>\*")

Related

Databricks Lakehouse JDBC and Docker

Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)

Creating catalog/schema/table in prestosql/presto container

I would like to use prestosql/presto container for automated tests. For this purpose I want to receive the ability to programmatically to create catalog/schema/table. Unfortunately, I didn't find the option via docker environment variables. If I trying to do it via jdbc connector, I receive following error:"This connector does not support creating tables"
How can I create schemas or tables using prestosql/presto container?
If you are writing tests in Java (as suggested by JDBC tag), you can use testcontainers library. It comes with Presto module.
uses prestosql/presto container under the hood
comes with Presto memory connector pre-installed, so you can create schemas & tables there

SSIS to Azure HDInsight Using Microsoft Hive ODBC Driver

Currently driving an RnD project testing hard against Azure's HDInsight Hadoop service. We use SQL Server Integration Services to manage ETL workflows, and so making HDInsight work with SSIS is a must.
I've had good success with a few of the Azure Feature Pack tasks. But there is no native HDInsight/Hadoop Destination task for use with DFTs.
Problem With Microsoft's Hive ODBC Driver Within An SSIS DFT
I create a DFT with a simple SQL Server "OLE DB Source" pointing to the cluster with a "ODBC Destination" using Microsoft HIVE ODBC Driver. (Ignore red error. It has detected the cluster is destroyed).
I've tested the cluster ODBC connection after entering all parameters, and it tests "OK". It is able to read the HIVE table even and map all columns to. The problem arrives at run time. It generally just locks up, with no rows in counter, or it will get to a handful of rows in the buffer and freeze.
I've troubleshooted with:
Verified connection string and Hadoop cluster username/password.
Recreated cluster and task several times.
Source is SQL Server, and runs fine if i point it to only a file destination or recordset destination.
Tested a smaller number off rows to see if it is a simple performance issue (SELECT TOP 100 FROM stupidTable). Also tested with only 4 columns.
Tested on a separate workstation to make sure it wasn't related to the machine.
All that said, and I can't figure out what else to try. I'm not doing much different than examples on the web like this one, except that I'm using the ODBC as a Destination and not a Source.
Has anyone had success with using the HIVE driver or another one within an SSIS Destination task? Thanks in advanced.

Writable Shared Memory in Apache Spark

I am working on a project of Twitter Data Analysis using Apache Spark with Java and Cassandra for NoSQL databases.
In the project I am working I want to maintain a arraylist of linkedlist(will use Java in built Arraylist and Linkedlist) which is common to all mapper nodes. I mean, if one mapper writes some data into arraylist it should be reflected to all other mapper nodes.
I am aware of broadcast shared variable, but that is read only shared variable, what I want is shared writable dataframe where changes by one mapper should be reflected in all.
Any advice on how to achieve this in apache spark with Java will be of great help.
Thanks in advance
Short, and most likely disappointing, answer is it is not possible given Spark architecture. Worker nodes don't communicate with each other and neither broadcast variables nor accumulators (write-only variables) are really shared variables. You can try different workarounds like using external services or shared file system to communicate but it introduces all kind of issues like idempotency or synchronizing.
As far as I can tell the best thing you can get is updating state between batches or using tools like StreamingContext.remember.

Set cluster name when using Cassandra CQL/JDBC driver

I'm using the Cassandra CQL/JDBC driver I got from google code but it doesn't seem to let me provide a cluster name - is there a way?
I'm using cluster names to ensure I don't run commands against a live system, it has a different cluster name to my dev systems.
Edit: Just to clarify, I have two totally separate Cassandra clusters, one live and one for test. They have different cluster names to ensure that I don't accidentally run test code meant for the test cluster on the live cluster. Therefore any client I need to use must let me set a cluster name. Hector does this.
There is no inbuilt protection for checking cluster names for Cassandra clients. It is built to ensure nodes from different clusters don't try and join together but not to ensure clients connect to the right cluster. It would be possible to add this checking to a client though (since the cluster name is exposed to the client) but I'm not aware of any clients doing this.
I'd strongly recommend firewalling off your different environments to avoid this kind of mistake. If that isn't possible, you should choose different ports to avoid confusion. Change this with the 'rpc_port' setting in cassandra.yaml.
You'd have to mirror the data on two different clusters. You cant access the same cluster with different names.
To rename your cluster (from the default 'Test Cluster') you edit the cassandra configuration file found in location/of/cassandra/conf/cassandra.yaml. Its the top line, if you need more details look at the datastax configuration documentation and explanation.

Resources