How to distribute JDBC jar on Cloudera cluster? - apache-spark

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook.
I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).

Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
example
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
example
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
example via spark python
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
specifics and additional ways to compile connection strings can be found here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)
https://medium.com/#thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

Related

Saving JSON DataFrame from Spark Standalone Cluster

I am running some tests on my local machine with a Spark Standalone Cluster (4 docker containers, 3 workers and 1 master). I'm trying to save a DataFrame using JSON format, like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('sparkApp') \
.master("spark://172.19.0.2:7077") \
.getOrCreate()
data = spark.range(5, 10)
data.write.format("json") \
.mode("overwrite") \
.save("/home/alvaro/Alvaro/ICI/DeltaLake/SparkCluster/output11")
However, the folder created looks like the one in output9. I ran the same code, but replaced the master url to local[5] and it worked, resulting in the file output8.
What should I do to get the DataFrame JSON to be created on my local machine? Is it possible to do so using this kind of Spark cluster?
Thanks in advance!

How to write structured stream data to Cassandra table using pyspark?

This is my terminal command to run strm.py file
$SPARK_HOME/bin/spark-submit --master local --driver-memory 4g --num-executors 2 --executor-memory 4g --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 org.apache.spark:spark-cassandra-connector_2.11:2.4.0 strm.py
Error:
Cannot load main class from JAR org.apache.spark:spark-cassandra-connector_2.11:2.4.0 with URI org.apache.spark. Please specify a class through --class.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
atorg.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:224)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
So can anyone help me out what is the issue with this why it can't loading.
You have 2 problems:
you're incorrectly submitting your application - you don't have a comma between org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 and org.apache.spark:spark-cassandra-connector_2.11:2.4.0, so spark-submit treats cassandra connector as a jar, instead of using your python file.
current version of Spark Cassandra Connector doesn't support direct write for Spark Structured Streaming data - this functionality is available only in DSE Analytics. But you can workaround this by using foreachBatch, something like this (not tested, the working Scala code is available here):
def foreach_batch_function(df, epoch_id):
df.format("org.apache.spark.sql.cassandra").option("keyspace","test")\
.option("table", "my_tables").mode('append').save()
query.writeStream.foreachBatch(foreach_batch_function).start()

How to connect to redshift data using Spark on Amazon EMR cluster

I have an Amazon EMR cluster running. If I do
ls -l /usr/share/aws/redshift/jdbc/
it gives me
RedshiftJDBC41-1.2.7.1003.jar
RedshiftJDBC42-1.2.7.1003.jar
Now, I want to use this jar to connect to my Redshift database in my spark-shell . Here is what I do -
import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)
val df : DataFrame = sqlContext.read
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()
and I get this error -
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I am not sure if I am specifying the correct format while reading the data. I have also read that spark-redshift driver is available but I do not want to run spark-submit with extra JARS.
How do I connect to redshift data from Spark-shell ? Is that the correct JAR to configure the connection in Spark ?
The error being generated is because you are missing the .format("jdbc") in your read. It should be:
val df : DataFrame = sqlContext.read
.format("jdbc")
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()
By default, Spark assumes sources to be Parquet files, hence the mention of Parquet in the error.
You may still run into issues with classpath/finding the drivers, but this change should give you more useful error output. I assume that folder location you listed is in the classpath for Spark on EMR and those driver versions look to be fairly current. Those drivers should work.
Note, this will only work for reading from Redshift. If you need to write to Redshift your best bet is using the Databricks Redshift data source for Spark - https://github.com/databricks/spark-redshift.

Data validation for Oracle to Cassandra Data Migration

We are migrating the data from Oracle to Cassandra as part of an ETL process on a daily basis. I would like to perform data validation between the 2 databases once the Spark jobs are complete to ensure that both the databases are in sync. We are using DSE 5.1. Could you please provide your valuable inputs to ensure data has properly migrated
I assumed you have DSE Max with Spark support.
SparkSQL should suite best for it.
First you connect to Oracle with JDBC
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#jdbc-to-other-databases
I have no Oracle DB so following code is not tested, check JDBC URL and drivers before run it:
dse spark --driver-class-path ojdbc7.jar --jars ojdbc7.jar
scala> val oData = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#//localhost:1521/pdborcl")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
C* data is already mapped to SparkSQL table. So:
scala> cData = spark.sql("select * from keyspace.table");
You will need to check schema of both and data conversions details, to compare that tables properly. Simple integration check: All data form Oracle exist in C*:
scala> cData.except(oData).count
0: Long

How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

Several postings on stackoverflow has responses with partial information about How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine. So I'd like to ask the following questions for complete information about how to do that:
In the Spark SQL app, do we need to use HiveContext to register tables? Or can we use just SQL Context?
Where and how do we use HiveThriftServer2.startWithContext?
When we run start-thriftserver.sh as in
/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001
besides specifying the jar and main class of the Spark SQL app, do we need to specify any other parameters?
Are there any other things we need to do?
Thanks.
To expose DataFrame temp tables through HiveThriftServer2.startWithContext(), you may need to write and run a simple application, may not need to run start-thriftserver.sh.
To your questions:
HiveContext is needed; the sqlContext converted to HiveContext implicitly in spark-shell
Write a simple application, example :
import org.apache.spark.sql.hive.thriftserver._
val hiveContext = new HiveContext(sparkContext)
hiveContext.parquetFile(path).registerTempTable("my_table1")
HiveThriftServer2.startWithContext(hiveContext)
No need to run start-thriftserver.sh, but run your own application instead, e.g.:
spark-submit --class com.xxx.MyJdbcApp ./package_with_my_app.jar
Nothing else from server side, should start on default port 10000;
you may verify by connecting to the server with beeline.
In Java I was able to expose dataframe as temp tables and read the table content via beeline (as like regular hive table)
I have n't posted the entire program (with the assumption that you know already how to create dataframes)
import org.apache.spark.sql.hive.thriftserver.*;
HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
DataFrame orgDf = sqlContext.createDataFrame(orgPairRdd.values(), OrgMaster.class);
orgPairRdd is a JavaPairRDD, orgPairRdd.values() -> contains the entire class value (Row fetched from Hbase)
OrgMaster is a java bean serializable class
orgDf.registerTempTable("spark_org_master_table");
HiveThriftServer2.startWithContext(sqlContext);
I submitted the program locally (as the Hive thrift server is not running in port 10000 in that machine)
hadoop_classpath=$(hadoop classpath)
HBASE_CLASSPATH=$(hbase classpath)
spark-1.5.2/bin/spark-submit --name tempSparkTable --class packageName.SparkCreateOrgMasterTableFile --master local[4] --num-executors 4 --executor-cores 4 --executor-memory 8G --conf "spark.executor.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.driver.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.executor.extraClassPath=${hadoop_classpath}" --conf --jars /path/programName-SNAPSHOT-jar-with-dependencies.jar
/path/programName-SNAPSHOT.jar
In another terminal start the beeline pointing to this thrift service started using this spark program
/opt/hive/hive-1.2/bin/beeline -u jdbc:hive2://<ipaddressofMachineWhereSparkPgmRunninglocally>:10000 -n anyUsername
Show tables -> command will display the table that you registered in Spark
You can do describe also
In this example
describe spark_org_master_table;
then you can run regular queries in beeline against this table. (Until you kill the spark program execution)

Resources