Data validation for Oracle to Cassandra Data Migration - apache-spark

We are migrating the data from Oracle to Cassandra as part of an ETL process on a daily basis. I would like to perform data validation between the 2 databases once the Spark jobs are complete to ensure that both the databases are in sync. We are using DSE 5.1. Could you please provide your valuable inputs to ensure data has properly migrated

I assumed you have DSE Max with Spark support.
SparkSQL should suite best for it.
First you connect to Oracle with JDBC
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#jdbc-to-other-databases
I have no Oracle DB so following code is not tested, check JDBC URL and drivers before run it:
dse spark --driver-class-path ojdbc7.jar --jars ojdbc7.jar
scala> val oData = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#//localhost:1521/pdborcl")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
C* data is already mapped to SparkSQL table. So:
scala> cData = spark.sql("select * from keyspace.table");
You will need to check schema of both and data conversions details, to compare that tables properly. Simple integration check: All data form Oracle exist in C*:
scala> cData.except(oData).count
0: Long

Related

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.

How to distribute JDBC jar on Cloudera cluster?

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook.
I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).
Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
example
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
example
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
example via spark python
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
specifics and additional ways to compile connection strings can be found here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)
https://medium.com/#thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

unable to insert into hive partitioned table from spark

I create an external partitioned table in hive.
in the logs it shows numinputrows. that means the query is working and sending data. but when I connect to hive using beeline and query, select * or count(*) it's always empty.
def hiveOrcSetWriter[T](event_stream: Dataset[T])( implicit spark: SparkSession): DataStreamWriter[T] = {
import spark.implicits._
val hiveOrcSetWriter: DataStreamWriter[T] = event_stream
.writeStream
.partitionBy("year","month","day")
.format("orc")
.outputMode("append")
.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
hiveOrcSetWriter
}
What can be the issue? I'm unable to understand.
msck repair table tablename
It give go and check the location of the table and adds partitions if new ones exits.
In your spark process add this step in order to query from hive.
Your streaming job is writing new partitions to the table_location. But the Hive metastore is not aware of this.
When you run a select query on the table, the Hive checks metastore to get list of table partitions. Since the information in Metastore is outdated, so the data don't show up in the result.
You need to run -
ALTER TABLE <TABLE_NAME> RECOVER PARTITIONS
command from Hive/Spark to update the metastore with new partition info.

What is the efficient way to save a dataframe into a Hive table?

We are migrating from Greenplum to HDFS.
Data comes from source tables to Greenplum thru huge ETL and from Greenplum, we are just dumping the data into HDFS using Spark.
So I am trying to read a GP table and load it into Hive tables on HDFS using Spark.
I have a dataframe read from a GP table as below:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
There are different options to save a dataframe into a Hive table.
First Method:
yearDf.write().mode("overwrite").partitionBy("source_system_name","period_year","period_num").saveAsTable("schemaName.tableName");
Second Method:
myDf.createOrReplaceTempView("yearData");
spark.sql("insert into schema.table partition("source_system_name","period_year","period_num") select * from yearData");
What are the pros and cons of the above mentioned ways ?
We have huge tables in production which usually take lot of time to load the data into Hive. Could anyone let me know which is the efficient and recommended way to save data from a dataframe to Hive table ?

How do I save spark.writeStream results in hive?

I am using spark.readStream to read data from Kafka and running an explode on the resulting dataframe.
I am trying to save the result of the explode in a Hive table and I am not able to find any solution for that.
I tried the following method but it doesn't work (it runs but I don't see any new partitions created)
val query = tradelines.writeStream.outputMode("append")
.format("memory")
.option("truncate", "false")
.option("checkpointLocation", checkpointLocation)
.queryName("tl")
.start()
sc.sql("set hive.exec.dynamic.partition.mode=nonstrict;")
sc.sql("INSERT INTO TABLE default.tradelines PARTITION (dt) SELECT * FROM tl")
Check HDFS for the dt partitions on the file system
You need to run MSCK REPAIR TABLE on the hive table to see new partitions.
If you aren't doing anything special with Spark, then it's worth pointing out that Kafka Connect HDFS is capable of registering Hive partitions directly from Kafka.

Resources