Dataset shows in Hive is inconsistent with JDBC - apache-spark

I am trying to query data from Hive in spark. According the spark-sql, there're two ways to do this:
The first way is Init session with enableHiveSupport
SparkSession session = SparkSession.builder().enableHiveSupport().getOrCreate();
session.sql("select dw_date from tfdw.dwd_dim_date limit 10").show();
the dataset shows the correct result
The second way is through JDBC
Dataset<Row> ds = session.read()
.format("jdbc")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("url", "jdbc:hive2://iZ11syxr6afZ:21050/;auth=noSasl")
.option("dbtable", "tfdw.dwd_dim_date")
.load();
ds.select("dw_date").limit(10).show();
But the dataset only show the column name in the result rather than the data of the column.
The two pictures should be consistent I think. Any outstanding I missed ? Many thanks!

Related

Optimizing Spark JDBC connection read time by adding query parameter

Connecting sql server to spark using the following package https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16. At the moment am reading the entire table however this is bad for performance. To optimize performance I want to pass a query to the following spark.read config. for example select * from my table where record time > timestamp. is this possible? how would I do this?
DF = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()
You can just filter the data frame that you are creating. Spark supports predicate pushdown, which means that the filtering will most likely run on top of the database directly. You can make sure that that works by looking at the SparkUI / Explain Plan

How to design spark application find lower bound and upper bound at runtime for jdbc read

For ETL process I'm creating spark application and I'm new to this.I have create application with lowerbound,upperbound,partitioncolumn.As I study it will make parrallel JDBC connection and read data.
I'm wondering how to make this lower and upperbound at runtime?? I tried below case but finding bound itself taking long time because table itself have billions of records.
val bounds = spark.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("user", user)
.option("password", password)
.option("dbtable", "(select min(partitionColumn) as minimum, max(partitionColum) as maximum from tablename where convert(date,columnname) in (date) ) as tablename")
.load()
Is there any alternative way to find this values?

save spark dataframe to multiple targets parallel

I need to write my final dataframe to hdfs and oracle database.
currently once saving to hdfs done, it start writing to rdbms. is there any way to use java threads to save same dataframe to hdfs as well as rdbms parallel.
finalDF.write().option("numPartitions", "10").jdbc(url, exatable, jdbcProp);
finalDF.write().mode("OverWrite").insertInto(hiveDBWithTable);
Thanks.
Cache finalDF before writing to hdfs and rdbms. Then make sure that enough executors are available for writing simultaneously. If number of partitions in finalDF are p and cores per executors are c, then you need minimum ceilof(p/c)+ceilof(10/c) executors.
df.show and df.write are Actions. Actions occur sequentially in Spark. So, answer is No, not possible standardly unless threads used.
We can use below code to append dataframe values to table
DF.write
.mode("append")
.format("jdbc")
.option("driver", driverProp)
.option("url", urlDbRawdata)
.option("dbtable", TABLE_NAME)
.option("user", userName)
.option("password", password)
.option("numPartitions", maxNumberDBPartitions)
.option("batchsize",batchSize)
.save()

What is the efficient way to save a dataframe into a Hive table?

We are migrating from Greenplum to HDFS.
Data comes from source tables to Greenplum thru huge ETL and from Greenplum, we are just dumping the data into HDFS using Spark.
So I am trying to read a GP table and load it into Hive tables on HDFS using Spark.
I have a dataframe read from a GP table as below:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
There are different options to save a dataframe into a Hive table.
First Method:
yearDf.write().mode("overwrite").partitionBy("source_system_name","period_year","period_num").saveAsTable("schemaName.tableName");
Second Method:
myDf.createOrReplaceTempView("yearData");
spark.sql("insert into schema.table partition("source_system_name","period_year","period_num") select * from yearData");
What are the pros and cons of the above mentioned ways ?
We have huge tables in production which usually take lot of time to load the data into Hive. Could anyone let me know which is the efficient and recommended way to save data from a dataframe to Hive table ?

Data validation for Oracle to Cassandra Data Migration

We are migrating the data from Oracle to Cassandra as part of an ETL process on a daily basis. I would like to perform data validation between the 2 databases once the Spark jobs are complete to ensure that both the databases are in sync. We are using DSE 5.1. Could you please provide your valuable inputs to ensure data has properly migrated
I assumed you have DSE Max with Spark support.
SparkSQL should suite best for it.
First you connect to Oracle with JDBC
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#jdbc-to-other-databases
I have no Oracle DB so following code is not tested, check JDBC URL and drivers before run it:
dse spark --driver-class-path ojdbc7.jar --jars ojdbc7.jar
scala> val oData = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#//localhost:1521/pdborcl")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
C* data is already mapped to SparkSQL table. So:
scala> cData = spark.sql("select * from keyspace.table");
You will need to check schema of both and data conversions details, to compare that tables properly. Simple integration check: All data form Oracle exist in C*:
scala> cData.except(oData).count
0: Long

Resources