Table not found while creating dataframe from Hive Table - apache-spark

Spark Command:
spark-submit \
--class com.dev.SparkHiveToHdfs \
--jars /home/dev/dbJars/datanucleus-api-jdo-3.2.6.jar,/home/dev/dbJars/datanucleus-rdbms-3.2.9.jar,/home/dev/dbJars/datanucleus-core-3.2.10.jar \
--master yarn-cluster \
--name DCA_SPARK_JOB \
/home/dev/dbJars/data-connector-spark.jar dev.emp
data-connector-spark.jar contains below code:
public class SparkHiveToHdfs {
public static void main(String[] args) throws Exception {
String hiveTableNameWithSchema = args[0];
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("DCA_HIVE_HDFS");
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.sql("select * from "+hiveTableNameWithSchema);
df.printSchema();
}
}
Properties in hive-site.xml in $SPARK_HOME/conf:
<property>
<name>hive.metastore.client.connect.retry.delay</name>
<value>5</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>1800</value>
</property>
<property>
<name>hive.metastore.connect.retries</name>
<value>24</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://xxxx:9083</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.transport.mode</name>
<value>binary</value>
</property>
Error log:
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Table not found: `dev`.`emp`; line 1 pos 18
org.apache.spark.sql.AnalysisException: Table not found: `dev`.`emp`; line 1 pos 18
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:54)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at com.impetus.idw.data.connector.SparkHiveToHdfs.main(SparkHiveToHdfs.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)

Please try passing the hive-site.xml in the spark submit command.
spark-submit \
--class com.dev.SparkHiveToHdfs \
--jars /home/dev/dbJars/datanucleus-api-jdo-3.2.6.jar,/home/dev/dbJars/datanucleus-rdbms-3.2.9.jar,/home/dev/dbJars/datanucleus-core-3.2.10.jar \
--master yarn-cluster \
--name DCA_SPARK_JOB \
--files hive-site.xml
/home/dev/dbJars/data-connector-spark.jar dev.emp

Related

Apache Spark integration with Kafka

I am following a course on Udemy about Kafka and Spark and I'm learning apache spark integration with Kafka
Below is the code of apache spark
SparkSession session = SparkSession.builder().appName("KafkaConsumer").master("local[*]").getOrCreate();
session.sparkContext().setLogLevel("ERROR");
Dataset<Row> df = session
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "second_topic").load();
df.show();
And below is the content of the pom.xml file
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example.kafka.spark</groupId>
<artifactId>Kafka-Spark-Integration-Code</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<!-- <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
</dependency> -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.12</artifactId>
<version>3.0.0</version>
</dependency>
</dependencies>
</project>
However when I run the code I am getting below error which I am not able to resolve. I am using openjdk 8 and spark 3 on MX Linux. Thanks
exception in thread "main" java.lang.ClassFormatError: Invalid code attribute name index 24977 in class file org/apache/spark/sql/execution/columnar/InMemoryRelation
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:83)
at org.apache.spark.sql.SparkSession.$anonfun$sharedState$1(SparkSession.scala:132)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:132)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:131)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:323)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107)
at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:157)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:155)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:152)
at org.apache.spark.sql.streaming.DataStreamReader.<init>(DataStreamReader.scala:519)
at org.apache.spark.sql.SparkSession.readStream(SparkSession.scala:657)
at example.code.spark.kafka.KafkaSparkConsumer.main(KafkaSparkConsumer.java:19)
You could follow the examples given in the Structured Streaming + Kafka Integration Guide:
SparkSession session = SparkSession.builder()
.appName("KafkaConsumer")
.master("local[*]")
.getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "second_topic")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
to consume the data. The Structured Streaming Programming Guide shows you how to print the data to the console:
StreamingQuery query = df
.writeStream()
.format("console")
.outputMode("append")
.option("checkpointLocation", "path/to/checkpoint/dir")
.start();
query.awaitTermination();

Hive is not accessible via Spark In Kerberos Environment : Client cannot authenticate via:[TOKEN, KERBEROS]

Hi All, I'm running Spark(2.4.4) in kerberos environment, I've written a code to query Hive Table Via Spark. I am doing kinit also in spark-submit command, but still i'm facing
java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS];
Here is My code:-
#transient lazy val spark: SparkSession = getSparkSession()
def getSparkSession(): SparkSession = {
log.info("Creating spark session")
var sparkBuilder: SparkSession.Builder = SparkSession.builder().
master("local[*]").
appName("Query Hive Via Spark").
config("hive.exec.scratchdir", "/tmp/hive").enableHiveSupport().
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("hive.exec.max.dynamic.partitions", "1000")
#transient lazy val spark: SparkSession = sparkBuilder.getOrCreate()
registerUdfs(spark)
spark.sparkContext.setLogLevel(logLevel)
spark
}
Code to Access Hive Tables via Spark Sql.
val resultDF= spark.sql(s"SELECT count(*) AS cnt FROM brl_in_cash.cash_in_incoming_data WHERE insert_date='20200821'")
resultDF.printSchema()
resultDF.show(false)
I am executing a shell script for spark-submit where i am doing kinit and also passing --principal $KERBEROS_PRINCIPAL --keytab $KERBEROS_KEYTAB .
Spark-submit Command :-
spark-submit --master yarn --deploy-mode cluster \
--verbose \
--name ${appName} \
--principal $KERBEROS_PRINCIPAL \
--keytab $KERBEROS_KEYTAB \
--driver-memory 4g \
--executor-memory 4g \
--executor-cores 2 \
--files ${hiveSite.xml} \
--conf spark.hadoop.yarn.timeline-service.enabled=false \
--conf spark.hadoop.yarn.client.failover-proxy-provider=org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider \
--conf spark.security.credentials.EsServiceCredentialProvider.enabled=false \
--class com.dpk.hive.HiveViaSpark "${jarPath}"
Error Log :-
20/08/26 13:34:17 INFO TezClient: Failed to retrieve AM Status via proxy
com.google.protobuf.ServiceException: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "dfghcv012.global.xyz.com/10.7.1.52"; destination host is: "dfghcv013.global.xyz.com":43890;
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:243)
at com.sun.proxy.$Proxy36.getAMStatus(Unknown Source)
at org.apache.tez.client.TezClient.getAppMasterStatus(TezClient.java:618)
at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:697)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:205)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:116)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:532)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:736)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:747)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:681)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:713)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:706)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:706)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:652)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at com.sc.sdm.rt.oa.recon.TestConnection$.main(TestConnection.scala:34)
at com.sc.sdm.rt.oa.recon.TestConnection.main(TestConnection.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "dfghcv012.global.xyz.com/10.7.1.52"; destination host is: "dfghcv013.global.xyz.com":43890;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
... 91 more
Any Help is Appreciated!!
I was able to resolve this.
I removed
--files hive-site.xml
added spark configuration
--conf spark.security.credentials.hadoopfs.enabled=true
Above changes worked for me.

Count operation not working on aggregated IgniteDataFrame

I am working with Apache Spark and Apache Ignite. I have a spark dataset which I wrote in Ignite using following code
dataset.write()
.mode(SaveMode.Overwrite)
.format(FORMAT_IGNITE())
.option(OPTION_CONFIG_FILE(), "ignite-server-config.xml")
.option(OPTION_TABLE(), "CUSTOM_VALUES")
.option(OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "ID")
.save();
And I am reading it again to perform group by operation which will be pushed to Ignite.
Dataset igniteDataset = sparkSession.read()
.format(FORMAT_IGNITE())
.option(OPTION_CONFIG_FILE(), "ignite-server-config.xml")
.option(OPTION_TABLE(), "CUSTOM_VALUES")
.load();
RelationalGroupedDataset idGroupedData = igniteDataset.groupBy(customized_id);
Dataset<Row> result = idGroupedData.agg(count(id).as("count_id"),
count(fid).as("count_custom_field_id"),
count(type).as("count_customized_type"),
count(val).as("count_value"), count(customized_id).as("groupCount"));
Now, I want to get the number of rows returned by groupby action. So, I am calling count() on dataset asresult.count();
When I do this, I get following exception.
Caused by: org.h2.jdbc.JdbcSQLException: Syntax error in SQL statement "SELECT COUNT(1) AS COUNT FROM (SELECT FROM CUSTOM_VALUES GROUP[*] BY CUSTOMIZED_ID) TABLE1 "; expected "., (, USE, AS, RIGHT, LEFT, FULL, INNER, JOIN, CROSS, NATURAL, ,, SELECT"; SQL statement:
SELECT COUNT(1) AS count FROM (SELECT FROM CUSTOM_VALUES GROUP BY CUSTOMIZED_ID) table1 [42001-197]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:357)
at org.h2.message.DbException.getSyntaxError(DbException.java:217)
Other functions such as show(), collectAsList().size(); works.
What am I missing here ?
I tested your example against the last community version 8.7.5 of GridGain that is the opensource version of Gridgain based on Ignite 2.7.0 sources with a subset of additional fixes (https://www.gridgain.com/resources/download).
Here is the code:
public class Main {
public static void main(String[] args) {
if (args.length < 1)
throw new IllegalArgumentException("You should set the path to client configuration file.");
String configPath = args[0];
SparkSession session = SparkSession.builder()
.enableHiveSupport()
.getOrCreate();
Dataset<Row> igniteDataset = session.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), configPath) //Ignite config.
.load();
RelationalGroupedDataset idGroupedData = igniteDataset.groupBy("CITY_ID");
Dataset<Row> result = idGroupedData.agg(count("id").as("count_id"),
count("city_id").as("count_city_id"),
count("name").as("count_name"),
count("age").as("count_age"),
count("company").as("count_company"));
result.show();
session.close();
}
}
Here are the maven dependencies:
<dependencies>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>gridgain-core</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-core</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-spring</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-indexing</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-spark</artifactId>
<version>8.7.5</version>
</dependency>
</dependencies>
Here is the cache configuration:
<property name="cacheConfiguration">
<list>
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="Person"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="sqlSchema" value="PUBLIC"/>
<property name="queryEntities">
<list>
<bean class="org.apache.ignite.cache.QueryEntity">
<property name="keyType" value="PersonKey"/>
<property name="valueType" value="PersonValue"/>
<property name="tableName" value="Person"/>
<property name="keyFields">
<list>
<value>id</value>
<value>city_id</value>
</list>
</property>
<property name="fields">
<map>
<entry key="id" value="java.lang.Integer"/>
<entry key="city_id" value="java.lang.Integer"/>
<entry key="name" value="java.lang.String"/>
<entry key="age" value="java.lang.Integer"/>
<entry key="company" value="java.lang.String"/>
</map>
</property>
<property name="aliases">
<map>
<entry key="id" value="id"/>
<entry key="city_id" value="city_id"/>
<entry key="name" value="name"/>
<entry key="age" value="age"/>
<entry key="company" value="company"/>
</map>
</property>
</bean>
</list>
</property>
</bean>
</list>
</property>
Using Spark 2.3.0 that is only supported for ignite-spark dependency I have next result on my test data:
Data:
ID,CITY_ID,NAME,AGE,COMPANY,
4,1,Justin Bronte,23,bank,
3,1,Helen Richard,49,bank,
Result:
+-------+--------+-------------+----------+---------+-------------+
|CITY_ID|count_id|count_city_id|count_name|count_age|count_company|
+-------+--------+-------------+----------+---------+-------------+
| 1| 2| 2| 2| 2| 2|
+-------+--------+-------------+----------+---------+-------------+
Also, this code could be fully applied to Ignite 2.7.0.

I save a DataFrame in Hbase and I get: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/TableDescriptor

I created a project on Apache Spark.
Version:
scala 2.11.8
apache spark 2.3.0
apache hbase 1.2.0
hortonworks shc 1.1.0.3.1.2.0-4 (the hortonworks connector)
I need to save a simple DataFrame in an HBase table. For this I started HBase 1.2.0 in Docker container (https://github.com/zhao-y/docker-hbase-pseudo) and created the following table:
$ hbase(main):002:0> create "table1", "cf1", "cf2", "cf3", "cf4", "cf5", "cf6", "cf7", "cf8"
$ 0 row (s) in 1.4440 seconds
To save a DataFrame in Hbase I use: https://github.com/hortonworks-spark/shc
I declared the catalog exactly as in the example
I created a catalog-based dataframe
I tried to save dataframe in hbase as in example:
dataFrame.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
Code:
import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.junit.Test
class SparkTest {
case class HBaseRecord(
col0: String,
col1: Boolean,
col2: Double,
col3: Float,
col4: Int,
col5: Long,
col6: Short,
col7: String,
col8: Byte)
object HBaseRecord {
def apply(i: Int, t: String): HBaseRecord = {
val s = s"""row${"%03d".format(i)}"""
HBaseRecord(s,
i % 2 == 0,
i.toDouble,
i.toFloat,
i,
i.toLong,
i.toShort,
s"String$i: $t",
i.toByte)
}
}
#Test
def bar(): Unit = {
val sparkSession = SparkSession.builder
.appName("SparkTest")
.master("local[*]")
.config("spark.testing.memory", 2147480000)
.getOrCreate()
val data = (0 to 255).map { i => HBaseRecord(i, "extra") }
val dataFrame = sparkSession.createDataFrame(data)
dataFrame.show
dataFrame.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
}
}
Error:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/TableDescriptor
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:63)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at SparkTest.bar(SparkTest.scala:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59)
at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98)
at org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:79)
at org.junit.internal.runners.MethodRoadie.runBeforesThenTestThenAfters(MethodRoadie.java:87)
at org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:77)
at org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:42)
at org.junit.internal.runners.JUnit4ClassRunner.invokeTestMethod(JUnit4ClassRunner.java:88)
at org.junit.internal.runners.JUnit4ClassRunner.runMethods(JUnit4ClassRunner.java:51)
at org.junit.internal.runners.JUnit4ClassRunner$1.run(JUnit4ClassRunner.java:44)
at org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:27)
at org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:37)
at org.junit.internal.runners.JUnit4ClassRunner.run(JUnit4ClassRunner.java:42)
at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.TableDescriptor
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 41 more
val sparkSession = SparkSession.builder
.appName("SparkTest")
.master("local[*]")
.config("spark.testing.memory", 2147480000)
.getOrCreate()
means you are running that in local and your hbase client jar is missing. (if its there in classpath then you can change the scope to runtime rather than compile)
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.1.4</version>
</dependency>
if you are using intellij to run locally, you can see hbase client jar is present in the .iml file.
normal way of runnning in cluster or client modes(not local) would be hbase claasspath add it to
export HBASE_CLASSPATH=$HBASE_CLASSPATH:`hbase classpath`
which will add all the hbase jars in to the classpath
to see/print all the jars in classpath below will be helpful to understand which jars in your classpath.
def urlsinclasspath(cl: ClassLoader): Array[java.net.URL] = cl match {
case null => Array()
case u: java.net.URLClassLoader => u.getURLs() ++ urlsinclasspath(cl.getParent)
case _ => urlsinclasspath(cl.getParent)
}
Caller would be...
val urls = urlsinclasspath(getClass.getClassLoader).foreach(println)

unable to communicate with hbase through spark

I have a project where I need to configure spark and hbase in a local environment. I downloaded spark-2.2.1, hadoop 2.7 and hbase 1.1.8 and configured accordingly on standalone single node Ubuntu 14.04 OS.
I am able to pull and push data from spark to HDFS but not with hbase.
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml:
[root#localhost conf]# cat hdfs-site.xml <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
</property> </configuration>
spark-env.sh
[root#localhost conf]# cat spark-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/app/spark/tmp
# Options read in YARN client mode
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_EXECUTOR_INSTANCES=1
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=1G
export SPARK_DRIVER_MEMORY=1G
export SPARK_YARN_APP_NAME=Spark
export SPARK_CLASSPATH=/opt/hbase/lib/*
hbase-site.xml:
[root#localhost conf]# cat hbase-site.xml <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration> <property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value> </property> <property>
<name>hbase.cluster.distributed</name>
<value>true</value> </property> <property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value> </property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>hdfs://localhost:9000/zookeeper</value> </property> <property>
<name>hbase.master.dns.interface</name>
<value>default</value> </property> <property>
<name>hbase.master.ipc.address</name>
<value>localhost</value> </property> <property>
<name>hbase.regionserver.dns.interface</name>
<value>default</value> </property> <property>
<name>hbase.regionserver.ipc.address</name>
<value>HOSTNAME</value> </property>
<property>
<name>hbase.zookeeper.dns.interface</name>
<value>default</value> </property>
</configuration>
spark-defaults.conf:
[root#localhost conf]# cat spark-defaults.conf
spark.master
spark://127.0.0.1:7077 spark.yarn.dist.files
/opt/spark/conf/hbase-site.xml
Errors:
Even hbase lib(jars) are exported in spark-env.sh it is unable to import hbase libraries (Ex: HBaseConfiguration).
scala> import org.apache.hadoop.hbase.HBaseConfiguration
<console>:23: error: object hbase is not a member of package org.apache.hadoop
import org.apache.hadoop.hbase.HBaseConfiguration
^
If I load these jar through --drive-class-path
spark-shell --master local --driver-class-path=/opt/hbase/lib/*
scala> conf.set("hbase.zookeeper.quorum","localhost")
scala> conf.set("hbase.zookeeper.property.clientPort", "2181")
scala> val connection: Connection = ConnectionFactory.createConnection(conf)
connection: org.apache.hadoop.hbase.client.Connection = hconnection-0x2a4cb8ae
scala> val tableName = connection.getTable(TableName.valueOf("employee"))
tableName: org.apache.hadoop.hbase.client.Table = employee;hconnection-0x2a4cb8ae
scala> val insertData = new Put(Bytes.toBytes("1"))
insertData: org.apache.hadoop.hbase.client.Put = {"totalColumns":0,"row":"1","families":{}}
scala>
| insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("Name"), Bytes.toBytes("Jeevan"))
res3: org.apache.hadoop.hbase.client.Put = {"totalColumns":1,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807}]}}
scala> insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("City"), Bytes.toBytes("San Jose"))
res4: org.apache.hadoop.hbase.client.Put = {"totalColumns":2,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807},{"qualifier":"City","vlen":8,"tag":[],"timestamp":9223372036854775807}]}}
scala> insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("Company"), Bytes.toBytes("Cisco"))
res5: org.apache.hadoop.hbase.client.Put = {"totalColumns":3,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807},{"qualifier":"City","vlen":8,"tag":[],"timestamp":9223372036854775807},{"qualifi
":"Company","vlen":5,"tag":[],"timestamp":9223372036854775807}]}}
scala> insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("location"), Bytes.toBytes("San Jose"))
res6: org.apache.hadoop.hbase.client.Put = {"totalColumns":4,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807},{"qualifier":"City","vlen":8,"tag":[],"timestamp":9223372036854775807},{"qualifi
":"Company","vlen":5,"tag":[],"timestamp":9223372036854775807},{"qualifier":"location","vlen":8,"tag":[],"timestamp":9223372036
4775807}]}}
but I dont see any new column in Hbase.
Can any one help please. any reference to configuration would be great. do i need to configure any zookeeper? appreciate your help.

Resources