Spark write to Ignite on Ignite custom schema - apache-spark

I use Spark 2.3.0 and Ignite 2.7.0, I have created custom schemas in Ignite using
<property name="sqlSchemas">
<list>
<value>dbsch</value>
<value>dbschma</value>
<value>db_ui</value>
</list>
</property>
I could create tables on each of the schema. But my problem is I cannot load data into the particular schema.
df.write
.format(FORMAT_IGNITE)
.option(OPTION_CONFIG_FILE, CONFIG)
.option(OPTION_TABLE, "db_ui."+tableName)
.option(OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS, primaryKey)
.option(OPTION_CREATE_TABLE_PARAMETERS, "template=replicated")
.mode(SaveMode.Append)
.save()
It tries to create table as the df and eventually fails by saying null in Spark Config. It works if I try to load the table in the public(default) schema. I get the same error if I try to read from Ignite. Any input please?
Ash

Related

Spark not able to write into a new hive table in partitioned and append mode

Created a new table in hive in partitioned and ORC format.
Writing into this table using spark by using append ,orc and partitioned mode.
It fails with the exception:
org.apache.spark.sql.AnalysisException: The format of the existing table test.table1 is `HiveFileFormat`. It doesn't match the specified format `OrcFileFormat`.;
I change the format to "hive" from "orc" while writing . It still fails with the exception :
Spark not able to understand the underlying structure of table .
So this issue is happening because spark is not able to write into hive table in append mode , because it cant create a new table . I am able to do overwrite successfully because spark creates a table again.
But my use case is to write into append mode from starting. InsertInto also does not work specifically for partitioned tables. I am pretty much blocked with my use case. Any help would be great.
Edit1:
Working on HDP 3.1.0 environment.
Spark Version is 2.3.2
Hive Version is 3.1.0
Edit 2:
// Reading the table
val inputdf=spark.sql("select id,code,amount from t1")
//writing into table
inputdf.write.mode(SaveMode.Append).partitionBy("code").format("orc").saveAsTable("test.t2")
Edit 3: Using insertInto()
val df2 =spark.sql("select id,code,amount from t1")
df2.write.format("orc").mode("append").insertInto("test.t2");
I get the error as:
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN HiveMetastoreCatalog: Unable to infer schema for table test.t1 from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
If I rerun the insertInto command I get the following exception :
20/05/17 19:16:37 ERROR Hive: MetaException(message:The transaction for alter partition did not commit successfully.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
Error in hive metastore logs :
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(907)) - 163: alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(349)) - ugi=X#A.ORG ip=10.10.1.36 cmd=alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions_with_environment_context(5119)) - New partition values:[BR]
2020-05-17T21:17:43,913 ERROR [pool-8-thread-198]: metastore.ObjectStore (ObjectStore.java:alterPartitions(4397)) - Alter failed
org.apache.hadoop.hive.metastore.api.MetaException: Cannot change stats state for a transactional table without providing the transactional write state for verification (new write ID -1, valid write IDs null; current state null; new state {}
I was able to resolve the issue by using external tables in my use case. We currently have an open issue in spark , which is related to acid properties of hive . Once I create hive table in external mode , I am able to do append operations in partitioned/non partitioned table.
https://issues.apache.org/jira/browse/SPARK-15348

Hive Databases Only List default DB

When I try to list all hive databases through Spark (1.6)
scala> val tdf = sqlContext.sql("SHOW DATABASES");
tdf: org.apache.spark.sql.DataFrame = [result: string]
scala> tdf.show
+-------+
| result|
+-------+
|default|
+-------+
When I try to list all hive databases through hive shell
hive> show databases;
OK
default
Time taken: 0.621 seconds, Fetched: 1 row(s)
While in my hive, actually I already have lot of databases. Am I miss some configuration on my Cloudera cluster? Or maybe there are some problem with my hive metastore?
use HiveContext to fetch data from the hive. set hive.metastore.uris by
spark code -
System.setProperty("hive.metastore.uris","thrift://hostserver:9083")
val hivecontext = new HiveContext(sparkContext)
val tdf = hivecontext.sql("SHOW DATABASES");
spark-shell
spark-shell --driver-java-options "-Dhive.metastore.uris=thrift://hostserver:9083"
Since Hive shell also shows only default database, the Hive metastore configuration can be checked.
To start with, you can log into the database having the metastore, and run this query that should list Hive databases. Example query for MySQL database is:
mysql> SELECT NAME, DB_LOCATION_URI FROM hive.DBS;
Then, you can verify and update hive-site.xml as per below. The location of this file on CDH is generally at /usr/lib/hive/conf/hive-site.xml, and on HDP is generally at /usr/hdp/current/hive-client/conf/hive-site.xml.
The documentation reference for configuration of the metastore:
a) https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase
b) (CDH) https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html (Refer to the section: 4. Configure the metastore service to communicate with the MySQL database)
Example configuration:
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>

Spark Cassandra Connector Issue

I am trying to integrate Cassandra with Spark and facing the below issue.
Issue:
com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches:
spark.cassandra.sql.keyspace
spark.cassandra.output.batch.grouping.key
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:50)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:253)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:263)
at org.apache.spark.sql.cassandra.CassandraCatalog.org$apache$spark$sql$cassandra$CassandraCatalog$$buildRelation(CasandraCatalog.scala:41)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:26)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:23)
Please find the below versions of spark Cassandra and connector I am using.
Spark : 1.6.0
Cassandra : 2.1.17
Connector Used : spark-cassandra-connector_2.10-1.6.0-M1.jar
Below is the code snippet I am using to connect Cassandra from spark.
val conf: org.apache.spark.SparkConf = new SparkConf(true) \
.setAppName("Spark Cassandra") \
.set"spark.cassandra.connection.host", "abc.efg.lkh") \
.set("spark.cassandra.auth.username", "xyz") \
.set("spark.cassandra.auth.password", "1234") \
.set("spark.cassandra.keyspace","abcded")
val sc = new SparkContext("local[*]", "Spark Cassandra",conf)
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("abcded")
val my_df = csc.sql("select * from table")
Here when I try to create DF, I am getting above posted error. I tried without passing schema in conf but it is trying to access in default schema where mentioned user doesn't have access.
Already a JIRA was opened and closed.
https://datastax-oss.atlassian.net/browse/SPARKC-102
yet I am getting this issue. Please let me know whether I need to use lastest connector to resolve this issue.
Thanks in advance.
The important information is in the error message you posted [formatted for readability]:
Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches: spark.cassandra.sql.keyspace
spark.cassandra.keyspace is not an available property for the connector. A full list of the available properties can be found here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
You may have some luck using the suggested spark.cassandra.sql.keyspace; otherwise you may just need to explicitly specify the keyspace for every Cassandra interaction you perform using the connector.

spark-sql Table or view not found error

I'm trying to run a basic java program using spark-sql & JDBC. I'm running into the following error. Not sure what's wrong here. Most of the material I have read does not talk on what needs to be done to fix this problem.
It will also be great if someone can point me to some good material to read on Spark-sql (Spark-2.1.1). I'm planning to use spark to implement ETL's, connecting to MySQL and other datasources.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: myschema.mytable; line 1 pos 21;
String MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/myschema";
String MYSQL_USERNAME = "root";
String MYSQL_PWD = "root";
Properties connectionProperties = new Properties();
connectionProperties.put("user", MYSQL_USERNAME);
connectionProperties.put("password", MYSQL_PWD);
Dataset<Row> jdbcDF2 = spark.read()
.jdbc(MYSQL_CONNECTION_URL, "myschema.mytable", connectionProperties);
spark.sql("SELECT COUNT(*) FROM myschema.mytable").show();
It's because Spark is not registering any tables from any schemas from connection by default in Spark SQL Context. You must register it by yourself:
jdbcDF2.createOrReplaceTempView("mytable");
spark.sql("select count(*) from mytable");
Your jdbcDF2 has a source in myschema.mytable from MySQL and will load data from this table on some action.
Remember that MySQL table is not the same as Spark table or view. You are telling Spark to read data from MySQL, but you must register this DataFrame or Dataset as table or view in current Spark SQL Context or Spark Session

Datastax 4.0 Error: Does not contain a valid host:port authority: ${dse.job.tracker}

Just got a single node cluster up and running with the new datastax 4.0.
Works great. We use hive to build and query our data.
On the server it self. I can start hive
$>dse hive
and query tables just fine.
When I try and use the newest Hive ODBC driver to run the same query I seeing this error.
It connects just fine, i can query the keyspace and see the tables. but when i try to run the query. Looks like the map/red gets in the queue, but then errors out with the following.
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
java.lang.IllegalArgumentException: Does not contain a valid host:port authority: ${dse.job.tracker}
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2584)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474)
at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:457)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:402)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:144)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:198)
at org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:646)
at org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:630)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:225)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Job Submission failed with exception 'java.lang.IllegalArgumentException(Does not contain a valid host:port authority: ${dse.job.tracker})'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Any thoughts on what i should try?
Thanks ahead of time for any thoughts and or suggestions/assistance you all can provide.
Cheers,
Eric
I have solved the issue by manually configuring host:port into mapred-site.xml configuration file.
Just add the lines
<property>
<name>mapred.job.tracker</name>
<value>host:port</value>
</property>
depending on the ip address of your hive server and the used port (usually 8012).
This will override the default placeholder $(dse.job.tracker) present in dse-mapred-default.xml configuration file.
The dse.job.tracker property needs to be set in System properties of the JVM that starts Hadoop Jobs. Hadoop will substitute the placeholder with an appropriate system property value if it is defined. Otherwise, it will be just left as is, thus the error you see.
For hive, pig and mahout the mapred.job.tracker property is set in the bin/dse script as follows:
if [ -z "$HADOOP_JT" ]; then
HADOOP_JT=`$BIN/dsetool jobtracker --use-hadoop-config`
fi
if [ -z "$HADOOP_JT" ]; then
echo "Unable to run $HADOOP_CMD: jobtracker not found"
exit 2
fi
#set the JT param as a JVM arg
export HADOOP_OPTS="$HADOOP_OPTS -Ddse.job.tracker=$HADOOP_JT"
So you should do the same for your program using the Hive ODBC driver and I guess it should be fine.
By hardcoding Hadoop JT location you make it harder to move the JT to another node, because then you'd have to update the config file manually. Moreover, the automatic JT failover of dse won't work properly if your primary JT goes down, because your program would still try to connect the old one.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>localhost:50030</value>
</property>
<property>
<name>mapreduce.jobhhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
</configuration>

Resources