I am running hive queries using Spark-SQL.
I made a hive context object
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
Then when I am trying to run the command:
hiveContext.sql("use db_name");
OR
hiveContext.hiveql("use db_name");
It doesnt work. It says database not found.
When I try to run
val db = hiveContext.hiveql("show databases");
db.collect.foreach(println);
It prints nothing. Just prints [default].
Any help would be appreciated.
hiveContext.sql("SELECT * FROM database.table")
Related
I'm trying to fetch the data from db2 using
df= spark.read.format(“jdbc”).option(“user”,”user”).option(“password”,”password”)\
.option(“driver”, “com.ibm.db2.jcc.DB2Driver”)\
.option(“url”,”jdbc:db2://url:<port>/<DB>”)\
.option(“query”, query)\
.load()
In my local in options query function is working but in server it is asking me to use dbtable
when i use dbtable i'm getting sqlsyntax error: sql code =-104 sqlstate =42601 and taking wrong columns
can some one help me with this
You can use the AS400 driver to fetch DB2 data using Spark.
Your DB2 URL will look something like this: jdbc:as400://<DBIPAddress>
val query = "(select * from db.temptable) temp"
val df = spark.read.format("jdbc").option("url", <YourURL>).option("driver", "com.ibm.as400.access.AS400JDBCDriver").option("dbtable", query).option("user", <Username>).option("password", <Password>).load()
Please note that you will need to keep the query format as shown above (i.e. give an alias to the query). Hope this resolves your issue.
Using a EMR cluster, I created an external Hive table (over 800 millions of rows) that maps to a DynamoDB table. It works well and I can do queries and inserts through hive.
IF I try a query with a condition by the hash_key in Hive, I get the results in seconds. But doing the same query through spark-submit using SparkSQL and enableHiveSupport (accesing Hive) it doesn't finish.It seems that from Spark it's doing a full scan to the table.
I tried several configurations(different hive-site.xml for example) but it doesn't seem to work well from Spark. How should I do it through Spark? Any suggestions?
Thanks
Just make sure to use the dynamo connector opensource by AWS. By default it is available on EMR AFAIK.
Syntax to create a table using the DynamoDBStorageHandler class:
CREATE EXTERNAL TABLE hive_tablename (
hive_column1_name column1_datatype,
hive_column2_name column2_datatype
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb_tablename",
"dynamodb.column.mapping" =
"hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name"
);
For any Spark Job, you need to have the followings confs :
$ spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
...
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.input.tableName", "myDynamoDBTable")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
orders.count()
References :
https://github.com/awslabs/emr-dynamodb-connector
I am trying to integrate Cassandra with Spark and facing the below issue.
Issue:
com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches:
spark.cassandra.sql.keyspace
spark.cassandra.output.batch.grouping.key
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:50)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:253)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:263)
at org.apache.spark.sql.cassandra.CassandraCatalog.org$apache$spark$sql$cassandra$CassandraCatalog$$buildRelation(CasandraCatalog.scala:41)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:26)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:23)
Please find the below versions of spark Cassandra and connector I am using.
Spark : 1.6.0
Cassandra : 2.1.17
Connector Used : spark-cassandra-connector_2.10-1.6.0-M1.jar
Below is the code snippet I am using to connect Cassandra from spark.
val conf: org.apache.spark.SparkConf = new SparkConf(true) \
.setAppName("Spark Cassandra") \
.set"spark.cassandra.connection.host", "abc.efg.lkh") \
.set("spark.cassandra.auth.username", "xyz") \
.set("spark.cassandra.auth.password", "1234") \
.set("spark.cassandra.keyspace","abcded")
val sc = new SparkContext("local[*]", "Spark Cassandra",conf)
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("abcded")
val my_df = csc.sql("select * from table")
Here when I try to create DF, I am getting above posted error. I tried without passing schema in conf but it is trying to access in default schema where mentioned user doesn't have access.
Already a JIRA was opened and closed.
https://datastax-oss.atlassian.net/browse/SPARKC-102
yet I am getting this issue. Please let me know whether I need to use lastest connector to resolve this issue.
Thanks in advance.
The important information is in the error message you posted [formatted for readability]:
Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches: spark.cassandra.sql.keyspace
spark.cassandra.keyspace is not an available property for the connector. A full list of the available properties can be found here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
You may have some luck using the suggested spark.cassandra.sql.keyspace; otherwise you may just need to explicitly specify the keyspace for every Cassandra interaction you perform using the connector.
I am using HDP-2.6.0.3 but I need Zeppelin 0.8, so I have installed it as an independent service. When I run:
%sql
show tables
I get nothing back and I get 'table not found' when I run Spark2 SQL commands. Tables can be seen in the 0.7 Zeppelin that is part of HDP.
Can anyone tell me what I am missing, for Zeppelin/Spark to see Hive?
The steps I performed to create the zep0.8 are as follows:
maven clean package -DskipTests -Pspark-2.1 -Phadoop-2.7-Dhadoop.version=2.7.3 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.11
Copied zeppelin-site.xml and shiro.ini from /usr/hdp/2.6.0.3-8/zeppelin/conf to /home/ed/zeppelin/conf.
created /home/ed/zeppelin/conf/zeppeli-env.sh in which I put the following:
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.6.0.3-8"
Copied /etc/hive/conf/hive-site.xml to /home/ed/zeppelin/conf
EDIT:
I have also tried:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://s2.royble.co.uk:9083") // replace with your hivemetastore service's thrift url
.config("url", "jdbc:hive2://s2.royble.co.uk:10000/default")
.config("UID", "admin")
.config("PWD", "admin")
.config("driver", "org.apache.hive.jdbc.HiveDriver")
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate()
same result, and:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
which gives:
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
ERROR XSDB6: Another instance of Derby may have already booted the database /home/ed/metastore_db
Fixed error with:
val url = "jdbc:hive2://s2.royble.co.uk:10000"
but still no tables :(
This works:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://s2.royble.co.uk:10000"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
val r: ResultSet = conn.createStatement.executeQuery("SELECT * FROM tweetsorc0")
but then I have the pain of converting the resultset to a dataframe. I'd rather SparkSession worked and I get a dataframe so I will add a bounty later today.
I had a similar problem in Cloudera Hadoop. In my case the problem was that spark sql did not see my hive metastore. So when I used my Spark Session object for spark SQL I could not see my previously created tables. I managed to solve it with adding in zeppelin-env.sh
export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export HADOOP_HOME=/opt/cloudera/parcels/CDH
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
(I assume for Horton Works these paths are something else). I also change spark.master from local[*] to yarn-client at Interpreter UI. Most importantly I manually copied hive-site.xml in /etc/spark/conf/ because I though it was strange that it was not in that directory and that solved my problem.
So my advice is to see if hive-site.xml exists in your SPARK_CONF_DIR and if not add it manually. I also find a guide for Horton Works and zeppelin in case this will not work.
I'm trying to run a basic java program using spark-sql & JDBC. I'm running into the following error. Not sure what's wrong here. Most of the material I have read does not talk on what needs to be done to fix this problem.
It will also be great if someone can point me to some good material to read on Spark-sql (Spark-2.1.1). I'm planning to use spark to implement ETL's, connecting to MySQL and other datasources.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: myschema.mytable; line 1 pos 21;
String MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/myschema";
String MYSQL_USERNAME = "root";
String MYSQL_PWD = "root";
Properties connectionProperties = new Properties();
connectionProperties.put("user", MYSQL_USERNAME);
connectionProperties.put("password", MYSQL_PWD);
Dataset<Row> jdbcDF2 = spark.read()
.jdbc(MYSQL_CONNECTION_URL, "myschema.mytable", connectionProperties);
spark.sql("SELECT COUNT(*) FROM myschema.mytable").show();
It's because Spark is not registering any tables from any schemas from connection by default in Spark SQL Context. You must register it by yourself:
jdbcDF2.createOrReplaceTempView("mytable");
spark.sql("select count(*) from mytable");
Your jdbcDF2 has a source in myschema.mytable from MySQL and will load data from this table on some action.
Remember that MySQL table is not the same as Spark table or view. You are telling Spark to read data from MySQL, but you must register this DataFrame or Dataset as table or view in current Spark SQL Context or Spark Session