prestodb vs prestosql insert-existing-partitions-behavior

prestodb vs prestosql insert-existing-partitions-behavior - presto

the config below exists in prestosql(trino) but not in prestodb. How can we get same functionality using prestodb ?
"hive.insert-existing-partitions-behavior"

Related

Azure data bricks external hive metastore creation

I am creating a metastore in azure databricks for azure sql.I have given below commands to cluster config using 7.3 runtime. As mentioned in the documentation
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#spark-options
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://xxx.database.windows.net:1433;database=hivemetastore
spark.hadoop.javax.jdo.option.ConnectionUserName xxxx
datanucleus.fixedDatastore false
spark.hadoop.javax.jdo.option.ConnectionPassword xxxx
datanucleus.autoCreateSchema true
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 1.2.1
hive.metastore.schema.verification.record.version false
hive.metastore.schema.verification false
--
After this when I tried to create database metastore I will get cancelled automatically.
Error I am getting in Data section in databricks which I am not able to copy also.
Cluster setting
Command

--Update
According to the error message updated in the comments
The maximum length allowed is 8000, when the the length specified in declaring a VARCHAR column.
WorkAround: Use either VARCHAR(8000) or VARCHAR(MAX) for column 'PARAM_VALUE'. I would prefer using nvarchar(max), cause an nvarchar (MAX) can store up to 2GB of characters.
Apparently found an official record of the know issue!
See Error in CREATE TABLE with external Hive metastore
This is a known issue with MySQL 8.0 when the default charset is
utfmb4.
Try running this to confirm
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "<database-name>"
If yes, Refer Solution
You need to update or recreate the database and set the charset to
latin1.
You have 2 options:
Manually run create statements in the Hive database with DEFAULT CHARSET=latin1 at the end of each CREATE TABLE statement.
Setup the database and user accounts. And create the database and run alter database hive character set latin1; before you launch the metastore. (This command sets the default CHARSET for the database. It is applied when the metastore creates tables.)

How can we add MySQL details as property in PySpark?

While creating a SparkSession, as there is a property to connect to Cassandra called
.config("spark.cassandra.connection.host", "ip-address")
that can be directly added while creating a SparkSession, can we add the MySQL details similar to this so that we can avoid passing them in every Spark function?

No, there is no such option when connecting to MySQL. Cassandra has its own spark-cassandra-connector while for MySQL it uses JDBC which requires the connection params to be passed as Java Properties.
They differ in configuration options and in how they works.

How to configure SSL between Spark and Cassandra?

I'm trying to configure SSL for the Cassandra Spark connector, but I couldn't find an example of how to do it.
I'm trying to configure it like this:
SparkConf conf = new SparkConf().setAppName("someApp")
.set("spark.cassandra.connection.host", "111.111.111.111")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/some/tfile.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "apassword")
.set("spark.cassandra.connection.ssl.trustStore.type", "JKS")
.set("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.set("spark.cassandra.connection.ssl.keyStore.path", "/some/kfile.jks")
.set("spark.cassandra.connection.ssl.keyStore.password", "anotherpassword")
.set("spark.cassandra.connection.ssl.keyStore.type", "JKS")
.set("spark.cassandra.connection.ssl.protocol", "TLS");
When I try to submit the spark job, I get these errors:
Exception in thread "main" com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.ssl.keyStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabled is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.protocol is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabledAlgorithms is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
So I'm not sure if this is supported or I'm just using the wrong property names.
I saw this ticket for release 1.2.3 of the connector, but I couldn't find an example of how to use it and it sounded like it may not support keystores. I'm using version 1.4.0-M1 of the connector.
Can anyone show me an example of how to configure SSL for the Spark Cassandra connector? Thanks.

Though I don't see any keystore configurations, I can see below config variables and they are working fine for me.
Note: I am using 1.5.0-M1 version. Not sure if there is any other bug in the version you are using.
sparkConf.set("spark.cassandra.connection.ssl.enabled", "true");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.password", "password");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.path", "jks file path");

Accessing Spark RDDs from a web browser via thrift server - java

We have processed our data using Spark 1.2.1 with Java and stored in Hive tables. We want to access this data as RDDs from an web browser.
I read documentation and I understood the steps to do the task.
I am unable to find the way to interact with Spark SQL RDDs via thrift server. Examples I found have belw line in the code and I am not find the class for this in Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
In github i saw scala examples using
import org.apache.spark.sql.hive.thriftserver , but I dont see this in Java API docs. Not sure if I am missing something.
Did anybody had luck with accessing Spark SQL RDDs from a browser via thrift? Can you post the code snippet. We are using Java.

I've got most of this working. Lets dissect each part of it: (References at bottom of post)
HiveThriftServer2.startWithContext is defined in Scala. I was never able to access it from Java or from Python using Py4j, and am no JVM expert, but I ended up switching to Scala. This may have something to do with the annotation #DeveloperApi . This is how I imported it Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For anyone reading this and not using Hive, a Spark SQL context won't do, and you need a hive context. However, the HiveContext constructor requires a Java spark context, not a scala one.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the thrift server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we have to convert them into Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then, we need to turn them into Spark SQL tables. You do this by persisting them to Hive, or making the RDD available as a temporary table.
Persist to Hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or, use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note - temporary tables are isolated to an SQL session.
Spark's hive thrift server is multi-session by default
in version 1.6 (one session per connection). Therefore,
for clients to access temporary tables you've registered,
you'll need to set the option spark.sql.hive.thriftServer.singleSession to true
You can test this by querying the tables in beeline, a command line utility for interacting with the hive thrift server. It ships with Spark.
Finally, you need a way of accessing the hive thrift server from the browser. Thanks to its awesome developers, it has an HTTP mode, so if you want to build a web app, you can use the thrift protocol over AJAX requests from the browser. A simpler strategy might be to create an IPython notebook, and use pyhive to connect to the thrift server.
Data Frame Reference:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
singleSession option pull request:
https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/%3Cc2bd1313f7ca4e618ec89badbd8f9f31#git.apache.org%3E
HTTP mode and beeline howto:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive:
https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition:
https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56-73

Thrift is JDBC/ODBC server.
You can connect to it via JDBC/ODBC connections and access content through the HiveDriver.
You can not get RDDs back from it, because HiveContext is not available.
What you refered to is an experimental feature not available for Java.
As a workaround, you could re-parse the results and create your structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

how to use presto to query hive data

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
$ ./presto --server node6:8080 --catalog hive --schema default
presto:default> show tables;
Query 20131113_150006_00002_u8uyp failed: Table hive.information_schema.tables does not exist
The config.properties is:
coordinator=true
datasources=jmx,hive
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/root/h2
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=`http://node6:8080`
And the hive.properties is:
connector.name=hive-cdh4
hive.metastore.uri=thrift://node6:9083
The hadoop distribution I used is CDH 4.4. I believe it's properly installed and hive can process queries successfully on its own.
Can anyone help me work it out? Any ideas will be appreciated.

As recommended by the Getting Started, I created a controller (jmx only) and a separate worker (jmx,hive), each on separate machines.
What finally solved this for me was to specify the worker's hostname and http-server.http.port as the --server argument to presto. When specifying the controller, it didn't work.
This all makes sense, but I am still wondering what will happen when I have two Presto-Hive workers...

Add more line to etc/catalog/hive.properties
"hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml"
ofcourse check values of path before do it.
presto-metastore.db.filename= <- is this the value for Hive Warehouse
Directory ?
=> this presto's metastore,not hive.

I just figured out what was wrong in my case:
you also have to add following line to $HIVE_HOME/conf/hive-env.sh for informing hive to open thrift port(same as mentioned under hive.metastore.uris property in hive-site.xml file). This port is used by hive client to connect to Metastore through RPC.
export METASTORE_PORT=9084
in the hive-env.sh file in the conf folder.
This should sync your hive with presto.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string