Runtime Error: Cannot set database in spark! [DBT + Spark + Thrift] - apache-spark

Can anyone help me on this?
I'm getting error,***Runtime Error: Cannot set database in spark!*** while running dbt model via Spark thrift mode with remote Hive metastore.
I need to transform some models in DBT using Apache Spark as the adapter. Now, I'm running spark locally on my local machine.
I started the thrift server as below with remote hive metastore URI.
Started master
./sbin/start-master.sh
Started worker
./sbin/start-worker.sh spark://master_url:7077
Started Thrift Server
./sbin/start-thriftserver.sh --master spark://master_url:7077
--packages org.apache.iceberg:iceberg-spark3-runtime:0.13.1 --hiveconf hive.metastore.uris=thrift://ip:9083
In my DBT project,
project_name:
outputs:
dev:
host: localhost
method: thrift
port: 10000
schema: test_dbt
threads: 4
type: spark
user: admin
target: dev
While executing dbt run,
getting the following error.
dbt run --select test -t dev
Running with dbt=1.1.0
Partial parse save file not found. Starting full parse.
Encountered an error:
Runtime Error
Cannot set database in spark!
Please note that there is not much info in dbt.log
This error was getting because of the " database" filed in the source yml file.

Related

is it possible to configure one Presto instance to act as both coordinator and worker

I have installed presto server from this repo
https://repo.maven.apache.org/maven2/io/prestosql/presto-server/330/
Then downloadedapache-hive-3.1.3-binandhadoop-3.3.3
Then initialized hive metastore and launched presto-server by bin/launcher run
Then launched presto-cli by
`./presto-cli --server 127.0.0.1:8080 --catalog hive --schema default`
In which i'm trying to create a schema:
`presto:default> create schema hive.mytest with (location = 's3a://my-bucket/mytest');`
and have very unclear output
`Query 20220828_084647_00002_rnxa4 failed: localhost:9083`
In server stderr i see this:
io.prestosql.NotInTransactionException: Unknown transaction ID: eadd5d61-4524-4b9e-9ade-6596089b0712. Possibly expired? Commands ignored until end of transaction block
....
These are my presto config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
node.properties
node.environment=demo
inode.data-dir=/home/patrick/presto-server-330/var/data
and hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=**************
hive.s3.aws-secret-key=***************
So... my question is - does presto miss a worker node?
Is it possible to configure one instance as both coordinator and worker?
Where can i see more verbose logs of presto sql statements?

Failed to bring up Cloud SQL Metastore When create dataproc cluster using preview image

I am using Spark to do some computation over some data and then push to Hive. The Cloud Dataproc versions is 1.2 with Hive 2.1 included. The Merge command in Hive is only support by version 2.2 onwards. So I have to use preview version for dataproc cluster. When I use version 1.2 for dataproc cluster, I can create the cluster without any issue. I got this error "Failed to bring up Cloud SQL Metastore" when using preview version.
The initialisation script is here. Has anyone every met this problem before?
hive-metastore.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install is-enabled hive-metastore
mysql.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable mysql
insserv: warning: current start runlevel(s) (empty) of script `mysql` overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `mysql' overrides LSB defaults (0 1 6).
Created symlink /etc/systemd/system/multi-user.target.wants/cloud-sql-proxy.service → /usr/lib/systemd/system/cloud-sql-proxy.service.
Cloud SQL Proxy installation succeeded
hive-metastore.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install is-enabled hive-metastore
[2018-06-06T12:43:55+0000]: Failed to bring up Cloud SQL Metastore
I believe the issue may be that your metastore was initialized from an older version of Dataproc and thus has outdated schema.
If you have the failed cluster (if not, please create a new one as before, you can use --single-node option to reduce cost), then SSH to master node and upgrade schema:
$ gcloud compute ssh my-cluster-m
$ /usr/lib/hive/bin/schematool -dbType mysql -info
Hive distribution version: 2.3.0
Metastore schema version: 2.1.0 <-- you will need this
org.apache.hadoop.hive.metastore.HiveMetaException: Metastore schema version is
not compatible. Hive Version: 2.3.0, Database Schema Version: 2.1.0
*** schemaTool failed ***
$ /usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 2.1.0
Unfortunately this cluster cannot be returned to running state, so please delete and recreate it.
I have created this PR to make issue more discoverable:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/pull/278

Presto cassandra connector issue

I am trying to connect to presto for cassandra as below
./presto --server localhost:7070 --catalog cassandra
when i try to execute any query on it, it shows following error:
Error running command: Server refused connection: http://localhost:7070/v1/statement
i am new to it and have tried every possible effort to solve this.
could someone help me in this?

How to renew Kerberos ticket on spark yarn client mode?

I was using Spark 1.6.0 to access data on Kerberos enabled HDFS by API DataFrame.read.parquet($path).
My application is deployed as spark on yarn with client mode.
By default, Kerberos ticket expires every 24 hours. Everything works fine in the first 24 hours but failing to read files after 24 hours(or more, like 27 hours).
I have tried several ways to login and renew the ticket, doesn't work.
Set spark.yarn.keytab and spark.yarn.principal in spark-defaults.conf
Set --keytab and --principal in the spark-submit command line
Start a timer in code to call UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab() every 2 hours.
Error details are:
WARN [org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:671)] - Couldn't setup connection for adam/cluster1#DEV.COM to cdh01/192.168.1.51:8032
DEBUG [org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1632)] - PrivilegedActionException as:adam/cluster1#DEV.COM (auth:KERBEROS) cause:java.io.IOException: Couldn't setup connection for adam/cluster1#DEV.COMto cdh01/192.168.1.51:8032
ERROR [org.apache.spark.Logging$class.logError(Logging.scala:95)] - Failed to contact YARN for application application_1490607689611_0002.
java.io.IOException: Failed on local exception: java.io.IOException: Couldn't setup connection for adam/cluster1#DEV.COM to cdh01/192.168.1.51:8032; Host Details : local host is: "cdh05/192.168.1.41"; destination host is: "cdh01":8032;
The problem was solved.
It was caused by the wrong version of Hadoop lib.
In Spark 1.6 assembly jar, it refer to the old ver. of Hadoop lib, so I downloaded it again without build-in Hadoop lib, and referring to a third party Hadop 2.8 lib.
Then it just works.

Why does Spark job fail on Mesos with "hadoop: not found"?

I use Spark 1.6.1, Hadoop 2.6.4 and Mesos 0.28 on Debian 8.
While trying to submit a job via spark-submit to a Mesos cluster a slave fails with the following in stderr log:
I0427 22:35:39.626055 48258 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ad642fcf-9951-42ad-8f86-cc4f5a5cb408-S0\/hduser","items":[{"action":"BYP$
I0427 22:35:39.628031 48258 fetcher.cpp:379] Fetching URI 'hdfs://xxxxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
I0427 22:35:39.628057 48258 fetcher.cpp:250] Fetching directly into the sandbox directory
I0427 22:35:39.628078 48258 fetcher.cpp:187] Fetching URI 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
E0427 22:35:39.629243 48258 shell.hpp:93] Command 'hadoop version 2>&1' failed; this is the output:
sh: 1: hadoop: not found
Failed to fetch 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar': Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was e$
Failed to synchronize with slave (it's probably exited)
My Jar file contains hadoop 2.6 binaries
The path to spark executor/binary is via an hdfs:// link
My jobs don't appear in the framework tab, but they do appear in the driver with the status 'queued' and they just sit there till I shut down the spark-mesos-dispatcher.sh service.
I was seeing a very similar error and I figured out my problem was that hadoop_home wasn't set in the mesos agent.
I added to /etc/default/mesos-slave (path may be different on your install) on each mesos-slave the following line: MESOS_hadoop_home="/path/to/my/hadoop/install/folder/"
EDIT: Hadoop has to be installed on each slave, the path/to/my/haoop/install/folder is a local path

Resources