anyway to support session level connector configuration for presto - presto

Based on Connectors configuration, all workers shares the same connectors configuration under catalog folder, which means it will use the same connector configuration for any presto cluster user. E.g., Queries from prestosql user1 and user2 will use same jdbc query config. However, the traditional RDBMS ACL is based on username in JDBC configuration to support isolation. E.g., user1 will use jdbc:mysql://user1#host:3306 while user2 will use jdbc:mysql://user2#host:3306.
Question: is there any pointers or directions to support session level connector configuration within the same presto cluster? e.g., when user1 run a query, it will pickup jdbc:mysql://user1#host:3306 when using mysql connector while it can switch to jdbc:mysql://user2#host:3306 when user2 run a query that connects to mysql.
I'm open for any design inputs like using a centralized config management tool like consul or etcd.

You can configure this for the MySQL connector catalog properties file:
user-credential-name=mysql_user
password-credential-name=mysql_password
This allows the user to provide the MySQL username and password as extra credentials that are passed directly to the backend MySQL server when running a Presto query:
presto --extra-credential mysql_user=user1 --extra-credential mysql_password=secret
The credential names mysql_user and mysql_password are arbitrary and give you flexibility on how to configure it. For example, suppose you have two MySQL catalogs pointing at two different MySQL servers. If both of the servers share the same users, then you would configure both catalogs with the same credential names, allowing the same credentials to be used for both. If they are different, then you would name them differently, e.g., mysql1_user and mysql2_user, allowing the user to provide different credentials for each catalog.

This is directly supported by use of extra credentials.
See https://github.com/trinodb/trino/blob/8c1a4a10abaad91a2f4656653a3f5eb0e44aa6c1/presto-base-jdbc/src/main/java/io/prestosql/plugin/jdbc/credential/ExtraCredentialProvider.java#L26
See documentation tasks for more information:
https://github.com/trinodb/trino/issues/1910
https://github.com/trinodb/trino/issues/1911

In case you're using Starburst Presto, then there is also an option of user impersonation. See https://docs.starburstdata.com/latest/security/impersonation.html. Among others, it is currently supported for PostgreSql, but not yet for MySql connector.

Related

How do I create database users for Cassandra on CosmosDB?

I want to be able to create multiple users on my Cassandra app such that a user can only access specific databases. I've tried following the official docs of DataStax to create users. However, cosmosdB doesn't let me add users or even LIST USERS as it gives a No Host Available error. I've tried altering the system_auth keyspace to change the replication strategy as mentioned in this thread but to no avail. Gives me message="system_auth keyspace is not user-modifiable."
How can I use the Database Users feature of Cassandra on CosmosDB?
What alternatives do I have to create logins such that a user is only able to access a specific keyspace
Azure Cosmos DB for Apache Cassandra (the product) is not a true Cassandra cluster. Instead, Cosmos DB provides an API that is CQL compliant so you can connect to Cosmos DB using Cassandra drivers.
There are limits to what Cosmos DB's API can provide and there are several Cassandra features which are not supported which include (but not limited to) the following CQL commands:
CREATE ROLE
CREATE USER
LIST ROLES
LIST USERS
From what I understand, Cosmos DB uses role-based access control (RBAC) and you'll need to provision access to your DB using the Azure portal.
Cosmos DB also does not provide support for Cassandra permissions so you will not be able to grant granular permissions in the same way that you would in a traditional Cassandra cluster.
For more information, see Apache Cassandra features supported by Azure Cosmos DB. Cheers!

Is there a way to access internal metastore of Azure HDInsight to fire queries on Hive metastore tables?

I am trying to access the internal Hive metastore tables like HIVE.SDS, HIVE.TBLS etc.
I have an HDInsight Hadoop Cluster running with the default internal metastore. From Ambari screen, I got the Advanced setting details required for connections like -
javax.jdo.option.ConnectionDriverName,javax.jdo.option.ConnectionURL,javax.jdo.option.ConnectionUserName as well as the password
When I try connecting to the SQL Server instance(internal hive metastore) instance from a local machine, I get the message to add my IP address to the allowed list. However, since this Azure SQL server is not visible in the list of Azure SQL server dbs in the portal, it is not possible for me to whitelist my IP.
So, I tried logging in via the secure shell user- SSHUSER into the Cluster and tried accessing the HIVE database from within the Cluster using the credentials of metastore provided in Ambari. I am still not able to access it. I am using sqlcmd to connect to sql sever.
Does HDInsight prevent direct access to internal Metastores? Is External Metastore the only way to move ahead? Any leads would be helpful.
Update- I created an external SQL Server instance and used it as an external metastore and was able to access it programatically.
No luck with the Internal one yet.
There is not a way to access internal metastores for HDInsight cluster. The internal metastores live in the internal subscription which only PGs are able to access.
If you want to have more control on you metastores it is recommended to bring your own "external" metastore.

How can I enable replication in snowflake database using terraform?

How can I enable replication in snowflake database using terraform ?
I am using chanzuckerberg snowfalke provider in terraform and was able to create the DB/SCHEMA/Warehouse/Tables/shares but I am not able to find the option to enable the database replication through terraform.
or Is there a way to enable run alter command on snowflake DB to enable the replication using terraform.
like :- alter database ${var.var_database_name} enable replication to accounts ${var.var_account_name};
You first have to enable replication for your organization. There are two ways to do this. One option is to open a case with Snowflake Support. Another option is to use the Organizations feature. If you do not have the Organization feature enabled, you will need to open a case with Support to enable it. If you do have the Organizations feature, you will have at least one person in your organization who has the ORGADMIN role.
That person will need to follow these steps: https://docs.snowflake.com/en/user-guide/database-replication-config.html#prerequisite-enable-replication-for-your-accounts
Once the accounts are enabled for replication, you can use SQL statements (orchestrated from Terraform or elsewhere) to promote a local database to the primary in a replication group: https://docs.snowflake.com/en/user-guide/database-replication-config.html#promoting-a-local-database
After you then get secondary databases, you can run a single line of SQL on the secondary DB's account to initiate replication: alter database <DB_NAME> refresh;

Common metadata in databricks cluster

I have a 3-4 clusters in my databricks instance of Azure cloud platform. I want to maintain a common metastore for all the cluster. Let me know if anyone implemented this.
I recommend configuring an external Hive metastore. By default, Detabricks spins its own metastore behind the scenes. But you can create your own database (Azure SQL does work, also MySQL or Postgres) and specify it during the cluster startup.
Here are detailed steps:
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
Things to be aware of:
Data tab in Databricks - you can choose the cluster and see different metastores.
To avoid using SQL user&password, look at Managed Identities https://learn.microsoft.com/en-us/azure/stream-analytics/sql-database-output-managed-identity
Automate external Hive metastore connections by using initialization scripts for your cluster
Permissions management on your sources. In case of ADLS Gen 2, consider using password pass-through

Data Migration from AWS RDS to Azure SQL Data Warehouse

I have my application's database running in AWS RDS (postgresql). I need to migrate the data from AWS to Azure SQL Data Warehouse.
This is a kind of ETL process and I need to do some calculations/computations/aggregations on the Data from Postgresql and put it in a different schema in Azure SQL Data Warehouse for reporting purpose.
Also, I need to sync the data on a regular basis without duplication.
I am new to this Data Migration concept and kindly let me know what are the best possible ways to achieve this task?
Thanks!!!
Azure datafactory is the option for you. It is a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
Please find the Postgresql connector below.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-postgresql-connector
On the transform part you may have to put in some custom intermediate steps to do the data massaging.
Have you tried the Azure datafactory suggestion?
Did it solve your issue?
If not, you can try using Alooma. This solution can replicate PostgreSQL database hosted on Amazon RDS to Azure SQL data warehouse in near real time. (https://www.alooma.com/integrations/postgresql/)
Follow this steps to migrate from RDS to Azure SQL:
Verify your host configuration
On the RDS dashboard under Parameter Groups, navigate to the group that's associated with your instance.
Verify that hot_standby and hot_standby_feedback are set to 1.
Verify that max_standby_archive_delay and max_standby_streaming_delay are greater than 0 (we recommend 30000).
If any of the parameter values need to be changed, click Edit Parameters.
Connect to Alooma
You can connect via SSH server (https://support.alooma.com/hc/en-us/articles/214021869-Connecting-to-an-input-via-SSH) or to to whitelist access to Alooma's IP addresses.
52.35.19.31/32
52.88.52.130/32
52.26.47.1/32
52.24.172.83/32
Add and name your PostreSQL input from the Plumbing screen and enter the following details:
Hostname or IP address of the PostgreSQL server (default port is 5432)
User name and Password
Database name
Choose the replication method you'd like to use for PostgreSQL database replication
For full dump/load replication, provide:
A space- or comma-separated list of the names of the tables you want to replicate.
The frequency at which you'd like to replicate your tables. The more frequent, the more fresh your data will be, but the more load it puts on your PostgreSQL database.
For incremental dump/load replication, provide:
A table/update indicator column pairs for each table you want to replicate.
Don't have an update indicator column? Let us know! We can still make incremental load work for you.
Keep the mapping mode to the default of OneClick if you'd like Alooma to automatically map all PostgreSQL tables exactly to your target data warehouse. Otherwise, they'll have to be mapped manually from the Mapper screen.

Resources