Authorization through Apache Ranger in Spark - apache-spark

We have ranger policies defined on hive table and authorization works as expected when we use hive cli and beeline. But when we access those hive tables using spark-shell or spark-submit it does not work.
Is there any way to set it up?
Problem Statement:
Ranger secures Hive (JDBC) server only. But Spark does not interact with HS2, but directly interacts with Metastore. Hence, the only way to use Ranger policies if you use Hive via JDBC. Another option is HDFS or Storage ACLs, which are coarse grain control over file path etc. You can use Ranger to manage HDFS ACLs as well. In such scenario spark will be bound by those policies. But, if I use Ranger to manage HDFS ACLS, as you mentioned it will coarse grain control over file. I might have few fine grained use cases at row/column level

Check for ranger audits in ranger ui and check for the denied results for those tables, verify the user.

Related

Databricks and Informatica Delta Lake connector spark configuration

I am working with Informatica Data Integrator and trying to set up a connection with a Databricks cluster. So far everything seems to work fine, but one issue is that under Spark configuration we had to put the SAS key for the ADLS gen 2 storage account.
The reason for this is that when Informatica tries to write to Databricks it first has to write that data into a folder in ADLS gen 2 and then Databricks essentially takes that file and writes it as a Delta Lake table.
Now one issue is that the field where we put the Spark config contains the full SAS value (url plus token and password). That is not really a good thing unless we only make 1 person an admin.
Did anyone work with Informatica and Databricks? Is it possible to put the Spark config as a file and then have the Informatica connector read that file? Or is it possible to add that SAS key to the Spark cluster (the interactive cluster we use) and have that cluster read the info from that file?
Thank you for any help with this.
You really don't need to put SAS key value into Spark configuration, but instead you need to store that value in the Azure KeyVault-baked secret scope (on Azure) or Databricks secret scope (in other clouds), and then refer to that value from Spark configuration using the syntax {{secrets/<secret-scope-name>/<secret-key>}} (see doc) - in this case, SAS key value will be read on the cluster start, and won't available to the users who have access to a cluster UI.

Is there a way to access internal metastore of Azure HDInsight to fire queries on Hive metastore tables?

I am trying to access the internal Hive metastore tables like HIVE.SDS, HIVE.TBLS etc.
I have an HDInsight Hadoop Cluster running with the default internal metastore. From Ambari screen, I got the Advanced setting details required for connections like -
javax.jdo.option.ConnectionDriverName,javax.jdo.option.ConnectionURL,javax.jdo.option.ConnectionUserName as well as the password
When I try connecting to the SQL Server instance(internal hive metastore) instance from a local machine, I get the message to add my IP address to the allowed list. However, since this Azure SQL server is not visible in the list of Azure SQL server dbs in the portal, it is not possible for me to whitelist my IP.
So, I tried logging in via the secure shell user- SSHUSER into the Cluster and tried accessing the HIVE database from within the Cluster using the credentials of metastore provided in Ambari. I am still not able to access it. I am using sqlcmd to connect to sql sever.
Does HDInsight prevent direct access to internal Metastores? Is External Metastore the only way to move ahead? Any leads would be helpful.
Update- I created an external SQL Server instance and used it as an external metastore and was able to access it programatically.
No luck with the Internal one yet.
There is not a way to access internal metastores for HDInsight cluster. The internal metastores live in the internal subscription which only PGs are able to access.
If you want to have more control on you metastores it is recommended to bring your own "external" metastore.

Common metadata in databricks cluster

I have a 3-4 clusters in my databricks instance of Azure cloud platform. I want to maintain a common metastore for all the cluster. Let me know if anyone implemented this.
I recommend configuring an external Hive metastore. By default, Detabricks spins its own metastore behind the scenes. But you can create your own database (Azure SQL does work, also MySQL or Postgres) and specify it during the cluster startup.
Here are detailed steps:
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
Things to be aware of:
Data tab in Databricks - you can choose the cluster and see different metastores.
To avoid using SQL user&password, look at Managed Identities https://learn.microsoft.com/en-us/azure/stream-analytics/sql-database-output-managed-identity
Automate external Hive metastore connections by using initialization scripts for your cluster
Permissions management on your sources. In case of ADLS Gen 2, consider using password pass-through

anyway to support session level connector configuration for presto

Based on Connectors configuration, all workers shares the same connectors configuration under catalog folder, which means it will use the same connector configuration for any presto cluster user. E.g., Queries from prestosql user1 and user2 will use same jdbc query config. However, the traditional RDBMS ACL is based on username in JDBC configuration to support isolation. E.g., user1 will use jdbc:mysql://user1#host:3306 while user2 will use jdbc:mysql://user2#host:3306.
Question: is there any pointers or directions to support session level connector configuration within the same presto cluster? e.g., when user1 run a query, it will pickup jdbc:mysql://user1#host:3306 when using mysql connector while it can switch to jdbc:mysql://user2#host:3306 when user2 run a query that connects to mysql.
I'm open for any design inputs like using a centralized config management tool like consul or etcd.
You can configure this for the MySQL connector catalog properties file:
user-credential-name=mysql_user
password-credential-name=mysql_password
This allows the user to provide the MySQL username and password as extra credentials that are passed directly to the backend MySQL server when running a Presto query:
presto --extra-credential mysql_user=user1 --extra-credential mysql_password=secret
The credential names mysql_user and mysql_password are arbitrary and give you flexibility on how to configure it. For example, suppose you have two MySQL catalogs pointing at two different MySQL servers. If both of the servers share the same users, then you would configure both catalogs with the same credential names, allowing the same credentials to be used for both. If they are different, then you would name them differently, e.g., mysql1_user and mysql2_user, allowing the user to provide different credentials for each catalog.
This is directly supported by use of extra credentials.
See https://github.com/trinodb/trino/blob/8c1a4a10abaad91a2f4656653a3f5eb0e44aa6c1/presto-base-jdbc/src/main/java/io/prestosql/plugin/jdbc/credential/ExtraCredentialProvider.java#L26
See documentation tasks for more information:
https://github.com/trinodb/trino/issues/1910
https://github.com/trinodb/trino/issues/1911
In case you're using Starburst Presto, then there is also an option of user impersonation. See https://docs.starburstdata.com/latest/security/impersonation.html. Among others, it is currently supported for PostgreSql, but not yet for MySql connector.

How to access table from Hive cluster located in HDInsight from Local Spark Server built on Intellij

I am not able to access and read data from Hive table located in HDInsight from my local Instance where application build on Intellij and Maven.
May someone please help me what are the Pre-requisite for scenario when we need to write data from Spark to Hive , but Hive is located on HDInsight and Spark is on local native instance.
Note: I don't have Spark cluster on HDInsight , i only have hive cluster on HDInsight.
Please share your comment
Please add the hive-site.xml of the cluster in your resources folder. Also, make sure you have the network port open for the local network. Please refer the below link.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Resources