How to define the AWS Athena s3 output location using terraform when using aws_glue_catalog_database and aws_glue_catalog_table resources

How to define the AWS Athena s3 output location using terraform when using aws_glue_catalog_database and aws_glue_catalog_table resources - terraform

Summary
The terraform config below creates aws_glue_catalog_database and aws_glue_catalog_table resources, but does not define an s3 bucket output location which is necessary to use these resources in the context of Athena. I can add the s3 output location manually through the AWS console, but need to do it programmatically using terraform.
Detail
Minimal example terraform config which creates an aws glue database and table:
resource "aws_glue_catalog_database" "GlueDB" {
name = "gluedb"
}
resource "aws_glue_catalog_table" "GlueTable" {
name = "gluetable"
database_name = aws_glue_catalog_database.gluedb.name
table_type = "EXTERNAL_TABLE"
parameters = {
EXTERNAL = "TRUE"
}
storage_descriptor {
location = var.GLUE_SOURCE_S3_LOCATION
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"
ser_de_info {
name = "jsonserde"
serialization_library = "org.openx.data.jsonserde.JsonSerDe"
parameters = {
"serialization.format" = "1"
}
}
columns {
name = "messageId"
type = "string"
comment = ""
}
}
}
The aim is to be able to access the table either via the Athena query editor (AWS console), or using the python boto3 library (boto3.client('athena')).
However, before Athena access works in either case I need to define an output location for the query. This is easy to do in the AWS console (Amazon Athena -> Query Editor -> Manage settings -> Location of query result), but I need to do this via terraform so that the entire aws infrastructure stack can be setup in one go.
There is a terraform resource called aws_athena_workgroup which has an output_location property, but it is unclear how a separate aws_athena_workgroup resource would be related to the aws_glue_catalog_database already defined (there doesn't seem to be any way to link these two resources).
This answer suggests importing the existing primary workgroup into terraform and modifying it. But what I need is a terraform implementation which sets everything up from scratch in one go.
Any suggestions on how to wire up an s3 output-location in terraform so the above glue resources can be used in the context of Athena would be greatly appreciated!

AWS Glue and Athena are two independent services. Glue doesn't have to know the Athena query output location configuration at all. It just stores the query results run in Athena.
You can simply create a new resources for aws_athena_workgroup next to Glue resources and define the result configuration bucket.
resource "aws_athena_workgroup" "example" {
name = "example"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.example.bucket}/output/"
encryption_configuration {
encryption_option = "SSE_KMS"
kms_key_arn = aws_kms_key.example.arn
}
}
}
}

Related

Automating Permissions for Databricks SQL Tables or Views

Trying to automate the setup of Databricks SQL.
I have done it from the UI and it works, so this is a natural next step.
The one thing I am unsure about is how to automate granting of the access to SQL tables and/or views using REST. I am trying to avoid a Notebooks job.
I have seen this microsoft documentation and downloaded the specification but when I opened it with Postman, I see permissions/objectType/Object id, but the only sample I have seen there is for "queries". It just seems to be applicable for Queries and Dashboards. Can't this be done for Tables and views? There is no further documentation that I could see.
So, basically how to do something like
grant select on tablename to group using REST api without using a Notebook job. I am interested to see if I can just call a REST endpoint from our release pipeline (Azure DevOps)

As of right now, there is no REST API for setting Table ACLs. But it's available as part of the Unity Catalog that is right now in the public preview.
If you can't use Unity Catalog yet, then you still have a possibility to automate assignment of Table ACLs by using databricks_sql_permissions resource of Databricks Terraform Provider - it sets permissions by executing SQL commands on a cluster, but this is hidden from administrator.

This is an extension to Alex Ott `s answer giving some details on what I tried to make the databricks_sql_permissions Resource work for Databricks SQL as was the OP's original question. All this assumes that one does not want/can use Unity Catalog which follows a different permission model and has a different Terraform resource, namely databricks_grants Resource.
Alex`s answer refers to table ACLs which had me surprised as the OP (and myself) were looking for Databricks SQL object security and not table ACLs in the classic workspace. But from what I understand so far, it seems the two are closely interlinked and the Terraform provider addresses table ACLs in the classic workspace (i.e. non-SQL) which are mirrored to SQL objects in the SQL workspace. It follows that if you like to steer SQL permissions in Databricks SQL via Terraform, you need to enable table ACLs in classic workspace (in admin console). If you (for whatever reason) cannot enable table ACLs, it seems to me the only other option is via sql scripts in the SQL workspace with the disadvantage of having to explicitly write out grants and revokes. Potentially an alternative is to throw away all permissions before one only runs grant statements but this has other negative implications.
So here is my approach:
Enable table ACL in classic workspace (this has no implications in classic workspace if you don`t use table ACL-enabled clusters afaik)
Use azurerm_databricks_workspace resource to register Databricks Azure infrastructure
Use databricks_sql_permissions Resource to manage table ACLs and thus SQL object security
Below is a minimal example that worked for me and may inspire others. It certainly does not follow Terraform config guidance but is merely used for minimal illustration.
NOTE: Due to a Terraform issue I had to ignore changes from attribute public_network_access_enabled, see GitHub issues: "azurerm_databricks_workspace" forces replacement on public_network_access_enabled while it never existed #15222
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "=3.0.0"
}
databricks = {
source = "databricks/databricks"
version = "=1.4.0"
}
}
backend "azurerm" {
resource_group_name = "tfstate"
storage_account_name = "tfsa"
container_name = "tfstate"
key = "terraform.tfstate"
}
}
provider "azurerm" {
features {}
}
provider "databricks" {
azure_workspace_resource_id = "/subscriptions/mysubscriptionid/resourceGroups/myresourcegroup/providers/Microsoft.Databricks/workspaces/mydatabricksworkspace"
}
resource "azurerm_databricks_workspace" "adbtf" {
customer_managed_key_enabled = false
infrastructure_encryption_enabled = false
load_balancer_backend_address_pool_id = null
location = "westeurope"
managed_resource_group_name = "databricks-rg-myresourcegroup-abcdefg12345"
managed_services_cmk_key_vault_key_id = null
name = "mydatabricksworkspace"
network_security_group_rules_required = null
public_network_access_enabled = null
resource_group_name = "myresourcegroup"
sku = "premium"
custom_parameters {
machine_learning_workspace_id = null
nat_gateway_name = "nat-gateway"
no_public_ip = false
private_subnet_name = null
private_subnet_network_security_group_association_id = null
public_ip_name = "nat-gw-public-ip"
public_subnet_name = null
public_subnet_network_security_group_association_id = null
storage_account_name = "dbstorageabcde1234"
storage_account_sku_name = "Standard_GRS"
virtual_network_id = null
vnet_address_prefix = "10.139"
}
tags = {
creator = "me"
}
lifecycle {
ignore_changes = [
public_network_access_enabled
]
}
}
data "databricks_current_user" "me" {}
resource "databricks_sql_permissions" "database_test" {
database = "test"
privilege_assignments {
principal = "myuser#mydomain.com"
privileges = ["USAGE"]
}
}
resource "databricks_sql_permissions" "table_test_student" {
database = "test"
table = "student"
privilege_assignments {
principal = "myuser#mydomain.com"
privileges = ["SELECT", "MODIFY"]
}
}
output "adb_id" {
value = azurerm_databricks_workspace.adbtf.id
}
NOTE: Serge Smertin (Terraform Databricks maintainer) mentioned in GitHub issues: [DOC] databricks_sql_permissions Resource to be deprecated ? #1215 that the databricks_sql_permissions resource is deprecated but I could not find any indication about that in the docs, only a recommendation to use another resource when leveraging Unity Catalog which I'm not doing.

terraform: database does not exist

I am fairly new to Terraform and trying to create database and schema using terraform in Snowflake.
To start with I have written a script to create a Database and Schema.
However when I run the below code. On the first run It creates the Database and then throw error on schema -- Database SNOWACC does not exist.
and when I run again it creates the schema successfully.
Seems like terraform is not getting the latest DB status in same script.
terraform {
required_providers {
snowflake = {
source = "chanzuckerberg/snowflake"
version = "0.25.36"
}
}
}
provider "snowflake" {
alias = "sys_admin"
role = "DBA"
}
resource "snowflake_database" "db" {
provider = snowflake.sys_admin
name = "SNOWACC"
comment = "Database created for snowflake monitoring through terraform"
}
resource "snowflake_schema" "schema" {
provider = snowflake.sys_admin
database = "SNOWACC"
name = "MONITOR"
comment = "Schema for snowflake monitoring created by TF"
}
I have a feeling, I will have to run DB and Schema script separately and then all object (table, view, task etc) seperately.
I am hoping If there is better way.

Terraform check if resource exists before creating it

Is there a way in Terraform to check if a resource in Google Cloud exists prior to trying to create it?
I want to check if the following resources below exist in my CircleCI CI/CD pipeline during a job. I have access to terminal commands, bash, and gcloud commands. If the resources do exist, I want to use them. If they do not exist, I want to create them. I am doing this logic in CircleCI's config.yml as steps where I have access to terminal commands and bash. My goal is to create my necessary infrastructure (resources) in GCP when they are needed, otherwise use them if they are created, without getting Terraform errors in my CI/CD builds.
If I try to create a resource that already exists, Terraform apply will result in an error saying something like, "you already own this resource," and now my CI/CD job fails.
Below is pseudo code describing the resources I am trying to get.
resource "google_artifact_registry_repository" "main" {
# this is the repo for hosting my Docker images
# it does not have a data source afaik because it is beta
}
For my google_artifact_registry_repository resource. One approach I have is to do a Terraform apply using a data source block and see if a value is returned. The problem with this is that the google_artifact_registry_repository does not have a data source block. Therefore, I must create this resource once using a resource block and every CI/CD build thereafter can rely on it being there. Is there a work-around to read that it exists?
resource "google_storage_bucket" "bucket" {
# bucket containing the folder below
}
resource "google_storage_bucket_object" "content_folder" {
# folder containing Terraform default.tfstate for my Cloud Run Service
}
For my google_storage_bucket and google_storage_bucket_object resources. If I do a Terraform apply using a data source block to see if these exist, one issue I run into is when the resources are not found, Terraform takes forever to return that status. It would be great if I could determine if a resource exists within like 10-15 seconds or something, and if not assume these resources do not exist.
data "google_storage_bucket" "bucket" {
# bucket containing the folder below
}
output bucket {
value = data.google_storage_bucket.bucket
}
When the resource exists, I can use Terraform output bucket to get that value. If it does not exist, Terraform takes too long to return a response. Any ideas on this?

Thanks to the advice of Marcin, I have a working example of how to solve my problem of checking if a resource exists in GCP using Terraform's external data sources. This is one way that works. I am sure there are other approaches.
I have a CircleCI config.yml where I have a job that uses run commands and bash. From bash, I will init/apply a Terraform script that checks if my resource exists, like so below.
data "external" "get_bucket" {
program = ["bash","gcp.sh"]
query = {
bucket_name = var.bucket_name
}
}
output "bucket" {
value = data.external.get_bucket.result.name
}
Then in my gcp.sh, I use gsutil to get my bucket if it exists.
#!/bin/bash
eval "$(jq -r '#sh "BUCKET_NAME=\(.bucket_name)"')"
bucket=$(gsutil ls gs://$BUCKET_NAME)
if [[ ${#bucket} -gt 0 ]]; then
jq -n --arg name "" '{name:"'$BUCKET_NAME'"}'
else
jq -n --arg name "" '{name:""}'
fi
Then in my CircleCI config.yml, I put it all together.
terraform init
terraform apply -auto-approve -var bucket_name=my-bucket
bucket=$(terraform output bucket)
At this point I check if the bucket name is returned and determine how to proceed based on that.

TF does not have any build in tools for checking if there are pre-existing resources, as this is not what TF is meant to do. However, you can create your own custom data source.
Using the custom data source you can program any logic you want, including checking for pre-existing resources and return that information to TF for future use.

There is a way to check if a resource already exists before creating the resource. But you should be aware of whether it exists. Using this approach, you need to know if the resource exists. If the resource does not exist, it'll give you an error.
I will demonstrate it by create/reading data from an Azure Resource Group. First, create a boolean variable azurerm_create_resource_group. You can set the value to true if you need to create the resource; otherwise, if you just want to read data from an existing resource, you can set it to false.
variable "azurerm_create_resource_group" {
type = bool
}
Next up, get data about the resource using the ternary operator supplying it to count, next do the same for creating the resource:
data "azurerm_resource_group" "rg" {
count = var.azurerm_create_resource_group == false ? 1 : 0
name = var.azurerm_resource_group
}
resource "azurerm_resource_group" "rg" {
count = var.azurerm_create_resource_group ? 1 : 0
name = var.azurerm_resource_group
location = var.azurerm_location
}
The code will create or read data from the resource group based on the value of the var.azurerm_resource_group. Next, combine the data from both the data and resource sections into a locals.
locals {
resource_group_name = element(coalescelist(data.azurerm_resource_group.rg.*.name, azurerm_resource_group.rg.*.name, [""]), 0)
location = element(coalescelist(data.azurerm_resource_group.rg.*.location, azurerm_resource_group.rg.*.location, [""]), 0)
}
Another way of doing it might be using terraformer to import the infra code.
I hope this helps.

This work for me:
Create data
data "gitlab_user" "user" {
for_each = local.users
username = each.value.user_name
}
Create resource
resource "gitlab_user" "user" {
for_each = local.users
name = each.key
username = data.gitlab_user.user[each.key].username != null ? data.gitlab_user.user[each.key].username : split("#", each.value.user_email)[0]
email = each.value.user_email
reset_password = data.gitlab_user.user[each.key].username != null ? false : true
}
P.S.
Variable
variable "users_info" {
type = list(
object(
{
name = string
user_name = string
user_email = string
access_level = string
expires_at = string
group_name = string
}
)
)
description = "List of users and their access to team's groups for newcommers"
}
Locals
locals {
users = { for user in var.users_info : user.name => user }
}

Create new resources on muti-accounts on AWS?

I am building a small tool using Terraform to generate sandbox on AWS. I will be the owner of every new sandbox and each user will be added as an IAM user with appropriate rights.
The input file looks like this: users.auto.tfvars
sandboxs_manager = "pierre-alexandre.mousset"
dev_team_members = [
{ name = "brian.davids", is_enabled = true }, { name = "tom.hanks", is_enabled = true }]
Which is going to create 2 different AWS accounts:
pierre-alexandre.mousset+brian.davids#company.com
pierre-alexandre.mousset+tom.hanks#company.com
What I am now trying to achieve, is to create a simple S3 bucket on every new aws_organizations_account generated by Terraform.
This is how I am generating AWS accounts on my Terraform:
resource "aws_organizations_account" "this" {
for_each = local.all_user_names
name = "Dev Sandbox ${each.value}"
email = "${var.manager}+sbx_${each.value}#company.com"
role_name = "Administrator"
parent_id = var.sandbox_organizational_unit_id
}
Is there a way to loop over every id generated by aws_organizations_account to create a S3 bucket on each of those newly created account?
Based on this Github issue, I would need to use muti-provider which is not yet supported by Terraform and would probably look like this before to generate my s3 buckets:
provider "aws" {
for_each = local.aws_accounts
alias = each.value.aws_account_id
assume_role {
role_arn = "arn:aws:iam::${aws_organizations_account.this.id}:role/TerraformAccessRole"
}
}
Is there any way to deal with this?

Terraform: Unable to fetch connection strings when using 'data' resource for existing cluster

I'm new to terraform, and trying to setup mongoatlas privatelink for an existing cluster. The issue here is that, terraform is able to setup the privatelink, but when I try to fetch the privatelink string, then it gives the following error
data.mongodbatlas_cluster.cluster-atlas.connection_strings[0].aws_private_link is empty map of string
I'm using data resource of terraform to fetch the details of the cluster. Though I have read on some forums that data resource are read before creating resources so for that I'm using depends_on on the data resource but still facing the issue here.
My cluster code:
data "mongodbatlas_cluster" "cluster-atlas" {
project_id = var.atlasprojectid
name = "mongoatlas-cluster-xxxxxxx"
depends_on = [time_sleep.wait_300_seconds]
}
output "atlasclusterstring" {
value = data.mongodbatlas_cluster.cluster-atlas.connection_strings
}
output "plstring" {
value = lookup(data.mongodbatlas_cluster.cluster-
atlas.connection_strings[0].aws_private_link, aws_vpc_endpoint.ptfe_service.id)
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string