Neo4j 4.4.11 enterprise - Causal Cluster Deployment using Terraform - azure

I followed the instructions from here: https://neo4j.com/docs/operations-manual/4.4/kubernetes/quickstart-cluster/server-setup/ and deployed a cluster of three core members using Terraform.
Used helm-charts: https://github.com/neo4j/helm-charts/releases/tag/4.4.10
Used neo4j version: Neo4j 4.4.11 enterprise
The code structure is as follows:
module/neo4j:
\-main.tf
\-variables.tf
\--core-1/main.tf
\--core-1/variables.tf
\--core-1/core-1.values.yaml
\--core-2/main.tf
\--core-2/variables.tf
\--core-2/core-2.values.yaml
\--core-3/main.tf
\--core-3/variables.tf
\--core-3/core-3.values.yaml
So the root main.tf creates modules of each core. Nothing special, nothing fancy.
The helm deployment is as follows:
resource "helm_release" "neo4j-core-1" {
name = "neo4j-core-1"
chart = "https://github.com/neo4j/helm-charts/releases/download/${var.chart_version}/neo4j-cluster-core-${var.chart_version}.tgz"
namespace = var.namespace
wait = false
values = [
templatefile("${path.module}/core-1.values.yaml", {
share_secret = var.share_secret_name
share_name = var.share_name
share_dir = var.share_dir
image_name = var.image
image_version = var.image_version
})
]
timeout = 600
force_update = true
reset_values = true
set {
name = "neo4j.name"
value = "neo4j-cluster"
}
set_sensitive {
name = "neo4j.password"
value = var.password
}
set {
name = "dbms.mode"
value = "CORE"
}
# backup configuration
set {
name = "dbms.backup.enabled"
value = true
}
set {
name = "neo4j.resources.memory" # sets both requests and limit
value = var.memory
}
set {
name = "neo4j.resources.cpu" # sets both requests and limit
value = var.cpu
}
set {
name = "dbms.memory.heap.initial_size"
value = var.dbms_memory
}
set {
name = "dbms.memory.heap.max_size"
value = var.dbms_memory
}
set {
name = "dbms.memory.pagecache.size"
value = var.dbms_memory
}
set {
name = "causal_clustering.minimum_core_cluster_size_at_formation"
value = 3
}
set {
name = "causal_clustering.minimum_core_cluster_size_at_runtime"
value = 3
}
set {
name = "causal_clustering.discovery_type"
value = "K8S"
}
dynamic "set" {
for_each = local.nodes
content {
name = "nodeSelector.${set.key}"
value = set.value
}
}
}
The problem I am facing is: The deployment passes like only 1 out of 10 times. Whenever the deployment fails, it is due to a time-out of the Terraform helm_release of one or two core members stating: "Secret "neo4j-cluster-auth" exists.
Looking into the log of the one (or two) members already deployed, the startup failed, because the cluster is missing members. (initialDelaySeconds have been configured for each core member and have been increased testwise too)
kubernetes pods
2022-11-17 08:59:22.738+0000 ERROR Failed to start Neo4j on 0.0.0.0:7474.
java.lang.RuntimeException: Error starting Neo4j database server at /var/lib/neo4j/data/databases
at org.neo4j.graphdb.facade.DatabaseManagementServiceFactory.startDatabaseServer(DatabaseManagementServiceFactory.java:227) ~[neo4j-4.4.11.jar:4.4.11]
at org.neo4j.graphdb.facade.DatabaseManagementServiceFactory.build(DatabaseManagementServiceFactory.java:180) ~[neo4j-4.4.11.jar:4.4.11]
at com.neo4j.causalclustering.core.CoreGraphDatabase.createManagementService(CoreGraphDatabase.java:38) ~[neo4j-causal-clustering-4.4.11.jar:4.4.11]
at com.neo4j.causalclustering.core.CoreGraphDatabase.<init>(CoreGraphDatabase.java:30) ~[neo4j-causal-clustering-4.4.11.jar:4.4.11]
at com.neo4j.server.enterprise.EnterpriseManagementServiceFactory.createManagementService(EnterpriseManagementServiceFactory.java:34) ~[neo4j-enterprise-4.4.11.jar:4.4.11]
at com.neo4j.server.enterprise.EnterpriseBootstrapper.createNeo(EnterpriseBootstrapper.java:20) ~[neo4j-enterprise-4.4.11.jar:4.4.11]
at org.neo4j.server.NeoBootstrapper.start(NeoBootstrapper.java:142) [neo4j-4.4.11.jar:4.4.11]
at org.neo4j.server.NeoBootstrapper.start(NeoBootstrapper.java:95) [neo4j-4.4.11.jar:4.4.11]
at com.neo4j.server.enterprise.EnterpriseEntryPoint.main(EnterpriseEntryPoint.java:24) [neo4j-enterprise-4.4.11.jar:4.4.11]
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'com.neo4j.dbms.ClusteredDbmsReconcilerModule#5c2ae7d7' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join or bootstrap a raft group with id RaftGroupId{00000000} and members RaftMembersSnapshot{raftGroupId=Not yet published, raftMembersSnapshot={ServerId{c72f54d8}=Published as : RaftMemberId{c72f54d8}}} in time. Please restart the cluster. Clue: not enough cores found".
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:463) ~[neo4j-common-4.4.11.jar:4.4.11]
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:110) ~[neo4j-common-4.4.11.jar:4.4.11]
at org.neo4j.graphdb.facade.DatabaseManagementServiceFactory.startDatabaseServer(DatabaseManagementServiceFactory.java:218) ~[neo4j-4.4.11.jar:4.4.11]
... 8 more
I tried different settings for the following two config parameters:
causal_clustering.discovery_type
causal_clustering.initial_discovery_members
First the combination of using default discovery_type=K8S which omits any set initial_discovery_members.
Second the combination of discovery_type=LIST and defining initial_discovery_members by name and port 5000.
Both settings let to an successful clustering in like 1/10 times.
As the cluster members are searching for each other while getting deployed, another thing being tried is configuring terraform conditions that two of the cluster members get build with "wait=false" and the third member gets a depends on:
module "neo4j-cluster-core-3" { depends_on = [module.neo4j-cluster-core-1, module.neo4j-cluster-core-2]

Related

PagerDuty Terraform API Limitations

Just inquiring if anyone's aware of any permission limitations with the PagerDuty terraform API? With a base role of Observer in PagerDuty, it appears as though certain objects (which my user created) can be deleted via the GUI, but not via the terraform API even though I’m using the same user account. A PagerDuty Extension is an example of an object where I’m hitting this issue.
The same test case works as expected if I try it with a user with a base role of Manager though. Here’s a quick terraform file I threw together to verify this test case:
resource "pagerduty_schedule" "schedule" {
name = "terraform-test-schedule"
time_zone = "America/Denver"
teams = ["PRDBAEK"]
layer {
name = "weekly"
start = "2020-02-05T09:00:00-06:00"
rotation_virtual_start = "2020-02-05T09:00:00-06:00"
rotation_turn_length_seconds = 604800
users = ["PN94M6Q"]
}
}
resource "pagerduty_escalation_policy" "escalation_policy" {
name = "terraform-test-ep"
description = "terraform-test-ep"
num_loops = 0
teams = ["PRDBAEK"]
rule {
escalation_delay_in_minutes = 10
target {
type = "schedule_reference"
id = pagerduty_schedule.schedule.id
}
}
}
resource "pagerduty_service" "event" {
name = "terraform-test-service"
description = "terraform-test-service"
alert_creation = "create_alerts_and_incidents"
escalation_policy = pagerduty_escalation_policy.escalation_policy.id
incident_urgency_rule {
type = "constant"
urgency = "severity_based"
}
alert_grouping_parameters {
type = "intelligent"
config {
fields = []
timeout =0
}
}
auto_resolve_timeout = "null"
acknowledgement_timeout = "null"
}
resource "pagerduty_extension" "test_extension" {
name = "terraform-test-extension"
extension_schema = data.pagerduty_extension_schema.generic_v2_webhook.id
endpoint_url = https://fakeurl.com
extension_objects = [
pagerduty_service.event.id
]
config = jsonencode({})
}
All objects can be created successfully. I get the following error when testing a terraform destroy with an account with base role Observer though. It can't delete the Extension.
Error: DELETE API call to https://api.pagerduty.com/extensions/P53423F failed 403 Forbidden. Code: 2010, Errors: <nil>, Message: Access Denied
But using that same account, I can delete that extension in the GUI with no issues.

Terraform outputs to another module when using a for_each

I'm having trouble referencing the outputs of a module as an input to another module. I have a module that creates the target group using a for_each. I have another module that is also using a for_each for creating a listener rule.
I would like to use the output from the target group as an input for the target group under the listener rule module - but be able to specify which target group would attach to the listener rule.
I have tried many different suggestions with numerous variations for outputs but have not been successful.
In my main, I have these two modules created for the target groups and the listener rules:
Target Groups:
module "aws_lb_target_group_ui" {
source = "../../terraform/modules/aws_lb_target_group"
for_each = var.target_group_values
target_group_name = each.value["name"]
target_group_target_type = each.value["target_type"]
target_group_port = each.value["port"]
target_group_protocol = each.value["protocol"]
target_group_slow_start = each.value["slow_start"]
/* Health Check */
target_group_health_check_path = each.value["healthcheckpath"]
target_group_health_check_port = each.value["healthcheckport"]
target_group_health_check_enabled = each.value["healthcheckenabled"]
target_group_health_check_healthy_threshold = each.value["healthy_threshold"]
target_group_health_check_interval = each.value["interval"]
target_group_health_check_timeout = each.value["timeout"]
target_group_tags = {
Environment = each.value["Environment"]
Component = each.value["Component"]
Application = each.value["Application"]
TechContact = each.value["TechContact"]
}
}
Listener Rule:
module "aws_alb_listener_rule_ui" {
depends_on = [
module.aws_lb_target_group_ui,
module.aws_alb_listener_ui_https
]
source = "../../terraform/modules/aws_lb_listener_rule_https"
listener_rule = module.aws_alb_listener_ui_https.ui_lb_listener
for_each = var.ui_https_listener_rules
listener_rule_host_header = each.value["values"]
listener_rule_target_group = each.value["target_group"]
listener_rule_action_type = var.ui_https_listener_rule_action_type
}
The resources themselves are very generic with variables but I can post those if it's helpful.
In my tfvars I have this:
Target Groups:
I have removed all the variables but just wanted to get an idea of it would replicate out in my config through the tfvars.
target_group_values = {
Application1 = {
"name" = "TargetGroup1",
},
Application2 = {
"name" = "TargetGroup2",
},
}
Listener Rules:
ui_https_listener_rule_action_type = "forward"
ui_https_listener_rules = {
Application1 = {
values = ["Host_Header_1"]
target_group = "Manul_ARN_Entry"
},
Application2 = {
values = ["Host_Header_2"]
target_group = "Manul_ARN_Entry"
},
}
If I run it in this format with the manually arn being entered from an existing target group it runs without issue. I would like to utilize the target groups I am creating, though and get the output from those groups into the code instead of the manual entry.
What I was trying to accomplish was to get the output and then update the "listener_rule_target_group" value with the output but be able to identify one of the specific target groups as it would be a 1:1 for target groups and rules.
I have not posted the output as I have not been successful in a method yet but was more looking for some help on what I should be doing with this setup.
Update:
I was able to resolve this.
I moved "values" and "target_group_values" variables into the "target_group_values" variable and deleted the "ui_https_listener_rules" variable.
This allowed me to use one variable across both modules for my for_each. From there I was able to use the below to resolve the issue.
listener_rule_target_group = module.aws_lb_target_group_kinsale_ui[each.key].arn

Terraform fires Cycle error when applying

I am trying to build a galera cluster using terraform. To do that I need to render the galera config with the nodes ip, so I use a file template.
When applying, terraform fires an error
Error: Cycle: data.template_file.galera_node_config, hcloud_server.galera_node
It seems there is a circular reference when applying because the servers are not being created before the data template is used.
How may I circumvent this ?
Thanks
galera_node.tf
data "template_file" "galera_node_config" {
template = file("sys/etc/mysql/mariadb.conf/galera.cnf")
vars = {
galera_node0 = hcloud_server.galera_node[0].ipv4_address
galera_node1 = hcloud_server.galera_node[1].ipv4_address
galera_node2 = hcloud_server.galera_node[2].ipv4_address
curnode_ip = hcloud_server.galera_node[count.index].ipv4_address
curnode = hcloud_server.galera_node[count.index].id
}
}
resource "hcloud_server" "galera_node" {
count = var.galera_nodes
name = "galera-${count.index}"
image = var.os_type
server_type = var.server_type
location = var.location
ssh_keys = [hcloud_ssh_key.default.id]
labels = {
type = "cluster"
}
user_data = file("galera_cluster.sh")
provisioner "file" {
content = data.template_file.galera_node_config.rendered
destination = "/tmp/galera_cnf"
connection {
type = "ssh"
user = "root"
host = self.ipv4_address
private_key = file("~/.ssh/id_rsa")
}
}
}
The problem here is that you have multiple nodes that all depend on each other, and so there is no valid order for Terraform to create them: they must all be created before any other one can be created.
To address this will require a different approach. There are a few different options for this, but the one that seems closest to what you were already trying is to use the special resource type null_resource to factor out the provisioning into a separate resource that Terraform can work on only after all of the hcloud_server instances are ready.
Note also that the template_file data source is deprecated in favor of the templatefile function, so this is a good opportunity to simplify the configuration by using the function instead.
Both of those changes together lead to this:
resource "hcloud_server" "galera_node" {
count = var.galera_nodes
name = "galera-${count.index}"
image = var.os_type
server_type = var.server_type
location = var.location
ssh_keys = [hcloud_ssh_key.default.id]
labels = {
type = "cluster"
}
user_data = file("galera_cluster.sh")
}
resource "null_resource" "galera_config" {
count = length(hcloud_server.galera_node)
triggers = {
config_file = templatefile("${path.module}/sys/etc/mysql/mariadb.conf/galera.cnf", {
all_addresses = hcloud_server.galera_node[*].ipv4_address
this_address = hcloud_server.galera_node[count.index].ipv4_address
this_id = hcloud_server.galera_node[count.index].id
})
}
provisioner "file" {
content = self.triggers.config_file
destination = "/tmp/galera_cnf"
connection {
type = "ssh"
user = "root"
host = hcloud_server.galera_node[count.index].ipv4_address
private_key = file("~/.ssh/id_rsa")
}
}
}
The triggers argument above serves to tell Terraform that it must re-run the provisioner each time the configuration file changes in any way, which could for example be because you've added a new node: all of the existing nodes would then be reprovisioned to include that additional node in their configurations.
Provisioners are considered a last resort in the Terraform documentation, but in this particular case the alternatives would likely be considerably more complicated. A typical non-provisioner answer to this would be to use a service discovery system where each node can register itself on startup and then discover the other nodes, for example with HashiCorp Consul's service catalog. But unless you have lots of similar use-cases in your infrastructure which could all share the Consul cluster, having to run another service is likely an unreasonable cost in comparison to just using a provisioner.
You really try to use data.template_file.galera_node_config inside of your resource "hcloud_server" "galera_node" and use hcloud_server.galera_node in your data.template_file.
To avoid this problem:
Remove provisioner "file" from your hcloud_server.galera_node
Move this provisioner "file" to a new null_resource e.g. like that:
resource "null_resource" template_upload {
count = var.galera_nodes
provisioner "file" {
content = data.template_file.galera_node_config.rendered
destination = "/tmp/galera_cnf"
connection {
type = "ssh"
user = "root"
host = hcloud_server.galera_nodes[count.index].ipv4_address
private_key = file("~/.ssh/id_rsa")
}
depends_on = [hcloud_server.galera_node]
}

GCP Terraform: Setting port for backend service

Reading through the docs here:
https://www.terraform.io/docs/providers/google/r/compute_backend_service.html
We can define backend service:
resource "google_compute_backend_service" "kubernetes-nginx-prod" {
name = "kubernetes-nginx-prod"
health_checks = [google_compute_health_check.kubernetes-nginx-prod-healthcheck.self_link]
backend {
group = replace(google_container_node_pool.pool-1.instance_group_urls[0], "instanceGroupManagers", "instanceGroups")
# TODO missing port 31443
}
}
It seems we are unable to set backend service port via the Terraform settings:
Recreating the backend service without this settings actually leads to downtime for us and the port must be written manually.
We need to reference the port name that we gave in the instance group for e.g.
resource "google_compute_backend_service" "test" {
name = "test-service"
port_name = "test-port"
protocol = "HTTP"
timeout_sec = 5
health_checks = []
backend {
group = "${google_compute_instance_group.test-ig.self_link}"
}
}
resource "google_compute_instance_group" "test-ig" {
name = "test-ig"
instances = []
named_port {
name = "test-port"
port = "${var.app_port}"
}
zone = "${var.zone}"
}
I resolved this by using a terraform data source to extract the instance group data from my google_container_node_pool, and then appending a google_compute_instance_group_named_port resource to it.
NOTE my google_container_node_pool spans 3 zones (a,b,c) and the below code highlights the solution for only zone c
# extract google_compute_instance_group from google_container_node_pool
data "google_compute_instance_group" "instance_group_zone_c" {
# .2 refers to the last of the 3 instance groups in my node pool (zone c)
name = regex("([^/]+)/?$", "${google_container_node_pool.k8s_node_pool.instance_group_urls.2}").0
zone = regex("${var.region}-?[abc]", "${google_container_node_pool.k8s_node_pool.instance_group_urls.2}")
project = var.project_id
}
# define a named port to attach to the google_compute_instance group in my node pool
resource "google_compute_instance_group_named_port" "named_port_zone_c" {
group = data.google_compute_instance_group.instance_group_zone_c.id
zone = data.google_compute_instance_group.instance_group_zone_c.zone
name = "port31090"
port = 31090
}

How to conditionally populate an argument value in terraform?

I am writing a Terraform script to spin up resources in Google Cloud Platform.
Some resources require one argument only if the other one set, how to populate one argument only if the other one is populated (or any other similar condition)?
For example:
resource "google_compute_router" "compute_router" {
name = "my-router"
network = "${google_compute_network.foobar.name}"
bgp {
asn = 64514
advertise_mode = "CUSTOM"
advertised_groups = ["ALL_SUBNETS"]
advertised_ip_ranges {
range = "1.2.3.4"
}
advertised_ip_ranges {
range = "6.7.0.0/16"
}
}
}
In the above resource (google_compute_router) the description for both advertised_groups and advertised_ip_ranges says This field can only be populated if advertise_mode is CUSTOM and is advertised to all peers of the router.
Now if I keep the value of advertise_mode as DEFAULT, my code looks something like below:
resource "google_compute_router" "compute_router" {
name = "my-router"
network = "${google_compute_network.foobar.name}"
bgp {
asn = 64514
#Changin only the value below
advertise_mode = "DEFAULT"
advertised_groups = ["ALL_SUBNETS"]
advertised_ip_ranges {
range = "1.2.3.4"
}
advertised_ip_ranges {
range = "6.7.0.0/16"
}
}
}
The above script however on running gives the following error:
* google_compute_router.compute_router_default: Error creating Router: googleapi: Error 400: Invalid value for field 'resource.bgp.advertiseMode': 'DEFAULT'. Router cannot have a custom advertisement configurati
on in default mode., invalid
As a workaround to the above, I have created two resources with different names doing almost the same thing. The script looks something like below:
resource "google_compute_router" "compute_router_default" {
count = "${var.advertise_mode == "DEFAULT" ? 1 : 0}"
name = "${var.router_name}"
region = "${var.region}"
network = "${var.network_name}"
bgp {
asn = "${var.asn}"
advertise_mode = "${var.advertise_mode}"
#Removed some codes from here
}
}
resource "google_compute_router" "compute_router_custom" {
count = "${var.advertise_mode == "CUSTOM" ? 1 : 0}"
name = "${var.router_name}"
region = "${var.region}"
network = "${var.network_name}"
bgp {
asn = "${var.asn}"
advertise_mode = "${var.advertise_mode}"
advertised_groups = ["${var.advertised_groups}"]
advertised_ip_ranges {
range = "${var.advertised_ip_range}"
description = "${var.advertised_ip_description}"
}
}
}
The above script works fine, however it seems like a lot of code repetition to me and a hack. Also, for two options (of dependent attributes) is fine, however, if there are more options say 5, the code repetition for such a small thing would be too much.
Is there a better way to do what I am trying to achieve?
This is pretty much what you are restricted to in Terraform < 0.12. Some resources allow you to use an empty string to omit basic values and the provider will interpret this as a null value, not passing it to the API endpoint so it won't complain about it not being set properly. But from my brief experience with the GCP provider this is not the case for most things there.
Terraform 0.12 introduces nullable arguments which would allow you to set these conditionally with something like the following:
variable "advertise_mode" {}
resource "google_compute_router" "compute_router" {
name = "my-router"
network = "${google_compute_network.foobar.name}"
bgp {
asn = 64514
advertise_mode = "${var.advertise_mode}"
advertised_groups = ["${var.advertise_mode == "DYNAMIC" ? ALL_SUBNETS : null}"]
advertised_ip_ranges {
range = "${var.advertise_mode == "DYNAMIC" ? 1.2.3.4 : null}"
}
advertised_ip_ranges {
range = "${var.advertise_mode == "DYNAMIC" ? 6.7.0.0/16 : null}"
}
}
}
It will also introduce dynamic blocks that you are able to loop over so you can also have a dynamic number of advertised_ip_ranges blocks.
The above answer is incorrect as 'advertised_ip_ranges' wont accept a null value; the solution to this is to leverage a dynamic block which can handle a null value for this resource and further enables the resource to accept a variable number of ip ranges.
variable custom_ranges {
default = ["172.16.31.0/24","172.16.32.0/24"]
}
resource "google_compute_router" "router_01" {
name = "cr-bgp-${var.gcp_bgp_asn}"
region = var.gcp_region
project = var.gcp_project
network = var.gcp_network
bgp {
asn = var.gcp_bgp_asn
advertise_mode = var.advertise_mode
advertised_groups = var.advertise_mode == "CUSTOM" ? ["ALL_SUBNETS"] : null
dynamic "advertised_ip_ranges" {
for_each = var.advertise_mode == "CUSTOM" ? var.custom_ranges : []
content {
range = advertised_ip_ranges.value
}
}
}
}
additional search keys: google_compute_router "bgp.0.advertised_ip_ranges.0.range" wont accept a null value.

Resources