How to Scale out Gitlab EE - gitlab

Currently I am running the whole gitlab EE as a single container. I need to scale out the service so that it can support more users and more operations/pull/push/Merge Requests etc simultanously.
I need to run a redis cluster of its own
I need to run a PG cluster separate
I need to integrate elasticsearch for search
But how can I scale out the remaning core gitlab services. Do they support a scale out architecture.
gitlab workhorse
unicorn ( gitlab rails )
sidekiq ( gitlab rails )
gitaly
gitlab shell

Do they support a scale out architecture.
Not exactly, considering the GitLab Omnibus image is one package with bundled dependencies.
But I never experienced so much traffic that it needed to be split up and scaled out.
There is though a proposal for splitting up the Omnibus image: gitlab-org/omnibus-gitlab issue 1800.
It points out to gitlab-org/build/CNG which does just what you are looking for:
Each directory contains the Dockerfile for a specific component of the infrastructure needed to run GitLab.
rails - The Rails code needed for both API and web.
unicorn - The Unicorn container that exposes Rails.
sidekiq - The Sidekiq container that runs async Rails jobs
shell - Running GitLab Shell and OpenSSH to provide git over ssh, and authorized keys support from the database
gitaly - The Gitaly container that provides a distributed git repos
The other option, using Kubernetes, is the charts/gitlab:
The gitlab chart is the best way to operate GitLab on Kubernetes. This chart contains all the required components to get started, and can scale to large deployments.
Some of the key benefits of this chart and corresponding containers are:
Improved scalability and reliability
No requirement for root privileges
Utilization of object storage instead of NFS for storage
The default deployment includes:
Core GitLab components: Unicorn, Shell, Workhorse, Registry, Sidekiq, and Gitaly
Optional dependencies: Postgres, Redis, Minio
An auto-scaling, unprivileged GitLab Runner using the Kubernetes executor
Automatically provisioned SSL via Let's Encrypt.
Update Sept. 2020:
GitLab 13.4 offers one feature which can help scaling out GitLab on-premise:
Gitaly Cluster majority-wins reference transactions (beta)
Gitaly Cluster allows Git repositories to be replicated on multiple warm Gitaly nodes. This improves fault tolerance by removing single points of failure.
Reference transactions, introduced in GitLab 13.3, causes changes to be broadcast to all the Gitaly nodes in the cluster, but only the Gitaly nodes that vote in agreement with the primary node persist the changes to disk.
If all the replica nodes dissented, only one copy of the change would be persisted to disk, creating a single point of failure until asynchronous replication completed.
Majority-wins voting improves fault tolerance by requiring a majority of nodes to agree before persisting changes to disk. When the feature flag is enabled, writes must succeed on multiple nodes. Dissenting nodes are automatically brought in sync by asynchronous replication from the nodes that formed the quorum.
See Documentation and Issue.

Related

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

How to patch GKE Managed Instance Groups (Node Pools) for package security updates?

I have a GKE cluster running multiple nodes across two zones. My goal is to have a job scheduled to run once a week to run sudo apt-get upgrade to update the system packages. Doing some research I found that GCP provides a tool called "OS patch management" that does exactly that. I tried to use it but the Patch Job execution raised an error informing
Failure reason: Instance is part of a Managed Instance Group.
I also noticed that during the creation of the GKE Node pool, there is an option for enabling "Auto upgrade". But according to its description, it will only upgrade the version of the Kubernetes.
According to the Blog Exploring container security: the shared responsibility model in GKE:
For GKE, at a high level, we are responsible for protecting:
The nodes’ operating system, such as Container-Optimized OS (COS) or Ubuntu. GKE promptly makes any patches to these images available. If you have auto-upgrade enabled, these are automatically deployed. This is the base layer of your container—it’s not the same as the operating system running in your containers.
Conversely, you are responsible for protecting:
The nodes that run your workloads. You are responsible for any extra software installed on the nodes, or configuration changes made to the default. You are also responsible for keeping your nodes updated. We provide hardened VM images and configurations by default, manage the containers that are necessary to run GKE, and provide patches for your OS—you’re just responsible for upgrading. If you use node auto-upgrade, it moves the responsibility of upgrading these nodes back to us.
The node auto-upgrade feature DOES patch the OS of your nodes, it does not just upgrade the Kubernetes version.
OS Patch Management only works for GCE VM's. Not for GKE
You should refrain from doing OS level upgrades in GKE, that could cause some unexpected behavior (maybe a package get's upgraded and changes something that will mess up the GKE configuration).
You should let GKE auto-upgrade the OS and Kubernetes. Auto-upgrade will upgrade the OS as GKE releases are inter-twined with the OS release.
One easy way to go is to signup your clusters to release channels, this way they get upgraded as often as you want (depending on the channel) and your OS will be patched regularly.
Also you can follow the GKE hardening guide which provide you with step to make sure your GKE clusters are as secured as possible

Does Kubernetes restart a failed container or create a new container when the running container fails for any reason?

I have ran a docker container locally and it stores data in a file (currently no volume is mounted). I stored some data using the API. After that I failed the container using process.exit(1) and started the container again. The previously stored data in the container survives (as expected). But when I do this same thing in Kubernetes (minikube) the data is lost.
Posting this as a community wiki for better visibility, feel free to edit and expand it.
As described in comments, kubernetes replaces failed containers with new (identical) ones and this explain why container's filesystem will be clean.
Also as said containers should be stateless. There are different options how to run different applications and take care about its data:
Run a stateless application using a Deployment
Run a stateful application either as a single instance or as a replicated set
Run automated tasks with a CronJob
Useful links:
Kubernetes workloads
Pod lifecycle

Drone slaves provided by CoreOs

I have a drone host and a CoreOS cluster with fleet.
The drone now have only unix:///var/run/docker.sock in the nodes menu.
As I understand, I could add other docker nodes defined by docker URLs and certificates. However once I have a CoreOS cluster, it seems logical to use that as the provider of the slaves. I am looking for a solution where
(1)I don't have to configure the nodes whenever the CoreOS cluster configration changes, and
(2) provides correct resource management.
I could think of the following solutions:
Expose docker uris in the CoreOS cluster nodes, and configure all of them directly in drone. In this case I would have follow CoreOs cluster changes manually. Resource management would probably conflict with that of fleet.
Expose docker uris in the CoreOS cluster nodes, and provide a DNS round-robin based access. Seems to be a terrible way of resource management, and would most probably conflict with feet.
Install Swarm on the CoreOs nodes. Resource management would probably conflict with that of fleet.
Have fleet or RKT expose a docker uri, and fleet/RKT would decide on which node the container runs on. The problem is that I could not find any way to do this.
Have drone.io use fleet or RKT. Same problem. Is it possible?
Is there any way to provide solutions for all of my requirements with drone.io and CoreOs?
As I understand, I could add other docker nodes defined by docker URLs
and certificates. However once I have a CoreOS cluster, it seems
logical to use that as the provider of the slaves.
The newest version of drone supports build agents. Build agents are installed per-server and will communicate with the central drone server to pull builds from the queue, execute and send back the results.
docker run \
-e DRONE_SERVER=http://my.drone.server \
-e DRONE_SECRET=passcode \
-v /var/run/docker.sock:/container/path/docker.sock \
drone/drone:0.5 agent
This allows you to add and remove agents on the fly without having to register or manage them at the server level.
I believe this should solve the basic problem you've outlined, although I'm not sure it will provide the level of integration you desire with fleet and coreos. Perhaps a coreos expert can augment my answer.

How to setup Jenkins with HA?

Currently we are using a Jenkins as our CI system and there is one master server and slaves which are provisioned by Saltstack on Openstack. If our Jenkins master server goes down, we need to create a new master and we need to pull the files from the old master & put it in new ones but it's gonna take at least 30mins.
Is there any way to setup Jenkins with High Availability?
I already check with Gearman Plugin, however if the Gearman server goes down for some reason, we need to setup a HA for Gearman also.
Is there any other ways to setup a High Availability for Jenkins?
Jenkins doesn't have a great HA story; the best you can do with the open source version is to put all of the files in $JENKINS_HOME on a shared file system, and then have a cold standby master machine that you can spin up if the active master goes down. That would reduce your failover time to however long it takes for the master to restart, which is usually just a few minutes.
You could also look at CloudBees' Jenkins Enterprise offering, which includes a High Availability Plugin.
I use cluster from scratch doc to create a Jenkins WAN-HA active/passive cluster. See the attached Architecture Diagram for Jenkins HA using pacemaker .
/etc/init.d/jenkins will need to be converted to be an ocf agent script. Currently I manually start up Jenkins via systemd on pcmk-2 server when pcmk-1 is down.

Resources