I'm running an instance of Gitlab Omnibus CE, version 8.15.2, on CentOS 7.3.1611. Upgrading from the 8.14 release family didn't go quite according to plan; since doing that, I've been unable to access the Gitlab browser interface.
When I try to access the browser interface, I can access the login screen and log in, but after I'm logged in, going to any page results in an Error 500: Whoops, something went wrong on our end.
So I used gitlab-ctl tail to grab some log data for what's happening, and it looks like it's a problem with Postgresql's data for one of my projects:
http://pastebin.com/VDMk0eKr
But I'm not sure how I should fix this. Any ideas?
It's known issue that's been fixed with newest release 8.15.3. If you don't want to upgrade GitLab, there is an existing workaround (Edit: as mentioned in comment, the workaround does not always work so consider upgrade primary)
File:
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/has_status.rb
Replace
builds = scope.select('count(*)').to_sql
created = scope.created.select('count(*)').to_sql
success = scope.success.select('count(*)').to_sql
pending = scope.pending.select('count(*)').to_sql
running = scope.running.select('count(*)').to_sql
skipped = scope.skipped.select('count(*)').to_sql
canceled = scope.canceled.select('count(*)').to_sql
with
builds = scope.select('count(*)').reorder(nil).to_sql
created = scope.created.select('count(*)').reorder(nil).to_sql
success = scope.success.select('count(*)').reorder(nil).to_sql
pending = scope.pending.select('count(*)').reorder(nil).to_sql
running = scope.running.select('count(*)').reorder(nil).to_sql
skipped = scope.skipped.select('count(*)').reorder(nil).to_sql
canceled = scope.canceled.select('count(*)').reorder(nil).to_sql
And restart GitLab.
I had the same issue and the above didn't work so I ran the following command to downgrade.
To check the current version istalled:
sudo dpkg -l | grep gitlab-ce
To see which versions were available:
sudo apt-cache madison gitlab-ce | less
and the following to "downgrade", since I was at 9.2.0-rc2.ce.0 shown by the above command:
sudo apt-get install gitlab-ce=9.2.0-rc1.ce.0
Related
After installing the self-managed gitlab docker container, I'm facing an issue when trying to init a GitLab Kubernetes Agent.
First of all, I've added the .gitlab/agents/<agent-name>/config.yaml according to gitlab docs and it's possible to click the green integrate with GitLab Agent button, but then the dropdown is empty and the console returns an 500 internal server error without any interesting information.
The gitlab-kas configuration in /etc/gitlab/gitlab.rb was enabled by default with those configuration:
##! Settings used by the GitLab application
# gitlab_rails['gitlab_kas_enabled'] = true
# gitlab_rails['gitlab_kas_external_url'] = ws://gitlab.example.com/-/kubernetes-agent
# gitlab_rails['gitlab_kas_internal_url'] = grpc://localhost:8153
##! Enable GitLab KAS
# gitlab_kas['enable'] = true
Last but not least, found some more helpful logs in docker logs -f gitlab I guess:
Gitlab::Kas::Client::ConfigurationError (GitLab KAS is not enabled):
lib/gitlab/kas/client.rb:16:in `initialize'
ee/app/graphql/resolvers/kas/agent_configurations_resolver.rb:28:in `new'
ee/app/graphql/resolvers/kas/agent_configurations_resolver.rb:28:in `kas_client'
ee/app/graphql/resolvers/kas/agent_configurations_resolver.rb:16:in `resolve'
lib/gitlab/graphql/present/field_extension.rb:18:in `resolve'
lib/gitlab/graphql/generic_tracing.rb:40:in `with_labkit_tracing'
lib/gitlab/graphql/generic_tracing.rb:30:in `platform_trace'
lib/gitlab/graphql/generic_tracing.rb:40:in `with_labkit_tracing'
lib/gitlab/graphql/generic_tracing.rb:30:in `platform_trace'
lib/gitlab/graphql/generic_tracing.rb:40:in `with_labkit_tracing'
lib/gitlab/graphql/generic_tracing.rb:30:in `platform_trace'
app/graphql/gitlab_schema.rb:40:in `multiplex'
...
So it seams that the gitlab-kas service is not running, but how can I boot it up?
OMG ID10T incoming: after studying the /etc/gitlab/gitlab.rb config again, I found the error and it's kind of obvious. Changed settings is good, but if they were not included, it doesn't help at all.
In reference to the original question, within the provided config screenshot you can see, that the setting is actually a comment. After removing the # it works fine.
I am running a Node.js app on Google App Engine, using the following command to deploy my code:
gcloud app deploy --stop-previous-version
My desired behavior is for all instances running previous versions to be terminated, but they always seem to stick around. Is there something I'm missing?
I realize they are not receiving traffic, but I am still paying for them and they cause some background telemetry noise. Is there a better way of running this command?
Example output of the gcloud app instances list:
As you can see I have two different versions running.
We accidentally blew through our free Google App Engine credit in less than 30 days because of an errant flexible instance that wasn't cleared by subsequent deployments. When we pinpointed it as the cause it had scaled up to four simultaneous instances that were basically idling away.
tl;dr: Use the --version flag when deploying to specify a version name. An existing instance with the same version will be
replaced then next time you deploy.
That led me down the rabbit hole that is --stop-previous-version. Here's what I've found out so far:
--stop-previous-version doesn't seem to be supported anymore. It's mentioned under Flags on the gcloud app deploy reference page, but if you look at the top of the page where all the flags are listed, it's nowhere to be found.
I tried deploying with that flag set to see what would happen but it seemingly had no effect. A new version was still created, and I still had to go in and manually delete the old instance.
There's an open Github issue on the gcloud-maven-plugin repo that specifically calls this out as an issue with that plugin but the issue has been seemingly ignored.
At this point our best bet at this point is to add --version=staging or whatever to gcloud deploy app. The reference docs for that flag seem to indicate that that it'll replace an existing instance that shares that "version":
--version=VERSION, -v VERSION
The version of the app that will be created or replaced by this deployment. If you do not specify a version, one will be generated for you.
(emphasis mine)
Additionally, Google's own reference documentation on app.yaml (the link's for the Python docs but it's still relevant) specifically calls out the --version flag as the "preferred" way to specify a version when deploying:
The recommended approach is to remove the version element from your app.yaml file and instead, use a command-line flag to specify your version ID
As far as I can tell, for Standard Environment with automatic scaling at least, it is normal for old versions to remain "serving", though they should hopefully have zero instances (even if your scaling configuration specifies a nonzero minimum). At least that's what I've seen. I think (I hope) that those old "serving" instances won't result in any charges, since billing is per instance.
I know most of the above answers are for Flexible Environment, but I thought I'd include this here for people who are wondering.
(And it would be great if someone from Google could confirm.)
I had same problem as OP. Using the flex environment (some of this also applies to standard environment) with Docker (runtime: custom in app.yaml) I've finally solved this! I tried a lot of things and I'm not sure which one fixed it (or whether it was a combination) so I'll list the things I did here, the most likely solutions being listed first.
SOLUTION 1) Ensure that cloud storage deletes old versions
What does cloud storage have to do with anything? (I hear you ask)
Well there's a little tooltip (Google Cloud Platform Web UI (GCP) > App Engine > Versions > Size) that when you hover over it says:
(Google App Engine) Flexible environment code is stored and billed from Google Cloud Storage ... yada yada yada
So based on this info and this answer I visited GCP > Cloud Storage > Browser and found my storage bucket AND a load of other storage buckets I didn't know existed. It turns out that some of the buckets store cached cloud functions code, some store cached docker images and some store other cached code/stuff (you can tell which is which by browsing the buckets).
So I added a deletion policy to all the buckets (except the cloud functions bucket) as follows:
Go to GCP > Cloud Storage > Browser and click the link (for the relevant bucket) in the Lifecycle Rules column > Click ADD A RULE > THEN:
For SELECT ACTION choose "Delete Object" and click continue
For SELECT OBJECT choose "Number of newer versions" and enter 1 in the input
Click CREATE
This will return you to the table view and you should now see the rule in the lifecycle rules column.
REPEAT this process for all relevant buckets (the relevant buckets were described earlier).
THEN delete the contents of the relevant buckets. WARNING: Some buckets warn you NOT to delete the bucket itself, only the contents!
Now re-deploy and your latest version should now get deployed and hopefully you will never have this problem again!
SOLUTION 2) Use deploy flags
I added these flags
gcloud app deploy --quiet --promote --stop-previous-version
This probably doesn't help since these flags seem to be the default but worth adding just in case.
Note that for the standard environment only (I heard on the grapevine) you can also use the --no-cache flag which might help but with flex, this flag caused the deployment to fail (when I tried).
SOLUTION 3)
This probably does not help at all, but I added:
COPY app.yaml .
to the Dockerfile
TIP 1)
This is probably more of a helpful / useful debug approach than a fix.
Visit GCP > App Engine > Versions
This shows all versions of your app (1 per deployment) and it also shows which version each instance is running (instances are configured in app.yaml).
Make sure all instances are running the latest version. This should happen by default. Probably worth deleting old versions.
You can determine your version from the gcloud app deploy logs (at the start of the logs) but it seems that the versions are listed by order of deployment anyway (most recent at top).
TIP 2)
Visit GCP > App Engine > Instances
SSH into an instance. This is just a matter of clicking a few buttons (see screenshot below). Once you have SSH'd in run:
docker exec -it gaeapp /bin/bash
Which will get you into the docker container running your code. Now you can browse around to make sure it has your latest code.
Well I think my answer is long enough now. If this helps, don't thank me, J-ES-US is the one you should thank ;) I belong to Him ^^
Google may have updated their documentation cited in #IAmKale's answer
Note that if the version is running on an instance of an auto-scaled service, using --stop-previous-version will not work and the previous version will continue to run because auto-scaled service instances are always running.
Seems like that flag only works with manually scaled services.
This is a supplementary and optional answer in addition to my other main answer.
I am now, in addition to my other answer, auto incrementing version manually on deploy using a script.
My script contents are below.
Basically, the script auto increments version every time you deploy. I am using node.js so the script uses npm version to bump the version but this line could easily be tweaked to whatever language you use.
The script requires a clean git working directory for deployment.
The script assumes that when the version is bumped, this will result in file changes (e.g. changes to package.json version) that need pushing.
The script essentially tries to find your SSH key and if it finds it then it starts an SSH agent and uses your SSH key to git commit and git push the file changes. Else it just does a git commit without a push.
It then does a deploy using the --version flag ... --version="${deployVer}"
Thought this might help someone, especially since the top answer talks a lot about using the --version flag on a deploy.
#!/usr/bin/env bash
projectName="vehicle-damage-inspector-app-engine"
# Find SSH key
sshFile1=~/.ssh/id_ed25519
sshFile2=~/Desktop/.ssh/id_ed25519
sshFile3=~/.ssh/id_rsa
sshFile4=~/Desktop/.ssh/id_rsa
if [ -f "${sshFile1}" ]; then
sshFile="${sshFile1}"
elif [ -f "${sshFile2}" ]; then
sshFile="${sshFile2}"
elif [ -f "${sshFile3}" ]; then
sshFile="${sshFile3}"
elif [ -f "${sshFile4}" ]; then
sshFile="${sshFile4}"
fi
# If SSH key found then fire up SSH agent
if [ -n "${sshFile}" ]; then
pub=$(cat "${sshFile}.pub")
for i in ${pub}; do email="${i}"; done
name="Auto Deploy ${projectName}"
git config --global user.email "${email}"
git config --global user.name "${name}"
echo "Git SSH key = ${sshFile}"
echo "Git email = ${email}"
echo "Git name = ${name}"
eval "$(ssh-agent -s)"
ssh-add "${sshFile}" &>/dev/null
sshKeyAdded=true
fi
# Bump version and git commit (and git push if SSH key added) and deploy
if [ -z "$(git status --porcelain)" ]; then
echo "Working directory clean"
echo "Bumping patch version"
ver=$(npm version patch --no-git-tag-version)
git add -A
git commit -m "${projectName} version ${ver}"
if [ -n "${sshKeyAdded}" ]; then
echo ">>>>> Bumped patch version to ${ver} with git commit and git push"
git push
else
echo ">>>>> Bumped patch version to ${ver} with git commit only, please git push manually"
fi
deployVer="${ver//"."/"-"}"
gcloud app deploy --quiet --promote --stop-previous-version --version="${deployVer}"
else
echo "Working directory unclean, please commit changes"
fi
For node.js users if you call the script deploy.sh you should add:
"deploy": "sh deploy.sh"
In your package.json scripts and deploy with npm run deploy
Problems
1. Browser = Firefox (Non Geckodriver, Selenium v2.53.4)
(Works on one linux thin client but not on another...)
$ bundle exec rake parallel:spec
Selenium::WebDriver::Error::WebDriverError:
unable to bind to locking port 7054 within 120 seconds
2. Broswer = Firefox (Geckodriver v0.14.0, Selenium-webdriver v3.1.0)
$ bundle exec rake parallel:spec
Net::ReadTimeout:
Net::ReadTimeout
3. Browser = Chrome (Chromedriver v2.27, Selenium-webdriver v3.1.0)
$ bundle exec rake parallel:spec
Selenium::WebDriver::Error::NoSuchDriverError:
no such session
(Driver info: chromedriver=2.27.440175 ,platform=Linux 3.16.0-0.bpo.4-amd64 x86_64)
My Setup
Server with the following installed:
-Linux - Debian x86_64 Wheezy
-ruby 2.2.5p319 (2016-04-26 revision 54774)
-Firefox v46.0.3
-Chrome 56.0.2924.87 (64-bit)
-ChromeDriver 2.27.440175
-Xvfb (x11-xserver-utils v 7.7~3 through headless gem)
Gems
-Selenium v3.1.0 (was 2.53.4)
-parallel_tests v2.10.0
-capybara (2.7.1)
-rspec-activemodel-mocks (1.0.3)
-rspec-core (3.4.4)
-rspec-expectations (3.4.0)
-rspec-mocks (3.4.1)
-rspec-rails (3.4.2)
-rspec-support (3.4.1)
-headless (2.2.3) (Xvfb)
Multiple thin clients running their software off the mentioned server setup.
My computer is one of many...
Important: There is another computer that does NOT encounter the mentioned problem running the same software and same versions off the same server!
Things that are NOT the problem
It is not an incompatibility between my firefox browser version and Selenium.
Why not?
a) Firefox v46.0.3 and Selenium v2.53.4 is currently installed on our server and another client of this server successfully runs parallel_tests using the mentioned versions of Firefox & Selenium.
b) Which Firefox version is compatible with Selenium 2.53.0?
There are no zombie processes(still running firefox) causing firefox to lock port 7054.
This is specifically after each failure has occurred and prior to starting a new $ bundle exec rake parallel:spec run.
Why not?
refer to items 1 & 2 in 'Things I Have Tried'
Turned Out this was not the case, although it was not the cause of the problem
Databases were not always properly killed, See Update 5.
However the databases not being killed were an outcome of the problem.
They were not the cause of the problem, refer to solution section.
Side note
For those wishing to install the mentioned versions to get selenium / firefox working:
Installing a previous version doesn't fix most problems
Things I have tried
Removed any processes still running
$ killall ruby; killall rspec; killall firefox
Result: Failed...
Discovered that completing step 1 is not enough to kill all zombie processes.
After logging out into a different tty i discovered that there was still an rspec, ruby and firefox process running!
So I logged out of my user, logged into a new tty, killed all zombie processes using:
$ kill -9 process_id
I then rerun $ ps aux to ensure all processes are cleaned.
Result: Failed...
Gain insight into the problem.
Ran $ lsof -i TCP:7054 see what is holding that process.
Result: It was my firefox test, no suprise, no real insight gained.
Ensured parallel test databases were running correctly.
Dropped all databases, recreated databases, reloaded all schemas, reseeded (development), reprepared.
Result: Failed... I doubted this was the cause, but doing this certainly eliminated it.
Deleted firefox cache, all persisted setting, everything, for a clean start.
Result: Failed...
Try to eliminate any local environment variables obtained from the project.
Did this by copying the project directory from the working computer.
Then reran $ bundle exec rake parallel:spec.
Result: Failed...
Try to eliminate all local environment variables (project and linux).
Did this by creating a new linux user.
Then switched into the new user.
$ su new_user -l
Copied over the minimum zsh items needed.
Then ran $ bundle exec rake parallel:spec
Result: Failed...
Ensured that /etc/hosts contained:.
127.0.0.1 localhost
Result: Failed...
Running the tests in a single thread (not parallel).
$ rspec spec
Result: Successfully runs (does not hit the problem)
See Update 1
See Update 2
See Update 3
See Update 4
See Update 5
Partial Solution
See Update 6
Debugged Selenium & Parallel_tests gems
Result: Identified that the issue is NOT in Selenium
See Update 7
Result: Running tests in parallel worked. But why?
See Update 8
Result:
Discovered Selenium 3.1.0 changed the way files are automatically downloaded.
This caused tests to hang indefinitely whilst running parallel test.
Which caused the databases to be held open.
Things I am going to try (Updates)
Run tests with chromedriver in chrome browser and see if it passes after the fix.
Update 1
I replaced firefox for chrome.
When I run a single test, the test successfully completes with chrome.
It did so with firefox as well.
However running $ bundle exec rake parallel:spec
Result: Failed...
Selenium::WebDriver::Error::NoSuchDriverError:
no such session
(Driver info: chromedriver=2.27.440175 ,platform=Linux 3.16.0-0.bpo.4-amd64 x86_64)
Update 2
I updated selenium-webdriver gem to the latest gem (was v2.53.4 now 3.2.2)
Result: Failed...
Selenium::WebDriver::Error::NoSuchDriverError:
no such session
(Driver info: chromedriver=2.27.440175 ,platform=Linux 3.16.0-0.bpo.4-amd64 x86_64)
Update 3
Located lock file for parallel test (~.config/google-chrome).
Identified 3 persisting lock files.
Other users only had 1.
Deleted these and reran tests.
Result: Failed...
Selenium::WebDriver::Error::NoSuchDriverError:
no such session
(Driver info: chromedriver=2.27.440175 ,platform=Linux 3.16.0-0.bpo.4-amd64 x86_64)
Update 4
Upgraded selenium-webdriver to v3.1.0 (latest stable)
Upgraded parallel_tests to v2.13.0 (latest)
Installed Geckodriver v0.14.0 (latest)
Then ran $ bundle exec rake parallel:spec
Result: Failed...
Failure/Error: visit "#/login"
Net::ReadTimeout:
Net::ReadTimeout
Update 5
Whilst in the firefox (Geckodriver v0.14.0, Selenium-webdriver v3.1.0) branch.
I only realised when I had to drop all my parallel_test databases that some were still open.
#ltsp:~/ap$ bundle exec rake parallel:drop[32]
Couldn't drop ap_test_andre32 : #<ActiveRecord::StatementInvalid: PG::ObjectInUse: ERROR: database "ap_test_andre32" is being accessed by other users
DETAIL: There are 3 other sessions using the database.
: DROP DATABASE IF EXISTS "ap_test_andre32">
Couldn't drop ap_test_andre25 : #<ActiveRecord::StatementInvalid: PG::ObjectInUse: ERROR: database "ap_test_andre25" is being accessed by other users
DETAIL: There are 3 other sessions using the database.
: DROP DATABASE IF EXISTS "ap_test_andre25">
When rake parallel:spec does not complete (indefinetly hangs during ),
the process must be killed manually.
Doing so leaves databases locked to the parallel_tests that were using them at the time.
So they must be identified and cleaned up.
postgres 743 0.0 0.0 222364 33628 ? Ss 15:30 0:00 postgres: andre ap_test_andre32 [local] idle in transaction
andre 24581 0.0 0.0 7852 2028 pts/36 S+ 15:49 0:00 grep andre32
postgres 26822 0.0 0.0 220032 23400 ? Ss 15:35 0:00 postgres: andre ap_test_andre32 [local] ALTER TABLE waiting
postgres 29684 0.0 0.0 220032 24064 ? Ss 15:40 0:00 postgres: andre ap_test_andre32 [local] ALTER TABLE waiting
Update 5 Solution:
search for database processes & kill all of them
ps aux | grep test_andre
andre#ltsp:~/ap$ sudo kill -9 743 26822 29684
I was then able to drop my databases.
bundle exec rake parallel:drop[32]
Update 6
Whilst in the firefox (Geckodriver v0.14.0, Selenium-webdriver v3.1.0) branch.
Cloned parallel_tests & Selenium projects locally.
Replaced my gems with a path to the locally cloned projects.
Debugged starting with the error stack trace.
Results
Updated to selenium 3.1.0 and loaded geckodriver (marionette).
I discovered that my firefox profile was not setup correctly with Capybara.
This broke my local single thread tests.
Fixed this.
Discovered that geckodriver is not to be used for FF<48.
Also discovered that the capybara, selenium 3+ & FF48+ combo is not yet ready for use.
Some vital functions are not working. (Right clicking, window resizing...)
Refer here for full details
After investigating parallel_tests, was able to rule that out.
Continued to debug in the firefox test case.
Used the locking port error as my guide.
Ruled out Selenium as the cause of the error.
After debugging the stack trace, it was proving to be very likely that the error state was inherited.
This was just a strong hunch at the time.
It later proved to be correct...
So summary here was that firefox had processes that were being locked.
And they were not being locked by Selenium.
Update 7
Whilst in the firefox (Selenium-webdriver v2.53.4) branch.
Went back to the new linux user that was created.
In light of Update 5, I dropped cleaned up all running processes.
Dropped all databases.
$ bundle exec rake parallel:spec
Result: Parallel tests worked
But why?
The databases were not the cause of the issue.
There was something else.
Update 8
Whilst in the firefox (Geckodriver v0.14.0, Selenium-webdriver v3.1.0) branch.
Identified the reason why the tests were failing and idenfinitely hanging.
This caused the issues described in Update 5 & 6.
It was caused by a change in the way Selenium accepts firefox profile settings.
I identified that the integration tests that were failing were ones that launched a pdf download.
Previously, I had this automated so that the download modal would not appear.
Instead it would automatically download the file to a specified folder.
Updating to Selenium 3.1.0 broke this.
Tests hanged indefinitely.
Databases were held open.
The problems identified, in the updates were not the root cause.
The root cause was that firefox/chrome browser ports were not closed and held open.
After looking at htop, Polkitd was seen to be taking up 16.5gb of ram!
This was caused by a memory leak in Polkitd.
After checking the issues it was confirmed that Polkitd memory leak is a know issue.
The issue has been fixed but only in later distributions of linux debian and not for Wheezy.
After restarting Polkitd, and rerunning the tests in parallel they worked!
This explains why the first time I created a new linux user with a clean profile the parallel test issues were still occuring. - Memory leaks are unpredictable.
It also explains why another computer did not run into the issue.
And why the second time I created a new user the parallel tests worked!
Phew, that took a lot of effort!
Polkitd was uninstalled as it was not needed for any printers or other software that we run.
Overall, if anyone else has the locking issue, it would be helpful to follow some of the process detection that I have done as some of the issues are common to all OS.
context: Gitlab 8 with external nginx and postgresql on Ubuntu 15.04. It all worked with Gitlab 7.10 and I started with a fresh install to avoid upgrade-problems.
In the gitlab.rb there is:
gitlab_rails['db_adapter'] = "postgresql"
gitlab_rails['db_encoding'] = "unicode"
gitlab_rails['db_database'] = "gitlabdb"
gitlab_rails['db_pool'] = 10
gitlab_rails['db_username'] = "gitlab"
When doing a reconfigure and "gitlab-rake gitlab:setup" there is no problem, and the database gets recreated. So far looking good. Unfortunately the page doesn't load and I get a 500 - the logfile tells me that it cannot login with the given password. I made the database accept all (without password) and then got to the weird error:
ActiveRecord::NoDatabaseError (FATAL: database "gitlabhq_production" does not exist
Nowhere in the config-files a database gitlabhq_production is mentioned, so I'm clueless here. Can you help out?
It was an old instance of Gitlab bugging. A reboot helped.
Ok, so I'm trying to configure and install svnserve on my Ubuntu server. So far so good, up to the point where I try to configure sasl (to prevent plain-text passwords).
So; I installed svnserve and made it run as a daemon (also installed it as a startup script with the command svnserve -d -r /var/svn).
My repository is in /var/svn and has following configuration (to be found in /var/svn/myrepo/conf/svnserve.conf) (I left comments out):
[general]
anon-access = none
auth-access = write
realm = my_repo
[sasl]
use-sasl = true
min-encryption = 128
max-encryption = 256
Over to sasl, I created a svn.conf file in /usr/lib/sasl2/:
pwcheck_method: auxprop
auxprop_plugin: sasldb
sasldb_path: /etc/my_sasldb
mech_list: DIGEST-MD5
I created it in that folder as the article at this link suggested: http://svnbook.red-bean.com/nightly/en/svn.serverconfig.svnserve.html#svn.serverconfig.svnserve.sasl (and also because it existed and was listed as a result when I executed locate sasl).
Right after that I executed this command:
saslpasswd2 -c -f /etc/my_sasldb -u my_repo USERNAME
Which also asked me for a password twice, which I supplied. All going great.
When issuing the following command:
sasldblistusers2 -f /etc/my_sasldb
I get the - correct, as far as I can see - result:
USERNAME#my_repo: userPassword
Restarted svnserve, also restarted the whole server, and tried to connect.
This was the result from my TortoiseSVN client:
Authentication error from server: SASL(-13): user not found: unable to canonify
user and get auxprops
I have no clue at all in what I'm doing wrong. I've been scouring the web for the past few hours, but haven't found anything but that I might need to move the svn.conf file to another location - for example, the install location of subversion itself. which svn results in /usr/bin/svn, thus I moved the svn.conf to /usr/bin (although that doesn't feel right to me).
Still doesn't work, even after a new reboot.
I'm running out of ideas. Anyone else?
EDIT
I tried changing this (according to what some other forums on the internet told me to do): in the file /etc/default/saslauthd, I changed
START=no
MECHANISMS="pam"
to
START=yes
MECHANISMS="sasldb"
(Actually I had already changed START=no to START=yes before, but I forgot to mention it). But still no luck (I did reboot the whole server).
It looks like svnserve uses default values for SASL...
Check /etc/sasl2/svn.conf to be readable by the svnserver process owner.
If /etc/sasl2/svn.conf is owned by user root, group root and --rw------, svnserve uses the default values.
You will not be warned by any log file entry..
see section 4 of https://svn.apache.org/repos/asf/subversion/trunk/notes/sasl.txt:
This file must be named svn.conf, and must be readable by the svnserve process.
(it took me more than 3 days to understand both svnserve-sasl-ldap and this pitfall at the same time..)
I recommend to install the package cyrus-sasl2-doc and to read the section Cyrus SASL for System Administrators carefully.
I expect this is caused by the SASL API for the call
result = sasl_server_new(SVN_RA_SVN_SASL_NAME,
hostname, b->realm,
localaddrport, remoteaddrport,
NULL, SASL_SUCCESS_DATA,
&sasl_ctx);
if (result != SASL_OK)
{
svn_error_t *err = svn_error_create(SVN_ERR_RA_NOT_AUTHORIZED, NULL,
sasl_errstring(result, NULL, NULL));
SVN_ERR(write_failure(conn, pool, &err));
return svn_ra_svn__flush(conn, pool);
}
as you may see, handling the access failure by svnserve is not foreseen, only Ok or error is expected...
I looked in /var/log/messages and found
localhost svnserve: unable to open Berkeley db /etc/sasldb2: No such file or directory
When I created the sasldb to the above file and got the permissions right, it worked. Looks like it ignores or does not use the sasl database path.
There was another suggestion that rebooting solved the problem but that option was not available to me.