Azure Key Vault is a great way to secure your secrets and ensure your secrets roll out to your cloud applications from a centralized area. This helps it conform to the 12 factor environment configuration setup but it also ensures that your secrets can easily be changed and managed with in your Cloud application.

Key Vault Overview

The documentation for Key Vault is readily available. The basics are that you add in a key-value pair into key vault and it stores your secret in a secure storage. You must then use a Client ID and Client Secret to access those secrets wherever you may be using them - whether that is a C# web API, an Angular SPA, or any number of other things. Microsoft has also provided us with code packages in many different languages in order to directly integrate with Azure Key Vault. You can look into most tutorial to see the details of how to use these and how they work.

The basics of integrating are you provider a KeyVaultProvider, which can implement a filter, and request a value back from that provider. The Provider loops through all of the keys and finds the keys that are matches and returns the secret values for those items.

Key Vault Provider Scaling

In order to fully understand the default key vault provider we need to look into the actual code. Now in most circumstances this code will work very well. In 3.0, they even added polling for changes in the keyvault which is a cool thing to do. There’s one issue with this - key vault has a limitation of about 2000 operations per 10 seconds. The higher encryption you use the lower the throughput that you have.

Now for 10, maybe 25, or so keys this is not really any issue. When you have about 1000 keys for a large environment you start to run into issues where the initial load of key vault parses through all 1000 keys (including all historically tracked versions if enabled). For a pure micro-services environment you may have 5 or 10 services at a minimum each with 2-3 secrets each. This equates out to 10 secrets at minimum and 30 at max. If you have historical tracking in place this will multiply this by the number of times the individual secrets have been adjusted.

Now say I need a key to access event grid, service bus, cosmos, and SQL Server. Now I also have 25 different micro-services that implement about 5 different keys each. This equates to 125 keys within key vault. If I require a multi-tenanted database to segregated client data I increase this by the number of micro-services multiplied by the number of clients I may have. If I reach 100 clients I now have 2500 keys that I need to deal with at a minimum. Suddenly my initial 25 keys is no longer very effective and if each microservice is trying to pull these values on each request - we will easily exceed our total throughput for key vault.

We can add in caching for the keys in order to limit our hit against key vault to not hit it on every single request we have. This may limit us down to individual spikes that will have key vault throughput errors at initial startup and at cache expiration. Even when we get an individual single key it still takes a long time - why is this? Simple - as our list of keys grows, so does the length of our get request for any individual item when using the default key vault provider. This occurs because we do a full list operation on the keys to filter down to which keys we want.

Detailed key vault code work flow

Each time we actually use the default key vault provider we step go through multiple steps.

List out all keys in the key vault
Filter to determine if we need to load the secret for the key
Load the secret for the individual key

As the number of keys grows within our key vault - ALL key vault operations will slow down. This is because of the first point and how the filters actually execute. In 2.2 we can see this actual code) in the LoadAsync method:

	private async Task LoadAsync()
        {
            var data = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);

            var secrets = await _client.GetSecretsAsync(_vault).ConfigureAwait(false);
            do
            {
                foreach (var secretItem in secrets)
                {
                    if (!_manager.Load(secretItem) || (secretItem.Attributes?.Enabled != true))
                    {
                        continue;
                    }

                    var value = await _client.GetSecretAsync(secretItem.Id).ConfigureAwait(false);
                    var key = _manager.GetKey(value);
                    data.Add(key, value.Value);
                }

                secrets = secrets.NextPageLink != null ?
                    await _client.GetSecretsNextAsync(secrets.NextPageLink).ConfigureAwait(false) :
                    null;
            } while (secrets != null);

            Data = data;
        }

The exact issue is that we do a GetSecretsAsync which is a full list operation. This is very expensive to execute as your number of keys expands. This also goes to show that we need to ensure that we clean up our keys in key vault to only those that need to be available. Another thing to note is that this can possibly get Secrets that are not enabled!

If we take a look at the 3.1 code we see a drastic change occur because there is caching/polling added in but we still have an issue with the GetSecrets. The only difference we have now is that we have paging implemented inside of the GetSecrets.

		var secretPage = await _client.GetSecretsAsync(_vault).ConfigureAwait(false);
            var allSecrets = new List<SecretItem>(secretPage.Count());
            do
            {
                allSecrets.AddRange(secretPage.ToList());
                secretPage = secretPage.NextPageLink != null ?
                    await _client.GetSecretsNextAsync(secretPage.NextPageLink).ConfigureAwait(false) :
                     null;
            } while (secretPage != null);

So how do we go about resolving this?

Resolution - Custom Key Vault Providers

When you get to a certain amount of keys within key vault you need to specifically call out the individual keys rather than trying to filter and use the default key vault provider. Paging through an entire list of keys in order to get 1, 2, or maybe 5 different keys is insanely inefficient. It also starts to eat up your throughput which will effectively shut down your key vault and begin rejecting your requests due to throughput limitations.

Build out your own custom provider or get the individual keys rather than using the default provider. 3.0 and 3.1 actually compound the issue stated above because they have implemented polling. And along with that polling they actually do the full load of secrets.

		private async Task PollForSecretChangesAsync()
        {
            while (!_cancellationToken.IsCancellationRequested)
            {
                await WaitForReload();
                try
                {
                    await LoadAsync();
                }
                catch (Exception)
                {
                    // Ignore
                }
            }
        }

        protected virtual async Task WaitForReload()
        {
            await Task.Delay(_reloadInterval.Value, _cancellationToken.Token);
        }

But how do I create a custom key vault provider? Well you override the how the actual Load works inside of the AzureKeyVaultConfigurationProvider. Sadly though most of these are marked as internal only which complicates all of this. The simplistic way is to build out a key vault provider following the same ConfigurationProvider that is shown in .net core code for Azure Key Vault.

Maybe, I’ll figure out a better way but for now - this is an issue for the scalability of Key Vault that we as a community need to be aware of in the cloud. Also, looking at the newest version of .net, we can see that the issue is not getting fixed - in fact, it is getting much worse because of polling.

Push to the Cloud

Azure Key Vault Provider Scaling Issues

Key Vault Overview

Key Vault Provider Scaling

Detailed key vault code work flow

Resolution - Custom Key Vault Providers