Why control matters

In March we moved from Groove to Zendesk - with this migration our Knowledge Base (KB) moved also.

The challenge we faced was name-spacing - KB articles hosted on Groove were in the name-space  http://help.canary.tools/knowledge_base/topics/, but the namespace /knowledge* is reserved on Zendesk and is not available for our use. This forced us to migrate all KB pages to new URLs and update the cross-references between articles.  This addressed the user experience when one lands at our KB portal  by clicking a valid URL or when typing https://help.canary.tools in a browser.

What isn’t resolved though, is thousands of Canaries in the field that have URLs now pointing to the old namespace. We design Canary to be dead simple, but realise that users may sometimes look for assistance. To this end, the devices will often offer simple “What is this?” links in the interface that will lead a user to a discussion on the feature.

With the move (and with Zendesk stealing the namespace), a customer who clicked on one of those links would get an amateurish, uninformative white screen saying “Not Found”.

This is a terrible customer experience! 

The obvious  way forward is a redirect mechanism which maps https://help.canary.tools/knowledge_base/* URLs to the Zendesk name-space. By implication - the DNS entry help.canary.tools cannot point directly to Zendesk; it needs to point to a system that is less opaque to us so can we configure it at will.


That’s straight-forward! On a fuzzy-match we should have something up and running in minutes with AWS CloudFront . This allows us to map name-spaces from https://help.canary.tools/* to https://thinkst.zendesk.com/* with minimal effort.

Step 1:
URL: https://help.canary.tools/some/uri
GET /some/uri HTTP/1.1

Step 2:
URL: https://thinkst.zendesk.com/hc/en-gb/some/uri
GET /hc/en-gb/some/uri HTTP1.1

The next step is to intercept request to the /knowledge_base name-space, and return an HTTP/301 redirect to the correct URL in  Zendesk.  We make use of the Lambda@Edge functionality to implement a request handler.


30 minutes later and few lines of Python it seemed like we had it all figured out but for one not-so-minor detail - images in KB articles weren’t loading. What is going on?

The DNS record help.canary.tools was pointing to CloudFront while the origin for the content was configured as thinkst.zendesk.com, so when CloudFront requested an image it got HTTP redirect back to itself causing an Infinite redirect loop.

Surely this is fixable by adding the correct Host: canary.tools header to the request ? Nope! Instead of a redirect, now we were getting a 403 from CloudFlare (N.B NOT CloudFront) - Zendesk uses CloudFlare for its own content delivery. WTF?!?

After a few iterations the “magic” (read: plain obvious) incantation was discovered. Note 104.16.55.111 is the IP address behind thinkst.zendesk.com

This is somewhat expected, since Zendesk is configured to think it's serving requests for help.canary.tools.

Without this option Zendesk rewrites all relative URIs in the KB to yet another name-space: https://thinkst.zendesk.com/* which brought its own set of challenges, complexity and non-deterministic behavior.

To avoid confusion and further issues down the line we imposed a design constraint on ourselves - a simplifying assumption: the browser’s address bar should only ever display help.canary.tools - the  thinkst.zendesk.com name-space should never leak to customers.

Committed to this approach, the next hurdle we faced was Server Name Indication (SNI). 

Server Name Indication (SNI) is an extension to the Transport Layer Security (TLS) computer networking protocol by which a client indicates which hostname it is attempting to connect to at the start of the handshaking process. This allows a server to present multiple certificates on the same IP address and TCP port number and hence allows multiple secure (HTTPS) websites (or any other service over TLS) to be served by the same IP address without requiring all those sites to use the same certificate.

CloudFront was doing exactly what it was configured to do. It connected to (and negotiated SNI for) thinkst.zendesk.com which resulted in a 403 error because Zendesk is configured for SNI help.canary.tools.

For any of this to work, what CloudFront needed to do was connect to  thinkst.zendesk.com ( 104.16.55.111 ), but negotiate SNI for help.canary.tools. By any other name - we needed “SNI spoofing” (not really a thing - I just coined the phrase).

Can CloudFront do that? No, it can’t :_(   And just like that we had to rethink our approach - CloudFront was not the solution.

Another failed approach was setting the Host mapping field in Zendesk to kb.canary.tools, and it may have worked but for a bug in Zendesk which fails to re-generate SSL certificates when the Host mapping field is updated in their admin console, so browsing to https://kb.canary.tools was met with certificate validation errors. How long does it take for Zendesk to rotate certificates? We don't know (but it's more than 30 minutes).

There were just too many moving parts in too many systems to allow us sanely and consistently reason about customer experience.

Retrospectively, the root-cause of all our problems was still related to name-spacing: both CloudFront and Zendesk (rightfully) believed they are authoritative for the hostname help.canary.tools

  • From the perspective of the entire  Internet help.canary.tools points to CloudFront.
  • From the perspective of CloudFront - help.canary.tools points to Zendesk.
So if both systems share the same name in public, how do the two systems address each other in private?
The answer was some form of Split-Horizon DNS. The least sophisticated version would've been to simply hack the /etc/hosts file on the host serving requests for help.canary.tools, but this functionality exists natively in Nginx's upstream{} block. Of course, those IP addresses could change, but this is a manageable risk that can be remediated in minutes. In contrast round-trip times on tickets to Zendesk are measured in days.

The proxy_ssl_server_name option enables SNI., and the kb_uri variable uses http_map_mod for performing lookups/rewrites on URLs in https://help.canary.tools/knowledge_base/* name-space.

In the end, the Nginx configuration necessary to address our needs was as simple as this:

map $request_uri $kb_uri {
     default "";
     /knowledge_base/topics/url1 /hc/en-gb/articles/360002426777
     ....
}
upstream help.canary.tools {
# dig help.canary.tools
server 104.16.51.111:443;
server 104.16.52.111:443;
server 104.16.53.111:443;
server 104.16.54.111:443;
server 104.16.55.111:443;
}
server{
listen 443 ssl;
server_name help.canary.tools;
location / {
proxy_pass https://help.canary.tools/;
proxy_ssl_server_name on;
}
location /knowledge_base/ {
if ($kb_uri != ""){
return 301 https://help.canary.tools$kb_uri;
}
return 302 https://help.canary.tools;
}
}

Where are we now?

https://help.canary.tools is now Nginx running on EC2.  It's all Dockerized and Terraformed so the configuration and deployment is reproducible in minutes. 



Nginx SSL certificate renewals and refreshes are automated using CertBot (thanks to this guide). Down the line we can add mod-security giving us a level of visibility into potential attacks against - this level of visibility is unfathomable even if CloudWatch was a viable solution.

Using Docker's native support for AWS CloudWatch all Nginx access logs land up in CloudWatch which  gives us dashboarding, metrics and alarming for free.


We now get alerted every time a customer attempts to access a missing URL in the https://help.canary.tools/knowledge_base/*. Mean while, the customer doesn't get an ugly "Not found" error message - they are redirected to our Knowledge Base home page where they can simply use the search function. This has already paid dividends in helping us rectify missing mappings.


From CloudWatch we can directly drill down into Nginx access logs to examine any anomalous behavior.



This is a stark contrast from the world where the application layer was opaque to us - bad user experiences and broken links would have gone completely unnoticed.

Control matters. This is why.


Comments

  1. Thank you for sharing this tutorial, I will apply it now!

    ReplyDelete

Post a Comment

Popular posts from this blog

Simple Graphs with Arbor.js

Small things done well¹

Using the Linux Audit System to detect badness