How to Scale APIs in the cloud with Security, Reliability and Quality of Service

netflix-logo

Companies are moving to the cloud faster than ever.   The motivation is higher than ever.

One of the companies that first put APIs in the cloud was Netflix and they realized many of the issues that others would face sooner than anyone else.

Microservices

The key takeaway from Netflix’s experience could be microservices.   There are many lessons but the chief one was that having a microservices architecture made scaling to millions and even a billion users possible.

There are many definitions for microservices that are somewhat pedantic, but the most important distinction is the need to start fast and operate lean.  Container technology married with microservices is a natural fit.  A container using Docker can be instantiated and start operating in less than 100 milliseconds.   A VM takes 100 times as long or more.

In order to make containers and microservices work well you need a lot of other pieces and you need to build your microservice functionality so it is lean.

Lean means the functionality does not depend on other services too much.  A lot of cross-dependence will create a lot of chatter and waiting.  If a service is bulky in that it requires a lot of other things to start up or large amounts of data to operate then the startup time and response time will suffer.

Now you’ve built your API lean and mean using Microservice architecture

The problem that many companies are running into is after this part.   They can write their API’s.  They are using best practices to make the API’s easy to use.

You need to make the service secure

Your service will need to follow best practices for security.  This means you need standard things like creating multiple subnets for different components to protect data and services from simple attacks.  You need scanning for virus’s and malware, etc.   You need to check your code for valid safe versions of open source software so mistakes are not made using bad software.  You need to detect intrusions and monitor logs to insure appropriate activities are going on.

Netflix says you need to assume your network is compromised.  Today, there are so many ways to get into your infrastructure it is best to scan and scan as well as look for signs of bad or inappropriate behavior all the time and assume there may be bad actors in your network.

You need to make sure that certificates for containers are valid and that the security practices of your APIs enforce TLS and TLS HSTS and other protocol enhancements available like TLS 1.2.  You may want to perform penetration tests and other intrusion tests using standard tools.

You need to make sure your databases are secure, possibly encrypted and file systems to the extent they need to be and that authentication and authorization are being done right.

You may want to do code scanning for potential security flaws.

This may seem like a lot but security is a huge concern of companies today and your customers.  If you run an insecure service you will not succeed in the end and customers will leave.   It is a selling point.   This is not a complete list.  There are a raft of additional measures you can take.   This is the basic list.

Security as specified above is a huge bottleneck to IT.   If the security practices above are not automated you will find yourself cutting short security.  This is a huge mistake.   Automating all your security processes is non-trivial amount of work.  Sorry.   If you don’t do this you will find that eventually somebody will find out and pierce your security.   If you try to do it by hand you will eventually make a mistake and become next weeks story.

Integrating security into your DevOps automation is described below and important to insure best practices and is essential to actually being secure.

You need a container management framework

Once you have a number of components deploying them by hand is no longer fun or safe. Similarly monitoring them and numerous other management tasks are best done by a container management framework.  This is what Docker Swarm, Kubernetes, Mesos and other products do.   You need to have this technology to implement a service with more than a couple containers.  With all the security components I just mentioned above you need a container management system out of the box.

The container management framework is critical for some security functions as well.  It provides the ability to deploy containers safely.

You need to make it reliable

Reliable software is something most of us have cut our teeth on over the years.  Active/Active and Active / Passive servers, service discovery and registration, heartbeats, transaction protocols, replication of databases and other means.

Most likely you will use a service registry and heartbeat with active/active servers for most of your reliability.  You may also want to bring in a message queue or some other transactional system.   Databases typically have their own reliability mechanisms.   Your service needs to support these components and all the other pieces they need to implement fast recovery of failed components and hardware.

Today, a lot of this can be implemented by policy and with standard components like Zookeeper, Consul, etcd which allow you to monitor heartbeats set up automatic restart and restore configuration instantly.

You need scalability

You need components to help you scale your API service.   Having a microservice architecture doesn’t automatically make it scalable.   Kubernetes for instance has the ability to monitor and keep load constant across a bunch of containers in a cluster.   You can say, for instance, when the load on these containers goes above 50% on any server create a new instance automatically.   You can use this to scale single microservice to multiple instances and keep response time consistent across your service even as load builds rapidly.

Sometimes this type of scaling is not enough.  You may need a more policy based scaling component that can scale multiple components in different ways depending on different load indications.

You need DevOps Fullstack Automation with security

Once you’ve picked all these components to have your API secure, reliable and scalable you need to automate all this.

The fact is that your development environment, test environment, production or other staging environments need to be completely in sync in order to be sure that when you deploy something it actually works.  In order to achieve the 13X improvement in agility the cloud can give you, you need a finely automated infrastructure that you can replicate stacks of components reliably over and over.

You may want to deploy multiple production environments for different regions, different customers or for any number of reasons.  Different APIs may share a lot of similar underlying infrastructure.   If you have multiple APIs automating your stacks will make it easier to deploy new APIs and new services.  You may want to do a synthetic test of your APIs or services with 1000 times the load you expect like AWS and Netflix and other sophisticated services in the cloud do.

It is critical that the DevOps automation include automatic security configuration and checks otherwise mistakes will be made.  People are notoriously bad at doing routine tedious things over and over.   Expecting that you can do security as a separate function from DevOps is a mistake.

You need to upgrade the stack of all these things regularly

If you have done all the above you need to do something even harder.   If you have 20 different components in your stack of components for your APIs or services and these components are upgraded 2 or 3 times a year (typical) then you will have 40-60 upgrades a year to do.

In the past many companies were scared to do upgrades because they put the production environment at risk.   Therefore, companies let the components become stale and let upgrades go for a year or more before trying a big upgrade.

The problem with this approach is that it is extremely painful when you finally decide to do the upgrades.   It is also somewhat impractical in a microservices architecture.

As components are upgraded frequently these upgrades provide needed performance, security or bug fixes that your customers will need.  Since a component microservices architecture shares services across all components if one service needs the upgrade you will end up upgrading all the uses of that microservice component.   Therefore, you will be forced to upgrade more frequently and you will be forced to test all your stack and components with the new upgrade.

Since so many upgrades are coming through you will need a way to do these upgrades with minimal interruption.   Several other components can help you make upgrades and security patches without bringing down any users of your services.

You need to have test suites for the stacks not just the API

Since you will be upgrading your components many times a year you will want to build test suites to test your stack of components in your service to see if any upgrade or security patch breaks some part of the stack.   The new upgrade may seem great and when you test it against your application it works fine but other components in the stack that are not your responsibility may use that component and they may not work with the upgrade.   So, you need to build test suites before you deploy anything to production for your stacks so you can run the test suites against a deployed copy of your production environment to see if anything breaks.

You need to keep the automation up to date and the test suites for the stacks

Over time upgrades of components, new features, adding components will mean you will have to maintain your automation.   This is a tedious job and risky.  Each change to the automation is essentially like an upgrade of a component and you need to test the automation and the resulting deployments as you would for an upgrade or security patch.

As you detect flaws in your test suites you need to keep them up to date and modify them.  You need to constantly be improving them to consider new test cases and potential issues.

You need to understand the costs of the components in your stack and to manage that cost

Finally, you have all of this infrastructure working and your APIs are secure, reliable, scalable, upgradable, automated.  You can make 13X as many changes to your APIs as before and you are seeing a successful service that is growing.   You are ecstatic.

However, you notice that the costs of your infrastructure are growing non-linearly in a bad way.  You get a bill from your cloud provider that shows you thousands of lines of detail but you have no idea how to translate those servers and usage to the underlying services.  You can’t figure out what is taking the most money or if that is reasonable.

You need to implement a data gathering and instrument all your services so they produce information you can use to diagnose what is using what and how you could save money.   Maybe some services shouldn’t be scaled arbitrarly.   Possibly you should limit the scaling to a certain amount.  Maybe a configuration change or frequency of doing something could drastically improve the costs.  Possibly a component you selected is too expensive using way more resources than it should.  Maybe this is a bug or a different component which does substantially the same function is less expensive.

You need to instrument your services and make that instrumentation part of your automation and you need to build the analysis tools or use tools to mine the data to find where your problems are or how you might be able to save money.

Conclusion

In the end building a successful scalable, reliable secure service and API using microservices is harder than you may have realized.   Numerous companies that are building these services are facing the issues above today.

Yenlo Inc provides consulting and support for companies trying to build these services like these above for companies.  Call us if you have any questions or looking for someone to help you do this faster and cheaper.

One thought on “How to Scale APIs in the cloud with Security, Reliability and Quality of Service

Leave a comment