Cloud-native deployments have transformed software development. According to the Cloud Native Computing Foundation’s 2021 Annual Survey, 96% of organizations are using or evaluating Kubernetes. More specifically, 5.6 million developers are using Kubernetes, too — a 67%increase from just last year.
Cloud-native architectures enable loosely coupled services to be scalable, manageable, and observable. When combined with automation, cloud-native capabilities can also enable frequent, high-impact changes with minimal disruption.
But while more and more devs are embracing cloud-native deployments, the technology is still relatively new in the telco business support systems (BSS) space, and there are several challenges cloud-native application deployment teams are facing, especially when it comes to stateful applications, such as those within BSS.
Let’s take a look at the main challenges enterprises face today with running cloud-native deployments.
Table Of Contents
- Cloud-native challenge #1: Running Helm charts
- Cloud-native challenge #2: CI/CD
- Cloud-native challenge #3: Auto-scaling
- Cloud-native challenge #4: Service meshes
- Cloud-native challenge #5: Cluster upgrades
- Cloud-native challenge #6: Multisite XDCR upgrades
- Cloud-native challenge #7: Security
- Cloud-native challenge #8: Operations and troubleshooting
- Cloud-native is still the future
Cloud-native challenge #1: Running Helm charts
Traditionally, BSS space deployments and upgrades have been a disaster, with full upgrades taking anywhere between 12 and 18 months and the provision of environments taking days or weeks. With the right approach, though, cloud-native application deployments promise to entirely disrupt this model with seamless upgrades that don’t interrupt service and take a fraction of the amount of time it normally takes.
However, an important part of setting up automated cloud-based deployments to accelerate upgrades is having the ability to quickly spin up test or even full-scale production environments.
Helm charts are one way to do this. With Helm, you should be able to get a system up and running to the desired size and specifications with the touch of a button. But in order for this to work, the organization’s entire tech stack, including its data platform, needs to work together seamlessly—which can be a challenge.
Assuming you can meet this challenge, it’s possible to spin up dev and test environments as needed using Helm and destroy them when you’re done. This is useful for sharing hardware and keeping public cloud costs down by paying only when you use resources.
Cloud-native challenge #2: CI/CD
Using Helm to manage deployments and upgrades usually only gets you so far. Multiple geo-redundant clusters have to be managed with any changes orchestrated across the various clusters to ensure continual service. Continuous integration and continuous delivery (CI/CD) pipelines can add significant business value via automated deployments that relieve teams of the burden of setting up manual configurations.
CI/CD tools are used to automate a pipeline of steps for initiating changes such as upgrades, as well as other manual tasks performed on production deployments. Instead of having a developer go through a 100-step process at 3 a.m. to complete an upgrade—and making mistakes due to exhaustion—CI/CD takes care of it all for you.
In order to unlock the promise of CI/CD, though, development teams need to ensure their tools work together seamlessly.
Cloud-native challenge #3: Auto-scaling
When companies deploy applications on a public cloud, they pay for the resources they use (e.g., per hour).
Typically, policy and charging volumes are much lower at night than they are during the day. What’s more, traffic patterns fluctuate seasonally and as the user base grows. That being the case, communications service providers (CSPs) don’t want to be paying prices that reflect the busiest hours of future volumes and the most expensive resources from the outset. Instead, they want to ensure that they’re paying for the resources they use—no more, no less— in a cost-effective manner.
This is where autoscaling can be particularly helpful. With autoscaling, applications are always scaled to the required traffic demands, and will dynamically scale up and down as requirements change.
For stateful applications, however, this is problematic. While stateful applications can scale up by adding new pods, scaling down comes with a cost; you have to do something with the data stored on the pods you want to remove, and that takes time and effort. Depending on what’s involved, the effort could be more costly than just leaving the solution alone!
Additionally, Kubernetes may know nothing about your data platform’s resiliency capabilities or partitions spread across pods, which means Kubernetes can’t be used to scale pods up or down where more than a single pod is required to scale in order to preserve data redundancy. As a result, you may need to manage all of this in your Operator, which requires additional engineering effort.
Cloud-native challenge #4: Service meshes
Breaking up monolithic applications, such as those for policy and charging, into several microservices increases the interfaces between each of them. Since each part of the solution is independently scaled for its own needs, more than one of each microservice will be running at any given time.
This is where a cloud-based service mesh can be particularly helpful. Simply put, a service mesh aims to help simplify and manage routing traffic between microservices given the dynamic number of pods running for each. A service mesh can also encrypt the interface traffic, which saves the application from having to manage encryption. In some cases, it can also help with logging and transaction tracing.
Tracing only really works with HTTP-based protocols; it’ll work with 5G, but not 4G or anything before it. Unfortunately, your interfaces between microservices are unlikely to be HTTP-based, so the mesh won’t help with tracing.
However, there are some RFPs that mandate the use of Istio-based service meshes for all internal and external interfaces. Requiring a mesh to manage system internal interfaces such as intra-cluster and cross data center replication (XDCR) will likely cause performance problems that could require engineering effort to resolve.
Cloud-native challenge #5: Cluster upgrades
With 5G service-level agreements (SLAs) requiring single-digit millisecond responses, developers don’t have time to route traffic to other data centers over XDCR. As such, each cluster should remain in service—even during upgrades.
Cluster updates should be pod by pod. This means that you’ll have to manage clusters of mixed versions until every pod is upgraded. While this is a common ask, for stateful applications it requires the product to handle concurrent multiple versions in a single cluster, which is a complex problem to solve.
Cloud-native challenge #6: Multisite XDCR upgrades
Everything from software version upgrades to schema changes to stored procedure changes needs to be performed while every cluster in an XDCR setup remains in service. In other words, you can’t take a cluster down and leave only one available.
Once again, pulling this off correctly requires a lot of product capabilities.
Cloud-native challenge #7: Security
Regulators are getting stricter and stricter about security and encryption on applications deployed on a public cloud. Since the policy and charging functions contain very sensitive data, regulators are increasingly scrutinizing these areas to protect consumers.
Looking ahead, it’s likely that all interfaces—and possibly even run-time memory—will have to be encrypted when applications run in the public cloud. For example, some European regulators already mandate encryption of run-time memory where sensitive data is held in a public cloud.
As security requirements increase, products will need to manage passwords, users, roles, access, and encryption certificates more effectively across the application stack. They’ll also need to ensure all traffic is encrypted while maintaining expected performance.
Cloud-native challenge #8: Operations and troubleshooting
Breaking up applications into several microservices that dynamically scale makes operations and troubleshooting much harder. When something goes wrong, there’s every chance the instance that serviced a failed request won’t be running by the time someone tries to see what happened.
For example, auto-scaling down will kill pods, and the one that caused the problem could be long dead by the time someone gets around to troubleshoot the problem.
Kubernetes is designed to handle auto-healing; pods are built to fail and die and be replaced. For this reason, teams will need new monitoring and tracing tools to help understand the lay of the land. And those tools will need to be managed, too.
Cloud-native is still the future
Despite these challenges, the future of software development remains cloud-native. That’s because the approach delivers a ton of benefits, including faster iteration, reduced costs, scalability, flexibility, automation, and more.
By being aware of the challenges that are inherent in cloud-native deployments and proactively working to address them, development teams can unlock the full promise of cloud-native applications—much to the delight of their users and their organization’s bottom lines.