Migrating 70+ Microservices to Azure Kubernetes Service — A Platform Engineer's Playbook

From on-premises to cloud-native at enterprise scale

Posted by Saurabh Chaubey on Saturday, September 20, 2025

Migrating a handful of microservices to Kubernetes is one thing. Migrating over seventy — with their interconnected dependencies, legacy configurations, and the weight of years of organic growth — is an entirely different challenge. This is the story of how our platform engineering team planned and executed a large-scale migration from on-premises infrastructure to Azure Kubernetes Service (AKS) at a major enterprise insurance company, and the lessons we learned along the way.

The Starting Point

Our on-premises estate was a product of evolution rather than design. Over the years, the organisation had built a microservices architecture that ran across a fleet of virtual machines managed by a combination of Ansible scripts and manual processes. Deployments involved SSH-ing into machines, running shell scripts, and hoping that the environment variables were configured correctly. Some services had automated deployments through Jenkins; many did not.

The infrastructure worked, but it was showing its age. Scaling was slow and manual. Environment parity between development, staging, and production was aspirational at best. Deploying a new service meant provisioning VMs, configuring load balancers, setting up monitoring — a process that could take two weeks or more. And troubleshooting issues often came down to “it works on my machine” because the environments were genuinely different.

The decision to move to Kubernetes wasn’t made lightly. It came after months of evaluation, proof-of-concept work, and building the business case. We chose AKS specifically because the organisation was already invested in the Azure ecosystem, and AKS offered a managed control plane that reduced our operational burden. We weren’t interested in running our own Kubernetes clusters — we wanted to focus on what ran on the clusters, not the clusters themselves.

Assessment and Planning

Before migrating a single service, we spent three months on assessment and planning. This phase was unglamorous but absolutely essential.

Service Inventory

The first task was understanding what we actually had. It sounds obvious, but in a large organisation, the answer to “how many microservices do we have?” is surprisingly hard to pin down. Services had been created by different teams over several years, documentation was inconsistent, and some services were running but effectively abandoned — still consuming resources but no longer actively maintained.

We built a comprehensive inventory that captured each service’s technology stack, resource requirements, dependencies, data persistence needs, and criticality. We categorised services into tiers: Tier 1 (business-critical, high-traffic), Tier 2 (important but not customer-facing), and Tier 3 (internal tools and batch jobs).

Containerisation Readiness

Not all services were equally ready for containerisation. Most were Mulesoft applications and APIs, which required careful consideration for containerisation. Mulesoft runtimes have specific memory and configuration requirements, and many of our APIs had dependencies on shared domains, custom connectors, and environment-specific property files. We also had a few legacy Node.js apps with hardcoded file paths and batch processing jobs that assumed access to network file shares.

For each service, we assessed containerisation complexity on a simple scale: low (already follows 12-factor patterns), medium (needs configuration externalisation), and high (requires code changes or architectural modifications). About 60% fell into the low category, 30% were medium, and 10% were high — with many of the Mulesoft APIs falling into the medium category due to their configuration and connector dependencies.

Migration Strategy: Lift-and-Shift vs Re-Platform

We made a pragmatic decision early on: this migration was about getting services onto Kubernetes, not about rewriting them. We adopted a “lift-and-shift with guardrails” approach. Services would be containerised as-is wherever possible, with modifications limited to what was necessary for container compatibility — externalising configuration, removing filesystem dependencies, and ensuring graceful shutdown handling.

The “guardrails” part meant that while we weren’t rewriting services, we were establishing standards. Every migrated service would have health check endpoints, structured logging, Dynatrace monitoring integration, and a standardised pipeline template. If a service didn’t have these, we’d add them as part of the migration — but we wouldn’t refactor the business logic.

This approach let us move fast. Trying to modernise seventy services simultaneously would have turned a migration into a multi-year rewrite programme.

Networking Challenges

Networking was, predictably, one of the hardest aspects of the migration. Our on-premises services communicated over a flat network with relatively simple firewall rules. Moving to AKS introduced a new networking model that we had to design carefully.

Cluster Networking

We chose Azure CNI Overlay networking for our AKS clusters. With CNI Overlay, pods receive IP addresses from a private CIDR overlay network rather than directly from the VNet subnet. This was a significant advantage for our scale — with 70+ services, multiple replicas per service, and several environments (dev, staging, production), CNI Overlay eliminated the IP exhaustion concerns that come with traditional Azure CNI, where every pod consumes a VNet IP address. Our subnet sizing could remain manageable while supporting significant pod density.

The Azure landing zone was connected to the organisation’s on-premises network via ExpressRoute, providing a private, high-bandwidth, low-latency connection to on-premises databases and legacy systems that weren’t migrating to the cloud. This was critical because many of our Mulesoft APIs needed to communicate with backend systems that remained on-premises.

All traffic between the cloud environment and the on-premises network was secured through Azure Firewalls, which provided network-level filtering, threat intelligence, and centralised logging of all cross-boundary traffic. The firewall rules were managed as code and reviewed as part of our change management process, ensuring that connectivity changes were auditable and controlled.

Service Mesh Considerations

We evaluated Istio and Linkerd for service mesh capabilities but ultimately decided against introducing a service mesh during the initial migration. The reasoning was simple: the migration itself was complex enough without adding another layer of infrastructure to learn and operate. We used Kubernetes-native services and ingress controllers for traffic management, with the option to introduce a service mesh later once the migration was stable.

This decision was controversial within the team, but I stand by it. You can always add complexity later; removing it is much harder.

DNS and Service Discovery

On-premises, services discovered each other through a combination of environment variables, configuration files, and in some cases, hardcoded IP addresses. In Kubernetes, we used internal DNS (service names) for intra-cluster communication and configured external-dns for services that needed to be reachable from outside the cluster.

Migrating DNS was one of those tasks that sounds trivial but consumed weeks of effort. Every service had to be updated to use the new endpoints, and we had to maintain backward compatibility during the transition period when some services were on-premises and others were in AKS.

Secrets Management

Secrets management was a critical workstream. On-premises, secrets were stored in a mix of places — environment variables baked into VM images, files on shared drives, and entries in a legacy vault solution. The migration required rethinking how credentials were handled, particularly for our Mulesoft applications running on Runtime Fabric.

Unlike a traditional Kubernetes deployment where you might mount secrets directly into pods via the Azure Key Vault CSI driver, MuleSoft Runtime Fabric manages secrets through its own mechanisms. We used a two-tier approach: infrastructure credentials (database connection strings, API keys) were set as secure properties at the namespace level using the rtfctl CLI, while application-level secrets in Mule property files were encrypted at rest using MuleSoft’s Secure Configuration Properties module. This gave us defence in depth without relying on Kubernetes-native secrets.

The migration of secrets was still painstaking — for each service, we had to identify every secret, determine which tier it belonged to, and provision accordingly. We built automation to scan service configurations and categorise secrets, but it still required manual verification for each service.

One lesson we learned the hard way: audit your secrets before migrating them. We discovered several services using secrets that were years old and no longer valid, and a few cases where multiple services shared the same credential (a security anti-pattern we cleaned up during the migration).

Observability

Moving to Kubernetes gave us an opportunity to standardise our observability stack. On-premises, monitoring was fragmented — some services used Sumo Logic, others used custom logging solutions, and a few had no meaningful monitoring at all.

For the AKS environment, we adopted Dynatrace as our single-pane-of-glass observability platform. We deployed the Dynatrace Operator on our AKS clusters, which provided automatic instrumentation of pods with OneAgent, enabling deep visibility without requiring code changes in individual services.

Dynatrace gave us a unified platform covering all observability pillars:

  • Metrics: Automatic collection of infrastructure and application metrics, including Kubernetes cluster health, pod resource utilisation, and custom application metrics — all surfaced through Dynatrace dashboards
  • Logging: Centralised log ingestion from all AKS workloads, with full-text search, log analytics, and correlation with traces and metrics
  • Distributed Tracing: Automatic end-to-end trace capture across our Mulesoft APIs and supporting services, with AI-powered root cause analysis through Dynatrace’s Davis AI engine
  • Alerting: Dynatrace’s anomaly detection and alerting capabilities replaced the need for manually configured thresholds, automatically baselining service behaviour and alerting on deviations

Every migrated service was automatically instrumented by the Dynatrace Operator, giving teams immediate visibility into their service’s health post-migration. Dynatrace dashboards were configured for each service, providing a comprehensive view of performance, dependencies, and error rates from day one.

CI/CD Pipeline Modernisation

The migration was also the forcing function for modernising our CI/CD pipelines. We migrated from Jenkins to Azure DevOps Pipelines, taking advantage of the tighter integration with AKS and the broader Azure ecosystem.

Since the majority of our services were Mulesoft API applications being migrated to AKS via MuleSoft Runtime Fabric, the CI/CD pipeline didn’t need to handle container image builds, registry management, or deployment orchestration directly. Runtime Fabric’s control plane manages the full lifecycle of Mulesoft applications on Kubernetes — from pulling the application archive, building and managing the container image, to deploying and scaling the runtime pods on the AKS cluster.

Our standardised pipeline followed this flow:

  1. Build: Compile the Mule application and run unit tests (MUnit)
  2. Scan: Static code analysis (SonarQube) and dependency vulnerability scanning
  3. Package: Build the Mule application archive and publish to Anypoint Exchange
  4. Deploy to Dev: Trigger deployment to the Runtime Fabric dev environment via Anypoint Platform
  5. Integration Tests: Run automated integration tests against the dev environment
  6. Deploy to Staging: Manual approval gate, then trigger Runtime Fabric deployment to staging
  7. Deploy to Production: Manual approval gate with change management integration

By delegating container lifecycle management to the Runtime Fabric control plane, our pipelines became simpler and more focused — teams only needed to worry about their application code and configuration, not the underlying container infrastructure.

The pipeline standardisation had an unexpected benefit: it became much easier to enforce security and compliance requirements. Instead of checking each service’s bespoke pipeline, we could update the base pipeline template and have the changes propagate to all services.

The Rollout Approach

We didn’t migrate all seventy services at once. We used a phased approach:

Phase 1 — Pathfinder (2 services, 4 weeks): We chose two low-risk, Tier 3 services as pathfinders. The goal was to validate our migration process, identify gaps in our tooling, and build confidence. These first two services took disproportionately long because we were building the foundation — the pipeline templates, the networking configuration, and the operational runbooks.

Phase 2 — Proof (8 services, 6 weeks): We expanded to eight services across different teams and technology stacks. This phase validated that our approach worked beyond the platform team’s own services. It also surfaced issues with team-specific configurations and undocumented dependencies.

Phase 3 — Scale (30 services, 8 weeks): With the process proven, we parallelised. Multiple teams migrated their services concurrently, with the platform team providing support and guidance. We ran migration workshops, created detailed runbooks, and established a dedicated Teams channel for migration questions.

Phase 4 — Complete (remaining services, 10 weeks): The final phase tackled the harder services — the ones with complex dependencies, legacy code, or high criticality. These required more hand-holding and in some cases, code changes to achieve container compatibility.

Throughout the rollout, we ran services in parallel — traffic flowing to both on-premises and AKS instances — before cutting over. This gave us a safety net and the ability to roll back quickly if issues arose.

Results

The migration took approximately seven months from the first planning session to the last service cutover. Here’s what we achieved:

  • Deployment frequency: Increased from an average of once per fortnight to multiple times per week
  • Environment provisioning: Reduced from 2 weeks to under 1 hour
  • Incident response time: Improved by approximately 40%, thanks to standardised observability
  • Resource utilisation: Improved by roughly 35% due to Kubernetes bin-packing and right-sizing
  • Cost: Initial cloud costs were higher than on-premises (as expected), but the operational efficiency gains and reduced manual toil more than compensated

Lessons Learned

Invest disproportionately in the first few services. The foundation you build during the pathfinder phase determines the speed of everything that follows. Don’t rush it.

Networking will take longer than you think. Every networking assumption you made on-premises will be challenged in the cloud. Start the networking workstream early and involve your network team from day one.

Don’t underestimate the people side. Migrating to Kubernetes isn’t just a technology change — it’s a skills change. We ran Kubernetes training sessions for development teams and created a certification pathway. Developers who understood the platform were significantly more effective at troubleshooting and optimising their services.

Automate the boring stuff. The migration involved a huge amount of repetitive work — creating namespaces, configuring RBAC, setting up monitoring. Every task we automated freed up time for the harder problems.

Keep the old infrastructure running longer than you’d like. We maintained on-premises environments for three months after the migration completed. It felt wasteful, but it was invaluable for debugging issues and providing a rollback path. The cost of keeping the lights on was far less than the cost of a botched migration.

Perfect is the enemy of migrated. Some services migrated with known imperfections — suboptimal resource limits, missing metrics, incomplete documentation. We tracked these as tech debt and addressed them post-migration. Trying to make everything perfect before cutover would have delayed the migration by months.

Looking Back

If I had to do this migration again, I’d change a few things.

First, I’d invest in a migration tracking dashboard from day one. We used spreadsheets initially, and they became unwieldy quickly. A proper dashboard with per-service status, blocking issues, and timeline tracking would have improved coordination significantly.

Second, I’d run a dedicated tribal knowledge capture sprint before the migration started. We repeatedly hit undocumented dependencies — services that relied on specific environment variables nobody remembered setting, batch jobs triggered by obscure cron entries on shared VMs, and Mule connector configurations that only the original developer understood. Each discovery meant pausing the migration to reverse-engineer the behaviour. A structured knowledge capture exercise — even just a few days of workshops with each team — would have surfaced these surprises early and saved weeks of debugging during the actual migration.

But overall, the migration was a success. Our services are more reliable, our deployments are faster, and our platform team can focus on building capabilities rather than babysitting infrastructure. Moving to AKS wasn’t just a lift-and-shift — it was the foundation for the next generation of our platform.