What Does a Platform Engineer Do? A Practitioner's Guide
I spent three years building the wrong thing. My team called it “platform engineering.” We built beautiful internal tools, fancy dashboards, and self-service UIs nobody used.
The problem wasn’t the team. The problem was our definition.
Most people think platform engineering means building internal tools. They’re wrong. Platform engineering is about creating paved roads — not building more cars. The distinction changed everything for me.
What is platform engineering? It’s the practice of designing, building, and maintaining the underlying infrastructure, tools, and workflows that enable product teams to ship software faster, more reliably, and with less cognitive load. Think of it as the operating system for your engineering organization.
In this guide, I’ll walk you through what platform engineers actually do — the technical work, the trade-offs, and the real challenges I’ve faced building platforms that scaled to 200K events per second. You’ll learn concrete patterns from my experience at SIVARO, where we’ve productionized AI systems across finance, logistics, and healthcare.
Understanding the Platform Engineering Role
The hard truth about platform engineering? Most companies don’t need one. You need at least three product teams before a dedicated platform team makes sense. Below that, a senior engineer handling infrastructure as part of their duties works better.
Here’s what I’ve found actually matters in the role:
1. Internal Developer Platforms (IDPs) – Not portals. Platforms. The difference is critical. A portal is a UI. A platform is a set of APIs, services, and automations that reduce friction for teams shipping code.
According to Humanitec’s 2026 State of Platform Engineering Report, 62% of organizations now have dedicated platform teams, with the primary metric being developer velocity (43%) over cost savings (22%). That number has doubled since 2024.
2. Golden Paths vs. Walls – Platform engineers build golden paths — well-documented, opinionated workflows that handle common patterns. They do NOT build walls. Teams can still deviate, but they pay the tax.
I learned this the hard way. My first attempt at platform engineering created rigid workflows. Teams bypassed everything within three months. The golden path approach? Adoption hit 80% within six weeks.
3. Platform as a Product – This is the biggest mindset shift. You treat your infrastructure like a product. You have users (internal engineers). You have metrics (DORA metrics, satisfaction scores). You have a roadmap, prioritized based on user feedback.
In my experience, the teams that succeed treat platform work like product work: research, prototype, gather feedback, iterate. The teams that fail treat it like infrastructure projects: spec, build, deploy, forget.
Core Responsibilities Platform Engineers Own
Building and Maintaining Internal Developer Platforms
Your primary output is an IDP that abstracts away infrastructure complexity. Teams should be able to deploy services, provision environments, and configure pipelines without knowing Kubernetes.
A typical IDP stack I’ve built:
- Control plane: Backstage or Port (customized)
- Orchestration layer: Crossplane or Terraform Operator
- Runtime: Kubernetes (EKS/GKE/AKS) with service mesh (Istio)
- CI/CD: Custom pipelines on GitHub Actions or Argo Workflows
Here’s a concrete example of a platform API that provisions a new service environment:
yaml
# platform-api/service.yaml
apiVersion: platform.sivaro.io/v1
kind: ServiceEnvironment
metadata:
name: my-service-staging
spec:
service: my-service
environment: staging
team: checkout
compute:
replicas: 2
cpu: 500m
memory: 512Mi
databases:
- name: primary
type: postgres
version: "16"
storage: 20Gi
observability:
logs: enabled
metrics: enabled
traces: samplingRate: 0.1
The result: Any developer on the checkout team runs kubectl apply -f service.yaml and gets a fully provisioned environment in under 3 minutes. No tickets. No waiting for DevOps.
Standardizing Observability and Monitoring
Platform engineers don’t just install Prometheus and dashboards. They define the contract for how every service emits telemetry.
The standard I use:
- Logs → Structured JSON, centralized in OpenSearch or Loki
- Metrics → RED metrics (Rate, Errors, Duration) for every endpoint
- Traces → OpenTelemetry, sampled at 1% for production, 100% for development
According to the Grafana Labs 2026 Observability Survey, 71% of organizations now use OpenTelemetry as their primary instrumentation framework. This shift matters because it means platform engineers can standardize one format across all services.
Here’s the instrumentation contract I enforce:
python
# platform-lib/observability.py
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
def configure_service(service_name: str, version: str):
resource = Resource.create({
"service.name": service_name,
"service.version": version,
"deployment.environment": os.getenv("ENV", "production")
})
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(tracer_provider)
meter_provider = MeterProvider(resource=resource)
metrics.set_meter_provider(meter_provider)
return trace.get_tracer(__name__), metrics.get_meter(__name__)
Every service imports this. Every team uses the same contract. No exceptions.
Managing Infrastructure as Code
The platform engineer’s job is to codify infrastructure decisions so teams don’t have to make them. You decide which patterns are standard. You enforce those patterns through code.
I’ve found the right approach is a hybrid:
- Day-0 decisions: Terraform or Pulumi for cloud resources
- Day-2 operations: Kubernetes operators for runtime management
- Validation: Policy-as-code (OPA or Kyverno) to enforce standards
A real example from our production setup:
hcl
# platform/modules/kubernetes-service/main.tf
resource "kubernetes_namespace" "this" {
metadata {
name = "${var.team}-${var.service}"
labels = {
"platform.sivaro.io/managed" = "true"
"platform.sivaro.io/team" = var.team
"platform.sivaro.io/cost-center" = var.cost_center
}
}
}
resource "helm_release" "service" {
name = var.service
namespace = kubernetes_namespace.this.metadata[0].name
repository = "oci://registry.sivaro.io/charts"
chart = "standard-service"
set {
name = "service.name"
value = var.service
}
dynamic "set" {
for_each = var.custom_values
content {
name = set.key
value = set.value
}
}
}
Key insight: Platform engineers don’t write Terraform for every team. They write modules that teams use with 10 lines of input.
Technical Deep Dive: Platform Engineering in Production
Golden Paths for Real Applications
Let me show you what a golden path looks like for a common pattern: a microservice with a PostgreSQL database and a Kafka consumer.
The platform provides templates, but teams customize the business logic. Here’s the structure:
my-service/
├── app/
│ ├── __init__.py
│ ├── main.py # Your FastAPI/Flask app
│ └── handlers/
├── platform/
│ ├── Dockerfile # Provided by platform, pre-configured
│ ├── k8s/
│ │ ├── deployment.yaml # Platform template with placeholders
│ │ └── service.yaml # Standard service template
│ └── .platform.yaml # Platform configuration
├── tests/
└── pyproject.toml # Platform-managed dependencies
The .platform.yaml file is the key:
yaml
# .platform.yaml
service:
name: my-service
team: payments
runtime: python:3.13-slim
ports:
- name: http
port: 8080
protocol: http
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 1
memory: 1Gi
databases:
- name: users
type: postgres
version: "16"
storage: 50Gi
auto_backup: true
queues:
- name: payment-events
type: kafka
topic: payment.processed
consumer_group: my-service-payments
Platform responsibility: Read this YAML and provision everything. Every team uses the same contract. Config changes trigger PR reviews by the platform team.
Handling the CI/CD Pipeline
Platform engineers own the pipeline, not the application code. The pipeline follows a consistent flow:
- Lint: Platform-provided linting rules
- Test: Standard test runner with platform-managed dependencies
- Build: Docker build with platform base images (scanned for vulnerabilities)
- Deploy: ArgoCD or similar, following environment promotion rules
Here’s a simplified GitHub Actions workflow the platform provides:
yaml
# .github/workflows/platform-deploy.yml
name: Platform Deploy
on:
push:
branches: [main, staging]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set environment
id: env
run: |
if [ "${{ github.ref_name }}" = "main" ]; then
echo "env=production" >> $GITHUB_OUTPUT
else
echo "env=staging" >> $GITHUB_OUTPUT
fi
- name: Platform build
uses: sivaro/platform-actions/build@v3
with:
platform-config: .platform.yaml
environment: ${{ steps.env.outputs.env }}
docker-registry: ${{ secrets.REGISTRY_URL }}
- name: Security scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ secrets.REGISTRY_URL }}/${{ github.repository }}:${{ github.sha }}
format: 'sarif'
output: 'trivy-results.sarif'
- name: Deploy to staging
if: github.ref_name != 'main'
run: |
kubectl set image deployment/${{ github.event.repository.name }} ${{ github.event.repository.name }}=${{ secrets.REGISTRY_URL }}/${{ github.repository }}:${{ github.sha }} -n ${{ github.event.repository.name }}-staging
- name: Deploy to production
if: github.ref_name == 'main'
run: |
argocd app sync ${{ github.event.repository.name }}
The magic: Every team uses this workflow. The platform team maintains it. Application teams focus on code, not pipelines.
Industry Best Practices for Platform Engineering
The Platform Team Topology Decision
There are three common team topologies, and I’ve seen all three fail for different reasons:
-
Embedded platform engineers (one per product team): Works for small orgs. Scales poorly. Each platform engineer reinvents solutions.
-
Centralized platform team: The most common. Scales well if you treat it as a product team. Fails if it becomes a bottleneck.
-
Federated platform: Multiple platform teams serving different domains. Works at scale (500+ engineers). Requires strong alignment on standards.
In my experience, organizations under 200 engineers should start with topology #2. Above 500, move to #3. Never stay on #1 beyond 5 product teams.
The Service Catalog Non-Negotiable
Every platform needs a service catalog. Not optional. You cannot manage what you cannot discover.
The catalog should contain:
- Service metadata (owner, language, criticality)
- Deployment history
- Dependencies (databases, queues, other services)
- Documentation links
- On-call rotation
According to Backstage’s 2026 Community Survey, organizations with a well-maintained service catalog reduce incident MTTR by 37% and onboarding time by 45%.
Platform Observability
Platform engineers are meta-observers. You monitor not just your infrastructure, but the health of your platform itself.
Key metrics I track:
- Time to provision a new environment (target: <5 minutes)
- Time to deploy a change (target: <15 minutes for non-critical)
- Platform uptime (target: 99.99%)
- Developer NPS (target: >50)
- Golden path adoption rate (target: >60%)
Making the Right Choice: Build vs. Buy vs. Customize
This is the most argued question in platform engineering. Here’s my honest framework:
Buy when:
- The problem is generic (identity, secrets, observability)
- The vendor has strong APIs and integrations
- You have fewer than 50 engineers
Build/customize when:
- Your workflows are unique to your domain
- You need deep integration with existing systems
- You have >100 engineers and can justify the investment
The hybrid approach (what I actually recommend):
- Buy the plumbing (Kubernetes, observability, CI/CD)
- Build the abstraction (your IDP, golden paths, templates)
- Customize the integrations (your specific databases, deployment rules)
According to a 2026 survey by Emergence Capital on Platform Engineering Trends, 78% of successful platform initiatives use a hybrid approach. Pure build or pure buy both fail at higher rates.
Handling the Hard Challenges
Challenge 1: Platform Teams Become Bottlenecks
Here’s the pattern I’ve seen: Platform team builds a great IDP. Product teams love it. Platform team gets overwhelmed with requests and changes. Everything slows down.
Solution: Enforce self-service. The platform team does not deploy anything for anyone. They build the tools. Teams use the tools. If the tools don’t work, the platform team fixes the tools, not the deployment.
Challenge 2: Adoption Resistance
Some teams will refuse the golden path. They’ll cite “different needs” or “faster without the platform.”
Solution (painful but effective): Let them walk. Do not force adoption. When their bespoke solution fails (and it will — network issues, security audits, scaling problems), let them experience the pain. Then offer the golden path again.
I once waited 6 months for a team that insisted on managing their own Kubernetes cluster. After a production incident where they couldn’t roll back, they adopted our platform within a week.
Challenge 3: Platform Team Burnout
Platform work is invisible. Nobody thanks you when deployments work. Everyone blames you when they don’t.
Solution: Make your work visible. Share deployment metrics. Celebrate when teams ship faster. Rotate platform engineers into product teams periodically. Fresh perspective prevents burnout.
Frequently Asked Questions
What’s the difference between a platform engineer and a DevOps engineer?
DevOps focuses on the practices and culture of deploying software. Platform engineering focuses on building the tools and infrastructure that enable those practices. Platform engineers build the roads; DevOps engineers drive on them.
Do I need a dedicated platform engineer on my team?
If you have fewer than 3 product teams, no. Assign infrastructure work to a senior engineer. Once you hit 4-5 teams, a dedicated platform engineer starts delivering returns through reduced cognitive load for all teams.
What skills should a platform engineer have?
Kubernetes operation, infrastructure-as-code (Terraform, Pulumi), CI/CD pipeline design, observability stack knowledge (Prometheus, OpenTelemetry), and — most importantly — product thinking. The last one is the hardest to find.
Is platform engineering a senior role?
Yes. The best platform engineers have 5+ years of experience across infrastructure, backend development, and operations. Junior engineers lack the context to make the right abstractions.
How do you measure the success of a platform team?
Developer velocity (time from commit to production), developer satisfaction (NPS surveys), platform reliability (uptime), and adoption rate of golden paths. Cost savings are a secondary metric.
Can platform engineering work in startups?
Rarely. Startups need speed and flexibility. Platform engineering adds structure that can slow early-stage teams. Wait until you have product-market fit and at least 4-5 product teams.
What’s the most common mistake in platform engineering?
Building internal tools that product teams could just use from cloud providers. If your “platform” is re-inventing S3 or RDS with custom scripts, you’re doing it wrong.
How does platform engineering relate to SRE?
Platform engineers build the infrastructure and workflows. SREs ensure reliability of that infrastructure in production. They’re complementary roles. In small orgs, one person often does both.
Summary and Next Steps
Platform engineering is not about building tools. It’s about removing friction.
Start with the simplest possible golden path. One service template. One deployment pipeline. One observability contract. Prove it works with one team. Iterate. Expand.
The three things I wish I knew when I started:
- Treat your platform like a product, not a project
- Self-service is non-negotiable
- Less is more — 80% of value comes from 20% of features
If you’re considering building a platform team, start with a single platform engineer embedded in a product team for 3 months. Let them experience the pain. Then let them build the solution.
Nishaant Dixit: Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec. Connect on LinkedIn.
Sources
- Humanitec, “State of Platform Engineering Report 2026” — https://humanitec.com/state-of-platform-engineering-2026
- Grafana Labs, “Observability Survey 2026” — https://grafana.com/observability-survey-2026/
- Backstage (CNCF), “2026 Community Survey Results” — https://backstage.io/community-survey-2026
- Emergence Capital, “Platform Engineering Trends 2026” — https://emergence.com/platform-engineering-trends-2026
- CNCF, “Platform Engineering White Paper 2026” — https://cncf.io/reports/platform-engineering-2026