Claude Code for DevOps

I've spent several hours over several days documenting and optimizing my entire local environment so I can "mechanize" the work I do every day managing infrastructure for multiple startups.

This isn't a mandate, nor a piece of best-practice advice, nor something set in stone. It's just an example of how Claude Code can help with day-to-day infrastructure work. Not only by writing code, but by operating on the cloud and its tools.

I've spent several hours over several days documenting and optimizing my entire local environment so I can "mechanize" the work I do every day managing infrastructure for multiple startups.

Without realizing it, over the years I had already built up a routine of commands, paths, and a set of tools that let me always work the same way, which has made it much easier to automate it through Claude Code.

Repositories

Before diving into AI, I want to share how I organize my repositories. This is fairly personal, but in my case, and given that I work with multiple companies, each with its own infrastructure, access, repositories, and so on, I've organized it as follows.

In my user's home folder I have a directory called repository . Inside it I have one directory per client:

➜ repository

--- ..
--- .
--- cliente-A
--- cliente-B
--- cliente-Z
--- helmcode
--- CLAUDE.md
--- .agents
--- .claude

Each client directory also follows the same structure (or a very similar one), for example the contents of cliente-A:

➜ repository
--- cliente-A
------ devops # Repositorio de infra
--------- argocd
--------- helm
--------- terraform
--------- README.md
--------- CLAUDE.md
------ app-A # Repositorio de app
--------- .workflow/build-deploy.yaml
--------- [Otros directorios con código de la app]
------ app-B # Repositorio de app
--------- .workflow/build-deploy.yaml
--------- [Otros directorios con código de la app]

The devops repository is our working base; here we keep manifests, Helm charts, ArgoCD apps, and so on. Everything needed to work on and operate the client's infrastructure lives here.

In the client app repositories, what really matters to us are the pipeline files. In this example, cliente-A uses GitHub Actions, but we have clients that use GitLab CI and others that use Bitbucket Pipelines. That's the least of it, and that's the beauty of it: each one can have its own context, as we'll see later.

Let's go back for a moment to the devops repository. It has a CLAUDE.md that should give us context about:

  • High-level architecture: which cloud it's on, region, architecture, how its services communicate, whether it has public and private services, and so on
  • Tools/technologies the client uses: Kubernetes, Helm, Argo, and so on
  • Environments the client has: production, dev, staging, pre-production, and so on
  • Main domains per environment: "app.client-a.com", "app.dev.client-a.com", and so on
  • Even whether it has any quirks, such as requiring a VPN to communicate with its infrastructure. What kind of CI/CD they use for their services, and so on
It's very important that in this file you give it high-level context. There's no need to add work workflows or broader, more specific processes here, since, as we'll see later, there are other, more useful options for that.

This should be done in each of the client folders. That way the AI will have very clear context about what it can do and how it should do it for each client.

Context is what matters most

VERY important

If you noticed, in the previous section I had 2 things inside my repository directory that weren't inside a client directory:

➜ repository
--- CLAUDE.md
--- .claude

Let's start with the first one:

CLAUDE.md : This file is really the one that gives Claude Code its first context, since all of my Claude Code sessions start in the repository directory, so this will be the first file it reads as soon as the session starts.

This file therefore contains a global view of how I operate and work. It tells Claude Code and gives it context about how my repositories are organized. If it needs to operate on a client, where it should look to understand how things work, the architecture, and which tools it should use for each client.

It also points it to key paths on my computer, or where I keep important configuration files it needs to operate. For example, if it needs to interact with a client's Kubernetes cluster in a specific environment, this file tells it to look for the kubeconfig at $HOME/.kube/<client>/<kubeconfig_environment>

That way, when I start a new Claude Code session in the repository directory, I only have to tell it: "Look at the logs for service X for client Z in the production environment and tell me what's going on"

Without any extra message, Claude Code has enough context to:

  1. Know which client I'm talking about.
  2. Know how to interact with that client's cluster.
  3. Know which kubeconfig it should use for the corresponding environment.
  4. Know which service I'm talking about and diagnose what's going on.

It's not magic. It's pure, plain context.

Also, something I've learned over time is that if at any point you have to repeat something to the agent more than twice, it should be documented somewhere so you don't have to repeat it again.

Skills, including workflows and flows.

➜ repository
--- CLAUDE.md
--- .claude

The second thing we have inside the repository directory is the .claude directory. If you've been working with Claude Code for a while, you'll know this is its main directory. The global one lives in your user's home folder, but in my case I keep this one at the "project" level because I want it to be very specific to my day-to-day managing cloud infrastructure.

Inside the .claude directory I have:

➜ .claude
--- settings.local.json
--- skills

Given the title, let's start with the Skills directory. If you haven't heard of skills, I highly recommend this doc from Anthropic, where it gives you an example of how to create skills and what they're for.

To sum it up briefly, it's basically documentation with a specific structure that lets you "teach" your agents, in this case Claude Code, to do something specific.

For example, in my case I have several skills, not only for infrastructure things but also for development things. You can find and install skills for whatever you want here: https://skills.sh/

That said, something VERY powerful is creating your own skills. For example, in our case we manage the VPN users for all clients through a script. We have one script per client, but the way that script is used for each client is exactly the same; however, it requires a set of guidelines and instructions to use it correctly.

This "knowledge" of telling it the workflow of what it should do is very easy to document in a skill, so that when we need to, for example, create a new VPN user, we only have to tell it: "Set up VPN access for the user Gandalf the Grey on client A".

With that, Claude Code will detect that it has a skill for it and will know exactly what to do to handle that request.

This is just one example; it's up to you, your work, and your business logic to give it as many skills as you see fit so it can run complete workflows from a single message.

I'll share some of the skills I use that are public, and therefore anyone can install. Some are for infrastructure things and others for development things:

  • architecture-patterns
  • async-python-patterns
  • fastapi-templates
  • frontend-design
  • golang-pro
  • helm-chart-scaffolding
  • mcp-builder
  • python-performance-optimization
  • python-testing-patterns
  • skill-development
  • tailwind-design-system
  • terraform-module-library
  • terraform-style-guide
  • terraform-test
  • vercel-react-best-practices

You can find and install all of them with the service I showed you earlier.

Permissions

Another of the things we see inside the .claude directory is the settings.local.json file. This file is the one that lets you configure certain things in Claude Code, such as permissions.

This lets you tell Claude Code which things it can do without asking, which tools it can always use, or, conversely, which things it can NEVER do.

This is up to each person. While I now use Claude Code for everything, locally, I always supervise what it does, since I work with critical infrastructure and can't afford mistakes.

So I don't have many things allowed by default, but I can give you a small example of what's in my settings to give you an idea:

{
"env": {
"CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
},
"permissions": {
"allow": [
"Bash(ls:*)",
"Bash(wc:*)",
"Bash(find:*)",
"WebSearch",
"Bash(gh pr view:*)",
"Bash(source:*)",
"Bash(terragrunt plan:*)",
"Bash(gh pr comment:*)",
"Bash(kubectl get:*)",
"Bash(aws sts get-caller-identity:*)",
"Bash(git pull:*)",
"Bash(git add:*)",
"Bash(git commit:*)",
"Bash(helm search repo:*)",
"Bash(helm show chart:*)",
"Bash(nslookup:*)",
"Bash(dig:*)",
"Bash(gh auth status:*)",
"Skill(bender-config)",
"WebFetch(domain:docs.anthropic.com)",
"Bash(gh search:*)",
"Bash(gh issue list:*)",
"Bash(gh run view:*)",
"Bash(gh pr diff:*)"
]
}
}

While for infrastructure tasks I'm not using agent teams but just a single agent, for some development tasks I am using Claude Code agent teams.

That's what the env var CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS is for. If you follow me on Twitter you'll have seen the daily ramblings I post about this, or on LinkedIn I've occasionally written a post touching on some of this too.

Maybe later on I'll write a post about agent teams.

And what about the famous MCPs?

What about MCPs?

The truth is that at first, when I started using Claude Code, I began installing several MCPs to interact with everything I've mentioned above: Kubernetes, whatever cloud was in play, and so on.

The problem with MCPs, in my particular case, is that their configuration is very rigid. For example: the Kubernetes MCP, in its settings, only let me configure a single kubeconfig (And yes, I know contexts exist, but I've never liked working with them). Given that we have 20+ clusters, each with its own kubeconfig, this was completely unworkable for me. Same problem with ArgoCD MCPs and many other things.

In the end, I've chosen to mainly use the tool that Claude Code ships with by default, "Bash()", and through it call the CLIs I use in my day-to-day (kubectl, argocd, aws, and so on)

The advantage is that most of these CLIs are fairly popular, so the agent knows very well how to use them and rarely makes mistakes with them. In fact, back when I didn't have as much documentation in place, its main problem was that it didn't know I work with multiple clients. Now, thanks to the context and its own knowledge of the CLI, it hits the mark most of the time.

So, at least for now, in my workflow it's working much better to use the CLIs directly than to spend time configuring MCPs.

Currently, the only MCP I have configured, and which I mainly use when doing Frontend development work, is playwright , which lets me control the browser and so test the UI better.

Plans, Models, and so on

The reality is that until a month ago I was running solely on Claude's Pro plan ($20/month), alternating between Sonnet 4.5 and Opus 4.5.

However, ever since they announced the addition of agent teams and released Opus 4.6, this plan has fallen well short for me and I've had to move to the Max 5x plan ($100/month). And even this one with Opus 4.6 hit the limits several times a day, but only when I do pure, hardcore development work. If I stick to infrastructure things, there's no way I'd burn through the Max 5x limits, not even close, for now.

Now with Sonnet 4.6 the results seem to be good, and even doing development work I don't hit this plan's limits, but we'll see over time how efficient this model is and whether I'll need to fall back on Opus now and then.

As for tokens, over the last few weeks I've been burning ~80K tokens per day. If you're doing full-on development work the numbers would surely be considerably higher, but for operating and managing the infrastructure of several companies, these have been my results.

On Twitter , I'm posting every day about the different experiments I run and their results, both the good and the bad. I'd be happy to hear your feedback or ideas about all of this.

If you don't want to miss more posts about AI focused on cloud infrastructure, don't forget to subscribe. See you next time! 🖖