May 22, 2026

Our Claude Code + n8n + Python Stack for Custom AI Workflows

The three-layer stack we use to ship custom AI workflows: Claude Code for engineering, n8n for orchestration, and Python where determinism matters most.

By Ian Phillips, Founder & CEO, Phillips Data Solutions

Our Claude Code + n8n + Python Stack for Custom AI Workflows

The Claude Code + n8n + Python stack is what is letting us ship tailored AI apps in days instead of months. Everyone wants to know the stack. Tools change, but the shape of our 2026 stack has been stable for about a year now, and it is the single biggest reason our build cycles look the way they do.

This post is the unvarnished version: what we use, why we picked it, what each layer is good at, and the things we deliberately do not use yet.

The Three-Layer Stack

Claude Code — the engineering loop. Writes code, runs tests, iterates.
n8n — the orchestrator. Catches events, fans out, retries, logs.
Python — the deterministic core. Anything that has to be exactly right, every time.

Each layer is good at one thing. None of them is trying to be the others. That separation is the entire trick. Most of the architecture mistakes we see in the wild come from one tool being asked to do all three jobs.

Why Claude Code at Layer 1

Claude Code is our default for the actual building. The reasons:

It works in real repos. It reads existing code, follows conventions, respects tests, and does not need a sandbox.
It is honest about uncertainty. It tells you when something is ambiguous instead of producing confident garbage.
It pairs with humans. We still review every change. Claude Code drafts; the engineer ships. The ratio of typing to thinking inverts in our favor.
It scales across tasks. From "rename this field" to "rebuild this Zap as a service" — same tool, same workflow.
It owns the boring parts. Env vars, OAuth flows, retry logic, error handling. The work that used to fill a week now fills an afternoon.

We use this loop heavily in the day-one builds described in Building Custom Internal Tools With Claude Code in One Day.

What Claude Code Is Not

Claude Code is not a replacement for engineering judgment. It is a force multiplier for an engineer who already knows what good looks like. We have not yet seen the "no engineers required" story work cleanly for production systems with real users — and we have tried.

Why n8n at Layer 2

We tried straight code-only architectures. They are clean and they are also brittle. n8n earns its place because:

It is visual, which makes it easy for clients to understand and modify after handoff. Six months later, they can still read their own automation.
It has 400+ integrations, which removes glue-code work for the long tail of SaaS APIs.
It self-hosts cheaply, so cost is predictable at scale — unlike per-task pricing.
It is JSON under the hood, so it versions in git like real code. Diffs are reviewable. Changes are auditable.
It is the cleanest landing place for workflows that graduate off Zapier — same mental model, dramatically better economics.

Where n8n Earns Its Keep

Webhook handling — receiving events from SaaS tools.
Scheduled jobs — the cron layer of the architecture.
Branching logic that a non-engineer needs to read.
Fan-out to multiple downstream systems with retry semantics.
Human-in-the-loop queues where someone reviews an AI proposal before it commits.

Why Python at Layer 3

Python is for the parts where "mostly right" is not good enough. The judgment calls live in Claude. The execution lives in Python.

Examples of work that lands in the Python layer for us:

Deduping a contact list against an existing CRM. The dedup logic needs to be deterministic, testable, and version-controlled.
Parsing financial documents. When the answer has to balance to the cent, we want a tested parser — not a model.
Calling internal APIs that need real auth, real rate limiting, and real retry logic.
Anything where the AI's job is to decide and Python's job is to execute exactly.

This is the same split that powered our 960x CRM enrichment work. AI handled the judgment calls (does this contact match? is this title senior? what industry vertical?). Python handled the writes (HubSpot API calls, batched with backoff, fully logged, idempotent on retry).

Why Python and Not, Say, Node?

Mostly inertia and ecosystem. The data-engineering tooling we lean on (pandas, polars, fuzzy-matching libraries, document parsers) is more mature in Python. Node would also work; we just have less leverage there. If a client's team already runs Node, we will happily build the deterministic layer in TypeScript instead. The shape of the stack matters more than the language.

The Shape of a Typical Build

Almost every workflow we ship has the same five-step skeleton:

An event lands in n8n — a webhook fires, a schedule triggers, a poll returns new data.
n8n normalizes the event and calls a Python service (or a Claude-Code-authored Next.js route).
The service does the deterministic work and, where useful, calls a Claude model with a structured prompt.
The result goes back through n8n, which writes to the system of record and logs every step.
Failures route to a human-review queue with full context attached — input, intermediate state, error message, proposed action.

That same shape covers roughly 80% of what we ship — from the HubSpot and Microsoft 365 agents to the AI receptionists to the document-classification workflows. The repetition is a feature: it means we can reason about new builds quickly, and clients can recognize their own architecture.

What We Deliberately Do Not Use (Yet)

Tool selection is also a stack decision. A few things we have looked at and chosen not to make defaults:

Heavy Agent Frameworks for Everything

Useful for some workflows, overkill for most. We reach for full agent frameworks only when a workflow genuinely needs autonomous multi-step reasoning. For 80% of work, a structured Claude call inside a deterministic pipeline is simpler, cheaper, and easier to debug.

Vector Databases by Default

Many workflows do not need RAG. The instinct in 2024 was to throw a vector DB at every problem. In 2026, we add one only when the workflow genuinely benefits from semantic retrieval over a corpus the model cannot fit in context. Most CRM and email workflows do not need this.

Big Managed AI Platforms

Great for prototyping. Expensive when you scale, and they tend to obscure the parts of the system you most need to debug. We prefer thin, well-understood layers we can replace independently.

"Just Use One Big Vendor"

Pitched constantly, almost never the right answer for SMBs. The reasoning is the same as in our custom AI app integration and HubSpot + Microsoft 365 + AI agents posts: the win is in the seams between tools, not in replacing them.

Cost Profile

A typical custom workflow on this stack lands at:

Build: 1–2 weeks of engineering time for v1.
Hosting: $20–$200/month depending on volume (n8n self-hosted, a small Python container, model API calls).
Model calls: highly variable; for most SMB workflows we have shipped, $50–$500/month.
Maintenance: roughly an hour every other month at steady state, after the first 30 days of tuning.

Compare this to a comparable Zapier-based workflow at 10,000+ tasks/month, where the platform fee alone often exceeds the all-in cost of the custom stack inside three months.

Conclusion

The best stack is the one your client can still understand six months after you hand it off. Claude Code at the build layer, n8n at the orchestration layer, Python at the deterministic core. Each layer does one job. None of them tries to be the others. That separation, more than any specific tool choice, is why this stack keeps shipping on time.

Ready to automate? Start a free discovery at phillipsdatasolutions.com/contact

Ready to automate?

Start a free discovery at phillipsdatasolutions.com/contact — we'll map your highest-ROI automation opportunities in 30 minutes.

Book Free Discovery