Charlotte Malmberg

Frameworks for simplicity beyond complex systems.

Blog

  • Writing With AI When Your Idea Is Original

    I’ve been writing a personal finance book based on my own method: ClearFlow.

    The system I created deliberately does not track transactions. That isn’t an omission. It’s the core design choice.

    Most personal finance systems are ledger-based — track every transaction, categorize spending, reconcile monthly, and analyse what already happened. ClearFlow works differently. It is built on forward constraints: spending boundaries, daily limits, and save-to-spend buckets. Prospective, not retrospective.

    That distinction isn’t a feature. It is the architecture.

    When I tried to have AI help draft sections of the book, transaction tracking kept appearing in the text. Not once, not occasionally — repeatedly. Even after I removed it. Even after I clarified the structure in detail.

    I assumed the model was misunderstanding me.

    It wasn’t.

    It was doing exactly what it is built to do. If most personal finance systems in its training data include transaction logs, then “personal finance system” and “track transactions” are strongly associated. Open a drafting space — or even ask for feedback — and the model drifts toward that dominant pattern.

    Not wrong. Typical.

    That was the moment something clicked.

    AI generation pulls toward what is common. If you are building something deliberately different, that pull becomes visible very quickly.

    I tried correcting it through prompts.

    “Do not include transaction tracking.”
    “This system does not rely on logs.”

    It would hold for a section or two. Then, as we moved further through the book, the familiar pattern returned.

    That’s when I realised I was working at the wrong level.

    Prompting is conversation. You are trying to steer behaviour with words. But the system’s underlying objective hasn’t changed. It is still optimized toward what is statistically normal. Each time I removed the drift, I was correcting entropy rather than preventing it.

    The problem wasn’t output quality. It was task design.

    The shift came when I stopped asking the model to draft freely and created a Claude Skill that encoded the structural rules of ClearFlow.

    Not stylistic guidance — structural constraints.

    What the system includes.
    What it excludes.
    How decisions are framed.
    What must never appear.

    Once those boundaries were explicit, the behaviour changed. Suggestions to add transaction tracking stopped appearing.

    More importantly, the model became useful in a different way. I could use it to check consistency across chapters, identify terminology drift, test whether examples aligned with stated principles across more than 30,000 words, and surface contradictions I had missed.

    It stopped acting like a co-author and started acting like a validator.

    The ideas remained mine. The architecture remained intact. The model enforced consistency against the structure I had defined.

    That experience changed how I think about AI.

    When something keeps reappearing in the output, the instinct is to improve the prompt. In my case, that wasn’t enough. The issue wasn’t phrasing. It was boundaries.

    Once those existed, I stopped fighting the system. The drift reduced. The work became cleaner. The AI could finally do what it is genuinely good at: systematic comparison and structural checking at scale.

    Prompting persuades. Boundaries constrain.

    When you are building something deliberately different, constraint isn’t restrictive. It is what allows the difference to survive.

    That experience also made something else obvious.

    If I, working on a small, well-defined system, saw drift this quickly, the same dynamic will exist anywhere AI is drafting inside an organization.

    Most operating models, risk frameworks, policies, and architecture documents follow established patterns. Those patterns dominate the training data. If AI is used to generate inside those domains without explicit structural constraints, it will tend to reinforce what is already common.

    That may not be a problem when you are formalizing standard practice.

    It becomes a problem when you are deliberately building something different.

    In those cases, the absence of boundaries doesn’t just create noise. It slowly reshapes the system back toward the norm.

    I learned that the hard way while writing a book.

  • If Not “Human-in-the-Loop”, Then What?

    Human-in-the-loop” isn’t a safety strategy. In many cases, it’s treated as one.

    So what does a better model look like?

    The shift is from human-in-the-loop as informal reviewer to AI-in-the-loop as structured validator. This isn’t about removing humans. It’s about putting both humans and AI where they’re actually good at something.

    The Right Division of Labor

    Humans are good at creating intent, defining what “good” looks like, making judgment calls, handling exceptions, and taking accountability when things go wrong.

    They are not good at systematic consistency checking across long material, maintaining vigilance over repetitive validation work, or spotting subtle structural contradictions.

    AI, by contrast, is good at rule-based comparison, consistency checking, and repeatable structural validation. It is not good at intent, ethical trade-offs, or accountability.

    The model should be:

    Human creates

    AI validates structure

    Human decides

    Not: AI creates, Human tries to catch everything, Hope it works.

    A Simple Example: Hair Care Advice

    This pattern shows up everywhere, even in casual AI use.

    I asked an AI system for hair care recommendations. I’d already described my hair type and goals. The system responded confidently with suggestions that completely ignored the constraints I’d just given. It defaulted to the generic patterns from its training model instead of my stated context.

    A human reviewer would need to go back, hold all my criteria in mind, compare them to each suggestion, and spot the mismatch. That’s cognitive work that gets skipped when you’re just scanning for “does this look reasonable?”

    So I changed the task structure.

    Instead of asking AI to generate recommendations, I selected candidate products myself and asked: “Does each one meet these specific criteria?”

    The recommendations aligned precisely with the criteria I had given.

    The AI wasn’t generating freely. It was validating against defined boundaries. That’s a much more reliable task.

    A Complex Example: Writing a Book

    I experienced this at scale while writing a book on personal finance systems. After drafting 30,000+ words across multiple chapters, I needed to check whether the framework stayed consistent throughout.

    Questions like:

    • Did chapter 7’s advice contradict chapter 3?
    • Had terminology shifted between sections?
    • Were the seven core principles applied consistently across different scenarios?
    • Did examples align with the stated framework?

    A human reviewer (me, or a beta reader) could catch obvious contradictions. But systematic consistency checking across an entire book? That’s exactly what AI should validate.

    I used AI to check:

    • Framework consistency across chapters
    • Terminology drift
    • Whether examples aligned with stated principles
    • Whether reasoning remained coherent throughout

    The AI flagged inconsistencies I’d missed. Not because I was careless, but because holding an entire book’s logic structure in working memory while writing is cognitively impossible.

    What This Means for Organizations

    The pattern is the same whether you’re validating hair care advice or enterprise documents:

    Generation is open-ended. The space of possible outputs is wide and loosely constrained. Human review becomes effortful, inconsistent, and prone to omission.

    Validation can be structured. The task becomes rule-based, repeatable, and scalable. Humans respond to specific flags rather than scanning everything.

    Organizations building AI systems need explicit validation architecture where AI is used to:

    • Check consistency across outputs
    • Compare decisions against defined rules or constraints
    • Detect drift, contradiction, or anomaly
    • Flag potential bias patterns
    • Maintain traceability between inputs, reasoning, and outcomes

    Humans remain responsible for judgment and accountability. But they’re no longer acting as general-purpose error detectors hoping to catch whatever slips through. They’re making targeted decisions informed by structured checks.

    Common Mistakes

    The failure modes are predictable.

    Treating validation as optional quality checking rather than core architecture. Validation gets bolted on later, if at all.

    Building generation without thinking about the validation layer. The question “how will we know if this is right?” comes too late.

    Assuming humans will catch what AI misses. That is the structural weakness in many HITL implementations. Humans are systematically bad at the kind of checking that AI excels at.

    Validating outputs without validating reasoning. An output might look correct but the reasoning that produced it could be flawed. Both need checking.

    No traceability. If you can’t trace from input through reasoning to output, you can’t validate properly and you can’t assign accountability when things go wrong.

    The Real Shift

    Organizations keep asking: “Where should we put a human to review the AI?”

    This is the wrong question.

    The right question: “How do we architect validation into the system so it’s repeatable, auditable, and doesn’t depend on someone staying vigilant?”

    That’s not a tooling question. It’s an operating model question.

    If AI is used only to generate and humans are left to review informally, you get:

    • Inconsistent quality
    • Hidden errors
    • Untraceable reasoning
    • Compliance risk disguised as compliance process

    If AI is used for structured validation with humans responsible for judgment and accountability, you get:

    • Systematic quality checking
    • Surfaced errors and edge cases
    • Traceable reasoning
    • Actual governance, not theatre.

    Human-in-the-loop is a reassurance phrase.

    AI-in-the-loop for structural validation is system design.

    The organizations that understand this difference will be the ones where AI becomes reliable and scalable, not just impressive in demos.


    Questions to Ask About Your AI Systems:

    What are you using AI to validate, not just generate?

    When AI produces output, what structural checks run automatically before a human ever sees it?

    If an AI-assisted decision goes wrong, can you trace the reasoning that led to it?

    What happens when human reviewers get tired, distracted, or stop paying attention after six months?

    If you can’t answer these questions, you’re hoping your AI systems work reliably. You’re not designing them to.

  • Human-in-the-Loop Is Not a Safety Strategy

    “Human in the loop” sounds like a corporate shorthand for AI safety.

    But it’s not a safety strategy. It’s a hope strategy. Hoping that by ensuring humans stays in the loop with AI, it will be safe to use it. Hoping that someone catches the error before it ships is not how high-stakes systems should be designed.

    What Does “Human-in-the-Loop” Actually Mean?

    Ask ten organizations how they implement human-in-the-loop and you’ll get ten different answers, if not eleven.

    Some describe collaborative work: humans and AI iterating together, each contributing what they do best. Others mean active oversight: humans making key decisions while AI handles routine processing. Many mean review: AI produces output, humans check it, correct it, before approving it.

    The problem isn’t the variety of definitions, although that is a problem. The real problem is when compliance frameworks require “human oversight,” when regulations mandate human review of AI decisions, when organizations need to demonstrate “responsible AI”, there is then a high likelihood that the implementation defaults to the simplest, most auditable form.

    Someone checks the output before it ships.

    This happens because:

    Review work is easy to audit. You can count reviews completed, track approval rates, measure time-per-review. It produces the metrics compliance needs.

    Collaborative work is hard to document. How do you verify that a human “worked with” AI meaningfully? What does that look like in a compliance report?

    Review scales more easily. You can distribute review work across many people with minimal training. Ensuring that the same output is reviewed by more than one person. Collaborative work between human and AI requires domain expertise and judgment.

    I don’t have direct evidence yet of organizations implementing HITL as repetitive review work purely to satisfy compliance requirements.

    But I’d be surprised if it isn’t already happening or about to happen.

    Because when requirements say “human oversight” but don’t specify what meaningful oversight looks like, organizations are likely to follow the path of least resistance.

    The gap between “collaborative human-AI work” and “someone reviews the output” is where the safety strategy fails.

    The Real Problem: We Are Likely Putting Humans in the Wrong Place

    The standard HITL implementation seems to look like this, whether by design or by drift:

    AI produces → Human reviews → Output is approved

    This treats humans as a final safety filter.

    The failure of human-in-the-loop isn’t about removing humans from AI systems. It’s that we’ve designed the wrong role for them.

    We placed humans at the end of the process, doing open-ended review, and expected them to provide consistency, bias correction, and governance through attention alone.

    To me this model relies on assumptions about humans that don’t hold in practice. That’s asking humans to do exactly the things they’re structurally bad at:

    • Large-scale consistency checking
    • Rule enforcement across long material
    • Spotting subtle distributional bias
    • Maintaining vigilance over time

    Three Assumptions That Break

    Assumption 1. Humans are reliable error detectors

    Reviewing is not the same cognitive skill as producing.

    I’ve written hundreds of business requirements and many architecture documents over 25+ years. The feedback pattern is pretty consistent. In most cases reviewers catch formatting issues, query specific details, challenge individual statements.

    What they rarely catch: what’s missing entirely.

    The gap that would derail implementation six months later. The scenario no one thought to ask about. The dependency that wasn’t documented because everyone assumed it was obvious.

    Finding what’s absent requires work that is cognitively expensive and time-consuming. Most reviewers don’t do it, not because they’re careless, but because it’s not what human brains are optimized for. It also demands deep domain knowledge that they might not have.

    Quality control research from manufacturing offers a sobering benchmark. Trained human inspectors, doing repetitive physical inspection work, typically catch 80-85% of defects. That means even in optimized conditions, 15-20% of errors slip through.

    These are trained inspectors looking for known defect types in physical products. AI review is likely cognitively harder. Reviewers aren’t trained quality inspectors. They’re doing open-ended validation of complex outputs, expected to catch not just errors, but subtle inconsistencies, missing information, and bias.

    If trained inspectors miss 15-20% of physical defects, what’s the error rate for (un)trained reviewers checking AI-generated documents for logical contradictions and structural issues?

    The same limitations are likely to show up in AI review.

    Reviewers are more likely to notice tone problems, flag awkward phrasing, check factual claims they already know about. But spotting that section 3 contradicts section 7, or that the framework quietly shifted assumptions halfway through? That requires holding the logic the document is trying to express in active memory while comparing it systematically.

    Humans aren’t built for systematic consistency checking. Systems are.

    When AI output is reviewed by humans, what gets checked is surface-level correctness — the things that are easy to see. What slips through is structural inconsistency — the things that require systematic comparison.

    Assumption 2. Humans are neutral judges

    There’s an implicit belief that humans will correct bias in AI outputs.

    But humans carry their own biases — and they’re often invisible to us.

    I’ve been consistently referred to as “he” by AI systems because I discuss personal finance, business strategy, or economics. My male colleagues were surprised when I mentioned this. None of them had ever been misgendered by AI.

    The bias was invisible to them because it aligned with their expectations. Technical and financial expertise still maps to male, so when they are addressed as”he” it doesn’t register as a choice the AI made based on a model with inherent bias.

    I’ve also watched AI consistently soften women’s voices when they report domestic violence or raise concerns about male behavior. The content stays factually similar, but the framing shifts. Assertive statements become tentative. Direct concerns become qualified worries. This means womens voice are discredited by the AI.

    Many human reviewers would miss both patterns. Not because they approve of gender bias, but because the outputs align with deeply embedded cultural expectations about who has authority in which domains and how women should speak about difficult topics.

    When both the training data and the reviewers share similar blind spots, a human in the loop doesn’t correct bias — it reinforces it.

    HITL can’t fix what humans can’t see.

    Assumption 3. Review work undermines human motivation

    HITL assumes sustained vigilance over time.

    But reviewing AI output violates everything we know about what motivates humans to do work well.

    Research on human motivation identifies three elements that drive sustained performance: autonomy (control over how you work), mastery (getting better at something meaningful), and purpose (understanding why it matters). For more see Daniel Pink’s book Drive.

    Review work provides none of these.

    Autonomy: The reviewer didn’t create the output. They can only approve or reject, and correct what someone else, or something else, produced. They have no control over the quality of what arrives for review.

    Mastery: What does it mean to get better at reviewing AI output? The skill isn’t building toward expertise. It’s maintaining vigilance over repetitive material. There’s no growth trajectory. Also how do you learn something, when you are not doing the work?

    Purpose: The implicit message is “check that the AI didn’t mess up.” That’s not a purpose. That’s a liability shield.

    Content moderation research demonstrates what happens when review work lacks these motivational elements. Studies of Facebook and Reddit moderators show consistent patterns: burnout, emotional exhaustion, and apathy are common, even among volunteers who initially cared deeply about their communities.

    Commercial content moderators, people paid to review flagged material, report even worse outcomes. Microsoft and Facebook have faced lawsuits from moderators who developed PTSD from the work. Research comparing content moderators to first responders and police analyzing child exploitation material found comparable psychological impacts.

    This isn’t about the disturbing nature of content moderation specifically. It’s about what happens when human work is reduced to validating system output without autonomy, mastery, or purpose.

    AI review likely follows the same pattern. Initial scrutiny is careful. Within months, it likely becomes perfunctory. IThe job hasn’t changed. The motivation has collapsed.

    Designing a system that depends on continuous human vigilance isn’t a safety strategy. It’s hoping people won’t get bored, burned out, or detached.

    Why This Matters

    These three limitations aren’t edge cases. They’re fundamental to how review work functions.

    Humans miss structural errors because finding what’s missing requires, cognitive work that’s mentally expensive and hard to do well.

    Humans often miss bias because the outputs align with cultural patterns they’ve internalized. When both the training data and the reviewers share similar blind spots, the loop reinforces bias rather than removing it.

    Humans lose vigilance because review work offers no autonomy, no mastery, and no purpose. Even people who start out as motivated volunteers who care about their communities, professionals committed to quality, are likely to experience declining attention over time.

    HITL treats review as a safety mechanism.

    But humans don’t catch structural errors. They don’t see their own biases. And their attention degrades over time when the work lacks meaningful motivation.

    The problem isn’t that humans are involved. The problem is that we’ve designed a role humans can’t perform reliably at scale.

    Questions organizations should consider asking themselves when designing Human-in-the-loop AI enabled systems.

    If you’re implementing human-in-the-loop for AI systems, these are the questions that matter:

    What error types are humans actually catching?

    – Can you distinguish between surface-level corrections (typos, formatting) and structural issues (missing information, logical contradictions)?

    – Are you measuring what reviewers miss, or only what they flag?

    What biases are reviewers systematically missing?

    – How are you testing whether reviewers share the same blind spots as the training data?

    – What happens when bias aligns with reviewer expectations so closely the bias in the AI output becomes invisible?

    How is review quality changing over time?

    – Do you have baseline data from early reviews to compare against current performance?

    – Are approval rates increasing while error rates stay constant — or are you not measuring both?

    What motivates reviewers to maintain vigilance?

    – Does the work provide autonomy, opportunities for mastery, and a clear purpose?

    – Or is it repetitive validation that erodes motivation over time?

    What does “human oversight” mean in your implementation?

    – Is it collaborative work where humans and AI contribute different strengths?

    – Or is it review queues where someone checks output before approval?

    – If you can’t specify exactly what meaningful oversight looks like, you probably don’t have it.

    If you can’t answer these questions with data, you likely don’t have a safety strategy for your AI enabled work. You have a compliance checkbox that makes you feel protected without actually designing for the limitations it claims to solve.

    HITL isn’t inherently wrong. But treating it as a safety strategy, something that will catch all the errors the AI made without understanding what humans can and cannot do reliably in review roles — is hoping that vigilance, attention, and bias detection will somehow emerge from a system designed to undermine all three.

    Hope doesn’t scale.

    Neither does human-in-the-loop as currently being discussed and potentially implemented.

  • Original Thinking in the AI Era

    There’s a common claim that in the age of AI, original thinking no longer matters, that everything worth saying has already been said, and machines can now say it better.

    AI is very good at synthesis. It can summarise, connect, and re-express existing knowledge at scale. Best practices are now cheap. Competent execution is table stakes.

    But synthesis is not origination.

    Original thinking doesn’t come from summarising and synthesising what already exists on the web, in books, or in training data. It comes from lived experience, systematic experimentation, and pattern recognition across domains that don’t usually speak to each other.

    My interests sit at these intersections.

    Whether in enterprise architecture, personal finance, health, or productivity, the same pattern keeps repeating: systems are designed for averages, and built on assumptions we no longer even see and therefore don’t question. The only way to see that clearly is to have skin in the game, gather your own data, and be willing to challenge orthodoxy. To return to first principles and question what we’ve learned to take for granted.

    AI can help articulate insights once they exist. It can help structure thinking, test language, and scale communication.

    What it can’t do is:

    • have lived experience
    • generate new data through personal experimentation
    • notice patterns across a specific combination of domains
    • challenge established assumptions

    In other words, AI can amplify original thinking — it can’t create it.

    As generic synthesis becomes abundant, original insight becomes rarer, more valuable, and more interesting by contrast. The bottleneck isn’t access to information anymore. It’s the willingness to do the work that produces something genuinely new.

    That’s the kind of thinking I’ll be documenting here.

    — Charlotte