NOUS OS Research Line · HAICES · v0

Toward compounding wisdom in human-AI pairs.

Most AI research measures how good the model became. Very few measure how capable, calibrated, or wiser the human became — especially in the moments when AI is no longer in the room. This research line is built around that gap.

Read the line ↓ Run pipeline → Theory overview → Student Sandbox v1 →

00 · Where this page fits

Research overview is the public map. Research Line is the operating evidence system.

Reader path · operator path

Start with Research when you want the plain-language overview: theory, model, metrics, Student Sandbox readiness, and source notes. Use this Research Line when you need the research protocol: preregistration, session review packets, evidence ledger, method commitments, and the gate that turns findings into product changes.

01 · North Star

The long-horizon question this line exists to answer.

North Star · decade-horizon · philosophical

Under what conditions does a human-AI pair accumulate compounding wisdom — not merely accelerated output, not merely personalized convenience — over months and years?

This question is intentionally not testable in a single session. The two near-term instruments below feed it. If those instruments trend upward across many cycles across many people, that is evidence for the north star. If they do not, the north star is wishful.

02 · Near-term instruments

What we can actually measure on the way there.

The north star is not directly testable. These two questions are.

(a) capability-without-AI delta

Does the human perform measurably better when AI is absent?

After a scaffolded AI-assisted loop, the human is given a similar task without AI access. We compare to a no-AI baseline and a cold-chat-AI baseline. The reverse of the cognitive-offloading concern.

surface · Student Sandboxhorizon · per-session, weeks

(b) calibrated trust + responsibility retention

Does trust track evidence quality rather than fluency?

Over repeated cycles, does the human delegate the right parts and retain the right parts; does over-reliance shrink; does the human get better at saying "the AI is wrong here"?

surface · trading-agent / cohorthorizon · per-cycle, months

03 · Three sub-lines

Three proof beds, three boundary conditions.

Each sub-line has a clear study unit, a baseline condition, and the boundaries it primarily tests. We move sub-lines forward when evidence accumulates, not when arguments accumulate.

L1 · Learning loop

Student Sandbox

Does a 20-minute scaffolded human-AI loop produce a measurable capability-without-AI delta vs (i) no-AI and (ii) cold-chat baselines?

Unit: one 20-min session + delayed transfer task without AI
Baselines: no-AI · cold-chat AI
Boundaries: learning, fact, privacy, taste

N = 0 real trials

L2 · Decision loop

Trading-agent

Does AI-augmented decision-making with explicit boundaries produce decisions the human can defend better, calibrate trust better, and learn from outcomes better over time?

Unit: one promoted candidate + post-outcome review
Baseline: retrospective bin by boundary discipline
Boundaries: decision, responsibility, value

running · co-evolution lens not yet applied

L3 · Knowledge loop

Personal knowledge

Does long-horizon human-AI memory interaction (Obsidian + TrustMem) produce compounding reflection capacity, or does it produce stale personalization and dependence?

Unit: 90-day Obsidian section + retrospective task
Baseline: pre-NOUS journaling segment, if available
Boundaries: identity, taste, responsibility

methodology not pinned

04 · Method commitments

The rules that turn a theory document into a research line.

Pre-register predictions before each session. Five-minute single-page document committing to what we expect to see. Stored at docs/research-line/preregistration/.
Two raters when feasible. Observer + a second reviewer independently score the session using self-evolution-metrics-v0.md. Track inter-rater agreement even at N=1.
De-identified review packets are the data. Every session produces one review packet. Public publication is required (de-identified). They are the corpus, not folder-dust.
Negative results are first-class. If a session fails the prediction, write it up. Failure-to-publish-negatives is the single most common research-line corruption.
Be explicit about N. N=1 case studies are legitimate but must be labeled. Never write "students tend to…" at N ≤ 5.
No instrument inflation. Resist adding metrics ad-hoc. New metrics require a quarterly synthesis to justify.

05 · Position in the literature

What surrounds this work, and what is left empty.

The full anchor atlas (academic + industry + products + voices) — ~30 anchors across 6 buckets, each with our positioning — lives at research-line-atlas (also at docs/research-line/anchor-atlas.md). Summary positioning:

Tradition	Closest anchor	Where NOUS OS adds
Augmentation	Engelbart 1962 · Bush 1945	LLM-era boundary taxonomy + capability-delta instrument
Cognitive offloading	Risko & Gilbert 2016 · Storm & Stone	Reverse question: when does offloading make people stronger?
Self-regulated learning	Zimmerman · Vygotsky ZPD	AI-native instantiation of forethought → performance → reflection
AI literacy	Long & Magerko 2020 · UNESCO 2024	From descriptive taxonomy to instrumented loop
Tools for thought	Matuschak & Nielsen 2019 · Bret Victor	From individual cognition tools to explicit symbiosis
Practical AI advice	Mollick · Co-Intelligence 2024	From descriptive heuristics for adults to prescriptive measured protocol
Industry products	Khanmigo · NotebookLM · Cursor · Claude Code	Product-agnostic protocol layered on any AI surface

The position the literature is most empty at: measuring whether a person is more capable when AI is absent. That is where this line is most concentrated.

06 · External-input loop

How fresh thinking enters the system every day.

The north star cannot be served by internal work alone. NOUS OS must continuously absorb external thinking — academic preprints, industry research, top products, individual essays and podcasts — without drowning in firehose noise. Three-tier discipline:

L1 · Capture

Daily

Scheduled remote agent pulls ~5–8 narrow high-signal sources, filters by anchor keywords, writes raw daily inbox to docs/research-line/inbound/_inbox/.

status · cron not yet live

L2 · Triage

Weekly

Scheduled agent (or human) promotes 1–3 candidates per week to full 1-page inbound notes. Each note has an HTML mirror and joins the public corpus.

status · manual until L1 is live

L3 · Synthesis

Quarterly

Human + Claude write a synthesis tying inbound notes and session data to "what we read, what we changed because of it." Published quarterly.

status · template pending

07 · Current state · 2026-05-17

Where this line actually stands today.

N = 0real Student Sandbox sessions

1AI-simulated dry-run audit, not counted as real evidence

nonelatest real review packet

N=1next planned trial: first student-adjacent session

✓Theory documents. human-ai-symbiosis-self-evolution, self-evolution-metrics-v0, human-ai-coevolution-model-v0, memory-philosophy-v0.
✓Research-line spec. This page + docs/research-line/research-line.md.
~Anchor atlas. Drafted in this wave; full living atlas page pending.
~L1 sub-line (Student Sandbox). Scaffold + public web + recruitment templates complete. N = 0 real trials.
~L2 sub-line (trading-agent). Active in production; no co-evolution lens applied to existing data yet.
○L3 sub-line (personal knowledge). Methodology not pinned.
✓L1 capture cron. Script, workflow, and offline tests implemented; first scheduled PR pending.
✓Pre-registration template. Implemented at docs/research-line/preregistration/_template.md.
✓Review Packet Index. Implemented at docs/research-line/session-review-index.md; latest real review is none.
✓Research-to-Product Gate. Implemented at docs/research-line/research-to-product-gate.md; first use expected after N=1.
✓Research Pipeline cockpit. Implemented at demo/research-pipeline.html; turns the North Star into preregistration → session → review → ledger → gate.
○First quarterly synthesis. Pending real session evidence.

The single highest-leverage action remains: run one real Student Sandbox session and produce N = 1.

08 · What this line is not

The deliberate negations.

This research line is NOT —

a tutoring product or an EdTech offering
a benchmark suite
a substitute for peer review
an excuse to write theory papers without running experiments
infrastructure work disguised as research

If we find ourselves making more boundary types or more metrics rather than more sessions, we have drifted.