Scenario Memory Experiment Design

Research question(s)

Primary question:

Can scenario-oriented context construction produce more decision-grade context than fragment-oriented retrieval on SME-like decision support tasks, using the same underlying graph memory?

Secondary questions:

  • Does scenario-style context improve context completeness?
  • Does it improve evidence traceability?
  • Does it make missing information more visible rather than hiding it?
  • Does it produce a more decision-useful context package for short-horizon SME decisions?

Hypotheses

H1: For SME-style operational or tactical decisions, scenario-style context construction will outperform fragment-oriented retrieval on context completeness and evidence traceability.

H2: Scenario-style context construction will more often surface explicit links between challenge, action, goal, and observed outcome.

H3: Any observed gains will be strongest on short-horizon, operationally grounded tasks rather than on broad strategic questions.

Task definition

This experiment does not compare final business answers. It compares context packages.

Each test item is a decision situation consisting of:

  • a realistic owner query
  • the same underlying ProsperPath graph as the information source
  • two context construction strategies

Candidate task types for the first run:

  • supplier MOQ pressure under cash-flow constraints
  • weekend staffing and rota conflict decisions
  • October sales decline and conversion recovery

These tasks were selected because the current assets include problem traces, suggested actions, and at least partial goal or outcome evidence.

Baseline definition

Condition A: fragment-oriented retrieval

  • Retrieval strategy: node_rrf
  • Interpretation: top matching nodes are treated as the context package
  • Strength: simple and realistic baseline
  • Weakness: does not explicitly recover relational business logic

Condition B: scenario-style context construction

  • Retrieval strategy: combined_rrf
  • Interpretation: top nodes plus episode-linked relational facts are assembled into a bounded context package
  • Strength: more likely to expose challenge, goal, action, and outcome links in one bundle
  • Weakness: still an approximation, not a fully implemented scenario-memory layer

edge_episode was tested during feasibility checking, but it was noisier than combined_rrf in the inspected cases. It is therefore better treated as supporting feasibility evidence rather than the main pilot condition.

Evaluation metrics

Preferred pilot metrics:

  • Context completeness
    • Does the context package expose the main problem, relevant constraint, plausible action, and outcome or target state?
  • Evidence traceability
    • Can the reader trace why the recommendation or action is being surfaced?
  • Missing-information awareness
    • Does the context package make important unknowns visible rather than implying false completeness?
  • Decision usefulness
    • Would a reviewer plausibly regard the package as a better basis for a business decision discussion?

Optional metric:

  • Action justification quality
    • Useful later, but not necessary for the first pilot

For this first run, these metrics are best scored by explicit human judgment using a small rubric. Automatic proxy metrics alone would be too weak.

Data selection criteria

Include only cases where:

  • the query corresponds to a concrete SME-style decision problem
  • the graph contains at least one relevant challenge or crisis trace
  • the graph contains at least one action or recommendation trace
  • the graph contains at least one goal, benefit statement, or partial outcome trace

Exclude cases where:

  • retrieval returns mostly unrelated domains
  • there is no visible action or recommendation trace
  • there is no inspectable contextual grounding beyond generic owner/company information

Minimum dataset size needed for a meaningful first run

For a pilot:

  • 3-5 decision situations is sufficient to test the pipeline and reveal whether the comparison is promising

For a stronger initial experiment:

  • 15-30 decision situations would be a more defensible minimum, ideally spanning several repeated problem types

Risks to validity

  • The current graph is not a clean implementation of a scenario-memory layer.
  • combined_rrf is only a proxy for scenario-oriented context construction.
  • The dataset is synthetic or semi-simulated.
  • Outcome traces are inconsistent across cases.
  • Retrieval noise can inflate or distort apparent scenario gains.
  • Human scoring in the pilot is subjective and unblinded.

Because of these risks, the pilot should be framed as an initial empirical check on feasibility, not as definitive validation.