Scenario Memory Experiment Plan
Scenario Memory Experiment Plan
Exact files and modules likely to use
Primary ProsperPath components:
pp_query/service.py- main graph search entrypoint
pp_query/schemas.py- search strategy definitions
pp_query/query_client.py- config-driven query client used for the pilot
decision_feat/graph_signals.py- graph evidence and signal extraction logic
scripts/query_memories.py- useful pattern for direct Neo4j access
schema/kg_schema.py- schema reference for interpreting retrieved entities
Existing data assets:
mock_data/chat.jsonmock_data/ratio.jsonmock_data/user_memories.jsonmock_data/loca_trends.json1001_output.json1001_recommendations.json
Neo4j queries needed
Read-only Cypher inspections:
- count
Episodicnodes for the target group - count entities by label
- count relationships by type
- count
Action,Decision,Recommendation,Challenge, andFinancialPeriod - count edges where
r.episodes IS NOT NULL - sample action-to-goal, action-to-challenge, and recommendation-linked relationships
These queries are sufficient for:
- feasibility checking
- selecting usable cases
- confirming whether episode-linked context is present in the current graph
Expected intermediate outputs
From the pilot pipeline:
- raw search results for each case and each condition
- compact baseline context package
- compact scenario-style context package
- a scored comparison table
- a short interpretation note describing what improved and what remained weak
Current pilot artifacts:
output/experiments/scenario_memory_pilot_results.jsonoutput/experiments/scenario_memory_pilot_results.csvoutput/experiments/scenario_memory_examples.mdoutput/experiments/scenario_memory_pilot_summary.md
Evaluation tables to generate
At minimum:
- case-by-case comparison table with:
- strategy
- node count
- edge count
- context completeness
- evidence traceability
- missing-information awareness
- decision usefulness
For the paper later:
- a compact task table describing each decision situation
- a results table aggregating rubric scores across fragment vs scenario conditions
Whether human judgment is needed
Yes.
For the current assets, human judgment is necessary for the main pilot scoring because:
- the target construct is decision-grade context quality
- automatic scores would overfit to surface structure such as edge count
- the graph contains noise, so simple proxies would be misleading
Automatic proxy evaluation is realistic only as a secondary signal, for example:
- whether an output contains explicit problem-action-outcome linkage
- whether edge-backed evidence is present
- whether multiple contextual dimensions appear in one bundle
Error and failure modes
- search retrieval may surface cross-domain or weakly related facts
- episode-aware retrieval may still fail to isolate one coherent business scenario
- some tasks may have recommendations but weak outcome traces
- some tasks may have goals and outcomes but weak challenge linkage
- graph relation labels may be semantically weak, for example many
RELATES_TOedges
These failure modes do not invalidate the pilot, but they must be reported explicitly.
What success would look like for a first paper-quality experiment
Minimum success:
- at least one or two cases where the scenario-style package is clearly better than the fragment package on completeness and traceability
- the improvement is inspectable in retrieved nodes and relation facts
- the limitations are transparent and do not require invented evidence
Stronger success:
- the same pattern repeats across several cases
- the scenario-style package consistently surfaces goal-challenge-action-outcome structure more clearly than the baseline
- the results support a narrow empirical statement in the paper
The current run should therefore be treated as a first inspectable pilot, not as the final paper experiment.