Collaborative Data Analysis

How to Collaboratively Build a Data Analysis Notebook

This tutorial shows how to use sudocode when building data analysis scripts. You’ll see how to break down the work step-by-step while capturing requirements iteratively - exploring your data, figuring out what to analyze, and discovering missing requirements as you plan the implementation. Traditional approach:

Receive data dump
Write exploratory scripts alone
Realize you need different metrics
Rewrite analysis
Discover data quality issues late
Context scattered across notebooks

Using sudocode for data analysis:

Explore existing data with AI assistance
Discuss what metrics and visualizations would be valuable
Capture requirements in a spec
Break down into concrete implementation issues
Discover missing requirements during planning (e.g., deduplication)
Add new issues as needs emerge
Implement with full context preserved in version control

Workflow

1. Explore Data Together

Use AI assistance to understand the data structure and formatStart by asking your agent to explore existing data files and understand the current state:

You: "I want to explore how we'll synthesize the data we've collected about sudocode usage. reference our current specs around
the data collection scripts. In our case, we have some data collection
you can see a test for under the data dir. take a look at this dir to
see how we format this data."

The agent will:

Read existing related specs for context
Explore the data directory structure
Examine CSV headers and sample rows
Understand data relationships and granularity

This establishes shared understanding before diving into requirements.

2. Discuss Analytical Approaches

Have a conversation about what insights you want and how to get themBefore writing a spec, discuss what analysis would be valuable:

You: "Let's first talk about what we should analyze. we want to get
usage data about sudocode from various repositories and PRs, to better
understand how people are using sudocode. you've seen the data, can you
make any suggestions of how we should set up this data to be queryable
and what we should demonstrate?"

Agent: "Looking at what you're collecting, here are the key insights I
think would be valuable:

High-Value Analytics:

1. Adoption & Engagement Metrics
   - Sudocode penetration rate: % of repos with sudocode enabled
   - Active usage rate: Of repos with sudocode, what % use it in PRs?

2. Productivity Signals
   - Time to review: Do PRs with sudocode get first review faster?
   - Time to merge: Do sudocode PRs merge faster?
   - PR size correlation: Are sudocode PRs larger/smaller?

3. Workflow Patterns
   - Specs vs Issues ratio: Are teams writing specs before implementation?
   - Modification patterns: Are specs heavily modified?

I'd recommend using Pandas + Matplotlib/Seaborn because..."

This conversation establishes:

What metrics matter
What questions to answer
Which tools to use
What visualizations to create

The agent brings suggestions, you provide domain knowledge and priorities.

3. Co-create the Specification

Work together to capture requirements in a specOnce you’ve discussed the approach, create the spec:

You: "great. let's create this spec, let's focus on adoption and workflow patterns."

Agent: [Uses upsert_spec() MCP tool to create comprehensive specification
including goals, data model, metrics, visualizations, and architecture]

The initial spec will likely include:

Goals and success metrics
Data sources and schema
Key analytics and visualizations
Technical architecture
Implementation phases

Review the spec in the web UI at localhost:3000.

4. Break Down into Issues

Split the spec into actionable implementation tasks

You: "great. let's split this spec into issues"

Agent: [Creates multiple issues following the spec's phases:
- i-3uzc: Build data loading infrastructure
- i-14md: Build metrics computation engine
- i-46w8: Build visualization library
- i-4dmu: Create interactive Jupyter notebooks
- i-6dek: Add query helper utilities (optional)
]

The agent will:

Create issues based on implementation phases
Link all issues to the spec via implements relationship
Establish dependency chains using depends-on links
Set appropriate priorities

Review the issues and dependency graph in the web UI.

5. Implement Issues

Execute the issues in dependency orderNow dispatch agents to implement the work:

Start with foundation issues (data loading)
Progress through metrics and visualizations
End with notebooks and documentation

As agents work, they may provide feedback on the spec or discover additional edge cases that need handling.

6. Discover Missing Requirements

Identify gaps through conversation - add issues iterativelyAs you discuss and implement issues, you may realize something is missing:

You: "question about the analytics. so in our case, someone will run
these every few weeks, but there may be overlap between the data dirs
that get generated. is there any deduplication that happens in the
analysis? if not, can we add an issue for this?"

Agent: "Great catch! You're absolutely right - if you run data
collection every few weeks with overlapping lookback periods, you'll
have duplicate PRs, commits, and repos across different run directories.

Without deduplication, metrics will be inflated and trends will be
inaccurate. Let me create a new issue for this:"

Agent: [Uses upsert_issue() to create deduplication issue:
- i-7gkc: Add deduplication logic for multi-run aggregation
- Links to spec and related data loading issue
]

This is key: Requirements emerge through discussion. Don’t try to think of everything upfront - let the conversation surface gaps.

Best Practices

Start with conversation, not specification

Don’t jump straight to writing specs. Explore and discuss first.Approach:

You: "Let's look at this data together and figure out what to analyze."

Agent: [Explores data, suggests metrics and approaches]

You: [Provide domain knowledge, priorities, constraints]

You: "That approach makes sense. Let's create a spec for it."

Benefits:

Agent understands context before formalizing
You get suggestions informed by data exploration
Requirements emerge naturally from discussion
Shared understanding before committing to approach

Keep specs focused on requirements, not implementation

Specs should define what you want, not how to build it.Wrong approach:

# Bad spec

## Implementation

\```python
def load_data(path):
    df = pd.read_csv(path)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    return df
\```

Right approach:

# Good spec

## Requirements

Data loading module should:
- Load CSV files from collection run directories
- Automatically parse datetime fields
- Support single-run and multi-run aggregation
- Handle missing data gracefully

Benefits:

Specs stay relevant longer
Implementation can adapt to new patterns
Focus remains on requirements
Easier to review and understand

Let requirements emerge through conversation

Don’t try to specify everything upfront. Discover gaps iteratively.Example from walkthrough:

Initial plan: Load data, compute metrics, visualize

During discussion: "Wait, what about deduplication?"

New issue created: Deduplication logic for multi-run aggregation

This is normal and expected:

Requirements surface through discussion
Edge cases emerge when reviewing the plan
Implementation reveals additional needs
Specs evolve based on learnings

Benefits:

More complete requirements
Realistic implementation plans
No premature over-specification
Context for why requirements were added

Discuss metrics and approaches before formalizing

Have a conversation about what to analyze before deciding how.Key questions to discuss:

What insights do we want?
What metrics would be valuable?
What visualizations would tell the story?
What tools are appropriate?
What are the data quality considerations?

In the walkthrough:

You: "What should we analyze? Can you make suggestions?"

Agent: [Proposes adoption metrics, productivity signals, workflow
patterns, with specific examples and reasoning]

You: "Great. Let's create this spec, with these metrics in mind"

This establishes shared understanding before writing anything formal.

Break specs into phases for iterative implementation

Structure implementation in logical phases with dependencies.Example breakdown:

Phase 1: Data Loading (foundation)
  ├─ Load CSV files
  └─ Handle deduplication

Phase 2: Metrics (builds on data loading)
  ├─ Compute time metrics
  └─ Aggregate commit-level to PR-level

Phase 3: Visualizations (builds on metrics)
  ├─ Impact dashboard
  ├─ Adoption timeline
  └─ Workflow patterns

Phase 4: Notebooks (brings it all together)
  ├─ Overview notebook
  ├─ Productivity analysis
  └─ Workflow patterns

Benefits:

Clear execution order
Parallel work opportunities within phases
Easy to track progress
Natural checkpoints for review

Common Patterns

Pattern: Exploratory Data Analysis to Specification

# 1. Human provides data and general goal
You: "I have this data dump. Let's figure out how to analyze it."

# 2. Agent explores data structure
Agent: [Uses Bash/Read tools to understand data format, schemas, sizes]

# 3. Human and agent discuss analytical approaches
You: "What insights would be valuable?"
Agent: [Proposes metrics, visualizations, technical approaches]
You: [Provides priorities, constraints, domain knowledge]

# 4. Agent creates spec capturing the approach
You: "That makes sense. Create a spec for this."
Agent: [Uses upsert_spec() with full requirements]

# 5. Refine spec to focus on abstractions
You: "Remove the code examples, keep the requirements."
Agent: [Updates spec to remove implementation details]

# 6. Break down into issues with dependencies
You: "Split this into issues."
Agent: [Creates issues, links to spec, establishes dependencies]

# 7. Discover missing requirements during review
You: "What about data deduplication?"
Agent: [Creates new issue, links appropriately]

# 8. Implement in phases
Dispatch agents to work through issues in dependency order

Pattern: Iterative Requirement Discovery

Requirements emerge through conversation, not just upfront specification:

Initial Spec:
  - Load data from CSV files
  - Compute productivity metrics
  - Create visualizations

During Planning Discussion:
  You: "Wait, data collection runs overlap. Do we deduplicate?"
  → New issue created: Deduplication logic

During Implementation:
  Agent: "Found edge case: PRs that change state between runs"
  → Issue updated: Handle state transitions in deduplication

During Testing:
  Agent: "Performance issue with large datasets"
  → New issue created: Optimize data loading for 100K+ records

This is normal and healthy - requirements surface when they’re discoverable.

Troubleshooting

Spec has too much implementation detail

Symptom: Spec contains code snippets, function implementations, detailed pseudocodeSolution:Ask the agent to refine the spec:

You: "This spec has too much implementation detail. Can you refactor it
to focus on requirements and abstractions? Remove code examples but keep
the high-level architecture and what we need to build."

Agent: [Uses upsert_spec() to update, removing code but keeping
requirements, data model, metrics definitions, visualization
requirements]

What to remove:

Specific code implementations
Detailed function signatures
Step-by-step algorithms

What to keep:

Goals and success criteria
Data model and schemas
Metrics to compute
Visualization requirements
High-level architecture

Discovered missing requirement after issues created

Symptom: Realize something is missing after breaking down the specSolution:This is normal! Create a new issue and link it appropriately:

You: "We forgot about data deduplication. Can we add an issue for this?"

Agent: [Uses upsert_issue() to create new issue, links to spec via
'implements', links to related issues via 'related' or 'depends-on']

From the walkthrough:

Initial plan: data loading, metrics, visualizations, notebooks
During review: “What about deduplication?”
New issue created: i-7gkc for deduplication logic
Linked to data loading issue as related work

Benefits of this pattern:

Requirements emerge when discoverable
Context preserved (why it was added)
Implementation can proceed with complete requirements
No need to restart the whole planning process

Spec-Driven Development

Full spec-driven workflow example

Specs Concept

Deep dive into specifications

Issues & Dependencies

Managing work with issues

MCP Tools

AI agent integration

Feedback System

Anchored feedback mechanics

Agent Workflows

MCP workflow examples

Getting Started

Examples & Tutorials

Core Concepts

Collaborative Data Analysis

How to Collaboratively Build a Data Analysis Notebook

Workflow

Best Practices

Common Patterns

Pattern: Exploratory Data Analysis to Specification

Pattern: Iterative Requirement Discovery

Troubleshooting

Spec-Driven Development

Specs Concept

Issues & Dependencies

MCP Tools

Feedback System

Agent Workflows

Getting Started

Examples & Tutorials

Core Concepts

​How to Collaboratively Build a Data Analysis Notebook

​Workflow

​Best Practices

​Common Patterns

​Pattern: Exploratory Data Analysis to Specification

​Pattern: Iterative Requirement Discovery

​Troubleshooting

​Related Documentation

Spec-Driven Development

Specs Concept

Issues & Dependencies

MCP Tools

Feedback System

Agent Workflows

How to Collaboratively Build a Data Analysis Notebook

Workflow

Best Practices

Common Patterns

Pattern: Exploratory Data Analysis to Specification

Pattern: Iterative Requirement Discovery

Troubleshooting

Related Documentation