Skip to main content

How to Collaboratively Build a Data Analysis Notebook

This tutorial shows how to use sudocode when building data analysis scripts. You’ll see how to break down the work step-by-step while capturing requirements iteratively - exploring your data, figuring out what to analyze, and discovering missing requirements as you plan the implementation. Traditional approach:
  1. Receive data dump
  2. Write exploratory scripts alone
  3. Realize you need different metrics
  4. Rewrite analysis
  5. Discover data quality issues late
  6. Context scattered across notebooks
Using sudocode for data analysis:
  1. Explore existing data with AI assistance
  2. Discuss what metrics and visualizations would be valuable
  3. Capture requirements in a spec
  4. Break down into concrete implementation issues
  5. Discover missing requirements during planning (e.g., deduplication)
  6. Add new issues as needs emerge
  7. Implement with full context preserved in version control

Workflow

1

1. Explore Data Together

Use AI assistance to understand the data structure and formatStart by asking your agent to explore existing data files and understand the current state:
You: "I want to explore how we'll synthesize the data we've collected about sudocode usage. reference our current specs around
the data collection scripts. In our case, we have some data collection
you can see a test for under the data dir. take a look at this dir to
see how we format this data."
The agent will:
  • Read existing related specs for context
  • Explore the data directory structure
  • Examine CSV headers and sample rows
  • Understand data relationships and granularity
This establishes shared understanding before diving into requirements.
2

2. Discuss Analytical Approaches

Have a conversation about what insights you want and how to get themBefore writing a spec, discuss what analysis would be valuable:
You: "Let's first talk about what we should analyze. we want to get
usage data about sudocode from various repositories and PRs, to better
understand how people are using sudocode. you've seen the data, can you
make any suggestions of how we should set up this data to be queryable
and what we should demonstrate?"

Agent: "Looking at what you're collecting, here are the key insights I
think would be valuable:

High-Value Analytics:

1. Adoption & Engagement Metrics
   - Sudocode penetration rate: % of repos with sudocode enabled
   - Active usage rate: Of repos with sudocode, what % use it in PRs?

2. Productivity Signals
   - Time to review: Do PRs with sudocode get first review faster?
   - Time to merge: Do sudocode PRs merge faster?
   - PR size correlation: Are sudocode PRs larger/smaller?

3. Workflow Patterns
   - Specs vs Issues ratio: Are teams writing specs before implementation?
   - Modification patterns: Are specs heavily modified?

I'd recommend using Pandas + Matplotlib/Seaborn because..."
This conversation establishes:
  • What metrics matter
  • What questions to answer
  • Which tools to use
  • What visualizations to create
The agent brings suggestions, you provide domain knowledge and priorities.
3

3. Co-create the Specification

Work together to capture requirements in a specOnce you’ve discussed the approach, create the spec:
You: "great. let's create this spec, let's focus on adoption and workflow patterns."

Agent: [Uses upsert_spec() MCP tool to create comprehensive specification
including goals, data model, metrics, visualizations, and architecture]
The initial spec will likely include:
  • Goals and success metrics
  • Data sources and schema
  • Key analytics and visualizations
  • Technical architecture
  • Implementation phases
Review the spec in the web UI at localhost:3000.
4

4. Break Down into Issues

Split the spec into actionable implementation tasks
You: "great. let's split this spec into issues"

Agent: [Creates multiple issues following the spec's phases:
- i-3uzc: Build data loading infrastructure
- i-14md: Build metrics computation engine
- i-46w8: Build visualization library
- i-4dmu: Create interactive Jupyter notebooks
- i-6dek: Add query helper utilities (optional)
]
The agent will:
  • Create issues based on implementation phases
  • Link all issues to the spec via implements relationship
  • Establish dependency chains using depends-on links
  • Set appropriate priorities
Review the issues and dependency graph in the web UI.
5

5. Implement Issues

Execute the issues in dependency orderNow dispatch agents to implement the work:
  • Start with foundation issues (data loading)
  • Progress through metrics and visualizations
  • End with notebooks and documentation
As agents work, they may provide feedback on the spec or discover additional edge cases that need handling.
6

6. Discover Missing Requirements

Identify gaps through conversation - add issues iterativelyAs you discuss and implement issues, you may realize something is missing:
You: "question about the analytics. so in our case, someone will run
these every few weeks, but there may be overlap between the data dirs
that get generated. is there any deduplication that happens in the
analysis? if not, can we add an issue for this?"

Agent: "Great catch! You're absolutely right - if you run data
collection every few weeks with overlapping lookback periods, you'll
have duplicate PRs, commits, and repos across different run directories.

Without deduplication, metrics will be inflated and trends will be
inaccurate. Let me create a new issue for this:"

Agent: [Uses upsert_issue() to create deduplication issue:
- i-7gkc: Add deduplication logic for multi-run aggregation
- Links to spec and related data loading issue
]
This is key: Requirements emerge through discussion. Don’t try to think of everything upfront - let the conversation surface gaps.

Best Practices

Don’t jump straight to writing specs. Explore and discuss first.Approach:
You: "Let's look at this data together and figure out what to analyze."

Agent: [Explores data, suggests metrics and approaches]

You: [Provide domain knowledge, priorities, constraints]

You: "That approach makes sense. Let's create a spec for it."
Benefits:
  • Agent understands context before formalizing
  • You get suggestions informed by data exploration
  • Requirements emerge naturally from discussion
  • Shared understanding before committing to approach
Specs should define what you want, not how to build it.Wrong approach:
# Bad spec

## Implementation

\```python
def load_data(path):
    df = pd.read_csv(path)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    return df
\```
Right approach:
# Good spec

## Requirements

Data loading module should:
- Load CSV files from collection run directories
- Automatically parse datetime fields
- Support single-run and multi-run aggregation
- Handle missing data gracefully
Benefits:
  • Specs stay relevant longer
  • Implementation can adapt to new patterns
  • Focus remains on requirements
  • Easier to review and understand
Don’t try to specify everything upfront. Discover gaps iteratively.Example from walkthrough:
Initial plan: Load data, compute metrics, visualize

During discussion: "Wait, what about deduplication?"

New issue created: Deduplication logic for multi-run aggregation
This is normal and expected:
  • Requirements surface through discussion
  • Edge cases emerge when reviewing the plan
  • Implementation reveals additional needs
  • Specs evolve based on learnings
Benefits:
  • More complete requirements
  • Realistic implementation plans
  • No premature over-specification
  • Context for why requirements were added
Have a conversation about what to analyze before deciding how.Key questions to discuss:
  • What insights do we want?
  • What metrics would be valuable?
  • What visualizations would tell the story?
  • What tools are appropriate?
  • What are the data quality considerations?
In the walkthrough:
You: "What should we analyze? Can you make suggestions?"

Agent: [Proposes adoption metrics, productivity signals, workflow
patterns, with specific examples and reasoning]

You: "Great. Let's create this spec, with these metrics in mind"
This establishes shared understanding before writing anything formal.
Structure implementation in logical phases with dependencies.Example breakdown:
Phase 1: Data Loading (foundation)
  ├─ Load CSV files
  └─ Handle deduplication

Phase 2: Metrics (builds on data loading)
  ├─ Compute time metrics
  └─ Aggregate commit-level to PR-level

Phase 3: Visualizations (builds on metrics)
  ├─ Impact dashboard
  ├─ Adoption timeline
  └─ Workflow patterns

Phase 4: Notebooks (brings it all together)
  ├─ Overview notebook
  ├─ Productivity analysis
  └─ Workflow patterns
Benefits:
  • Clear execution order
  • Parallel work opportunities within phases
  • Easy to track progress
  • Natural checkpoints for review

Common Patterns

Pattern: Exploratory Data Analysis to Specification

# 1. Human provides data and general goal
You: "I have this data dump. Let's figure out how to analyze it."

# 2. Agent explores data structure
Agent: [Uses Bash/Read tools to understand data format, schemas, sizes]

# 3. Human and agent discuss analytical approaches
You: "What insights would be valuable?"
Agent: [Proposes metrics, visualizations, technical approaches]
You: [Provides priorities, constraints, domain knowledge]

# 4. Agent creates spec capturing the approach
You: "That makes sense. Create a spec for this."
Agent: [Uses upsert_spec() with full requirements]

# 5. Refine spec to focus on abstractions
You: "Remove the code examples, keep the requirements."
Agent: [Updates spec to remove implementation details]

# 6. Break down into issues with dependencies
You: "Split this into issues."
Agent: [Creates issues, links to spec, establishes dependencies]

# 7. Discover missing requirements during review
You: "What about data deduplication?"
Agent: [Creates new issue, links appropriately]

# 8. Implement in phases
Dispatch agents to work through issues in dependency order

Pattern: Iterative Requirement Discovery

Requirements emerge through conversation, not just upfront specification:
Initial Spec:
  - Load data from CSV files
  - Compute productivity metrics
  - Create visualizations

During Planning Discussion:
  You: "Wait, data collection runs overlap. Do we deduplicate?"
  → New issue created: Deduplication logic

During Implementation:
  Agent: "Found edge case: PRs that change state between runs"
  → Issue updated: Handle state transitions in deduplication

During Testing:
  Agent: "Performance issue with large datasets"
  → New issue created: Optimize data loading for 100K+ records
This is normal and healthy - requirements surface when they’re discoverable.

Troubleshooting

Symptom: Spec contains code snippets, function implementations, detailed pseudocodeSolution:Ask the agent to refine the spec:
You: "This spec has too much implementation detail. Can you refactor it
to focus on requirements and abstractions? Remove code examples but keep
the high-level architecture and what we need to build."

Agent: [Uses upsert_spec() to update, removing code but keeping
requirements, data model, metrics definitions, visualization
requirements]
What to remove:
  • Specific code implementations
  • Detailed function signatures
  • Step-by-step algorithms
What to keep:
  • Goals and success criteria
  • Data model and schemas
  • Metrics to compute
  • Visualization requirements
  • High-level architecture
Symptom: Realize something is missing after breaking down the specSolution:This is normal! Create a new issue and link it appropriately:
You: "We forgot about data deduplication. Can we add an issue for this?"

Agent: [Uses upsert_issue() to create new issue, links to spec via
'implements', links to related issues via 'related' or 'depends-on']
From the walkthrough:
  • Initial plan: data loading, metrics, visualizations, notebooks
  • During review: “What about deduplication?”
  • New issue created: i-7gkc for deduplication logic
  • Linked to data loading issue as related work
Benefits of this pattern:
  • Requirements emerge when discoverable
  • Context preserved (why it was added)
  • Implementation can proceed with complete requirements
  • No need to restart the whole planning process