erp/plans/squishy-pondering-aho-agent-a23c7497aadc6da41.md

# Code Review: ERP Platform Base Design Specification

**Document reviewed:** `G:\erp\docs\superpowers\specs\2026-04-10-erp-platform-base-design.md`
**Verdict:** ISSUES FOUND

---

## What Is Done Well

The spec demonstrates several strengths before I get into the issues:

1. **Clear architectural vision.** The modular monolith with progressive extraction is a sound strategy for a product targeting small-to-large enterprises. It avoids premature microservice complexity while preserving the extraction path.

2. **Crate structure is well-considered.** Separating `erp-core` (traits, types, events) from `erp-common` (utilities, macros) and keeping each business module as its own crate is the right granularity for Rust workspace management and future extraction.

3. **RBAC + ABAC hybrid permission model.** This is the correct approach for a multi-tenant ERP. Pure RBAC breaks down at scale; ABAC alone is hard to administer. The hybrid is pragmatic.

4. **UUID v7 as primary keys.** Time-sortable UUIDs are the right choice for distributed systems and work well with PostgreSQL's B-tree indexes.

5. **Soft delete via `deleted_at`.** Appropriate for an ERP where audit trails are non-negotiable.

6. **Database design principles.** The universal columns (`tenant_id`, `created_at`, `updated_at`, `created_by`, `updated_by`) are a solid foundation.

---

## CRITICAL Issues (Must Fix Before Implementation)

### C1. No Error Handling Strategy Defined

The spec lists `erp-core` as containing "error handling" but provides zero detail on the error model. For a Rust codebase, this is an immediate blocker because error types pervade every crate boundary.

**What is missing:**
- A unified error type hierarchy (domain errors vs infrastructure errors vs API errors)
- How module-level errors map to HTTP responses (status codes, error envelopes)
- Whether a single `erp-core::Error` enum or per-module error types with `From` conversions
- Error chain propagation strategy (anyhow vs thiserror vs custom)
- How validation errors are represented and returned to the client

**Recommendation:** Add a section defining the error architecture. The standard Rust approach for this kind of project is `thiserror` for library crates with per-module error enums, converting to a unified API error type in `erp-server`. Define the HTTP error response envelope (status, code, message, details).

### C2. Event Bus Specification Is Absent

The spec states "Modules communicate via event bus, no direct coupling" as a core design principle but provides zero detail on:
- The event bus implementation (in-process? tokio channels? trait-object dispatch?)
- Event schema and versioning strategy
- Event delivery guarantees (at-least-once? exactly-once?)
- Event persistence (are events stored for audit/replay?)
- Dead letter handling
- How cross-module event subscriptions are registered

This is not an implementation detail -- it is a load-bearing architectural decision. The wrong choice here is extremely expensive to change later.

**Recommendation:** Add an "Event System" subsection under Architecture that specifies the event bus mechanism, event schema, delivery semantics, and at minimum the key events exchanged between the four core modules.

### C3. API Versioning and Contract Strategy Is Undefined

The APIs all use `/api/v1/` prefixes, but there is no statement on:
- How breaking changes are managed across versions
- Whether API versioning is URL-path-based (current) or header-based
- How the Tauri client and backend version compatibility is maintained
- Whether API contracts are formally defined (OpenAPI generation from code vs. design-first)

**Recommendation:** Add an API governance section. At minimum: the versioning strategy, the OpenAPI generation approach (utoipa is listed, but is it code-first or spec-first?), and the client-server compatibility contract.

### C4. No Data Migration Strategy for Multi-Tenant Schema Changes

The spec mentions SeaORM Migrations but does not address the multi-tenant reality:
- How are schema migrations applied across tenants?
- For the "independent schema per tenant" option, how is per-tenant migration managed?
- How are tenant-specific data migrations (dictionary changes, menu changes) handled separately from schema migrations?
- What happens when a migration fails for one tenant but succeeds for others?

**Recommendation:** Add a migration strategy section that covers schema migration in the shared-database model, per-tenant schema model, and data seeding for new tenants.

### C5. Workflow Engine BPMN 2.0 Compatibility Is Underspecified for Implementation

The workflow engine is listed as Phase 4 (3-4 weeks) and claims "BPMN 2.0 compatible," but implementing a BPMN engine from scratch in 3-4 weeks is not realistic. BPMN 2.0 is a massive specification (~500 pages). The spec does not specify:
- Which BPMN elements are in scope for Phase 4 vs. later phases
- How BPMN XML is parsed, stored, and rendered
- The expression language for conditions (mentions "EL expressions" but does not specify the engine)
- How subprocesses and call activities interact with multi-tenancy

**Recommendation:** Define a BPMN subset for Phase 4. A realistic Phase 4 scope would be: Start/End nodes, UserTask, ServiceTask, ExclusiveGateway, ParallelGateway, sequence flows with simple conditions. SubProcesses, InclusiveGateways, timers, and signals should be deferred.

---

## IMPORTANT Issues (Should Fix Before Implementation)

### I1. No Deployment Architecture or Operations Section

For a commercial SaaS product, the spec lacks:
- Production deployment topology (containers, orchestration, load balancing)
- How Tauri desktop clients connect to the backend (direct? reverse proxy? CDN?)
- Database connection pooling strategy
- Redis clustering strategy for production
- Backup and disaster recovery
- Monitoring, alerting, and observability stack
- Rate limiting strategy (mentioned in security checklist but not in the spec)

**Recommendation:** Add an "Operations" or "Infrastructure" section. Even if production deployment is future work, the development Docker Compose topology should be defined, and the backend should be designed to support horizontal scaling from day one (stateless server, sticky-session considerations for WebSocket, etc.).

### I2. Authentication Token Lifecycle Is Incomplete

The spec mentions JWT and refresh tokens but does not address:
- Token format and claims structure
- Access token TTL vs. refresh token TTL
- Refresh token rotation strategy
- Token revocation mechanism (critical for logout and security incidents)
- How tokens are transmitted (authorization header? cookies? both?)
- Tauri-specific secure token storage on the client side
- Concurrent session management (can a user be logged in on multiple devices?)
- Token handling across tenant boundaries (if a user belongs to multiple tenants)

**Recommendation:** Add a token lifecycle section covering issuance, rotation, revocation, storage, and multi-tenant user scenarios.

### I3. No Concurrency and Transaction Strategy

An ERP handles concurrent edits to shared data. The spec does not address:
- Optimistic vs. pessimistic locking strategy for business entities
- How SeaORM transactions are managed across module boundaries
- Idempotency for API operations (especially workflow actions)
- Concurrent workflow task handling (two people approving the same task simultaneously)

**Recommendation:** Add a concurrency section. For ERP, optimistic locking with version columns is the typical baseline. Define how cross-module transactions are handled (saga pattern? eventual consistency?).

### I4. Audit Logging Is Mentioned But Not Specified

The core shared layer includes "audit" but there is no specification for:
- What events are audited (all mutations? login/logout? data access?)
- Audit log storage (same database? separate? append-only?)
- Audit log retention policy
- Audit log query API
- How audit logs relate to multi-tenancy

For a commercial ERP, audit logging is not optional. It is a compliance requirement.

**Recommendation:** Add an audit specification section. Define the audit event schema, what triggers audit events, storage, and retention.

### I5. No Module Registration and Plugin Interface Defined

One of the four design principles is "Plugin extensibility: Industry modules register through standard interfaces, support dynamic enable/disable." But the spec provides:
- No trait definitions for module registration
- No lifecycle hooks (init, start, stop, health-check)
- No mechanism for modules to register their routes, menu items, or event handlers
- No tenant-level module enable/disable data model

This is the mechanism that makes the "platform base + industry plugins" architecture work. Without it, the industry modules cannot be built.

**Recommendation:** Add a "Module System" section that defines the module trait interface, the registration mechanism, and the per-tenant module configuration model.

### I6. Numbering Rules Are Too Simplistic

The spec lists "Numbering rules: Document number generation (e.g., PO-2024-001)" as a Config capability but does not address:
- Concurrency-safe sequence generation (multiple users creating POs simultaneously)
- Sequence reset rules (yearly? monthly? daily? never?)
- Sequence gaps (are gaps allowed? this has legal implications in some jurisdictions)
- Multi-tenant sequence isolation
- How different document types get different sequences

**Recommendation:** Expand the numbering rules section with the sequence generation strategy. PostgreSQL sequences or a dedicated sequence table with row-level locking are the typical approaches.

### I7. Desktop-Backend Communication Protocol Undefined

The spec shows REST APIs and a WebSocket endpoint but does not specify:
- How the Tauri app authenticates WebSocket connections
- WebSocket message format and protocol
- Reconnection and offline queueing strategy
- How Tauri IPC interacts with HTTP calls (does the frontend call the backend directly, or go through Tauri commands?)
- File upload/download handling (common in ERP: attachments, exports)
- How the desktop app handles backend version mismatches

**Recommendation:** Add a "Client-Server Communication" section covering HTTP API patterns, WebSocket protocol, Tauri IPC strategy, and offline behavior.

---

## SUGGESTIONS (Nice to Have)

### S1. Consider Adding a `erp-infra` Crate

The current structure has `erp-core` for shared types/traits and `erp-common` for utilities. Consider a third crate, `erp-infra`, for infrastructure adapters (database connection pool, Redis client, event bus implementation). This keeps `erp-core` purely abstract and makes testing easier since modules depend on `erp-core` traits, not `erp-infra` implementations.

### S2. Define the Configuration File Format

`config-rs` is listed but no application configuration schema is defined. What does `config.toml` or `config.yaml` look like? Database URL, Redis URL, JWT secret, server port, log level, etc. should be enumerated.

### S3. Consider API Response Envelope Standardization

Define a standard API response format (as mentioned in the project's own patterns rules). Example:
```json
{
  "success": true,
  "data": { ... },
  "error": null,
  "meta": { "total": 100, "page": 1, "limit": 20 }
}
```

### S4. Add Database Index Strategy

While the spec mentions "Indexes on `tenant_id` + business keys," a more detailed index strategy would help. Consider composite indexes for common query patterns, partial indexes for soft-deleted rows, and index-only scans for high-frequency queries.

### S5. Consider Health Check and Readiness Endpoints

For containerized deployment and monitoring, define `/health` and `/ready` endpoints. These are essential for Docker orchestration and load balancer configuration.

### S6. The `apps/admin` Directory Implies a Web Admin Panel

The spec mentions it as "Optional web admin panel" but does not define its relationship to the Tauri desktop client. If both exist, is it a separate SPA build? Does it share the `packages/ui-components` library? Clarify whether this is in scope or should be removed to avoid confusion.

### S7. Timeline Estimates May Be Tight

The total roadmap is 10-15 weeks. For a solo developer or small team building a BPMN workflow engine from scratch in Rust, Phase 4 (3-4 weeks) is extremely aggressive. Consider:
- Phase 4 should be 5-8 weeks for a usable workflow engine
- Phase 1 should include CI/CD pipeline as a deliverable, which can take 3-5 days alone
- Total realistic estimate: 14-22 weeks for a solid base

---

## Consistency Check

| Check | Result |
|-------|--------|
| Module API prefixes consistent | PASS - all use `/api/v1/{module}/` |
| Data model matches API surface | PARTIAL - no API for Organization CRUD, Policy CRUD, or Position CRUD in Auth |
| Crate structure matches module list | PASS |
| Tech stack choices are compatible | PASS - Axum + Tokio + SeaORM + redis-rs are a well-tested stack |
| Multi-tenant strategy is applied consistently | PARTIAL - some APIs use `/:tenant_id` in path (Config menus) while Auth relies on middleware injection. Should be consistent. |
| UI features match backend capabilities | PASS |
| Roadmap phases align with module dependencies | PASS - Auth first, then Config, then Workflow (which depends on both) |

---

## Security Gap Analysis

| Area | Status | Notes |
|------|--------|-------|
| Password hashing (Argon2) | Covered | Good |
| JWT authentication | Partially covered | Missing token revocation, rotation |
| SQL injection | Implicitly covered | SeaORM parameterized queries |
| XSS | Not addressed | Should specify CSP policy for Tauri |
| CSRF | Not applicable | Tauri desktop, not browser-based |
| Rate limiting | Not specified | Mentioned in verification plan but not in design |
| Input validation | Not specified | No validation strategy defined |
| CORS | Not specified | Needed if web admin panel is built |
| Secret management | Not specified | Where do JWT secrets, DB credentials, Redis passwords live? |
| Audit trail | Mentioned but not specified | See I4 above |
| Data encryption at rest | Not addressed | Relevant for private deployment customers |
| Backup/restore security | Not addressed | |
| Tenant data isolation verification | Covered | In verification plan |

---

## Summary

The specification provides a strong architectural vision and reasonable technology choices. However, it has five critical gaps that would block or severely impair implementation:

1. **C1** - No error handling strategy
2. **C2** - No event bus specification (despite being a core design principle)
3. **C3** - No API versioning and contract governance
4. **C4** - No multi-tenant migration strategy
5. **C5** - Workflow engine scope is unrealistic for the stated timeline

And seven important issues that should be addressed before coding begins:

1. **I1** - No deployment/operations architecture
2. **I2** - Incomplete authentication token lifecycle
3. **I3** - No concurrency and transaction strategy
4. **I4** - Audit logging mentioned but not specified
5. **I5** - No module registration/plugin interface defined
6. **I6** - Numbering rules too simplistic for ERP use
7. **I7** - Desktop-backend communication protocol undefined

**Recommendation:** Address all five CRITICAL issues and the IMPORTANT issues before beginning Phase 1. The CRITICAL issues represent architectural foundations that are extremely expensive to retrofit. The IMPORTANT issues represent functionality that will be needed within the first three phases of development.

The spec should move from `Status: Draft` to `Status: Under Review` until these issues are resolved.