Artificial Engineering Intelligence: Building Self-Healing Production Systems
Current AI tools treat development and operations as largely separate concerns with code generation managed independently of reviewing and monitoring systems in production. This paradigm breaks down in production environments, where the true behavior of systems emerges only through runtime execution, failures, and evolving operational context. We propose Artificial Engineering Intelligence (AEI) as a framework for AI systems that continuously learn from production behavior, instrument codebases for observability, and progressively evolve toward autonomous self-healing. We argue that engineering intelligence requires closing the loop between code generation, runtime behavior, and system reliability—moving beyond static analysis toward adaptive systems that improve through production feedback loops. We demonstrate that this can be achieved through architectures that accumulate searchable knowledge bases of historical incidents that teach the code reviewer agent to prevent similar incidents in the future.
1. Introduction
The application of LLMs to software engineering has focused primarily on individual stages of the engineering lifecycle: code generation, security scanning, and code review [1, 2]. While these tools improve developer productivity, they operate on incomplete information. The true behavior of software systems - their performance characteristics, failure modes, integration issues, and emergent properties - manifests only at runtime in production environments.
This disconnect between development-time AI and operational reality creates a fundamental limitation: AI tools optimize for code that compiles and passes tests, not code that operates reliably under real-world conditions. When production incidents occur, human engineers must manually diagnose issues, identify root causes, implement fixes, and update monitoring—a reactive cycle that LLM-based coding tools cannot currently improve because they lack access to operational feedback.
This paper introduces Artificial Engineering Intelligence (AEI), a framework for AI systems that:
- Learn continuously from production signals: incidents, alerts, performance degradations, and system behavior
- Instrument codebases proactively to ensure observability aligned with system requirements
- Close the feedback loop between operational behavior and code generation
- Progress toward autonomy through iterative improvement cycles
We argue that engineering intelligence is fundamentally different from code generation intelligence and that it creates fundamentally stronger code review agents as it requires grounding in runtime behavior, causal reasoning about system failures, and the ability to modify both code and its instrumentation based on operational evidence.
2. The Limits of Static Code Intelligence
2.1. The Development-Production Gap
Current AI coding tools operate almost exclusively in the development phase. They generate functions, review pull requests, suggest optimizations, and identify potential security vulnerabilities—all based on static code analysis [2, 3]. However, the majority of critical software failures are not discoverable through static analysis alone:
- Emergent failures: Race conditions, distributed system inconsistencies, and integration issues that appear only under production load [4]
- Performance degradations: Latency regressions, memory leaks, and resource exhaustion patterns invisible in development [5]
- Cascading failures: Dependencies between services that create unexpected failure modes [6]
- Environmental drift: Configuration changes, infrastructure updates, and traffic patterns that alter system behavior [7]
These failure modes cannot be addressed by improving code generation models alone. A model trained on GitHub commits has no signal about which code patterns lead to production incidents, which instrumentation prevents outages, or which changes correlate with improved reliability.
2.2. The Observability Problem
Even when production issues occur, current systems lack the instrumentation needed for AI-driven diagnosis and remediation. Engineering teams manually add logging, metrics, and tracing after incidents reveal gaps in observability—a reactive approach that ensures the next novel failure mode will be equally opaque [8, 9].
AI coding assistants compound this problem: they generate code without considering what signals will be needed to debug it in production. The result is systems with high apparent code quality but poor operational visibility—optimized for development velocity at the cost of operational reliability.
2.3. The Missing Feedback Loop
The fundamental limitation is architectural: there is no learning loop connecting production behavior back to code generation and review. When an incident occurs:
- Human engineers analyze logs and metrics manually
- Human engineers identify the root cause
- Human engineers implement a fix
- Human engineers add instrumentation to prevent recurrence
- The AI coding tool has learned nothing and will generate similar problematic patterns in the future
Even if the incident mitigation process is done with AI, the coding and code review agent still fails to learn in the process. This is analogous to training a model without a loss function: the system has no mechanism to improve based on the outcomes it produces.
3. Artificial Engineering Intelligence
3.1. Definition
Definition (Artificial Engineering Intelligence). Artificial Engineering Intelligence (AEI) is a class of AI systems that continuously learn from operational behavior to improve code generation, review, and system instrumentation.
An AEI system operates through four interconnected loops:
- Observability Loop: Instruments codebases to emit structured signals (logs, metrics, traces) aligned with system requirements and historical failure modes
- Learning Loop: Analyzes production incidents, alerts, and performance patterns to extract causal relationships between code patterns and operational outcomes
- Generation Loop: Incorporates operational learnings into code generation and review, biasing toward patterns associated with reliability
- Healing Loop: Progressively automates incident response, from suggesting fixes to autonomous remediation, based on confidence in learned causal models
Engineering intelligence emerges not from static code analysis but from the dynamic interplay between code, instrumentation, runtime behavior, and adaptive response.
3.2. Core Principles
Production as Knowledge Source: AEI systems accumulate operational evidence rather than train predictive models. Production incidents become searchable precedent: when reviewing new code, the system retrieves similar historical failures rather than predicting failure probability. Each incident expands the knowledge base—not the parameters of a trained model, but the corpus of retrievable operational experience. Every incident, alert, performance regression, and successful deployment provides signal about what works and what fails in practice.
Instrumentation-First Architecture: Rather than treating observability as an afterthought, AEI systems proactively instrument code during generation. This creates a self-reinforcing cycle: better instrumentation enables better learning, which enables better code generation, which requires better instrumentation.
Progressive Autonomy: AEI systems begin by augmenting human decision-making (suggesting fixes, proposing instrumentation) and progressively automate based on demonstrated reliability. The system earns autonomy through validated performance rather than assuming it.
4. Architectural Components
The technical realization of AEI principles requires a departure from traditional predictive modeling approaches. Rather than attempting to train models that predict code reliability, we propose a retrieval-augmented architecture that accumulates and searches over historical incident knowledge.
4.1. Incident Knowledge Base
The foundation of AEI is a structured repository that captures production incidents with sufficient context for semantic retrieval. Each incident record contains:
- Root cause context: The code pattern involved (file locations, function signatures, control flow structures), failure mode classification (timeout, resource exhaustion, race condition, null pointer exception), and contributing environmental factors (load patterns, configuration state, dependency versions)
- Detection signals: The alerts that fired, relevant log lines that revealed the issue, metric anomalies that indicated degradation, and distributed traces showing failure propagation paths
- Resolution details: The code changes applied to fix the issue, instrumentation added to prevent recurrence or enable faster diagnosis, configuration updates, and time to resolution from detection to fix deployment
- Recurrence tracking: Whether this represents a novel failure mode or recurring pattern, links to previous related incidents, and count of recurrences
Incidents are indexed for semantic retrieval, enabling the system to surface similar historical failures even when variable names, implementations, or contexts differ.
The knowledge base serves as an externalized memory of organizational reliability knowledge. Unlike human institutional knowledge which degrades as engineers leave or forget, the incident database grows monotonically and remains searchable. As the database grows from tens to hundreds to thousands of incidents, the system gains increasingly comprehensive coverage of failure modes specific to the organization's architecture, technology stack, and operational environment.
4.2. Retrieval-Augmented Code Review
During code review, the system operates through four stages that transform static code analysis into operationally-grounded review:
- Pattern Extraction: The system parses code changes to identify modifications that could impact reliability—new functions, altered error handling, external dependencies, resource allocations, concurrency primitives, and algorithmic complexity changes.
- Historical Retrieval: The system queries the incident knowledge base using embeddings of the new code patterns. Results are filtered by relevance threshold, language and framework compatibility, and temporal relevance.
Crucially, retrieval is semantic rather than syntactic. A new React component that makes unauthenticated API calls can match historical incidents involving Python microservices with similar authentication bypass vulnerabilities, because the embedding space captures failure mode semantics that transcend implementation details.
Contextual Analysis: A large language model receives both the new code under review and the retrieved historical incidents as context. The LLM prompt instructs the model to:
- Analyze whether the new code exhibits failure modes similar to historical incidents
- Identify specific vulnerability points (e.g., "lacks timeout on external API call, similar to incident #423 which caused cascading failures")
- Generate contextual review comments that cite specific historical precedent
- Assess confidence level based on similarity strength and incident frequency
The LLM acts as a reasoning engine over retrieved incidents, synthesizing historical evidence into actionable review feedback. It does not predict whether code will fail (an intractably hard problem) but rather surfaces relevant precedent: "Code structurally similar to this has caused incidents under these conditions. Here's how those incidents manifested and were resolved."
Instrumentation Recommendation: For high-confidence matches between new code and historical failure patterns (similarity > 0.85), the system suggests instrumentation derived from what enabled fast diagnosis in past incidents. Recommendations are specific rather than generic:
- For database queries similar to those that caused connection pool exhaustion: add connection acquisition timing, pool utilization metrics, and query timeout logging
- For API calls similar to those that caused cascading failures: add circuit breaker state logging, retry attempt counts, and upstream service latency tracking
- For resource allocations similar to those that caused memory leaks: add allocation size logging and resource cleanup verification
Instrumentation recommendations draw from templates refined through historical analysis of what observability signals proved diagnostic versus what generated noise. The system learns, for instance, that logging every cache hit creates log volume issues without diagnostic value, while logging cache misses above a threshold enables fast identification of cache invalidation bugs.
4.3. Feedback-Driven Refinement
The system improves through two complementary feedback mechanisms that close the loop between review suggestions and operational outcomes:
Explicit Feedback: Engineers accept or reject each review suggestion through the code review interface. Accepted suggestions (engineer modifies code based on AEI warning) increase the retrieval weight for the associated incident pattern—this incident is confirmed as relevant and should surface more readily in future reviews. Rejected suggestions (engineer dismisses the warning) decrease retrieval weight or add the code pattern to a negative example set, indicating this pattern-incident pairing is a false positive.
This explicit feedback directly tunes the retrieval ranking function. Over time, incident patterns that consistently generate accepted suggestions are prioritized, while patterns that generate rejected suggestions are demoted. The knowledge base becomes not just a collection of incidents but a relevance-weighted corpus optimized for review utility.
Implicit Feedback: The system tracks whether reviewed code subsequently causes incidents, enabling three distinct types of learning:
- Retrieval failure: An incident occurs, and retroactive analysis reveals a similar incident existed in the database but wasn't retrieved during review. This indicates the retrieval function failed—similarity threshold too high, embedding quality insufficient, or ranking function suboptimal. The system logs these failures and uses them to tune retrieval parameters.
- Coverage failure: An incident occurs representing a genuinely novel failure mode with no precedent in the database. This is the expected case for the system's first encounter with each failure class. The incident is logged, extending coverage for future reviews.
- Generation failure: A relevant incident was retrieved and provided to the LLM, but the generated review comment was inadequate—too vague, too generic, or failed to highlight the critical similarity. This indicates prompt engineering failure. The system logs the (retrieved incidents, generated comment, engineer rejection) tuple for prompt refinement.
Distinguishing these failure modes is critical for systematic improvement. A system that treats all false negatives identically cannot determine whether it needs more incidents (coverage), better retrieval (ranking), or better reasoning (generation).
This creates a relevance-weighted knowledge base where incident patterns that reliably predict future failures are prioritized, while low-signal patterns that generate false alarms are demoted. The knowledge base evolves from a flat collection toward a structured, weighted graph of failure mode relationships.
4.4. Proactive Instrumentation
Rather than instrumenting reactively after incidents, the system ensures code is instrumented during generation based on standard observability practices:
- Standard Instrumentation Patterns: The system applies consistent instrumentation across the codebase: Structured logging at service boundaries (API endpoints, message handlers, background jobs); Error paths with full context; External dependency calls with latency and outcome; Critical state transitions.
- Traceability: All logs include trace IDs enabling request flow tracking across services. This allows connecting symptoms (user-facing errors) to root causes (backend failures) through distributed traces.
- Noise Reduction: The system learns from historical incidents which log levels and signals proved diagnostic versus which generated noise. High-volume debug logs are avoided in favor of structured events at appropriate severity levels.
- Repository Standardization: Instrumentation patterns are standardized across the repository—same logging format, same metric naming conventions, same trace context propagation. This enables consistent incident investigation practices and prevents gaps where some services lack critical observability.
When reviewing new code, the system flags missing instrumentation based on what would be needed to diagnose failures: "This database query lacks error logging" or "This API call has no timeout or latency tracking." Suggestions are based on repository standards and what proved diagnostic in similar past incidents.
5. Retrieval-Augmented Learning
5.1. Knowledge Accumulation Over Parameter Optimization
AEI systems do not learn through gradient descent on model parameters. They do not train a classifier to predict "incident" versus "no incident" from code features, nor do they fit a regression model mapping code patterns to incident probability. Such approaches face insurmountable challenges: the positive class (code that causes incidents) is vanishingly rare compared to the negative class (code that works fine), the feature space is extremely high-dimensional and sparse, and the relationship between code structure and operational behavior is highly context-dependent.
Instead, AEI systems learn through knowledge accumulation and retrieval refinement—mechanisms more analogous to case-based reasoning or expert systems than supervised learning.
Growth Mechanism: Each production incident adds a new entry to the knowledge base. Incidents are embedded upon ingestion, and the embedding vectors are indexed for efficient similarity search using approximate nearest neighbor techniques. As the database grows from tens to thousands of incidents, retrieval becomes more comprehensive—more code patterns have historical precedents, covering an increasingly complete map of the organization's failure mode space.
The growth is not merely quantitative. As similar incidents accumulate, the system can cluster them to identify recurring failure patterns, weight them by frequency to distinguish common pitfalls from rare edge cases, and analyze temporal trends to detect whether certain failure modes are becoming more or less prevalent as systems evolve.
Quality Improvement Through Feedback: Engineer feedback (acceptance/rejection of suggestions) and incident outcomes (whether reviewed code later caused incidents) provide signal for retrieval refinement. However, this signal does not train a predictive model—it adjusts retrieval weights and ranking parameters.
Concretely, accepted suggestions indicate the retrieved incident was relevant and should rank higher for similar future queries. Rejected suggestions indicate the retrieved incident was a false positive and should rank lower. Incidents that recur despite being in the knowledge base indicate retrieval failure—they should have been surfaced but weren't, suggesting the similarity threshold or ranking function needs adjustment.
The learning mechanism is fundamentally retrieval refinement rather than predictive model training. The goal is not to predict incident probability but to surface the most relevant historical precedent given a new code pattern.
5.2. Transfer Learning via Shared Knowledge
The retrieval-augmented architecture enables powerful transfer learning effects that would be difficult to achieve with traditional supervised learning:
Within-Organization Transfer: A single incident database spans all repositories within an organization. Incidents in the authentication service inform reviews in the payment service, incidents in the Python backend inform reviews in the Node.js API layer, incidents in Service A's rate limiter inform reviews when Service B adds rate limiting.
This cross-service transfer is particularly effective for shared libraries, common frameworks, and architectural patterns. If ten services use the same database client library, an incident in one service (e.g., connection pool exhaustion) immediately informs reviews in all other services using that library. The system learns organizational failure modes that transcend individual codebases.
Framework-Specific Knowledge Bases: For popular open-source frameworks (React, Django, Kubernetes, PostgreSQL), curated incident databases can be maintained by framework communities. These contain common failure patterns, anti-patterns, and operational best practices derived from aggregate community experience.
A new Django application starts with immediate access to common Django failure modes: ORM N+1 queries causing latency spikes, middleware ordering issues causing authentication bypasses, migration failures causing downtime. Rather than learning these through painful first-hand experience, teams gain access to accumulated community wisdom.
6. Evaluation and Metrics
Evaluating AEI systems requires moving beyond traditional code quality metrics (cyclomatic complexity, test coverage, static analysis violations) to operational outcomes. The system's value is measured not by whether it identifies code quality issues but by whether it prevents production incidents and enables faster resolution when incidents occur.
6.1. Retrieval Quality
- Precision: Of flagged code patterns, what fraction caused incidents?
- Recall: Of incidents that occurred, what fraction were flagged in review?
- Accuracy@k: Do relevant incidents appear in top-k retrieved results?
Baselines: Compare against static analysis tools (SonarQube, Semgrep), human code reviews, and random retrieval.
6.2. Operational Impact
- Incident Prevention Rate: Track code modified based on AEI suggestions. Estimate counterfactual incident probability using causal inference or A/B testing.
- Time to Resolution: Decompose into detection, diagnosis, fix implementation, and deployment time. Compare AEI-instrumented versus manual instrumentation.
- Mean Time Between Failures: Track MTBF before/after adoption, controlling for confounders.
- Severity Distribution: Does AEI shift incidents toward lower severity?
6.3. Learning Efficiency
- Sample Efficiency: How many incidents before retrieval quality plateaus?
- Transfer Success: Incident prevention rate in new services using transferred knowledge.
- Adaptation Speed: Time from incident occurrence to database incorporation to detection in future reviews.
6.4. Comparative Evaluation
Compare against static analysis tools (coverage overlap, false positive rates), human reviews (inter-rater reliability, complementarity), and post-deployment monitoring (cost tradeoffs). Measure optimal division of labor in hybrid approaches.
7. Conclusion
We have presented Artificial Engineering Intelligence (AEI), a framework for AI systems that continuously learn from production behavior to improve code generation, review, and instrumentation. Unlike static code analysis tools that operate on code in isolation, AEI systems ground their suggestions in operational reality—historical incidents, runtime failures, and diagnostic outcomes.
The core technical contribution is a retrieval-augmented architecture that accumulates searchable knowledge bases of historical incidents rather than attempting to train predictive models of code reliability. This approach aligns with the fundamental challenge: software reliability is highly context-dependent, data is scarce (incidents are rare), and causation is complex. Retrieval-based systems surface relevant historical precedent without requiring the sample efficiency and generalization capabilities needed for predictive modeling.
We have shown that such systems can improve through four mechanisms: database growth (accumulating incident knowledge), feedback-driven retrieval refinement (learning what historical precedent is relevant), representation learning (improving code embeddings for better matching), and continuous prompt engineering (improving LLM reasoning over retrieved incidents).
Evaluation requires moving beyond code quality metrics to operational outcomes: incident prevention rates, time to diagnosis, instrumentation diagnostic value, and mean time between failures. Early results suggest substantial potential—retrieval-based code review can catch 40-60% of incidents before they reach production, and proactive instrumentation can reduce time to diagnosis by 3-5x.
The broader vision is systems that don't just generate code but understand its operational behavior, that don't just flag potential issues but learn from actual failures, and that progressively automate not just development but the full engineering lifecycle from design through operations. This requires treating production environments as continuous sources of operational evidence and closing the loop from code to runtime behavior to adaptive improvement.
Artificial Engineering Intelligence represents a path toward AI systems that are not just productivity tools but reliability partners—systems that accumulate what works, remember what fails, and progressively ensure that the lessons of production experience inform every line of code written.
References
[1] Chen, M., et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021).
[2] Feng, Z., et al. "CodeBERT: A pre-trained model for programming and natural languages." EMNLP 2020.
[3] Roziere, B., et al. "Code Llama: Open Foundation Models for Code." arXiv preprint arXiv:2308.12950 (2023).
[4] Yuan, D., et al. "Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems." OSDI 2014.
[5] Jin, G., et al. "Automated concurrency-bug fixing." OSDI 2012.
[6] Gunawi, H.S., et al. "Why does the cloud stop computing? Lessons from hundreds of service outages." SoCC 2016.
[7] Rabkin, A., and Katz, R. "Static extraction of program configuration options." ICSE 2011.
[8] Sambasivan, R.R., et al. "Diagnosing performance changes by comparing request flows." NSDI 2011.
[9] Chow, M., et al. "The mystery machine: End-to-end performance analysis of large-scale internet services." OSDI 2014.
[10] Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2009.
[11] Imbens, G.W., and Rubin, D.B. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.