9. November 2025
BillKrat
From Semantic Kernel Awareness to Microsoft Agent Framework
Note on authorship and collaboration: This guide was co-created through an AI-assisted workflow. The human contributor (Bill) set goals, constraints, and provided iterative feedback; GitHub Copilot generated structure, terminology definitions, and suggested wording, then applied edits in the workspace. Treat this as a collaborative, living document reflecting human intent assisted by AI.
Status: Draft v0.1 (skeleton) Draft notice: The server framework is being refactored to a layered DI architecture; sections may lag the latest design until refactor completion. Audience: C#/.NET 8–9 developers building AI apps without Azure, with a smooth Azure migration path
Project Links (read, run, source)
Note: You may be reading this on the blog or directly from the repository. The links above are canonical for project source and a running demo.
1) Goals and Scope
- Help C# developers understand AI terminology, core flows, and architecture choices.
- Start non-Azure (OpenAI API + local/open-source vector DB). Keep an easy on-ramp to Azure.
- Anchor example: “NotepadAI App” — an open-source starter application enabling document/blog ingestion, semantic search, and RAG answers; a separate private "Saints App" will later build atop NotepadAI for Christian studies and will drive iterative improvements back into the open-source platform.
2) Prerequisites
- Skills: C# async/await, dependency injection, minimal APIs/ASP.NET Core, REST/JSON basics.
- Tooling: .NET 8/9 SDK, Git, Docker (optional), VS/VS Code, Postman/REST Client.
- Accounts/Keys: OpenAI API key (non-Azure baseline). Azure account optional for migration.
3) Terminology (Glossary)
- LLM: Large Language Model. A deep neural network (Transformer architecture) trained on massive text corpora (books, documentation, code, web data) to predict the next token.
- Capabilities:
- Interpret natural language & follow instructions
- Generate / transform / summarize text
- Classify and extract structured data
- Assist reasoning via pattern completion
- Provide multilingual support (model-dependent)
- Limitations:
- Fixed context window (e.g., 16K–200K tokens per request)
- Stateless between calls unless prior messages are re-sent
- Susceptible to hallucination (may fabricate facts)
- No real-time external knowledge unless you retrieve & inject (RAG/tools)
- Sensitive to prompt phrasing and ordering
- Best Practices:
- Keep prompts concise and role-focused
- Supply grounded retrieved context (avoid unsupported speculation)
- Track token usage (cost & latency)
- Summarize older conversation turns to conserve window
- Enforce output schema (JSON / markdown) for predictable parsing
- Log prompts & responses for observability and QA
- Token: The unit of text the model processes and bills against.
- What it is: an integer id for a text piece produced by a tokenizer (e.g., byte-pair/WordPiece). Token ≠ word; it may be a word, subword, punctuation, or whitespace.
- When created:
- Before embedding: input text is tokenized; the embedding model consumes tokens to produce a vector.
- Before generation: the entire prompt (system + user + retrieved chunks + tool outputs) is tokenized; these are input tokens.
- During generation: the model predicts one token at a time; these are output tokens.
- Why it matters:
- Context window is measured in tokens (input + output), limiting how much you can send and receive in one call.
- Cost and latency scale with token counts; most providers price input and output tokens separately.
- Truncation and splitting happen at token boundaries.
- Counting & lengths:
- Token != character; English averages ~3–4 characters/token, but varies by text and tokenizer.
- Spaces/newlines/punctuation can be separate tokens depending on tokenizer.
- Different models have different tokenizers and max windows; don’t assume interchangeability.
- Display vs storage:
- Tokens are not UI elements or database keys.
- The UI displays detokenized text (decoded from token ids) returned by the model.
- Practical tips:
- Budget tokens across: system + retrieved context (chunks) + user + expected output.
- Keep chunks concise; remove boilerplate to save tokens.
- Use a tokenizer utility in .NET to estimate prompt size and cap top-k retrieval dynamically.
- Cache token counts for static corpus content to speed prechecks.
- Example:
- "St. Anthony of Egypt" may tokenize roughly as ["St", ".", "ĠAnthony", "Ġof", "ĠEgypt"] (illustrative; actual pieces depend on the tokenizer).
- Misconceptions:
- A token is not a security key or lookup id; it’s a processing unit for the model.
- Tokens are not raw bytes; they’re vocabulary ids specific to a model’s tokenizer.
- Embedding: A fixed-length numeric representation (vector) of text produced by an embedding model such that semantically similar texts are near each other in vector space.
- Models & Dimensions: e.g., OpenAI
text-embedding-3-small (1536 dims), text-embedding-3-large (3072 dims); pick one per index and stay consistent for queries.
- Workflow:
- Ingest: chunk documents → compute embedding for each chunk → store vector + text + metadata in vector store.
- Query: embed user question → similarity search → retrieve top-k chunks → pass retrieved text to LLM.
- Uses:
- Semantic search (RAG)
- Clustering/grouping
- Deduplication & near-duplicate detection
- Lightweight classification/tag suggestion
- Reranking (combine lexical + semantic signals)
- Best Practices:
- Record model name/version + dimension in schema
- Normalize vectors if store expects unit length (cosine)
- Cache embeddings by content hash (avoid recompute)
- Batch API calls for throughput / rate limit efficiency
- Keep chunks topical; strip boilerplate/HTML noise
- Use multilingual model if corpus spans languages
- Pitfalls:
- Mixing different embedding models between index & query
- Dimension mismatch vs. index schema
- Ignoring rate limits (add retry/backoff)
- Embedding giant unchunked documents (low precision)
- Failing to re-embed after model or chunking change
- Using different similarity metric than index assumption
- Cost & Privacy:
- Cheaper than chat completions per token processed
- Still external unless self-hosted model (consider PII redaction)
- Cache + deduplicate reduces spend dramatically
- Chunk: A semantically coherent segment (a self-contained passage focused on one idea—e.g., a paragraph, section, or blog snippet) of a document created during ingestion for retrieval/embedding. Typically 300–1000 tokens with optional overlap to preserve context; stored with metadata (author, tags, category, publishDate, domain) and original source reference.
- Vector: A numeric array (float[]) that encodes the meaning of text in a high-dimensional space. Example: "desert hermit" → [0.12, -0.04, 0.98, …] (dimension depends on the embedding model; e.g., 384, 768, 1024, 1536 for OpenAI
text-embedding-3-small, 3072 for larger models). Similar texts produce vectors that are close under cosine similarity or dot product (e.g., "anchorite monk" is close; "accounting spreadsheet" is far). In practice you: (1) embed each chunk and store its vector with metadata, (2) embed the user query, (3) retrieve nearest neighbors by similarity. Keep indexing and querying with the same model and similarity metric; some stores expect normalized vectors for cosine.
- Vector Store: Database optimized for nearest-neighbor search over vectors (e.g., Qdrant HNSW, Weaviate, Azure AI Search vector fields). You must set the field dimension to match the embedding model.
- Qdrant: Open-source vector database focused on semantic search. HNSW-based ANN index, payload metadata for filtering, REST/gRPC APIs, easy local Docker usage; accessible from .NET via HTTP clients.
- Weaviate: Open-source vector database with a schema-first approach and hybrid search. GraphQL/REST APIs, modules for reranking/transformers, good Docker support; accessible from .NET via HTTP clients.
- Similarity Search: Find nearest vectors (cosine/dot/Euclidean) to a query vector; returns top-k chunks with scores.
- RAG: Retrieval-Augmented Generation. Architecture that combines external retrieval with generation to produce grounded answers.
- Core Components: retriever (vector/hybrid search), chunk store with metadata, prompt assembler, LLM, post-processor (citations/formatting).
- Workflow: embed query → retrieve top-k relevant chunks (optionally filter by metadata) → build prompt (system + user + context) → generate answer → attach citations/scores.
- Benefits: improved factual accuracy, domain specificity, dynamic updates without fine-tuning, transparent sourcing.
- Best Practices: limit chunk count (balance coverage vs. token budget), deduplicate near-identical chunks, enforce source attribution, fallback to clarification if retrieval confidence low, monitor retrieval precision/recall.
- Pitfalls: prompt stuffing (too many tokens), low-quality chunking (mixed topics), missing metadata filters, stale embeddings after corpus change, ignoring retrieval scoring thresholds.
- Evaluation Metrics: answer correctness (manual or automated), citation validity, retrieval hit rate, latency breakdown (embed vs. search vs. generate), token usage.
- Agent: An active decision-making wrapper around an LLM that can invoke tools, manage memory, and apply policies.
- Responsibilities: interpret user intent, decide which tool(s) to call, integrate tool outputs into prompts, manage conversation context, enforce guardrails.
- Structure: system prompt (role/policies), tool registry (capabilities), memory interfaces (short-term chat history + long-term vector store), orchestrator/planner.
- Memory Types: ephemeral (current conversation turns), persistent (vector store/domain data), summarizations (compressed history), scratchpad (intermediate reasoning steps if supported).
- Best Practices: keep tool surface minimal & well-described, validate tool outputs before injection, cap conversation length via summarization, log decisions for observability, isolate domain-specific rules in system prompt.
- Pitfalls: over-broad tools that leak sensitive data, unbounded conversation growth, inconsistent tool naming, mixing retrieval & generation without grounding checks, silently failing tool calls.
- Observability: log selected tools, latency per tool, tokens in/out, retrieval scores, error/retry counts.
- Tool: An executable capability the agent can call (search DB, call API, run code). Formerly “skill” in SK.
- Orchestration: The control layer that sequences agent actions, tool invocations, and prompt assembly.
- Patterns: single-step (direct answer), multi-step planning (decompose tasks), tool chaining (search → summarize → format), conditional branching (retry/fallback), parallel retrieval (multiple indices).
- Responsibilities: choose next action, manage error handling/retries, consolidate tool outputs, enforce ordering constraints, produce final response package.
- Error Handling: retries with backoff for transient failures, circuit-breaker for failing tools, graceful degradation (fallback answer explaining limitation).
- Best Practices: explicit tool metadata (rate limits, cost hints), timeouts per tool, structured intermediate state, minimal serialization overhead, metrics instrumentation.
- Pitfalls: tight coupling between agent and concrete tools (hard to swap), hidden side-effects, lack of backpressure under high request volume, missing telemetry (hard to debug quality issues).
- Metrics: average orchestration steps per query, tool success rate, cumulative latency, token amplification (retrieved vs. used), failure modes distribution.
- Memory: Short-term conversation state and/or long-term knowledge (e.g., vector store).
- MCP: Model Context Protocol. A standard for interoperable communication between models, agents, and tools.
- Purpose: decouple agent logic from specific tool/model implementations, enable portable tool definitions.
- Abstractions: tools (declared capabilities & schemas), resources (data references), messages/events (invocation, result, error), context envelopes (state passed along call chain).
- Flow: agent formulates tool request → MCP transports structured call → tool executes (DB/vector search/etc.) → returns standardized result payload → agent integrates into next reasoning step.
- Advantages: interoperability, composability (share tools across agents), consistent error & schema handling, easier migration (swap providers), improved observability via standardized events.
- Best Practices: precise tool schemas (inputs/outputs), include versioning, define error codes, keep payloads lean (avoid huge raw blobs), secure endpoints (auth/z), validate responses before prompt injection.
- Pitfalls: ambiguous schema fields, oversized responses inflating context window, ignoring version changes, lack of auth leading to misuse, mixing untrusted raw data into prompts without sanitization.
- Observability: log tool invocation id, duration, payload size, version, error code.
- Prompt: The structured instruction + input payload you send to an LLM (often represented as ordered messages: system, user, assistant, tool). A good prompt constrains behavior, sets role, supplies task, provides context (retrieved chunks), defines output format, and lists guardrails. Typical RAG pattern:
- System: role + high-level rules (cite sources, domain scope).
- Context: retrieved chunks injected (each tagged with source/ID).
- User: the question or task.
- Optional tool messages: prior tool outputs (e.g., search results JSON).
- Output directive: “Return JSON with fields: answer, citations[].” Anti-patterns: stuffing unrelated chunks, contradictory instructions, overlong examples that consume context window. Prompts are NOT automatically visible to end users; you choose what to expose (e.g., only the final answer + citations).
- Context Window: The maximum number of tokens (input + model generated output) the LLM can hold in a single request. If the sum of: system + prior messages + injected chunks + user question + expected output tokens > window size, older or excess tokens must be truncated or excluded. Drives design decisions: chunk size, top-k retrieval, summary of prior conversation. Not what the user sees; it's the internal working memory limit of the model for that call.
- Grounding Data: External facts injected into prompts to reduce hallucination.
- Hallucination: Confident but incorrect output; mitigated via RAG, citations, constraints.
- Vector Store Selection (1-line): Qdrant for fast semantic + filtering simplicity, Weaviate for schema/hybrid & modules, FAISS for embedded prototyping (in-process, minimal infra), Azure AI Search for managed, scalable hybrid + enterprise integration.
4) High-Level Flows
- Ingestion Flow
- Raw docs (Markdown/HTML/TXT) → Chunk → Add metadata (author, tags, category, publishDate, domain) → Embedding → Upsert to vector DB.
- Query Flow
- User query → Agent → Tool: vector/search (+ optional filters) → Retrieve top-k chunks → Compose prompt → LLM → Answer + citations.
- Migration Flow
- Start: OpenAI + Qdrant/Weaviate (open-source vector stores) + ASP.NET Core.
- Migrate: Azure OpenAI + Azure AI Search + App Service/Container Apps + Key Vault + Monitor.
5) Semantic Kernel (Awareness)
- What it is: A developer library (C# and Python) for orchestrating prompts and LLM “functions,” popular for early prototyping.
- Core concepts (recognizable terms):
Kernel: central container that wires up services (models, memory, connectors) and invokes functions.
Skill/Plugin: a named collection of functions; can be native (C# methods) or semantic (prompt templates).
Function: an invocable unit (native or prompt-based) with descriptions, inputs, and outputs.
Planner: components that generate a plan (a sequence of functions) from a high-level goal (e.g., Action/Sequential planners).
Prompt Template: parameterized prompt with inputs and execution settings (temperature, max tokens, etc.).
Memory: abstractions for storing and retrieving embeddings/metadata via connectors (e.g., Azure Cognitive Search, Qdrant via community connectors).
Connectors: integrations to external services and data sources.
Context Variables: shared bag of inputs/outputs passed between functions in a pipeline.
- Typical workflow: register models and memory → import skills/plugins → use a planner to build a plan → execute plan step-by-step (each step reads/writes context) → return final result.
- Strengths: approachable mental model; easy to bind C# methods as tools; prompt templates + function descriptions; worked well for single-agent demos and small pipelines.
- Limitations: orchestration and multi-agent patterns required boilerplate; planning reliability varied by scenario; uneven connector quality; less standardized tool schema; harder to scale complex agent workflows.
- Evolution: not deprecated, but Microsoft Agent Framework is the successor with unified orchestration, MCP-native tools, clearer memory model, and production-focused APIs.
- Migration map (quick):
Kernel → Agent
Skill/Function → Tool
Planner → Orchestrator
Memory connectors → Agent memory + vector tool/retriever
- When to keep SK: maintaining existing apps or quick one-off prototypes already built on SK.
- When to choose Agent Framework: new builds, multi-tool/multi-agent systems, MCP integration, clearer migration to Azure.
6) Microsoft Agent Framework (Path Forward)
- Value: Unified orchestration, MCP-native tools, simpler APIs, production-first.
- Core pieces:
- Agent: Encapsulates reasoning, memory, and tool usage.
- Tool: Declarative capability (e.g., vector search, summarization, fetch biography).
- Orchestrator: Plans/decides tool calling and error handling.
- Memory: Conversation state + external knowledge via vector store.
- Non-Azure baseline
- Models: OpenAI GPT +
text-embedding-3-small (or compatible).
- Vector DB: Qdrant, Weaviate, or local FAISS.
- API: ASP.NET Core minimal API/Controllers.
- Azure path (drop-in replacements)
- Models: Azure OpenAI (same API surface via SDKs).
- Vector: Azure AI Search (vector + hybrid search, filters, scaling).
- Platform: App Service/Container Apps + Key Vault + Monitor/Log Analytics.
Visual Studio/.NET Aspire: Cloud-ready local development
- What it is: an opinionated stack (templates + tooling) in .NET 8/9 and Visual Studio for building cloud-native apps locally, with first-class Azure deployment when ready.
- Why it helps for AI:
- Orchestrates multi-project solutions locally via an
AppHost (e.g., NotebookAI.AppHost) and shared ServiceDefaults (e.g., NotebookAI.ServiceDefaults) for resiliency, health checks, and OpenTelemetry.
- Runs dependencies (e.g., Qdrant/Weaviate containers) alongside your API/worker services with unified configuration (env vars, connection strings), no Azure subscription required during dev.
- Promotes cloud-neutral code: swap OpenAI→Azure OpenAI or Qdrant→Azure AI Search primarily via configuration and DI wiring, not code rewrites.
- Streamlines deployment: Visual Studio publish targets (e.g., Azure Container Apps/App Service) understand Aspire conventions (health probes, logging, secrets), making the transition from local to Azure low-friction.
- Secrets & config: use User Secrets locally; map to Azure Key Vault/App Config in prod without changing code.
- ROI: prototype and iterate locally without cloud spend; when you need to scale, the same Aspire-based solution is already structured for Azure deployment with minimal changes.
7) Anchor Example: NotepadAI App (Reference Architecture)
- Purpose: Open-source starter for AI-enabled content management (upload documents, author blogs, semantic search, RAG answers) with extensibility via dependency injection so developers can tailor or extend features; a private Saints App will build atop this framework for a specialized religious study corpus.
- Projects (example layout)
- Web API (
NotebookAI.Server): endpoints /ingest, /query.
- Agents (
NotebookAI.Services): agent definition + system prompt + orchestration.
- Tools (
NotebookAI.Services): VectorSearchTool, SummarizeContentTool, optional domain-specific tools.
- Data/Infra: Vector DB client, chunking/embedding service.
- Data model (chunk)
- id, text, embedding, author, tags, category, publishDate, source, domain.
- Request/Response
- Query request: text + optional filters (tags, category, domain, topK).
- Response: answer, citations (source, snippet, score), tokens/latency.
8) Series Roadmap (Build in Small Steps)
(Current Status: Actively refactoring server framework to a layered DI architecture—roadmap execution is paused and will resume immediately after this refactor is complete.)
- Project setup (.NET 8/9, packages, config, secrets).
- Chunking + embeddings service (stream-safe chunk sizes, overlap, metadata extraction).
- Vector store integration (Qdrant/Weaviate/Azure AI Search) + schema.
- Tools in Agent Framework (vector search, summarize, format citations).
- NotepadAI agent (system prompt, guardrails, tool wiring).
- RAG prompt assembly (top-k selection, metadata filters, answer style).
- Web API endpoints + Angular-friendly DTOs.
- Observability (structured logs, timing, token usage, retrieval metrics).
- Quality & safety (grounding checks, citations, evaluation set).
- Azure migration checklist (resource mapping, config, CI/CD, secrets, monitoring).
9) Prompt & Memory Strategy (Draft)
- System prompt template tuned for mixed document/blog corpus; require citations.
- Retrieval policy: top-k, diversity by source, metadata filters (tags/category/domain).
- Conversation: short-term chat memory (summarized) vs. long-term knowledge (vector store).
- Guardrails: refuse to answer outside corpus domain unless explicit confirmation (Saints App can further narrow domain to religious studies).
10) Quality, Safety, and Cost
- Hallucination reduction: strict use of retrieved context, quote snippets, expose sources.
- Safety: optional content filters; corpus/domain scope prompts (Saints App adds theology-specific constraints).
- Cost/latency: shorter chunks, careful top-k, cache embeddings, reuse conversations.
11) Migration Path Details
- Abstractions:
IEmbeddingService, IVectorStoreClient, IRetriever, IAgentAdapter.
- Swap implementations: OpenAI → Azure OpenAI; Qdrant/Weaviate → Azure AI Search.
- Platform: Containerize; use Key Vault for secrets; monitor with App Insights.
12) Checklists (Appendix)
- Ingestion
- RAG response
- Migration readiness
13) Next Steps
- Flesh out each section with concrete code samples for .NET 8/9.
- Provide minimal Qdrant/Weaviate docker-compose for local dev.
- Add Azure AI Search mapping guide and sample ARM/Bicep templates.
References
- Microsoft Agent Framework docs and migration guide (Semantic Kernel → Agent Framework).
- OpenAI embeddings and chat completion APIs.