Cursor for LibreOffice Week 2 & 3: How I Added MCP, ACP, a Research Sub-agent, Talk to Your Document, an Eval Dashboard, and Survived Quarzadous’s Total Refactor


I’ve been calling this project Cursor for LibreOffice to myself, but I knew I couldn’t use the name forever, so I researched and chose WriterAgent. It supports Calc, and Draw as well, but I didn’t like the name OfficeAgent, which sounds like some Soviet-era KGB job title. Last week’s post was how I took John Balis’s clean little Localwriter and bolted on threading, tool-calling, chat, and enough other stuff that it started to feel like a powerful chatbox inside LibreOffice.

It became useful enough, and the progress was so fast with all the Python code out there to re-use, that I was motivated to keep going. Meanwhile a chap named Quarzadous dropped a complete refactor and I wanted to integrate it without breaking anything, including the new features I had added.

MCP

After creating the initial chat with document, I realized that many people might want to talk via their local agents: the infamous OpenClaw, Hermes, Claude, etc. and allow those agents to edit your documents. These systems have many features: memory of previous conversations, file-system access, and skills they can learn after install, so implementing the Model Context Protocol to let them make the same tool calls would also be useful.

I wondered whether supporting both external agents and an internal one in the same codebase is a good idea since the users and some use-cases are different. However, both use the same API backend and other pieces that much of the code is shared. The UI is just a new checkbox “Enable MCP”, and a few new files to spin up an HTTP server, process the JSON-RPC, and one day possibly support tunneling. So I decided it was worth supporting both, rather than either-or.

Actually, the hardest part of building software for non-technical people is that you need to make something Apple-like, very easy to use, which is hard because developers have a much higher tolerance for confusing products.

The libreoffice-mcp-extension, written by Quarzadous, had the missing pieces, and I integrated it with the existing code, and over time refactored it to remove any duplicate logic. I also added sidebar logging, so that when an MCP tool-call happens, you can see information in the chat, just like for the internal agent.

Huggingface Smolagents

The next feature I wanted was a web search tool for the AIs to make. LLMs are generally useful, but their training cutoff is often a year or 2 ago, so I wanted a way to let it look up information from the web to plug into a document.

However, once I thought through the various steps:

  • Make a web search tool-call
  • Read through the results, decide the first page to visit
  • Read the web page and decide if it needs to read another page or whether it has an answer

I realized that it would be much better to have an isolated, specialized sub-agent do all this work, and just return a distilled answer, and not distract the main LLM with this specialized task and bloat the context.

After a few minutes of searching, I discovered Huggingface’s smolagents library already includes this functionality. Huggingface is the man! The code needed to be changed slightly to remove dependencies (Jinja, etc.) but it was easy to vendor the core of their ToolCallingAgent + ReAct (Reason – Action) loop. Here’s some of the prompt and you can see how it encourages a loop until confident in the answer:

You are an expert assistant who can solve any task using tool calls. You will be given a task to solve as best you can.
To do so, you have been given access to some tools.

The tool call you write is an action: after the tool is executed, you will get the result of the tool call as an "observation".
This Action/Observation can repeat N times, you should take several steps when needed. You can use the result of the previous action as input for the next action.

To provide the final answer to the task, use an action blob with "name": "final_answer" tool. It is the only way to complete the task, else you will be stuck on a loop. So your final output should look like this:
Action:
{
  "name": "final_answer",
  "arguments": {"answer": "insert your final answer here"}
}

Tools list:
- web_search:
  Performs a duckduckgo web search based on your query (think a Google search) then returns the top search results.
  Inputs:
    - query (string): The search query to perform.
  Output type:
    - string

- visit_webpage:
  Visits a webpage at the given url and reads its content as a markdown string. Use this to browse webpages.
  Inputs:
    - url (string): The url of the webpage to visit.
  Output type:
    - string

- final_answer:
  Provides a final answer to the given problem.
  Inputs:
    - answer (any): The final answer to the problem.
  Output type:
    - any

Now Begin!

I rewrote their web tools to use just the standard APIs in the Python library, and wrapped the existing LlmClient so the research sub-agent uses the same model and endpoint as chat with document. That way, if a local model gets confused by a complex topic and starts chewing on the furniture, you can easily select a smarter, pricier one and pay a couple of pennies to have the adults handle it.

In a couple of hours, it was working and I could type this text in a document:

The price of a Sol-Ark 15K limitless inverter is: $YYY.

In the sidebar, I wrote: What is the real price of the inverter?

Without web research, if you ask a random LLM for the price and specs of a Sol-Ark 15KW inverter, it will hallucinate a price tag of $400, tell you it runs on AA batteries, and confidently suggest wiring it with speaker wire. With the sub-agent, it can learn any details you request, and the AI changed the sentence to:

The price of a Sol-Ark 15 KW Limitless inverter in the US is: $6,979.99 – $6,999.00.

It even fixed the capitalization for Limitless, which is a proper name. I’ve tweaked the prompts to explain to the AI that your primary job is to edit the document, not just answer questions, and they mostly get it now.

This feature was so exciting to me, I added a checkbox for Web research that lets you talk directly to the sub-agent to have it answer questions, or summarize web pages, and it place the answers in the chat window.

This little feature is better than ye olde Google search box since it understands natural language. You can ask it specific questions:

“What is the current version of Python and when was it released?”

And it gives you a natural language answer:

“The current stable version of Python is 3.14.3, which was released on February 3, 2026.

The LLMs are told about the tool call for Web research if asked about a topic it is unfamiliar with, but you can also encourage it: “Do web research and write a colorful, detailed summary of the space elevator, suitable for physicists.

Or you could say “suitable for English teachers”, and get a completely different report!

Reports generated by Nemotron 3 Super

With a typical model on OpenRouter, it takes 30-60 seconds to generate a report on any topic, which isn’t that long in the scheme of things, but I discovered a diffusion model called Mercury-2 which is fairly smart (Claude Haiku level) but much cheaper ($0.25 / M input tokens, $0.75 / M output) and outputs 250-500 tokens per second. With that model, I can get researched documents on any topic faster than I can take a sip of coffee, and each report costs a fraction penny. Going back to a standard model feels like watching a dot-matrix printer.

I hardly use search engines directly anymore. For the last couple of years, I would ask an LLM any questions and let it read the pages and synthesize. But now, I have WriterAgent running at all times and let it do the research since it is very fast and puts the information into a chat window or into a document I can further edit.

Talk to your document

The next feature I wanted to do was talk with the document. I had pushed it off (for almost 2 weeks) because there are no cross-platform APIs for using the microphone built in to the standard Python runtime. So I had the Google Jules coding agent do research and we had a long conversation about the various ways to implement this feature in the constrained LibreOffice environment, including using a local web browser to handle the cross-platform audio headaches.

However, I realized that there was a reasonable vendoring strategy, bundling a few MB of binaries for sounddevice, cffi, and pycparser directly into the extension. Sounddevice for Windows and macOS included the compiled binaries inside the package, so it was truly plug-and-play, without needing to fire up a bunch of cross-compilers.

Jules was either extremely thorough in the implementation phase, or lacking a bit in common sense when it grabbed binaries for every device known to man, including the IBM S-390x mainframe. I love supporting all the latest packages as much as anyone, but decided that the number of banking executives wanting to dictate memos in LibreOffice using the most expensive computer in their data center is probably zero. They can always make a custom build! By narrowing it down to x86 and ARM, on Linux, Mac, and Windows, the binary increased from 500 KB to 4 MB, which I felt was not too bad for a no-hassle install.

Few LLMs support native audio input, so I implemented an automatic fall-back. It first tries to send audio, and if it gets an error, it routes your voice to a fallback speech to text (STT) model to transcribe it, which is then sent to the chat model. This happens automatically, the user just clicks record and talks.

The Great Refactor (thanks Quarzadous)

While I was heads-down trying to make the system smarter, Quarzadous opened a ‘framework’ branch that completely rewrote my architecture from a cozy monolith into a maze that even an Enterprise Java developer, who is used to navigating registry classes to find factory classes to instantiate singletons aka globals, would think was slightly overdone.

Seriously though, he made so many good changes but the only tricky part was that it was all done at the same time, and suddenly the 15-kLOC codebase had more sub-directories than the Linux kernel and every file was in a different location.

I decided to take his changes a step at a time. First, I (mostly) took the new directory layout and build system, and step by step migrated the other features over. Once consolidated it into something I felt was appropriate for a codebase of its size, and I knew where the files were, I was happy. He added so many useful features:

  • Each module is its own folder with a module.yaml that auto-generates the settings UI, so no more manual XDL work for every new service.
  • A main-thread executor with backpressure (no more crashes on huge documents)
  • Fresh UNO context on every call
  • Refactored tools and services into common classes

Having a schema generate the config UI is such a nice feature that I would never have added to this codebase without someone else thinking of it and doing it.

ACP

While it was great to talk to the agents, it kinda sucked to interact with them on the command line. I spent several hours trying to implement TTY re-direct, and other tricks, but it was a pain and would hang. I noticed on March 14th, Hermes Agent added the Agent Communication Protocol, which provided an easy way to talk to it without dealing with the mess of a console. So I threw away the unreliable hacks, and changed it to a simple ACP implementation and in 10 minutes I had it talking.

You could ask Hermes to create a report of weekend events in Akihabara, and in less than a minute get pages that look like this:

Evaluation Dashboard

OpenRouter gives you 500 models, but which ones actually are best at editing documents and are good value? To answer that, I created some tests I could run against various models and compare how they did. For some tests, it was easy to tell whether the answer was correct or not (“remove all the excess spacing between the words.”) but I realized that for many of them (“make a table from this mess of text”) it would be best to call into a Teacher model to grade the score.

So I used Sonnet 4.6 to create the gold answers, and gave the teacher (Grok 4.1 fast) the gold answer as well as the model’s answer and instructions on how to grade from 0 to 1, considering formatting, naturalness, etc.

Originally I calculated Value = Correctness / Cost, but eventually decided to use a quadratic intelligence per dollar scoring (Value = Correctness² / Cost) because accuracy is more important than cheap but wrong.

RankModelValue (C²/$)Avg CorrectnessTokens/RunCost ($)
1openai/gpt-oss-120b263.80.92050,1980.0032
2google/gemini-3-flash-preview141.00.94050,1790.0063
3openai/gpt-4o-mini70.50.79047,5400.0089
4nvidia/nemotron-3-nano-30b-a3b60.60.56050,2430.0052
5x-ai/grok-4.1-fast46.50.98066,9290.0207
6nex-agi/deepseek-v3.1-nex-n139.40.91564,2220.0213
7minimax/minimax-m2.139.20.98362,3940.0246
8mistralai/devstral-251227.90.91057,1500.0297
9z-ai/glm-4.726.90.95363,0350.0337
10qwen/qwen3.5-27b26.50.99352,2100.0371
11openai/gpt-5-nano26.40.82599,5760.0258
12allenai/olmo-3.1-32b-instruct20.80.57068,3170.0156

DSPy

One of the reasons I love Python is the amazing set of libraries. Another that I wanted to check out is DSPy (Declarative Self-improving Language Programs). Developed by Stanford, DSPy is a framework that does programmatic optimization of your prompt, trying variants, to see if it can get greater intelligence and value from the models automatically.

Before DSPy, “prompt engineering” mostly consisted of typing in ALL CAPS, offering a $500 tips, or threats of jail to get it to follow instructions. DSPy automates the voodoo, creating variants of your prompt, and auto-optimizes to find the one which gives the best results with the fewest tokens used, so you don’t have to talk like a hostage negotiator just to get a clean table. Using this tool, I’ve taken some of the suggestions, rolled it into my prompts, and tested it against a bunch of models to verify it is generally helpful.

WriterAgent now feels like a real product instead of a weekend hack. If you want to try it out, the repo is here: https://github.com/KeithCu/writeragent. Let’s make LibreOffice an AI-native office suite!

If you enjoyed this article, check out Part one for background on how I got here.

Epilogue

LLM Slop

A lot of people talk about AIs generating slop, but few talk about how you can prompt AIs to remove slop when you see it. People used to talk about “refactoring code” all the time, yet somehow don’t realize this same process is still needed in the world of AI-assisted code. You can use AIs to remove technical debt, increase test-coverage, and do other code cleanliness activities if you bother to ask them.

Slop code used to appear in the world of human programmers too. Humans, sometimes when in the flow getting a new feature working, would copy and paste logic that should be put into a shared function, but they didn’t want to deal with that distraction at the time. Cleanup can happen after things are generally working and the test cases pass.

People should look at an AI as a smart person who just joined the team yesterday, and therefore doesn’t know everything. AI makes programming more efficient, but you need to oversee them. Someone who complains about slop is not prompting the AI properly.

Testing

Another critical piece to being able to rapidly evolve codebases using AI is to have thorough test coverage. The standard make test doesn’t need to test all the edge-cases, although codebases depended on by millions should have that, but it should try to exercise every major function in the product. When I get burned tracking down a regression, I add test coverage for that and other nearby parts of the product to prevent it from happening in the future.

You don’t have to write the tests at the same time as when you do the feature work, working on test suites isn’t nearly as fun as seeing a new feature working, but at some point later, they should be added. Note: when submitting new features to other codebases, having a test suite with the new code would be greatly appreciated, since the tests “prove” correctness of the feature and decrease the ongoing maintenance burden.

I was working on some testing code recently and decided to re-enable an assert that had been commented out. Of course I didn’t really bother to check whether an assert info.structVersion == 1 would be a problem, it looked so innocent, but enabling it broke talk to your document support! It took me almost 30 minutes to track it down to that line because the error handling in that part of the code wasn’t very good yet. So I improved the error handling, and then realized that assert should stay commented out!

The AIs by default wanted to write Mock implementations of LibreOffice functionality since you can’t depend on it when running tests outside. However, the whole point of the test code is because the LibreOffice API is very sophisticated and you want to actually verify end-to-end that it all works.

Quarzadous had created a pytest test harness for code that didn’t depend on LibreOffice which allows you to test the half of the plugin codebase. On top of that I created a mini-custom pytest runner for inside, that would run the tests inside LibreOffice and return the results in a JSON. The best way to handle the onslaught of new AI-assisted code is with more test coverage!

Comments

One response to “Cursor for LibreOffice Week 2 & 3: How I Added MCP, ACP, a Research Sub-agent, Talk to Your Document, an Eval Dashboard, and Survived Quarzadous’s Total Refactor”

  1. […] Cursor for LibreOffice Week 2 & 3: How I Added MCP, ACP, a research sub-agent, Talk to your docu… […]

Leave a Reply

Your email address will not be published. Required fields are marked *