Exploring AI Coding Assistants - Our Experiments with Claude Code

We tested Anthropic's Claude Code for adding programming language support to a legacy tool. Here's what we learned about its capabilities and limitations.

Exploring AI Coding Assistants - Our Experiments with Claude Code

The field of artificial intelligence assistance in software development is rapidly advancing. Recently, Anthropic released Claude Code, a tool specifically designed to help developers with their coding tasks. This assistant is an example of a "supervised coding agent," meaning it can handle complex steps within the development workflow, sometimes operating with a degree of autonomy.

Many well-known supervised coding agents integrate directly into development environments (IDEs), like Cursor or Cline. While some popular tools like GitHub Copilot are adding agent-like capabilities, many existing ones are IDE-centric. Claude Code stands out because it operates from the terminal. Other open-source agents, including Aider and Goose, also use a terminal interface. Working via the command line can simplify integrating these tools into broader systems, unlike being restricted to a single IDE.

Given the pace of evolution in this area, any comparison of tools should be viewed as a temporary snapshot. These tools are constantly being updated. However, there are distinct differences worth noting right now:

  • LLM Connection Method: Claude Code, similar to Cline and Aider, requires users to provide an API key to connect with a large language model (LLM). This typically involves accessing models from the Claude-Sonnet series, currently favored for coding tasks. In contrast, Cursor, Windsurf, and GitHub Copilot are generally offered as subscription services.
  • Context Integration: Claude Code, along with Cline and Goose, supports integration with the Model Context Protocol (MCP). This is an open standard facilitating the connection of LLMs to various data sources.

The overall effectiveness and utility of Claude Code at this point largely depend on its skill in managing code context, formulating prompts for the underlying model, and integrating with various data providers.

Here’s a brief look at how some of these tools compare:

| Tool | Interface | LLM Connection | | :------------ | :-------- | :----------------------------------- | | Claude Code | Terminal | API key / Pay-as-you-use | | Aider | Terminal | API key / Pay-as-you-use | | Goose | Terminal | API key / Pay-as-you-use | | Cursor | IDE | Subscription / API key (with limits) | | Cline | IDE | API key / Pay-as-you-use | | Windsurf | IDE | Subscription (with limits) |

This comparison highlights the variety in how these agent tools are accessed and integrated into developer workflows.

Why We Explored Claude Code

We have a strong interest in AI-assisted coding and agentic AI capabilities. The launch of Claude Code naturally piqued our curiosity.

While these agents have a vast array of potential applications, our specific goal was to investigate if Claude Code could assist with a particular challenge we face in developing our generative AI tool for understanding legacy codebases, which we call CodeConcise. This tool is designed to help decipher complex, older code.

A key aspect of enhancing CodeConcise is adding compatibility for new programming languages. Historically, we've only added support for languages as specific needs arose. However, we theorized that proactively supporting a wider range of languages upfront could significantly boost CodeConcise's overall effectiveness. This isn't just about technical function; reducing the effort needed to add language support means we can allocate developer time to other valuable tasks.

Integrating support for a new language isn't a simple process. It involves understanding the language's structure and how it maps to our tool's architecture. We thought Claude Code might be able to accelerate this. Since our experiments coincided with the release of this specific tool, we decided to try it out right away.

Technical Foundation of Our Code Comprehension Tool

To properly understand our experiments with Claude Code, it helps to know a bit about how our CodeConcise tool operates. CodeConcise relies heavily on abstract syntax trees (ASTs) to parse codebases and extract their structural components. This structure then allows the tool to navigate the code effectively when an LLM is employed to generate summaries and explanations. You can learn more about ASTs here.

More specifically, CodeConcise requires unique, language-specific implementations of a generic component known within our code as an IngestionTool. The purpose of this component is to generate lists of nodes and edges from the processed code, which are then used to populate a knowledge graph. Our current implementations achieve this by analyzing ASTs. Following this, we perform language-agnostic traversals of this graph structure. An LLM is then used to enrich the graph with information extracted and summarized from the code itself. This comprehensive knowledge graph is then utilized by a frontend application, employing an agent and GraphRAG techniques to offer users valuable insights derived from the code.

While LLMs are incredibly powerful for code analysis, supporting a new programming language within our tool presents a challenge: whenever we need to support a new language and extract specific details from its AST, we must write new code. This involves:

  • Finding relevant code examples.
  • Building a comprehensive suite of automated tests.
  • Modifying the existing codebase to incorporate the new language support.

This process is time-consuming, typically requiring a pair of developers between two and four weeks. This is the primary reason we've historically only added language support when it became absolutely necessary.

What We Found Worked Well with Claude Code

Success: Identifying Necessary Code Modifications

Our first test involved asking Claude Code for guidance on the code changes required to integrate Python support into our CodeConcise tool. While a developer familiar with the codebase for several weeks might find this straightforward, it's immensely valuable for someone new to the project.

By accessing the codebase and its nearby documentation, Claude Code produced impressive results. It precisely identified all modifications necessary to support Python. Furthermore, the code samples it suggested at the conclusion indicated that the agent had not only examined the existing ingestion tools we previously developed but had also recognized the specific patterns used to implement them for the new language we were asking about.

Limited Success: Implementing Code Modifications

For our second test, we instructed Claude Code to proceed with implementing the suggested changes itself:

"I need to build a new tool for loading python code into CodeConcise. Please do this and test it."

The agent worked autonomously for slightly over three minutes. It made all the necessary changes locally, including creating tests. The tests suggested by Claude Code all passed. However, when we used CodeConcise to load its own source code onto a knowledge graph and ran our own end-to-end validation, we uncovered two issues:

  1. The structure of the filesystem was not included in the generated graph.
  2. The connections (edges) between nodes did not conform to the expected model within CodeConcise. For instance, crucial call dependencies were missing, meaning subsequent processing steps, such as the comprehension pipeline, would not be able to navigate the graph as intended.

The Crucial Role of Feedback Loops in AI-Assisted Coding

This experiment provided a clear reminder of the critical importance of having multiple feedback mechanisms when using AI to assist with coding. If we hadn't had robust tests in place to verify that the integration actually functioned correctly from end to end, we would have discovered these issues much later. This delay could have been disruptive and costly, as both the developer and the AI agent would have lost context on the work in progress.

After providing this feedback to Claude Code, we observed the code being updated within seconds. An initial look at the revised code showed how effectively the agent could follow established patterns within the codebase, such as using Observer classes to construct the filesystem representation as the code was being parsed.

It's worth noting that for this particular task, much of the complex architectural thinking had already been completed by the developers who initially designed the tool. They had made the deliberate choice to separate core domain logic from the more repetitive implementation details needed to support parsing new languages. Claude Code's task was essentially to synthesize this information and deduce from the existing design what specific language-dependent components needed to be built.

Key Learnings from Successful Experiments

Despite the average estimate of two to four weeks required for a pair of developers and a subject matter expert (SME) to build support for a new language, involving an SME and Claude Code together produced the initial code for this specific task in just a few minutes. Validation took only a handful of hours. This suggests significant potential for acceleration under the right conditions.

What Didn't Work as Expected

Complete Failure: Attempting JavaScript Support

Given the positive initial results with Python, we attempted the same approach for JavaScript:

"I want you to add an ingestion tool for javascript. I already have put the lexerbase and parserbase in there for you to use. Please use the stageobserver like in other places and use the visitor pattern for the lexer and parser like it is done in the tsql loader."

On its first attempt, the agent tried implementing this using the ANTLR grammar designed for JavaScript. We encountered difficulties getting the grammar to work (which is beyond the scope of both the AI tool and our CodeConcise tool), so we couldn't confirm if the generated code was functional.

For the second try, we prompted Claude Code again, asking it to use treesitter instead. Our luck seemed to be running out; the agent started using libraries that simply did not exist. Again, we could not verify the code it generated.

The third and final attempt in this series saw the agent propose a different strategy: parsing the code using regular expression pattern matching.

So, did this approach succeed? No, it did not. The resulting code included references to internal packages that were not part of our project and didn't even exist.

Evidently, the agent lacks a strong internal mechanism for verifying the code it generates. It appears to miss the basic feedback that a simple unit test could provide. This isn't a new observation; we've seen this behavior frequently with other coding assistants as well.

We wondered if our initial success with Python was just fortunate, so we asked the agent to perform the same task for the C programming language. The results were quite similar to the Python experiment, though the agent chose to use regular expression matching instead of a more reliable, AST-based method.

Insights Gained About AI Coding Agents

The experiments described focused on a very specific application within our team. Therefore, drawing broad, unwarranted conclusions about AI coding agents in general wouldn't be appropriate. However, these tests yielded some valuable insights worth sharing:

  • Inconsistent Outputs: Results can vary significantly. Claude Code generated impressive code when asked to implement the ingestion tools for Python and C but failed completely when tasked with the same for JavaScript.
  • Factors Influencing Performance: The effectiveness of coding assistants seems to depend on multiple elements:
    • Code Quality: What we consider good quality code for human developers – modularity, clear design, separation of concerns, and adequate documentation – appears to also benefit AI agents. When working with well-structured code, the agent has a higher probability of producing quality outputs.
    • Library Ecosystem: Was adding the Python parser easier because Python, the language our tool is built on, already provides a standard library for converting code into ASTs? Our hypothesis is that while agents need capable tools to execute their decisions, the quality of their output is also influenced by their ability to leverage well-designed, widely adopted libraries.
    • Training Data: LLMs and, by extension, AI agents typically perform better on programming languages for which extensive data was available during their original training.
    • Underlying LLM: Since agents are built upon large language models, their performance is directly tied to the capabilities of the models they use.
    • Agent Design: This includes factors like the agent's prompt engineering strategies, internal workflow design, and the specific tools it has access to.
    • Human Collaboration: The best outcomes often emerge when developers and agents work together. Developers with experience using agents can often employ strategies to guide the process and achieve better results.

Claude Code shows considerable promise, and it's encouraging to see another supervised coding agent become available, particularly one that functions within the terminal environment. Like the rest of the landscape, Claude Code will undoubtedly continue to evolve. We look forward to exploring how we can further integrate tools like these into our daily development work.

Launch Your MVP at Lightning Speed