build your ai coding assistant Download - build your ai coding assistant Source code download

AI R&D efficiency improvement: building an AI-assisted coding assistant

Cover

In 2023, the popularity of generative AI will lead more and more organizations to introduce AI-assisted coding. What is slightly different from GitHub Copilot released in 2021 is that code completion is only one of many scenarios. A large number of companies are exploring scenarios such as generating complete code and code review based on requirements, and are also introducing generative AI to improve development efficiency.

In this context, we (Thoughtworks open source community) have also open sourced a series of AI auxiliary tools to help more organizations build their own AI-assisted coding assistants:

AutoDev for Intellij, a full-process AI-assisted coding tool based on the JetBrains platform.
AutoDev for VSCode, a full-process AI-assisted coding tool based on the VSCode editor.
Unit Eval, a high-quality data set construction and generation tool in code completion scenarios.
Unit Minions is a data set construction tool based on data distillation in test scenarios such as demand generation and test generation.

Because when we designed AutoDev, various open source models were constantly evolving. In this context, its steps are:

Construct IDE plug-in and measure system design. Based on the public model API, write and enrich IDE plug-in functions.
Model evaluation system and fine-tuning experiments.
Data engineering and model evolution around intent.

Therefore, this tutorial is also centered around these three steps. In addition, based on our experience, the sample technology stack for this tutorial:

Plug-in: Intellij IDEA. AutoDev is built on Intellij IDEA and has its own static code analysis capabilities, so it is used as an example. We also provide a VSCode plug-in version: AutoDev for VSCode, you can develop on this basis.
Model: DeepSeek Coder 6.7b. Based on Llama 2 architecture, compatible with Llama ecosystem
Fine-tuning: Deepspeed + Official Script + Unit Eval.
GPU: RTX 4090x2 + OpenBayes. (PS: Use my dedicated invitation link to register for OpenBayes. Both parties will receive 60 minutes of RTX 4090 usage time. It supports accumulation and is permanently valid: https://openbayes.com/console/signup?r=phodal_uVxU)

Since our experience in AI is relatively limited, there will inevitably be some mistakes. Therefore, we also hope to work with more developers to build this open source project.

Functional Design: Define Your AI Assistant

Combined with the artificial intelligence part of the JetBrains 2023 "Developer Ecosystem" report, we can summarize some general scenarios that reflect the areas where generative AI can play a role in the development process. Here are some of the main scenarios:

Code auto-completion: In daily coding, generative AI can provide intelligent code auto-completion suggestions by analyzing context and learning code patterns, thereby improving development efficiency.
Interpret code: Generative AI can explain code, help developers understand the function and implementation of specific code snippets, and provide deeper code understanding support.
Generate code: By learning a large number of code libraries and patterns, generative AI can generate code snippets that meet requirements, speed up the development process, and play an important role especially in repetitive work.
Code review: Generative AI can conduct code reviews, provide high-quality suggestions and feedback, and help developers improve code quality and follow best practices.
Natural language query: Developers can use natural language query to interact with generative AI and ask questions or requests to obtain relevant code snippets, documents, or explanations, making it easier for developers to obtain the information they need.
other. Such as refactoring, submission information generation, modeling, submission summary, etc.

When we built AutoDev, we also discovered scenarios such as creating SQL DDL, generating requirements, TDD, etc. so. We provide the ability to customize scenarios so that developers can customize their own AI capabilities. For details, see: https://ide.unitmesh.cc/customize.

Scenario-driven architecture design: balancing model speed and capabilities

In daily coding, there are several different scenarios with different requirements for AI response speed (just as an example):

scene	Response speed	Generate quality requirements	Size expected	illustrate
code completion	quick	middle	1~6B	Code completion is the most common scenario in daily coding, and response speed is crucial.
Document generation	middle	middle	1 6B	Documentation generation requires a full understanding of the code structure, and speed and quality are equally important.
code review	quick	middle	1 6B	Code reviews require high-quality advice but also need to be as responsive as possible.
Unit test generation	quick	middle	6B~	Unit tests generate less context, and responsiveness and AI quality are equally important.
code refactoring	middle	high	32B~	Code refactoring may require more contextual understanding, and response times may be moderately slowed down.
demand generation	middle	high	32B~	Demand generation is a relatively complex scenario, and the response speed can be moderately slowed down to ensure accuracy.
Natural language code search and interpretation	Medium-Low	high	32B~	Natural language code search and interpretation are relatively complex scenarios, and the response speed can be moderately slowed down to ensure accuracy.

PS: The 32B here is only expressed as an order of magnitude, because the effect will be better with a larger model.

Therefore, we summarize it as: one large, one middle, one micro and three models, providing comprehensive AI-assisted coding:

High quality large model: 32B~. Used for scenarios such as code refactoring, requirement generation, natural language code search and interpretation, etc.
High response speed medium model: 6B~. Used for code completion, unit test generation, document generation, code review and other scenarios.
Vectorized micromodel: ~100M. Used for vectorization in IDE, such as: code similarity, code correlation, etc.

Introduction to key scenarios: completion mode

AI code completion can combine IDE tools to analyze the code context and rules of the programming language, and AI will automatically generate or suggest code snippets. In code completion tools similar to GitHub Copilot, they are usually divided into three subdivision modes:

Inline completion (Inline)

Similar to FIM (fill in the middle) mode, the completed content is in the current line. For example: BlotPost blogpost = new , the completion is: BlogPost(); to achieve: BlogPost blogpost = new BlogPost();

We can use Deepseek Coder as an example to see the effect in this scenario:

 < ｜fim▁begin｜ > def quick_sort(arr):
    if len(arr) < = 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
< ｜fim▁hole｜ >
        if arr[i] < pivot:
            left.append(arr[i])
        else:
            right.append(arr[i])
    return quick_sort(left) + [pivot] + quick_sort(right) < ｜fim▁end｜ >

Here, we need to combine the code before and after the cursor.

In-block completion (InBlock)

Achieved through context learning (In-Context Learning), the completion content is in the current function block. For example, the original code is:

 fun createBlog ( blogDto : CreateBlogDto ): BlogPost {

}

The completed code is:

    val blogPost = BlogPost (
    title = blogDto.title,
    content = blogDto.content,
    author = blogDto.author
)
return blogRepository.save(blogPost)

AfterBlock

Achieved through context learning (In-Context Learning), completion after the current function block, such as: completion of a new function after the current function block. For example, the original code is:

 fun createBlog ( blogDto : CreateBlogDto ): BlogPost {
    // ...
}

The completed code is:

 fun updateBlog ( id : Long , blogDto : CreateBlogDto ): BlogPost {
    // ...
}

fun deleteBlog ( id : Long ) {
    // ...
}

When we build the corresponding AI completion function, we also need to consider applying it to the corresponding pattern data set to improve the quality of completion and provide a better user experience.

Some related resources for writing this article:

Why your AI Code Completion tool needs to Fill in the Middle
Exploring Custom LLM-Based Coding Assistance Functions

Key scene introduction: code explanation

Code explanations are designed to help developers manage and understand large code bases more effectively. These assistants can answer questions about the code base, provide documentation, search code, identify sources of errors, reduce code duplication, etc., thus improving development efficiency, reducing error rates, and reducing the workload of developers.

In this scenario, depending on the quality of generation we expect, it is usually composed of two models: one large and one micro or one medium and one micro. The larger model has better results in terms of the quality of the generation. Combined with our design experience in the Chocolate Factory tool, usually such a function can be divided into several steps:

Understand user intentions: Use large models to understand user intentions and convert them into corresponding AI Agent capability calls or function calls.
Conversion intention search: Use models to convert user intentions into corresponding code snippets, documents or explanations, and combine traditional search, path search and vectorized search technologies to search and sort.
Output results: The large model summarizes the final results and outputs them to the user.

As a RAG application, it is divided into two parts: indexing and query.

In the indexing stage, we need to index the code base, which involves text segmentation, vectorization, database indexing and other technologies. One of the most challenging elements is splitting. The splitting rules we refer to are: https://docs.sweep.dev/blogs/chunking-2m-files. Right now:

The average token-to-character ratio of the code is about 1:5 (300 tokens), while the embedding model has a maximum token limit of 512.
1500 characters corresponds to approximately 40 lines, which is roughly equivalent to a small to medium-sized function or class.
The challenge is to get as close to 1500 characters as possible while ensuring that the chunks are semantically similar and linked together with relevant context.

In different scenarios, we can also divide in different ways. For example, in Chocolate Factory, we divide through AST to ensure the quality of the generated context.

In the querying stage, we need to combine some of our traditional search technologies, such as vectorization search, path search, etc., to ensure the quality of the search. At the same time, in the Chinese scenario, we also need to consider the issue of conversion to Chinese, such as converting English to Chinese to ensure the quality of search.

Related tools: https://github.com/BloopAI/bloop
Related resources:
- Prompt Strategy: Semantic Search Design of Code Base AI Assistant

Others: daily assistance

For daily assistance, we can also achieve it through generative AI, such as automatically creating SQL DDL, automatically creating test cases, automatically creating requirements, etc. These can be achieved only by customizing prompt words and combining specific domain knowledge, so I won’t go into details here.

Architectural Design: Contextual Engineering

In addition to the model, context is also an important factor affecting AI assistance capabilities. When we built AutoDev, we also discovered two different context modes:

Related context: Context generation based on static code analysis can build better quality context to generate higher quality code and tests, etc., relying on the static code analysis capabilities of the IDE.
Similar context: Based on the context of similar search, more contexts can be built to generate more code and tests, regardless of platform capabilities.

A simple comparison is as follows:

	relevant context	similar context
Search technology	static code analysis	Similarity search
data structure information	AST, CFG	Similar Chunk
Cross-platform capabilities	Depends on IDE, or independent parser	Not dependent on specific platforms
contextual quality	extremely high	high
Generate results	extremely high	high
build cost	Depends on language and platform	Low

When the support for IDE is limited, context-related will bring higher cost performance .

Similar context architecture: GitHub Copilot case

GitHub Copilot adopts a similar context architectural pattern, and its detailed architecture is layered as follows:

Monitor user operations (IDE API). Monitor the user's Run Action, shortcut keys, UI operations, input, etc., as well as the recent document operation history (20 files).
IDE glue layer (Plugin). As the glue layer between the IDE and the underlying Agent, it handles input and output.
Context construction (Agent). JSON RPC Server handles various changes in the IDE, analyzes the source code, encapsulates it into a "prompt" (suspected) and sends it to the server.
Server. Process the prompt request and hand it over to the LLM server for processing.

In the research materials of the "public" Copilot-Explorer project, you can see how Prompt is built. The following is the prompt request sent to:

{
  "prefix" : " # Path: codeviz \ app.py n #.... " ,
  "suffix" : " if __name__ == '__main__': rn    app.run(debug=True) " ,
  "isFimEnabled" : true ,
  "promptElementRanges" : [
    {
      "kind" : " PathMarker " ,
      "start" : 0 ,
      "end" : 23
    },
    {
      "kind" : " SimilarFile " ,
      "start" : 23 ,
      "end" : 2219
    },
    {
      "kind" : " BeforeCursor " ,
      "start" : 2219 ,
      "end" : 3142
    }
  ]
}

in:

prefix part used to build the prompt is built from promptElements, which includes: BeforeCursor , AfterCursor , SimilarFile , ImportedFile , LanguageMarker , PathMarker , RetrievalSnippet and other types. From the names of several PromptElementKind , we can also see its true meaning.
suffix part used to construct the prompt is determined by the part where the cursor is located, and how many positions are left to be calculated based on the upper limit of tokens (2048). The token calculation here is the real LLM token calculation. In Copilot, it is calculated by Cushman002. The token length of Chinese characters is different, such as: { context: "console.log('你好，世界')", lineCount: 1, tokenLength: 30 } , where the length of the content in context is 20, but tokenLength is 30, the length of Chinese characters is 5 (including ， ), and the token occupied by a single character is 3.

Here is a more detailed example of a Java application context:

 // Path: src/main/cc/unitmesh/demo/infrastructure/repositories/ProductRepository.java
// Compare this snippet from src/main/cc/unitmesh/demo/domain/product/Product.java:
// ....
// Compare this snippet from src/main/cc/unitmesh/demo/application/ProductService.java:
// ...
// @Component
// public class ProductService {
//     //...
// }
// 
package cc . unitmesh . demo . repositories ;

// ...

@ Component
public class ProductRepository {
//...

In the computing context, GitHub Copilot uses Jaccard coefficient (Jaccard Similarity). This part of the implementation is implemented in Agent. For more detailed logic, please refer to: After spending more than half a month, I finally reverse-engineered Github Copilot.

Related resources:

Contextual engineering: real-time capability analysis and thinking based on Github Copilot

Related context architecture: AutoDev and JetBrains AI Assistant case

As mentioned above, the relevant code relies on static code analysis , mainly with the help of structural information of the code, such as AST, CFG, DDG, etc. In different scenarios and platforms, we can combine different static code analysis tools. The following are some common static code analysis tools:

TreeSitter, a framework developed by GitHub for generating efficient custom parsers.
Intellij PSI (Program Structure Interface), a static code analysis interface developed by JetBrains for its IDE.
LSP (Language Server Protocol), a universal language server protocol developed by Microsoft for IDE.
Chapi (common hierarchical abstract parser implementation) is a general static code analysis tool developed by the author (@phodal).

In the completion scenario, through static code analysis, we can get the current context, such as: current function, current class, current file, etc. The following is an example of AutoDev's context for generating unit tests:

 // here are related classes:
// 'filePath: /Users/phodal/IdeaProjects/untitled/src/main/java/cc/unitmesh/untitled/demo/service/BlogService.java
// class BlogService {
//   blogRepository
//   + public BlogPost createBlog(BlogPost blogDto)
//   + public BlogPost getBlogById(Long id)
//   + public BlogPost updateBlog(Long id, BlogPost blogDto)
//   + public void deleteBlog(Long id)
// }
// 'filePath: /Users/phodal/IdeaProjects/untitled/src/main/java/cc/unitmesh/untitled/demo/dto/CreateBlogRequest.java
// class CreateBlogRequest ...
// 'filePath: /Users/phodal/IdeaProjects/untitled/src/main/java/cc/unitmesh/untitled/demo/entity/BlogPost.java
// class BlogPost {...
@ ApiOperation ( value = "Create a new blog" )
@ PostMapping ( "/" )
public BlogPost createBlog ( @ RequestBody CreateBlogRequest request ) {

In this example, the context of createBlog function is analyzed to obtain the function's input and output classes: CreateBlogRequest , BlogPost information, and BlogService class information, which are provided to the model as context (provided in comments). At this point, the model generates more accurate constructors, as well as more accurate test cases.

Since the relevant context relies on static code analysis of different languages and APIs of different IDEs, we also need to adapt to different languages and different IDEs. In terms of construction cost, it is more expensive relative to similar contexts.

Step 1: Build IDE plug-in and measure system design

IDEs and editors are the main tools for developers, and their design and learning costs are relatively high. First, we can use the official template to generate:

IDEA plug-in template
VSCode plug-in generation

Then, add functionality on top (isn’t it very simple), of course not. The following are some IDEA plug-in resources that you can refer to:

Intellij Community version source code
IntelliJ SDK Docs Code Samples
Intellij Rust

Of course, it is more appropriate to refer to the AutoDev plug-in.

JetBrains plugin

You can directly use the official template to generate the corresponding plug-in: https://github.com/JetBrains/intellij-platform-plugin-template

Use this template

For IDEA plug-in implementation, it is mainly implemented through Action and Listener, which only need to be registered in plugin.xml . For details, please refer to the official documentation: IntelliJ Platform Plugin SDK

Version compatibility and compatibility architecture

Since we did not consider the compatibility issue with IDE versions of AutoDev in the early stage, in order to be compatible with older versions of IDE later, we need to perform compatibility processing on the plug-in. Therefore, as described in the official document: Build Number Ranges, we can see that different versions have different requirements for the JDK. The following are the requirements for different versions:

Branch number	IntelliJ Platform version
233	2023.3
232	2023.2
231	2023.1
223	2022.3
222	2022.2 NOTE Java 17 is now required (blog post)
221	2022.1
213	2021.3
212	2021.2
211	2021.1
203	2020.3 NOTE Java 11 is now required (blog post)

And configure it into gradle.properties :

 pluginSinceBuild = 223
pluginUntilBuild = 233.*

Subsequent configuration of compatibility is troublesome, so you can refer to AutoDev's design.

Completion mode: Inlay

In terms of automatic code completion, domestic manufacturers mainly refer to the implementation of GitHub Copilot, and the logic is not complicated.

Trigger using shortcut keys

It mainly monitors user input in Action, and then:

Function	shortcut key	illustrate
requestCompletions	`Alt` + `/`	Get the current context and then get the completion results through the model
applyInlays	`TAB`	Display completion results on the IDE
disposeInlays	`ESC`	Cancel completion
cycleNextInlays	`Alt` + `]`	Switch to the next completion result
cyclePrevInlays	`Alt` + `[`	Switch to the previous completion result

Use automatic triggering method

It mainly monitors user input through EditorFactoryListener , and then triggers different completion results based on different inputs. The core code is as follows:

 class AutoDevEditorListener : EditorFactoryListener {
    override fun editorCreated ( event : EditorFactoryEvent ) {
        // ...
        editor.document.addDocumentListener( AutoDevDocumentListener (editor), editorDisposable)
        editor.caretModel.addCaretListener( AutoDevCaretListener (editor), editorDisposable)
        // ...
    }

    class AutoDevCaretListener ( val editor : Editor ) : CaretListener {
        override fun caretPositionChanged ( event : CaretEvent ) {
            // ...
            val wasTypeOver = TypeOverHandler .getPendingTypeOverAndReset(editor)
            // ...
            llmInlayManager.disposeInlays(editor, InlayDisposeContext . CaretChange )
        }
    }

    class AutoDevDocumentListener ( val editor : Editor ) : BulkAwareDocumentListener {
        override fun documentChangedNonBulk ( event : DocumentEvent ) {
            // ...
            val llmInlayManager = LLMInlayManager .getInstance()
            llmInlayManager
                .editorModified(editor, changeOffset)
        }
    }
}

Then according to different inputs, different completion results are triggered and the structure is processed.

Render completion code

Subsequently, we need to implement an Inlay Render, which inherits from EditorCustomElementRenderer .

Daily accessibility development

Combined with the interface capabilities of the IDE, we need to add the corresponding Action, the corresponding Group, and the corresponding Icon. The following is an example of Action:

<add-to-group group-id="ShowIntentionsGroup" relative-to-action="ShowIntentionActions" anchor="after"/>

The following are some ActionGroups of AutoDev:

Group ID	AI uses	Description
ShowIntentionsGroup	Code refactoring, code interpretation, code generation, code testing	Used to display hints in code context and accessed via `Alt + Enter` and `⌥ + Enter` shortcuts on macOS.
ConsoleEditorPopupMenu	fix errors	The menu displayed in the console, such as the console of the program execution structure.
Vcs.MessageActionGroup	Code information generation	Menu for writing commit messages in VCS.
Vcs.Log.ContextMenu	Code review, code interpretation, code generation	Menu for viewing logs in VCS, available functions: AI inspection of code, generation of release logs.
EditorPopupMenu	All are acceptable	Right-click menu, you can also add the corresponding ActionGroup

When writing ShowIntentionsGroup, we can refer to the implementation of AutoDev to build the corresponding Group:

< group id = " AutoDevIntentionsActionGroup " class = " cc.unitmesh.devti.intentions.IntentionsActionGroup "
       icon = " cc.unitmesh.devti.AutoDevIcons.AI_COPILOT " searchable = " false " >
    < add-to-group group-id = " ShowIntentionsGroup " relative-to-action = " ShowIntentionActions " anchor = " after " />
</ group >

Multilingual context architecture

Due to Intellij's platform strategy, the difference between running in a Java IDE (Intellij IDEA) and other IDEs such as Python IDE (Pycharm) becomes even greater. We need to provide compatibility based on multi-platform products. For detailed introduction, please refer to: Plugin Compatibility with IntelliJ Platform Products

First, the plug-in architecture is further modularized, that is, different modules are provided for different languages. The following is the modular architecture of AutoDev:

java/   # Java 语言插件
  src/main/java/cc/unitmesh/autodev/ # Java 语言入口
  src/main/resources/META-INF/plugin.xml
plugin/ # 多平台入口
  src/main/resources/META-INF/plugin.xml
src/    # 即核心模块
  main/resource/META-INF/core.plugin.xml

In plugin/plugin.xml , we need to add the corresponding depends and extensions . The following is an example:

< idea-plugin package = " cc.unitmesh " xmlns : xi = " http://www.w3.org/2001/XInclude " allow-bundled-update = " true " >
    < xi : include href = " /META-INF/core.xml " xpointer = " xpointer(/idea-plugin/*) " />
    < content >
        < module name = " cc.unitmesh.java " />
            <!--  其它模块 -->
    </ content >
</ idea-plugin >

In java/plugin.xml , we need to add the corresponding depends and extensions . The following is an example:

< idea-plugin package = " cc.unitmesh.java " >
  <!-- suppress PluginXmlValidity -->
  < dependencies >
    < plugin id = " com.intellij.modules.java " />
    < plugin id = " org.jetbrains.plugins.gradle " />
  </ dependencies >
</ idea-plugin >

Subsequently, Intellij will automatically load the corresponding module to achieve multi-language support. Depending on the different languages we expect to support, we need corresponding plugin.xml , such as:

cc.unitmesh.javascript.xml
cc.unitmesh.rust.xml
cc.unitmesh.python.xml
cc.unitmesh.kotlin.xml
cc.unitmesh.java.xml
cc.unitmesh.go.xml
cc.unitmesh.cpp.xml

Finally, just implement the corresponding functions in different language modules.

context building

To simplify this process, we use Unit Eval to show how to build two similar contexts.

static code analysis

Through static code analysis, we can get the current function, current class, current file, etc. Combined with path similarity, find the most relevant context.

 private fun findRelatedCode ( container : CodeContainer ): List < CodeDataStruct > {
    // 1. collects all similar data structure by imports if exists in a file tree
    val byImports = container. Imports
        .mapNotNull {
            context.fileTree[it. Source ]?.container?. DataStructures
        }
        .flatten()

    // 2. collects by inheritance tree for some node in the same package
    val byInheritance = container. DataStructures
        .map {
            (it. Implements + it. Extend ).mapNotNull { i ->
                context.fileTree[i]?.container?. DataStructures
            }.flatten()
        }
        .flatten()

    val related = (byImports + byInheritance).distinctBy { it. NodeName }
    // 3. convert all similar data structure to uml
    return related
}

class RelatedCodeStrategyBuilder ( private val context : JobContext ) : CodeStrategyBuilder {
    override fun build (): List < TypedIns > {
        // ...
        val findRelatedCodeDs = findRelatedCode(container)
        val relatedCodePath = findRelatedCodeDs.map { it. FilePath }
        val jaccardSimilarity = SimilarChunker .pathLevelJaccardSimilarity(relatedCodePath, currentPath)
        val relatedCode = jaccardSimilarity.mapIndexed { index, d ->
            findRelatedCodeDs[index] to d
        }.sortedByDescending {
            it.second
        }.take( 3 ).map {
            it.first
        }
        // ...
    }
}

For the above code, we can use the Imports information of the code as part of the relevant code. Then find relevant code through the inheritance relationship of the code. Finally, the closest context is found through path similarity.

Related code analysis

Search first, and then find related code through code similarity. The core logic is shown:

 fun pathLevelJaccardSimilarity ( chunks : List < String >, text : String ): List < Double > {
    // ...
}
fun tokenize ( chunk : String ): List < String > {
    return chunk.split( Regex ( " [^a-zA-Z0-9] " )).filter { it.isNotBlank() }
}
fun similarityScore ( set1 : Set < String >, set2 : Set < String >): Double {
    // ...
}

For details, see: SimilarChunker

VSCode plug-in

TODO

TreeSitter

TreeSitter is a framework for generating efficient custom parsers, developed by GitHub. It uses an LR(1) parser, which means it can parse any language in O(n) time instead of O(n²) time. It also uses a technique called "syntax tree reuse" that allows it to update syntax trees without reparsing the entire file.

Since TreeSitter already provides multi-language support, you can use Node.js, Rust and other languages to build corresponding plug-ins. See: TreeSitter for details.

Depending on our intentions, there are different ways to use TreeSitter:

Parse Symbol

In the code natural language search engine Bloop, we use TreeSitter to parse Symbols to achieve better search quality.

 ; ; methods
(method_declaration 
  name: (identifier) @hoist.definition.method)

Then, decide how to display it based on the type:

 pub static JAVA : TSLanguageConfig = TSLanguageConfig {
    language_ids : & [ "Java" ] ,
    file_extensions : & [ "java" ] ,
    grammar : tree_sitter_java :: language ,
    scope_query : MemoizedQuery :: new ( include_str ! ( "./scopes.scm" ) ) ,
    hoverable_query : MemoizedQuery :: new (
        r#"
        [(identifier)
         (type_identifier)] @hoverable
        "# ,
    ) ,
    namespaces : & [ & [
        // variables
        "local" ,
        // functions
        "method" ,
        // namespacing, modules
        "package" ,
        "module" ,
        // types
        "class" ,
        "enum" ,
        "enumConstant" ,
        "record" ,
        "interface" ,
        "typedef" ,
        // misc.
        "label" ,
    ] ] ,
} ;

Chunk code

The following is how TreeSitter is used in Improving LlamaIndex's Code Chunker by Cleaning Tree-Sitter CSTs:

 from tree_sitter import Tree
 
def chunker (
	tree : Tree ,
	source_code : bytes ,
	MAX_CHARS = 512 * 3 ,
	coalesce = 50 # Any chunk less than 50 characters long gets coalesced with the next chunk
) -> list [ Span ]:
 
    # 1. Recursively form chunks based on the last post (https://docs.sweep.dev/blogs/chunking-2m-files)
    def chunk_node ( node : Node ) -> list [ Span ]:
        chunks : list [ Span ] = []
        current_chunk : Span = Span ( node . start_byte , node . start_byte )
        node_children = node . children
        for child in node_children :
            if child . end_byte - child . start_byte > MAX_CHARS :
                chunks . append ( current_chunk )
                current_chunk = Span ( child . end_byte , child . end_byte )
                chunks . extend ( chunk_node ( child ))
            elif child . end_byte - child . start_byte + len ( current_chunk ) > MAX_CHARS :
                chunks . append ( current_chunk )
                current_chunk = Span ( child . start_byte , child . end_byte )
            else :
                current_chunk += Span ( child . start_byte , child . end_byte )
        chunks . append ( current_chunk )
        return chunks
    chunks = chunk_node ( tree . root_node )
 
    # 2. Filling in the gaps
    for prev , curr in zip ( chunks [: - 1 ], chunks [ 1 :]):
        prev . end = curr . start
    curr . start = tree . root_node . end_byte
 
    # 3. Combining small chunks with bigger ones
    new_chunks = []
    current_chunk = Span ( 0 , 0 )
    for chunk in chunks :
        current_chunk += chunk
        if non_whitespace_len ( current_chunk . extract ( source_code )) > coalesce