Build a multimodal dialogue plugin using Ncatbot, integrating OpenAI and Ollama implementations.

0. Introduction#

The project has been open-sourced on GitHub, here are some basic technical explanations.

一个 NcatBot 的插件，实现大模型对话、图像识别，拥有持续对话、记忆能力

Python40

1. Structural Design: Plugin-based Architecture#

In the project, a plugin-based architecture is used for system construction, leveraging Ncatbot's plugin model to effectively isolate the main program from its functionalities.

The advantages of a plugin architecture include clear module boundaries, high dependency isolation, and strong system scalability. Standardized pluginization greatly decouples the code, making both development and deployment very user-friendly.

The project system directory structure is as follows:

ModuleChat/
├── main.py            # Main entry for plugins, responsible for command registration and scheduling logic
├── chat.py            # Model adaptation layer, encapsulating calls to local and cloud models
├── config.yml         # Configuration file, centrally controlling model parameters and enabled options
├── requirements.txt   # Dependency libraries
└── cache/
    └── history.json  # Chat history memory file

chat.py is the core module of the system, with the main program main.py responsible for receiving and parsing commands, and routing messages to the model module for processing. This structure is very user-friendly during development, allowing focus on the development and debugging of individual functional modules, significantly reducing maintenance complexity.

Configuration items are centralized in config.yml, further enhancing flexibility and environmental adaptability.

Using temporary JSON files to record command calls and replies, and passing them to the API interface, allows the large model to have a certain degree of short-term memory functionality. However, to some extent, this is not user-friendly for the system. I believe a better solution would be to use a database, but using a database significantly increases system complexity, so using JSON files is a good alternative choice.

2. Main Plugin Program: Command Decoupling and Routing Hub#

main.py is the main entry for the plugin, registering two commands through the register_user_func method: /chat and /clear chat_history, corresponding to chat functionality and clearing historical records, respectively.

Additionally, the main program supports automatic recognition of image messages, extracting the image URL and passing it to the chat_model_instance.recognize_image method to automatically obtain visual descriptions.

            if image_url and self.chat_model.get('enable_vision', True) and not self.chat_model.get('use_local_model'):
                # Use image recognition feature
                image_description = await chat_model_instance.recognize_image(image_url)
                user_input = f"The user sent an image, the description is: {image_description}. The user said: {user_input}"
            elif image_url and not self.chat_model.get('enable_vision', True):
                # Image recognition feature is not enabled, but check if it's a local model
                if self.chat_model.get('use_local_model'):
                    user_input = f"The user sent an image, but the user is using a local model, unable to perform image recognition. The user said: {user_input}"
                else:
                    user_input = f"The user sent an image, but the image recognition feature is not enabled. The user said: {user_input}"

After obtaining the visual description, it is then passed to the language large model for output. This is actually a very good solution; in the current usage scenario, more often than not, it is necessary to analyze the content after recognizing the image rather than processing the image itself. This can greatly reduce API calls, improve cache hit rates, reduce TOKEN usage, thereby lowering API call costs, and allow the cloud large model to perform recognition before handing it over to the local large model for response, further compressing costs.

In terms of error handling, try...except wraps the entire chat logic, preventing failures in image decoding or API exceptions from causing the main process to crash, maintaining the robustness of the plugin.

Overall, main.py is a typical "light controller" pattern, coordinating various components without taking on business logic details, providing good engineering readability for the entire plugin.

3. Model Adaptation Module: Multi-Model Encapsulation and Semantic Consistency#

chat.py is the core logic of the plugin. It is responsible for handling model calls, chat history memory, image recognition, and other tasks. To be compatible with various model interfaces (such as OpenAI API and Ollama local service), a unified encapsulation interface strategy is adopted, allowing external callers to focus on the conversation without worrying about model details, simply using the useCloudModel() or useLocalModel() methods.

    async def useLocalModel(self, msg: BaseMessage, user_input: str):
        """Use local model to process messages"""
        try:
            # Build message list, including historical records
            messages = self._build_messages(user_input, msg.user_id if hasattr(msg, 'user_id') else None)
            response: ChatResponse = chat(
                model=self.config['model'],
                messages=messages
            )
            reply = response.message.content.strip()

            # Save current conversation to history
            if hasattr(msg, 'user_id'):
                self._update_user_history(msg.user_id, {"role": "user", "content": user_input})
                self._update_user_history(msg.user_id, {"role": "assistant", "content": reply})
        except Exception as e:
            reply = f"An error occurred: {str(e)}"
        return reply

It is worth noting that due to slight differences in calling the OpenAI interface and Ollama, and some model parameters not being fully compatible, using the OpenAI interface to call the cloud model can actually provide a better experience. For example, we can control the model's temperature to make it more imaginative or more focused on real-time responses, reducing hallucinations.

All user histories are stored in the cache/history.json file, which is a persistent storage solution and can provide a certain degree of traceability. The history is dynamically updated through the _update_user_history method, controlled within the maximum rounds set in the configuration file. This approach prevents performance issues caused by overly large contexts while ensuring the model can understand continuous context, improving response quality, allowing even interface connections to have near-memory capabilities.

The class also integrates OpenAI's image recognition model, constructing a multimodal message structure through _build_vision_messages. Additionally, I have designed the separation of functions for image processing, message construction, exception handling, and model calling, making it easier to locate issues during development and more beneficial for other developers to read after open-sourcing.

4. Cloud Model Integration (OpenAI): Standardized Encapsulation#

The cloud model calls are mainly encapsulated through the official openai library, utilizing the chat.completions.create method to complete context construction and response generation. Each call constructs a complete dialogue context through the _build_messages() method, adding system prompts and using the historical records saved in cache/history.json to achieve multi-turn memory-style dialogue.

    def _build_messages(self, user_input: str, user_id: str = None):
        """Build message list"""
        messages = []
        
        # Add system prompt
        system_prompt = self.config.get('system_prompt', "You are a chat companion robot")
        messages.append({"role": "system", "content": system_prompt})

        if user_id:
            history = self._get_user_history(user_id)
            messages.extend(history)
        
        # Add current user input
        messages.append({"role": "user", "content": user_input})
        return messages

The calling logic encapsulates the temperature parameter, supporting flexible control of the model's output randomness through the configuration file.

When encountering a Bug report, a unified return is used to provide error feedback to the user, which can reduce a lot of error handling during development and clearly feedback common issues caused by configuration errors, namely unified model + rule processing methods to report problems encountered at runtime.

                if "401" in str(fallback_error) or "Unauthorized" in str(fallback_error):
                    raise Exception("Model API authentication failed, please check the configuration file")
                raise Exception(f"Image recognition error: {str(e)}, fallback method also failed: {str(fallback_error)}")

After returning the result, the current round of Q&A will be synchronized to the user's historical cache and saved to the local file, ensuring that the context can be retrieved normally in the next round. This can reduce memory dependency, enhance cache hit rates, and provide a basis for subsequent debugging and behavior reproduction.

5. Local Model Call (Ollama): Lightweight Inference and Unified Interface#

Local model calls are completed through ollama.chat(), reusing the context construction logic of _build_messages() to ensure consistency with cloud calls and maintain interface uniformity.

The advantages of this local inference mechanism are very obvious: it can still use intelligent dialogue functions in environments without network access or private deployments, greatly enhancing the deployment flexibility and security of the plugin. Even in privacy-sensitive scenarios, localized deployment and operation can be achieved.

The design maintains consistency between local and cloud call interfaces (both encapsulated as use*Model()), so external callers do not need to determine the source of the model, thereby reducing complexity. In addition, it also implements history record updates and exception capture mechanisms, allowing the local model to have the same functional completeness and stability as the cloud.

6. Image Recognition Processing Logic: Semantic Enhancement Strategy for Multimodal Input#

The image recognition feature is a major highlight of this plugin. The plugin supports recognizing image messages and processing them through the OpenAI visual model. The entire process is as follows:

1. Extract URL from the image message;

    for segment in msg.message:
        if isinstance(segment, dict) and segment.get("type") == "image":
            image_url = segment.get("data", {}).get("url")
            break

1. Retrieve the image content via HTTP request and perform Base64 encoding;
1. Construct visual input format (including image_url and text prompt);

            response = requests.get(image_url)
            response.raise_for_status()
            return base64.b64encode(response.content).decode('utf-8')

1. Call the visual model to complete the image description;

            # Get and encode the image
            image_data = self._encode_image_from_url(image_url)
            
            # Construct messages
            messages = self._build_vision_messages(image_data, prompt)
            
            # Call the visual model
            response = self.vision_client.chat.completions.create(
                model=self.config.get('vision_model'),
                messages=messages,
                temperature=self.config.get('model_temperature', 0.6),
                stream=False,
                max_tokens=2048
            )

1. Append the image description to the user input to enhance the semantic integrity of the context.

This mechanism effectively addresses the information asymmetry problem in scenarios with mixed text and images, while allowing for tiered adjustment of calls, enabling the use of cloud high computing power to handle complex issues and then passing simpler issues to local processing, greatly reducing TOKEN usage.

In terms of error handling, we have designed a two-level fallback strategy: if the main call fails, it attempts a pure text fallback prompt; if it still fails, it prompts the user to check the API key or model status. This fault-tolerant design allows the plugin to maintain service continuity even during partial failures.

7. Chat History System: Memory Window Control#

Chat history is stored in the cache/history.json file, managed by user dimensions. This design allows the system to serve multiple users simultaneously while maintaining independent contexts for each user. Through the _get_user_history and _update_user_history methods, the plugin can automatically inject historical information into each round of dialogue, achieving a quasi-"memory" question-and-answer experience.

We have imposed a window limit on the length of historical records (default 10 rounds) to control the context scale and avoid excessive pressure on the model, which could lead to overconsumption of TOKENs. Cache updates are synchronous write operations, ensuring that information is not lost in the event of system crashes, power outages, or other anomalies.

    async def clear_user_history(self, user_id: str):
        """Clear the history of the specified user"""
        user_id = str(user_id)
        if user_id in self.history:
            del self.history[user_id]
            self._save_history()
            reply = "Chat history has been cleared"
        else:
            reply = "No chat history found for the user"
        return reply

Additionally, it supports actively clearing user history through the command /clear chat_history, providing convenience for privacy or re-dialogue. This mechanism allows the plugin to have persistence while retaining user control space.

8. DEBUG & LOG#

Setting breakpoints and using print statements are good testing habits. I also learned a trick from WeChat development — print("FUCK"). Occasionally, during long-term operation, crashes may occur. In such cases, you can output a specific character in the log, allowing you to quickly locate issues by searching for the string in the logs. "FUCK" is undoubtedly an interesting way to do this.

This article is synchronized and updated to xLog by Mix Space. The original link is https://fmcf.cc/posts/technology/ncatbot-multimodal-plugin