Vary-toy: Compact large-scale language model for high-level visual vocabulary to easily identify target objects

Author：Eve Cole Update Time：2025-02-01 02:48:02

MEGVII Technology launches a new visual vocabulary large-scale language model called Vary-toy, which is an advanced model that can run on standard GPUs. This model significantly improves image perception capabilities by optimizing the creation of visual vocabulary, and has achieved excellent results in multiple benchmark tests such as DocVQA, ChartQA, and RefCOCO. The Vary-toy's small size makes it ideal for resource-constrained researchers, providing them with an efficient and easy-to-use baseline model.

MEGVII Technology releases Vary-toy, an advanced visual vocabulary large-scale language model suitable for standard GPUs. Aims to improve image perception by optimizing visual vocabulary creation. Vary-toy has achieved remarkable results in multiple benchmark tests, including DocVQA, ChartQA, RefCOCO, etc. Its small size makes it a practical benchmark for researchers with limited resources. The researchers plan to publicly release the code to drive further research and adoption.

The release of Vary-toy not only demonstrates MEGVII Technology's advanced technical strength in the field of computer vision, but also provides valuable resources to academia and industry. The code released in the future will further promote the progress and application in this field, which is worth looking forward to.