A Muggle large language model Chinese test set
https://docs.qq.com/sheet/DTEFsdkNERVVtR3BX
Since the release of ChatGPT, we often exclaim when using it: "Ah, it can actually answer this!" At the same time, we are also pleased to see that more and more large model teams and products have sprung up. appear like.
As early investors, we often need to try out and evaluate newly released conversational AI products. The most common way is to intuitively compare them with the output results of the iconic ChatGPT through some prompts. In the process, we gradually recorded some problems that large language models cannot currently handle well, as well as many interesting prompts.
So, what prompts are we using for testing? OpenAI has demonstrated 48 basic capabilities of ChatGPT on its official website. In the field of NLP, it also has widely used test sets such as SuperGLUE, MMLU, and Google BIG-bench. At the same time, given that new capabilities will emerge in large models as parameters and data scale increase, the test sets related to these new capabilities are also increasing.
However, through practice, we found that the current NLP task test set has the following problems:
Therefore, several of us VC Muggles, as heavy users of conversational AI, based on our own needs, summarized and launched "Z-Bench" - a tool for non-technical personnel to qualitatively test large-model conversational products (ChatGPT-like products). test set.
"Z-Bench v1.0" provides a total of 300 prompts from three perspectives: basic capabilities, advanced capabilities, and vertical capabilities. Our starting point is to cover as many types of NLP tasks as possible. Our goal is not to provide an academically rigorous and complete test set, but to combine existing academic test sets, some interesting cases collected daily, and the emergence and epiphany capabilities discovered by the academic community after the emergence of large models. , provides a large model proficiency test set suitable for use by non-technical professionals. However, we will inevitably miss some scenes, or there will be a lot of amateurish content from a professional perspective. In the future, we will continue to supplement and improve it based on the feedback we collect, and publish it in a timely manner.
© 2023 ZhenFund