In a growing technology sector where companies like OpenAI and Google dominate, NExT-GPT, an open source multimodal AI language model, is positioning itself as a serious contender. A collaboration between the National University of Singapore and Tsinghua University, NExT-GPT is capable of processing and generating combinations of text, images, audio and video.
The team that developed NExT-GPT touts this system as being very versatile, capable of accepting input in any form to provide appropriate responses. This could lead to considerable improvements similar to those observed with Stable Diffusion compared to its initial version.
Operation based on conversion and adaptation
According to the research article describing this model, separate modules are used to convert inputs such as images and audio into text representations that can be processed by the main language model. The language model then generates response text accompanied by special signaling tokens that trigger generation in other modalities.
Using specific tokens for each input and output modality allows flexible conversion between all combinations of modalities. Different decoders are used to generate outputs for each modality: Stable Diffusion for images, AudioLDM for audio and Zeroscope for video.
NExT-GPT versus other multimodal products
As major companies like Google and OpenAI launch their own multimodal AI products, NExT-GPT offers an open source alternative that allows creators to build on top of it. By making NExT-GPT freely available, researchers are providing a platform for the community to advance AI capabilities even further.
List of advantages of the NExT-GPT model:
- Open source: offers the possibility of customizing and adapting the model according to the specific needs of each project
- Multimodal: supports a variety of formats such as text, images, audio and video
- Flexible: facilitates conversion between different combinations of modalities thanks to a token system
- Efficient: uses separate decoders for each output type to optimize performance
- Innovative: contributes to advancing AI research by providing an open platform to the scientific community
Potentialities and future prospects of NExT-GPT
Leveraging multimodal technologies and open source, NExT-GPT has enormous potential to revolutionize the artificial intelligence market. The possible applications are vast, ranging from automatic media processing to interactive content generation to improving user interfaces.
The researchers hope that this initiative will encourage increased collaboration between universities and companies in the exploration of open source multimodal AI.
NExT-GPT represents a major breakthrough in the field of multimodal AI. Its open source nature will undoubtedly allow researchers and developers around the world to push the limits of artificial intelligence even further and thus contribute to the improvement and democratization of these technologies.
Beyond ChatGPT, NExT-GPT is a powerful model for mastering AI with audio, video, and text, and promises to further enrich innovative applications in the field.