ETRI Unveils Ultra-Fast Generative Visual Intelligence Model: Creates Images in Just 2 Seconds

ETRI’s researchers have unveiled a technology that combines generative AI and visual intelligence to create images from text inputs in just 2 seconds, propelling the field of ultra-fast generative visual intelligence.

Electronics and Telecommunications Research Institute (ETRI) announced the release of five types of models to the public. These include three models of ‘KOALA,’ which generate images from text inputs five times faster than existing methods, and two conversational visual-language models ‘Ko-LLaVA’ which can perform question-answering with images or videos.

The ‘KOALA’ model significantly reduced the parameters from 2.56B (2.56 billion) of the public SW model* to 700M (700 million) using the knowledge distillation technique**. A high number of parameters typically means more computations, leading to longer processing times and increased operational costs. The researchers reduced the model size by a third and improved the generation of high-resolution images to be twice as fast as before and five times faster compared to DALL-E 3.
Lightweight Stable Diffusion Model: Stable Diffusion XL (SDXL)
** Knowledge Distillation technique: A method for model compression, involving the transfer of information from a larger model to a smaller one.

ETRI has managed to reduce the model’s size(1.7B (Large), 1B (Base), 700M (Small)) considerably and increase the generation speed to around 2 seconds, enabling its operation on low-cost GPUs with only 8GB of memory amidst the competitive landscape of text-to-image generation both domestically and internationally.

ETRI’s three ‘KOALA’ models, developed in-house, have been released in the HuggingFace* environment.
* Hugging Face: An ecosystem facilitating the easy sharing, deployment, usage, and training of machine learning technologies, particularly deep learning models.

In practice, when the research team input the sentence “a picture of an astronaut reading a book under the moon on Mars,” ETRI-developed KOALA 700M model created the image in just 1.6 seconds, significantly faster than Kakao Brain’s Kallo (3.8 seconds), OpenAI’s DALL-E 2 (12.3 seconds), and DALL-E 3 (13.7 seconds).

ETRI also launched a website where users can directly compare and experience a total of 9 models, including the two publicly available stable diffusion models, BK-SDM, Karlo, DALL-E 2, DALL-E 3, and the three KOALA models.
* Model comparison experience site: [Hugging Face KOALA](https://huggingface.co/spaces/etri-vilab/KOALA)
** Model provision site: [Hugging Face ETRI-VILAB](https://huggingface.co/etri-vilab)

Furthermore, the research team unveiled the conversational visual-language model ‘Ko-LLaVA,’ which adds visual intelligence to conversational AI like ChatGPT. This model can retrieve images or videos and perform question-answering in Korean about them.

The ‘LLaVA’ model was developed in an international joint research project with the University of Wisconsin-Madison and ETRI, presented at the prestigious AI conference NeurIPS’23, and utilizes the open-source LLaVA(Large Language and Vision Assistant) with image interpretation capabilities at the level of GPT-4.

The researchers are conducting extension research to improve Korean language understanding and introduce unprecedented video interpretation capabilities based on the LLaVA model, which is emerging as an alternative to multimodal models including images.

Additionally, ETRI pre-released its own Korean-based compact language understanding-generation model (KEByT5). The released models (330M (Small), 580M (Base), 1.23B (Large)) apply token-free technology capable of handling neologisms and untrained words. Training speed was enhanced by more than 2.7 times, and inference speed by more than 1.4 times.

The research team anticipates a gradual shift in the generative AI market from text-centric generative models to multimodal generative models, with an emerging trend towards smaller, more efficient models in the competitive landscape of model sizes.

The reason why ETRI is making this model public is to foster an ecosystem in the related market by reducing the model size, which traditionally would require thousands of servers, thereby facilitating usage among small and medium-sized enterprises.

In the future, the research team expects high demand for Korean cross-modal models* that integrate visual intelligence technology into prominent open-language models of generative AI.
* Capabilities include generating images/videos/3D from text, and generating summaries from images/videos.

The team highlighted that the core patent of this technology is based on knowledge distillation, a technology that enables small models to perform the role of large models by accumulating knowledge using AI.

After making this technology public, ETRI plans to transfer it to image generation services, creative education services, content production, and businesses.

Lee Yong-Ju, director of ETRI’s Visual Intelligence Research Section, stated, “Through various endeavors in generative AI technology, we plan to release a range of models that are small in size but excel in performance. Our global research aims to break the dependency on existing large models and provide domestic small and medium-sized enterprises with the opportunity to effectively utilize AI technology.”

Professor Lee Yong-Jae from the University of Wisconsin-Madison, who oversees the LLaVA project, mentioned, “In leading the LLaVA project, we conducted research on open-source-based visual-language models to make it accessible to more people, competing against GPT-4. We plan to continue our research on multimodal generative models through international joint research with ETRI.”

The research team aims to showcase world-class research capabilities, moving beyond the conventional types of generative AI that convert text inputs into textual responses. They plan to extend their research to types that respond with sentences to images or videos, and types that respond with images or videos to sentences.

###

This achievement by ETRI is part of the Korean large language model technology development project, “Development of Pre-training Techniques for Efficient Large-Scale Korean Language Models,” supported by the Ministry of Science and ICT and the Institute for Information & Communications Technology Planning & Evaluation (IITP).

ETRI is a non-profit government-funded research institute. Since its foundation in 1976, ETRI, a global ICT research institute, has been making its immense effort to provide Korea a remarkable growth in the field of ICT industry. ETRI delivers Korea as one of the top ICT nations in the World, by unceasingly developing world’s first and best technologies.

withyou android app