ShareGPT4V:

Improving Large Multi-Modal Models with Better Captions

ECCV 2024
1 University of Science and Technology of China 2 Shanghai AI Laboratory

* Equal contribution. Corresponding authors.
§ Work done during an internship in Shanghai AI Laboratory.
🔥What's New
  • [2024.07.01] ShareGPT4V is accepted by ECCV 2024!
  • [2024.05.08] LogoShareGPT4Video is released, which contains 40K detailed video 291 hours captions constructed using GPT4V, and 4.8M high quality video 3000 hours captions constructed using ShareCaptioner-Video!
  • [2023.12.13] The code and checkpoint of Share-Captioner are available!
  • [2023.11.23] The Share-Captioner demo is available!
  • [2023.11.22] The code and checkpoint of ShareGPT4V-7B are available!
  • [2023.11.22] The ShareGPT4V-7B demo is available!
  • [2023.11.21] ShareGPT4V Dataset is available!
  • [2023.11.21] The paper and project page are released!

  • 🚀 A large-scale highly descriptive image-text dataset.
  • 🚀 100K GPT4-Vision-generated captions, 1.2M high-quality captions.
  • 🚀 A general image captioner, approaching GPT4-Vision's caption capability.
  • 🚀 A superior large multi-modal model, ShareGPT4V-7B.

Abstract

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2 million with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. We will release this project to serve as a pivotal resource for advancing the LMMs community.

ShareGPT4V Dataset (100K)

Dataset Name Image Source Visible Captioned by Samples Avg.
COCO-Caption COCO ✔︎ Human 118K 52
BLIP-LCS LCS ✔︎ BLIP 558K 54
LLaVA-23K COCO GPT4 23K 609
ShareGPT4V LCS, COCO, etc. ✔︎ GPT4-Vision 100K 942
ShareGPT4V-PT LCS, COCO, etc. ✔︎ Share-Captioner 1,246K 826

Comparison of widely-used caption datasets and ShareGPT4V. 'LCS' abbreviates the LAION, CC, and SBU datasets. The 'Visible' column denotes the image visibility during captioning, and the 'Avg.' column shows the average character number of the caption.

We illustrate the procedure for collecting highly descriptive captions from GPT4-Vision via various image sources and data-specific prompts, resulting in 100K high-quality captions that encapsulate a wide array of information conveyed by the images.

A comparison between the caption in our proposed ShareGPT4V dataset and those utilized by recent large multi-modal models (LMMs):

ShareGPT4V-PT Dataset (1.2M)

We delineate the process of utilizing the seed captions to train a general captioner (Share-Captioner) and then employing this captioner to generate 1.2M high-quality captions for pre-training usage.

A qualitative comparison of caption quality from various sources:

📊 Performance

We compare the performance of various large multi-modal models before and after replacing a corresponding portion of their SFT captions with those generated by GPT4-Vision:


The remarkable performance of the proposed LMM, ShareGPT4V-7B, developed with the assistance of the ShareGPT4V dataset:

Captioning and Conversation Examples

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃


          @article{chen2023sharegpt4v,
            title={ShareGPT4V: Improving Large Multi-Modal Models with Better Captions},
            author={Chen, Lin and Li, Jisong and Dong, Xiaoyi and Zhang, Pan and He, Conghui and Wang, Jiaqi and Zhao, Feng and Lin, Dahua},
            journal={arXiv preprint arXiv:2311.12793},
            year={2023}
          }
      

Acknowledgement

This website is adapted from Nerfies and LLaVA, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.