Paper-Weekly12-Demystifying OpenAI CLIP data

Oct 13, 2023 2 min read

国庆去香港大玩特玩的时候（谢谢JS），跟磊哥约了个饭，吃饭的时候他提到这篇最近让他印象深刻的文章。虽然这几年主要都在做NLP，了解一些CV/多模的进展或许也能有点启发（？）

这篇文章是9月底挂到arxiv上的，10月初更新了一版，非常新鲜。下面简单总结一下原文：

CLIP是一种在计算机视觉中具有高级研究和应用的方法，推动了现代识别系统和生成模型。作者相信 CLIP 成功的主要因素是它的数据，而不是模型架构或预训练目标。然而，OPENAI只提供有关其数据的非常有限的信息以及它是如何收集的，这也导致了一些工作想通过模型的参数来复现其训练数据（生成模型的思路，很有意思但难度很大）。

作者构造了MetaCLIP ，采用原始数据和元数据（源自 CLIP 的概念），并在元数据分布上产生平衡的子集。他们的实验研究严格隔离了模型和训练设置，只关注数据。MetaCLIP 应用于具有 400M 图像-文本数据对的 CommonCrawl 在多个标准基准上优于 CLIP 的数据。在零样本 ImageNet 分类中，MetaCLIP 达到了 70.8% 的准确率，在 ViT-B 模型上超过了 CLIP 的 68.3%。扩展到 1B 数据，同时保持相同的训练预算，达到 72.4%。

方法部分，作者先是引用了CLIP原文，原文中声称其构造了50万的Query，然后为每个Query收集至多2万个图文对。本文采用的Term是Metadata，其实就是CLIP原文的Query。

The base query list is all words occurring at least 100 times in the English ver- sion of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet synsets not already in the query list are added.

The next step is to match the text to matadata. We ended with a pool of 1.6B image-text pairs (5.6B counts of sub-string matches). Note that one text can have multiple matches of entries and we have 3.5 matches per text on average.

As an analysis, we count the number of matches for each entry and summarize that in Table 2. The counts exhibit a long-tailed distribution. Out of the 500k entries, 114k entries have no matches. This signifies the importance of knowing the training data distribution since it is very likely the training data does not have certain visual concepts. We observed that only 16k entries had counts higher than 20k, accounting for only 3.2% (16k/500k) of the entries, but their counts made up 94.5% (5.35B/5.6B) of the total counts of all entries.

这一步match出来之后有大量的长尾数据，甚至还有很多text被match到photo这种没太多意义的query上面。

接下来重点出现了：The key secret behind OpenAI CLIP’s curation is to balance the counts of matched entries.

For each metadata entry, the associated list of texts (or image-text pairs) is sub-sampled, ensuring that the resulting data distribution is more balanced. This step aims to mitigate noise and diversify the distribution of data points, making the data more task-agnostic as foundation data for pre-training.

果然！还是得data engineer，把类别内的数据再sample一次，保证他们是大致平衡的。那么，什么样的平衡才算平衡呢？OpenAI的做法是选取了一个magic number t=20k , 如果说一个entry下的pair低于t，那么就全部保留这些pair，如果超过就进行下采样。作者绘制了将entries按count从小到大排列之后的累加和图像，可以意外发现20k刚好卡在head跟tail的交界处，如果超过20k，累加和的增长速度就会粗略的从线性转变为指数。

作者还把实验的规模进行了扩展，20K是在1.6B个pair上实验出来的超参，如果把data pool扩展到10.7B，对应的t应该取170k，缩放的逻辑是保证tail pair数占总量的6%。

最后在各种实验上都证明采用作者的这一整套data processing流程之后模型在各种zero-shot image classification 任务上表现相比CLIP都有不小的提升。

技术

Paper-Weekly12-Demystifying OpenAI CLIP data

陈沁宇

Master Student@PKU