Learning Transferable Visual Models From Natural Language Supervision

从自然语言监督中学习可迁移的视觉模型。

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image。即给定一张图，预测最相关的文本段。

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

就是给定一组文本标签，给定一张图像，能够返回图像-文本对的相关性。

这一篇属于经典论文，2017年的成果在现在仍旧是非常有用的。

作者：OpenAI
代码：https://github.com/openai/CLIP
来源：ICML-2021，国际机器学习会议

abstract

当前的计算机视觉（CV）模型通常被训练用于预测有限的物体类别。这种严格的监督训练方式限制了模型的泛化性和实用性，因为这样的模型通常还需要额外的标注数据来完成训练时未曾见过的视觉“概念”。

直接从图片的描述中学习是一个有潜力的选择。利用一个简单的预训练任务，即预测一组文本中哪一个对应当前的图像，从4亿个图像-文本对中进行训练，可以获得一个较好的预训练模型。下游任务中，可以通过用自然语言匹配视觉概念进行，而不需要额外的训练。

两种表征学习路线：

建立image-text pair数据集
获取更多的数据，做弱监督预训练，不要求高质量，也不要求full labeled

最终，决定从网络爬取Text-image pair数据对，4亿个

具体实施方法

这其实是比较古老的文本-图像相互独立的编码法，TEXT端和IMAGE端的编码器可以有多种选择，能够提取到同一维度的深度信息，最后得出一个相关矩阵就行了。

image_features = self.encode_image(image)
text_features = self.encode_text(text)

# normalized features
image_features = image_features / image_features.norm(dim=1, keepdim=True)
text_features = text_features / text_features.norm(dim=1, keepdim=True)

# cosine similarity as logits
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logits_per_image.t()

# shape = [global_batch_size, global_batch_size]
return logits_per_image, logits_per_text

loss使用CE loss即可。

文本处理

步骤	操作	输入示例	输出示例	备注
1. 清洗	去除多余空格	`"a photo"`	`"a photo"`	基础预处理
2. 分词	BPE 编码	`"a photo"`	`["a", "photo"]`	转为子词
3. 格式化	添加标记/填充	`["a", "photo"]`	`[49406, 320, 1125, 49407, 0...]`	固定长度 77，过短就补0，过长就切掉
4. 编码	Transformer	`[IDs]`	`[77, 512]` 向量序列	12层 Transformer
5. 提取	EOT 池化	`[77, 512]`	`[512]` 单个向量	取结束符位置
6. 输出	投影+归一化	`[512]`	`[512]` 最终特征	用于计算相似度

总之，是处理长度统一的问题，并和图像特征对齐。

def encode_text(self, text):
    x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]

    x = x + self.positional_embedding.type(self.dtype)
    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD
    x = self.ln_final(x).type(self.dtype)

    # x.shape = [batch_size, n_ctx, transformer.width]
    # take features from the eot embedding (eot_token is the highest number in each sequence)
    x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection

    return x

# 其中token_embedding表示Embedding块，这一步就转为定长了
self.token_embedding = nn.Embedding(vocab_size, transformer_width)
# 可学习编码直接让模型“背下来”每个位置的特征，虽然笨重，但在数据充足时非常有效。
positional_embedding # 就是位置编码，还可以使用正弦波、RoPE去做这件事，而此处直接进行一个硬编码，给了一个微小的扰动特征，表示位置信息。
transformer # 是一个经典的Transformer模块，需要注意的是在使用之前要进行格式调整
# NLD -> LND 是因为torch库的代码是这样写的，张量数量在第1层。算完之后要换回来
self.ln_final = LayerNorm(transformer_width) # 这就是一个层归一化
# 最后把[batch_size, n_ctx, transformer.width]转为
# [transformer.width,transformer.width]
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection

最后这行索引代码的意思是：“对于 Batch 中的第 i 个句子，去取它第 EOT位置 的那个向量。”

输入：[batch_size, 77, 512]
输出：[batch_size, 512]
物理含义：EOT位置通常为取到argmax的index，所以可以取到每一个EOT。我们丢弃了中间所有的词向量，只保留了句尾那个经过全句注意力计算后的 EOT 向量。因为 Transformer 的因果掩码机制，这个 EOT 向量“看”到了前面所有的词，所以它包含了全句的语义信息。
矩阵乘法。
[batch_size, 512] @ [512, 512] = [batch_size, 512]。

图像处理

可以理解为输入Resnet或者ViT就可以了。