vit pytorch下載 - vit pytorch原始碼下載

視覺轉換器 - Pytorch
安裝
用法
參數
簡單的ViT
鈉ViT
蒸餾
深度ViT
蔡特
代幣到代幣 ViT
相關色溫
交叉ViT
坑
萊維特
CVT
雙胞胎SVT
交叉成型機
區域ViT
可擴展ViT
七月六日
最大ViT
巢
行動視覺化技術
西西特
屏蔽自動編碼器
簡單的蒙版影像建模
屏蔽補丁預測
蒙版位置預測
自適應令牌採樣
補丁合併
適用於小數據集的視覺轉換器
3D 維特
維維特
並行ViT
可學習記憶ViT
恐龍
埃斯維特
獲得關注
研究思路
- 高效注意力
- 與其他 Transformer 改進相結合
常問問題
資源
引文

視覺轉換器 - Pytorch

在 Pytorch 中實作 Vision Transformer，這是一種僅使用單一 Transformer 編碼器即可在視覺分類中實作 SOTA 的簡單方法。 Yannic Kilcher 的影片進一步解釋了其意義。這裡確實沒有太多需要編碼的內容，但不妨為每個人展示一下，以便我們加快注意力革命。

有關預訓練模型的 Pytorch 實現，請參閱此處 Ross Wightman 的儲存庫。

官方 Jax 儲存庫位於此處。

這裡還存在著由研究科學家 Junho Kim 創建的 tensorflow2 翻譯！

Enrico Shippole 的亞麻翻譯！

安裝

$ pip install vit-pytorch

用法

 import torch
from vit_pytorch import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 1 , 3 , 256 , 256 )

preds = v ( img ) # (1, 1000)

參數

image_size ：整數。
圖像尺寸。如果您有矩形圖像，請確保圖像尺寸是寬度和高度中的最大值
patch_size ：整數。
補丁的大小。 image_size必須能被patch_size整除。
補丁的數量為： n = (image_size // patch_size) ** 2且n必須大於 16 。
num_classes ：整數。
要分類的類別數。
dim ：int。
線性變換後輸出張量的最後一個維度nn.Linear(..., dim) 。
depth ：整數。
變壓器塊的數量。
heads ：int。
多頭注意力層中的頭數。
mlp_dim ：整數。
MLP（前饋）層的維度。
channels ：int，預設3 。
影像通道數。
dropout : 在[0, 1]之間浮動，預設0. ..
輟學率。
emb_dropout ：在[0, 1]之間浮動，預設為0 。
嵌入丟失率。
pool : 字串， cls令牌池或mean池

簡單的ViT

原論文的一些同一作者的更新提出了對ViT簡化，使其能夠更快更好地訓練。

這些簡化包括 2d 正弦位置嵌入、全域平均池化（無 CLS 令牌）、無 dropout、批次大小為 1024 而不是 4096，以及使用 RandAugment 和 MixUp 增強。他們還表明，末端的簡單線性並不比原始 MLP 頭明顯差

您可以透過匯入SimpleViT來使用它，如下所示

 import torch
from vit_pytorch import SimpleViT

v = SimpleViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048
)

img = torch . randn ( 1 , 3 , 256 , 256 )

preds = v ( img ) # (1, 1000)

鈉ViT

本文提出利用可變長度序列的注意力和掩蔽的靈活性來訓練多種解析度的影像，並將其打包到單一批次中。它們展示了更快的訓練速度和更高的準確性，唯一的成本是架構和資料加載的額外複雜性。他們使用因式分解的二維位置編碼、令牌刪除以及查詢鍵標準化。

您可以如下使用它

 import torch
from vit_pytorch . na_vit import NaViT

v = NaViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1 ,
    token_dropout_prob = 0.1  # token dropout of 10% (keep 90% of tokens)
)

# 5 images of different resolutions - List[List[Tensor]]

# for now, you'll have to correctly place images in same batch element as to not exceed maximum allowed sequence length for self-attention w/ masking

images = [
    [ torch . randn ( 3 , 256 , 256 ), torch . randn ( 3 , 128 , 128 )],
    [ torch . randn ( 3 , 128 , 256 ), torch . randn ( 3 , 256 , 128 )],
    [ torch . randn ( 3 , 64 , 256 )]
]

preds = v ( images ) # (5, 1000) - 5, because 5 images of different resolution above

或者，如果您希望框架自動將影像分組為不超過特定最大長度的可變長度序列

 images = [
    torch . randn ( 3 , 256 , 256 ),
    torch . randn ( 3 , 128 , 128 ),
    torch . randn ( 3 , 128 , 256 ),
    torch . randn ( 3 , 256 , 128 ),
    torch . randn ( 3 , 64 , 256 )
]

preds = v (
    images ,
    group_images = True ,
    group_max_seq_len = 64
) # (5, 1000)

最後，如果您想使用嵌套張量的 NaViT 風格（這將完全省略大量掩蔽和填充），請確保您使用的是2.5版本並按如下方式導入

 import torch
from vit_pytorch . na_vit_nested_tensor import NaViT

v = NaViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0. ,
    emb_dropout = 0. ,
    token_dropout_prob = 0.1
)

# 5 images of different resolutions - List[Tensor]

images = [
    torch . randn ( 3 , 256 , 256 ), torch . randn ( 3 , 128 , 128 ),
    torch . randn ( 3 , 128 , 256 ), torch . randn ( 3 , 256 , 128 ),
    torch . randn ( 3 , 64 , 256 )
]

preds = v ( images )

assert preds . shape == ( 5 , 1000 )

蒸餾

最近的一篇論文表明，使用蒸餾令牌將知識從卷積網路提取到視覺變壓器可以產生小型且高效的視覺變壓器。該存儲庫提供了輕鬆進行蒸餾的方法。

前任。從 Resnet50（或任何教師）中提取到視覺轉換器

 import torch
from torchvision . models import resnet50

from vit_pytorch . distill import DistillableViT , DistillWrapper

teacher = resnet50 ( pretrained = True )

v = DistillableViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

distiller = DistillWrapper (
    student = v ,
    teacher = teacher ,
    temperature = 3 ,           # temperature of distillation
    alpha = 0.5 ,               # trade between main loss and distillation loss
    hard = False               # whether to use soft or hard distillation
)

img = torch . randn ( 2 , 3 , 256 , 256 )
labels = torch . randint ( 0 , 1000 , ( 2 ,))

loss = distiller ( img , labels )
loss . backward ()

# after lots of training above ...

pred = v ( img ) # (2, 1000)

DistillableViT類別與ViT相同，只是前向傳遞的處理方式不同，因此您應該能夠在完成蒸餾訓練後將參數載入回ViT 。

您也可以在DistillableViT實例上使用方便的.to_vit方法來取得ViT實例。

 v = v . to_vit ()
type ( v ) # <class 'vit_pytorch.vit_pytorch.ViT'>

深度ViT

本文指出，ViT 很難在更大的深度（過去 12 層）進行關注，並建議在 softmax 後混合每個頭的注意力作為解決方案，稱為重新註意力。結果與 NLP 的 Talking Heads 論文一致。

您可以如下使用它

 import torch
from vit_pytorch . deepvit import DeepViT

v = DeepViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 1 , 3 , 256 , 256 )

preds = v ( img ) # (1, 1000)

蔡特

本文也指出了更深度訓練視覺變換器的困難，並提出了兩種解決方案。首先，它建議對殘差塊的輸出進行每通道乘法。其次，它建議讓補丁相互關注，並且只允許 CLS 代幣參與最後幾層的補丁。

他們還添加了 Talking Heads，指出了改進

您可以如下使用該方案

 import torch
from vit_pytorch . cait import CaiT

v = CaiT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 12 ,             # depth of transformer for patch to patch attention only
    cls_depth = 2 ,          # depth of cross attention of CLS tokens to patch
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1 ,
    layer_dropout = 0.05    # randomly dropout 5% of the layers
)

img = torch . randn ( 1 , 3 , 256 , 256 )

preds = v ( img ) # (1, 1000)

代幣到代幣 ViT

本文提出，前幾層應透過展開對影像序列進行下採樣，從而導致每個標記中的影像資料重疊，如上圖所示。您可以按如下方式使用ViT的此變體。

 import torch
from vit_pytorch . t2t import T2TViT

v = T2TViT (
    dim = 512 ,
    image_size = 224 ,
    depth = 5 ,
    heads = 8 ,
    mlp_dim = 512 ,
    num_classes = 1000 ,
    t2t_layers = (( 7 , 4 ), ( 3 , 2 ), ( 3 , 2 )) # tuples of the kernel size and stride of each consecutive layers of the initial token to token module
)

img = torch . randn ( 1 , 3 , 224 , 224 )

preds = v ( img ) # (1, 1000)

交叉ViT

本文提出讓兩個視覺變換器以不同的尺度處理影像，每隔一段時間交叉處理一個。他們展示了基礎視覺轉換器之上的改進。

 import torch
from vit_pytorch . cross_vit import CrossViT

v = CrossViT (
    image_size = 256 ,
    num_classes = 1000 ,
    depth = 4 ,               # number of multi-scale encoding blocks
    sm_dim = 192 ,            # high res dimension
    sm_patch_size = 16 ,      # high res patch size (should be smaller than lg_patch_size)
    sm_enc_depth = 2 ,        # high res depth
    sm_enc_heads = 8 ,        # high res heads
    sm_enc_mlp_dim = 2048 ,   # high res feedforward dimension
    lg_dim = 384 ,            # low res dimension
    lg_patch_size = 64 ,      # low res patch size
    lg_enc_depth = 3 ,        # low res depth
    lg_enc_heads = 8 ,        # low res heads
    lg_enc_mlp_dim = 2048 ,   # low res feedforward dimensions
    cross_attn_depth = 2 ,    # cross attention rounds
    cross_attn_heads = 8 ,    # cross attention heads
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 1 , 3 , 256 , 256 )

pred = v ( img ) # (1, 1000)

坑

本文建議透過使用深度卷積的池化過程對標記進行下採樣。

 import torch
from vit_pytorch . pit import PiT

v = PiT (
    image_size = 224 ,
    patch_size = 14 ,
    dim = 256 ,
    num_classes = 1000 ,
    depth = ( 3 , 3 , 3 ),     # list of depths, indicating the number of rounds of each stage before a downsample
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

# forward pass now returns predictions and the attention maps

img = torch . randn ( 1 , 3 , 224 , 224 )

preds = v ( img ) # (1, 1000)

萊維特

本文提出了一些變化，包括（1）卷積嵌入而不是逐塊投影（2）階段下採樣（3）注意力中的額外非線性（4）二維相對位置偏差而不是初始絕對位置偏差（5 ））batchnorm 代替 Layernorm。

官方儲存庫

 import torch
from vit_pytorch . levit import LeViT

levit = LeViT (
    image_size = 224 ,
    num_classes = 1000 ,
    stages = 3 ,             # number of stages
    dim = ( 256 , 384 , 512 ),  # dimensions at each stage
    depth = 4 ,              # transformer of depth 4 at each stage
    heads = ( 4 , 6 , 8 ),      # heads at each stage
    mlp_mult = 2 ,
    dropout = 0.1
)

img = torch . randn ( 1 , 3 , 224 , 224 )

levit ( img ) # (1, 1000)

CVT

本文提出混合卷積和注意力。具體來說，卷積用於分三個階段嵌入和下採樣影像/特徵圖。深度卷積也用於投影查詢、鍵和值以引起注意。

 import torch
from vit_pytorch . cvt import CvT

v = CvT (
    num_classes = 1000 ,
    s1_emb_dim = 64 ,        # stage 1 - dimension
    s1_emb_kernel = 7 ,      # stage 1 - conv kernel
    s1_emb_stride = 4 ,      # stage 1 - conv stride
    s1_proj_kernel = 3 ,     # stage 1 - attention ds-conv kernel size
    s1_kv_proj_stride = 2 ,  # stage 1 - attention key / value projection stride
    s1_heads = 1 ,           # stage 1 - heads
    s1_depth = 1 ,           # stage 1 - depth
    s1_mlp_mult = 4 ,        # stage 1 - feedforward expansion factor
    s2_emb_dim = 192 ,       # stage 2 - (same as above)
    s2_emb_kernel = 3 ,
    s2_emb_stride = 2 ,
    s2_proj_kernel = 3 ,
    s2_kv_proj_stride = 2 ,
    s2_heads = 3 ,
    s2_depth = 2 ,
    s2_mlp_mult = 4 ,
    s3_emb_dim = 384 ,       # stage 3 - (same as above)
    s3_emb_kernel = 3 ,
    s3_emb_stride = 2 ,
    s3_proj_kernel = 3 ,
    s3_kv_proj_stride = 2 ,
    s3_heads = 4 ,
    s3_depth = 10 ,
    s3_mlp_mult = 4 ,
    dropout = 0.
)

img = torch . randn ( 1 , 3 , 224 , 224 )

pred = v ( img ) # (1, 1000)

雙胞胎SVT

本文提出混合局部和全域注意力，以及位置編碼產生器（在 CPVT 中提出）和全域平均池化，以實現與 Swin 相同的結果，而無需移動視窗、CLS 令牌或位置嵌入的額外複雜性。

 import torch
from vit_pytorch . twins_svt import TwinsSVT

model = TwinsSVT (
    num_classes = 1000 ,       # number of output classes
    s1_emb_dim = 64 ,          # stage 1 - patch embedding projected dimension
    s1_patch_size = 4 ,        # stage 1 - patch size for patch embedding
    s1_local_patch_size = 7 ,  # stage 1 - patch size for local attention
    s1_global_k = 7 ,          # stage 1 - global attention key / value reduction factor, defaults to 7 as specified in paper
    s1_depth = 1 ,             # stage 1 - number of transformer blocks (local attn -> ff -> global attn -> ff)
    s2_emb_dim = 128 ,         # stage 2 (same as above)
    s2_patch_size = 2 ,
    s2_local_patch_size = 7 ,
    s2_global_k = 7 ,
    s2_depth = 1 ,
    s3_emb_dim = 256 ,         # stage 3 (same as above)
    s3_patch_size = 2 ,
    s3_local_patch_size = 7 ,
    s3_global_k = 7 ,
    s3_depth = 5 ,
    s4_emb_dim = 512 ,         # stage 4 (same as above)
    s4_patch_size = 2 ,
    s4_local_patch_size = 7 ,
    s4_global_k = 7 ,
    s4_depth = 4 ,
    peg_kernel_size = 3 ,      # positional encoding generator kernel size
    dropout = 0.              # dropout
)

img = torch . randn ( 1 , 3 , 224 , 224 )

pred = model ( img ) # (1, 1000)

區域ViT

本文提出將特徵圖劃分為局部區域，從而局部標記相互關注。每個本地區域都有自己的區域令牌，然後處理其所有本地令牌以及其他區域令牌。

您可以如下使用它

 import torch
from vit_pytorch . regionvit import RegionViT

model = RegionViT (
    dim = ( 64 , 128 , 256 , 512 ),      # tuple of size 4, indicating dimension at each stage
    depth = ( 2 , 2 , 8 , 2 ),           # depth of the region to local transformer at each stage
    window_size = 7 ,                # window size, which should be either 7 or 14
    num_classes = 1000 ,             # number of output classes
    tokenize_local_3_conv = False ,  # whether to use a 3 layer convolution to encode the local tokens from the image. the paper uses this for the smaller models, but uses only 1 conv (set to False) for the larger models
    use_peg = False ,                # whether to use positional generating module. they used this for object detection for a boost in performance
)

img = torch . randn ( 1 , 3 , 224 , 224 )

pred = model ( img ) # (1, 1000)

交叉成型機

本文利用局部和全局交替注意力的方式擊敗了 PVT 和 Swin。全局注意力是在視窗維度上完成的，以降低複雜性，就像用於軸向注意力的方案一樣。

他們還具有跨尺度嵌入層，他們證明這是一個可以改進所有視覺轉換器的通用層。還制定了動態相對位置偏差，以使網路能夠推廣到更高解析度的圖像。

 import torch
from vit_pytorch . crossformer import CrossFormer

model = CrossFormer (
    num_classes = 1000 ,                # number of output classes
    dim = ( 64 , 128 , 256 , 512 ),         # dimension at each stage
    depth = ( 2 , 2 , 8 , 2 ),              # depth of transformer at each stage
    global_window_size = ( 8 , 4 , 2 , 1 ), # global window sizes at each stage
    local_window_size = 7 ,             # local window size (can be customized for each stage, but in paper, held constant at 7 for all stages)
)

img = torch . randn ( 1 , 3 , 224 , 224 )

pred = model ( img ) # (1, 1000)

可擴展ViT

這篇位元組跳動人工智慧論文提出了可擴展自註意力（SSA）和互動式視窗自註意力（IWSA）模組。 SSA 透過按某個因子 ( reduction_factor ) 減少鍵/值特徵圖，同時調整查詢和鍵的維度 ( ssa_dim_key ) 來減輕早期階段所需的計算。 IWSA 在本地視窗內執行自註意力，類似於其他視覺轉換器論文。然而，他們添加了透過內核大小 3 的捲積傳遞的殘差值，他們將其命名為本地交互模組 (LIM)。

他們在本文中聲稱該方案優於 Swin Transformer，並且還展示了與 Crossformer 競爭的性能。

您可以按如下方式使用它（例如 ScalableViT-S）

 import torch
from vit_pytorch . scalable_vit import ScalableViT

model = ScalableViT (
    num_classes = 1000 ,
    dim = 64 ,                               # starting model dimension. at every stage, dimension is doubled
    heads = ( 2 , 4 , 8 , 16 ),                  # number of attention heads at each stage
    depth = ( 2 , 2 , 20 , 2 ),                  # number of transformer blocks at each stage
    ssa_dim_key = ( 40 , 40 , 40 , 32 ),         # the dimension of the attention keys (and queries) for SSA. in the paper, they represented this as a scale factor on the base dimension per key (ssa_dim_key / dim_key)
    reduction_factor = ( 8 , 4 , 2 , 1 ),        # downsampling of the key / values in SSA. in the paper, this was represented as (reduction_factor ** -2)
    window_size = ( 64 , 32 , None , None ),     # window size of the IWSA at each stage. None means no windowing needed
    dropout = 0.1 ,                          # attention and feedforward dropout
)

img = torch . randn ( 1 , 3 , 256 , 256 )

preds = model ( img ) # (1, 1000)

七月六日

另一篇位元組跳動人工智慧論文，它提出了一個深度逐點自註意力層，該層似乎很大程度上受到 mobilenet 深度可分離卷積的啟發。最有趣的方面是重用深度自註意力階段的特徵圖作為逐點自註意力的值，如上圖所示。

我決定僅包含具有此特定自註意力層的SepViT版本，因為分組注意力層既不顯也不新穎，並且作者不清楚如何處理組自註意力層的視窗標記。此外，似乎僅憑DSSA層，他們就能夠擊敗 Swin。

前任。 SepViT-Lite

 import torch
from vit_pytorch . sep_vit import SepViT

v = SepViT (
    num_classes = 1000 ,
    dim = 32 ,               # dimensions of first stage, which doubles every stage (32, 64, 128, 256) for SepViT-Lite
    dim_head = 32 ,          # attention head dimension
    heads = ( 1 , 2 , 4 , 8 ),   # number of heads per stage
    depth = ( 1 , 2 , 6 , 2 ),   # number of transformer blocks per stage
    window_size = 7 ,        # window size of DSS Attention block
    dropout = 0.1           # dropout
)

img = torch . randn ( 1 , 3 , 224 , 224 )

preds = v ( img ) # (1, 1000)

最大ViT

本文提出了一種混合卷積/注意力網絡，從卷積側使用MBConv，然後使用區塊/網格軸向稀疏注意力。

他們還聲稱這種特定的視覺轉換器非常適合生成模型（GAN）。

前任。最大ViT-S

 import torch
from vit_pytorch . max_vit import MaxViT

v = MaxViT (
    num_classes = 1000 ,
    dim_conv_stem = 64 ,               # dimension of the convolutional stem, would default to dimension of first layer if not specified
    dim = 96 ,                         # dimension of first layer, doubles every layer
    dim_head = 32 ,                    # dimension of attention heads, kept at 32 in paper
    depth = ( 2 , 2 , 5 , 2 ),             # number of MaxViT blocks per stage, which consists of MBConv, block-like attention, grid-like attention
    window_size = 7 ,                  # window size for block and grids
    mbconv_expansion_rate = 4 ,        # expansion rate of MBConv
    mbconv_shrinkage_rate = 0.25 ,     # shrinkage rate of squeeze-excitation in MBConv
    dropout = 0.1                     # dropout
)

img = torch . randn ( 2 , 3 , 224 , 224 )

preds = v ( img ) # (2, 1000)

巢

本文決定在分層階段處理影像，僅關注本地區塊的標記，這些標記隨著層次結構的向上移動而聚合。聚合在影像平面中完成，並包含卷積和後續的 maxpool，以允許其跨邊界傳遞訊息。

您可以透過以下程式碼使用它（例如 NesT-T）

 import torch
from vit_pytorch . nest import NesT

nest = NesT (
    image_size = 224 ,
    patch_size = 4 ,
    dim = 96 ,
    heads = 3 ,
    num_hierarchies = 3 ,        # number of hierarchies
    block_repeats = ( 2 , 2 , 8 ),  # the number of transformer blocks at each hierarchy, starting from the bottom
    num_classes = 1000
)

img = torch . randn ( 1 , 3 , 224 , 224 )

pred = nest ( img ) # (1, 1000)

行動視覺化技術

本文介紹了 MobileViT，這是一個用於行動裝置的輕量級通用視覺轉換器。 MobileViT 為使用 Transformer 進行全域資訊處理提供了不同的視角。

您可以透過以下程式碼使用它（例如 mobilevit_xs）

 import torch
from vit_pytorch . mobile_vit import MobileViT

mbvit_xs = MobileViT (
    image_size = ( 256 , 256 ),
    dims = [ 96 , 120 , 144 ],
    channels = [ 16 , 32 , 48 , 48 , 64 , 64 , 80 , 80 , 96 , 96 , 384 ],
    num_classes = 1000
)

img = torch . randn ( 1 , 3 , 256 , 256 )

pred = mbvit_xs ( img ) # (1, 1000)

西西特

本文介紹了交叉協方差注意力（簡稱XCA）。人們可以將其視為跨特徵維度而不是空間維度進行關注（另一種觀點是動態 1x1 卷積，內核是由空間相關性定義的關注圖）。

從技術上講，這相當於在使用學習到的溫度執行餘弦相似度注意力之前簡單地轉置查詢、鍵、值。

 import torch
from vit_pytorch . xcit import XCiT

v = XCiT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 12 ,                     # depth of xcit transformer
    cls_depth = 2 ,                  # depth of cross attention of CLS tokens to patch, attention pool at end
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1 ,
    layer_dropout = 0.05 ,           # randomly dropout 5% of the layers
    local_patch_kernel_size = 3     # kernel size of the local patch interaction module (depthwise convs)
)

img = torch . randn ( 1 , 3 , 256 , 256 )

preds = v ( img ) # (1, 1000)

簡單的蒙版影像建模

本文提出了一種簡單的掩模影像建模（SimMIM）方案，僅使用光罩標記到像素空間的線性投影，然後使用遮罩補丁的像素值進行 L1 損失。結果與其他更複雜的方法相比具有競爭力。

您可以如下使用它

 import torch
from vit_pytorch import ViT
from vit_pytorch . simmim import SimMIM

v = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048
)

mim = SimMIM (
    encoder = v ,
    masking_ratio = 0.5  # they found 50% to yield the best results
)

images = torch . randn ( 8 , 3 , 256 , 256 )

loss = mim ( images )
loss . backward ()

# that's all!
# do the above in a for loop many times with a lot of images and your vision transformer will learn

torch . save ( v . state_dict (), './trained-vit.pt' )

屏蔽自動編碼器

Kaiming He 的一篇新論文提出了一種簡單的自動編碼器方案，其中視覺變換器負責一組未屏蔽的補丁，而較小的解碼器嘗試重建屏蔽的像素值。

DeepReader快速論文審閱

AI 與 Letitia 喝咖啡

您可以透過以下程式碼使用它

 import torch
from vit_pytorch import ViT , MAE

v = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048
)

mae = MAE (
    encoder = v ,
    masking_ratio = 0.75 ,   # the paper recommended 75% masked patches
    decoder_dim = 512 ,      # paper showed good results with just 512
    decoder_depth = 6       # anywhere from 1 to 8
)

images = torch . randn ( 8 , 3 , 256 , 256 )

loss = mae ( images )
loss . backward ()

# that's all!
# do the above in a for loop many times with a lot of images and your vision transformer will learn

# save your improved vision transformer
torch . save ( v . state_dict (), './trained-vit.pt' )

屏蔽補丁預測

感謝 Zach，您可以使用本文中提出的原始蒙版補丁預測任務進行訓練，程式碼如下。

 import torch
from vit_pytorch import ViT
from vit_pytorch . mpp import MPP

model = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

mpp_trainer = MPP (
    transformer = model ,
    patch_size = 32 ,
    dim = 1024 ,
    mask_prob = 0.15 ,          # probability of using token in masked prediction task
    random_patch_prob = 0.30 ,  # probability of randomly replacing a token being used for mpp
    replace_prob = 0.50 ,       # probability of replacing a token being used for mpp with the mask token
)

opt = torch . optim . Adam ( mpp_trainer . parameters (), lr = 3e-4 )

def sample_unlabelled_images ():
    return torch . FloatTensor ( 20 , 3 , 256 , 256 ). uniform_ ( 0. , 1. )

for _ in range ( 100 ):
    images = sample_unlabelled_images ()
    loss = mpp_trainer ( images )
    opt . zero_grad ()
    loss . backward ()
    opt . step ()

# save your improved network
torch . save ( model . state_dict (), './pretrained-net.pt' )

蒙版位置預測

新論文介紹了蒙版位置預測預訓練標準。該策略比 Masked Autoencoder 策略更有效率，並且具有相當的效能。

 import torch
from vit_pytorch . mp3 import ViT , MP3

v = ViT (
    num_classes = 1000 ,
    image_size = 256 ,
    patch_size = 8 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
)

mp3 = MP3 (
    vit = v ,
    masking_ratio = 0.75
)

images = torch . randn ( 8 , 3 , 256 , 256 )

loss = mp3 ( images )
loss . backward ()

# that's all!
# do the above in a for loop many times with a lot of images and your vision transformer will learn

# save your improved vision transformer
torch . save ( v . state_dict (), './trained-vit.pt' )

自適應令牌採樣

本文提出使用 CLS 注意力分數，透過值頭的規格重新加權，作為丟棄不同層不重要標記的手段。

 import torch
from vit_pytorch . ats_vit import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 16 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    max_tokens_per_depth = ( 256 , 128 , 64 , 32 , 16 , 8 ), # a tuple that denotes the maximum number of tokens that any given layer should have. if the layer has greater than this amount, it will undergo adaptive token sampling
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 4 , 3 , 256 , 256 )

preds = v ( img ) # (4, 1000)

# you can also get a list of the final sampled patch ids
# a value of -1 denotes padding

preds , token_ids = v ( img , return_sampled_token_ids = True ) # (4, 1000), (4, <=8)

補丁合併

本文提出了一個簡單的模組（補丁合併），用於在不犧牲性能的情況下減少視覺轉換器任何層的令牌數量。

 import torch
from vit_pytorch . vit_with_patch_merger import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 16 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 12 ,
    heads = 8 ,
    patch_merge_layer = 6 ,        # at which transformer layer to do patch merging
    patch_merge_num_tokens = 8 ,   # the output number of tokens from the patch merge
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 4 , 3 , 256 , 256 )

preds = v ( img ) # (4, 1000)

也可以單獨使用PatchMerger模組

 import torch
from vit_pytorch . vit_with_patch_merger import PatchMerger

merger = PatchMerger (
    dim = 1024 ,
    num_tokens_out = 8   # output number of tokens
)

features = torch . randn ( 4 , 256 , 1024 ) # (batch, num tokens, dimension)

out = merger ( features ) # (4, 8, 1024)

適用於小數據集的視覺轉換器

本文提出了一種新的影像修補功能，該功能在對影像進行歸一化並將影像分割為修補程式之前結合了影像的移位。我發現轉變對於其他一些變形金剛工作非常有幫助，因此決定將其納入進一步的探索。它還包括具有學習溫度的LSA ，並屏蔽令牌對其自身的關注。

您可以如下使用：

 import torch
from vit_pytorch . vit_for_small_dataset import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 16 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 4 , 3 , 256 , 256 )

preds = v ( img ) # (1, 1000)

您也可以使用本文中的SPT作為獨立模組

 import torch
from vit_pytorch . vit_for_small_dataset import SPT

spt = SPT (
    dim = 1024 ,
    patch_size = 16 ,
    channels = 3
)

img = torch . randn ( 4 , 3 , 256 , 256 )

tokens = spt ( img ) # (4, 256, 1024)

3D虛擬實境技術

應大眾要求，我將開始將此儲存庫中的一些架構擴展到 3D ViT，以用於視訊、醫學成像等。

您將需要傳遞兩個額外的超參數：(1) 幀數frames和(2) 沿幀維度的補丁大小frame_patch_size

首先，3D ViT

 import torch
from vit_pytorch . vit_3d import ViT

v = ViT (
    image_size = 128 ,          # image size
    frames = 16 ,               # number of frames
    image_patch_size = 16 ,     # image patch size
    frame_patch_size = 2 ,      # frame patch size
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

video = torch . randn ( 4 , 3 , 16 , 128 , 128 ) # (batch, channels, frames, height, width)

preds = v ( video ) # (4, 1000)

3D 簡單 ViT

 import torch
from vit_pytorch . simple_vit_3d import SimpleViT

v = SimpleViT (
    image_size = 128 ,          # image size
    frames = 16 ,               # number of frames
    image_patch_size = 16 ,     # image patch size
    frame_patch_size = 2 ,      # frame patch size
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048
)

video = torch . randn ( 4 , 3 , 16 , 128 , 128 ) # (batch, channels, frames, height, width)

preds = v ( video ) # (4, 1000)

3D 版 CCT

 import torch
from vit_pytorch . cct_3d import CCT

cct = CCT (
    img_size = 224 ,
    num_frames = 8 ,
    embedding_dim = 384 ,
    n_conv_layers = 2 ,
    frame_kernel_size = 3 ,
    kernel_size = 7 ,
    stride = 2 ,
    padding = 3 ,
    pooling_kernel_size = 3 ,
    pooling_stride = 2 ,
    pooling_padding = 1 ,
    num_layers = 14 ,
    num_heads = 6 ,
    mlp_ratio = 3. ,
    num_classes = 1000 ,
    positional_embedding = 'learnable'
)

video = torch . randn ( 1 , 3 , 8 , 224 , 224 ) # (batch, channels, frames, height, width)
pred = cct ( video )

維維特

本文提供了 3 種不同類型的影片高效注意力架構，主題是跨空間和時間分解注意力。此儲存庫包括因式分解編碼器和因式分解自註意力變體。分解編碼器變體是一個空間變換器，後面跟著一個時間變換器。因子化自註意力變體是一個時空轉換器，具有交替的空間和時間自註意力層。

 import torch
from vit_pytorch . vivit import ViT

v = ViT (
    image_size = 128 ,          # image size
    frames = 16 ,               # number of frames
    image_patch_size = 16 ,     # image patch size
    frame_patch_size = 2 ,      # frame patch size
    num_classes = 1000 ,
    dim = 1024 ,
    spatial_depth = 6 ,         # depth of the spatial transformer
    temporal_depth = 6 ,        # depth of the temporal transformer
    heads = 8 ,
    mlp_dim = 2048 ,
    variant = 'factorized_encoder' , # or 'factorized_self_attention'
)

video = torch . randn ( 4 , 3 , 16 , 128 , 128 ) # (batch, channels, frames, height, width)

preds = v ( video ) # (4, 1000)

並行ViT

本文提出每層並行多個注意力和前饋區塊（2 個區塊），聲稱這樣更容易訓練而不損失表現。

您可以按如下方式嘗試此變體

 import torch
from vit_pytorch . parallel_vit import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 16 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048 ,
    num_parallel_branches = 2 ,  # in paper, they claimed 2 was optimal
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 4 , 3 , 256 , 256 )

preds = v ( img ) # (4, 1000)

可學習記憶ViT

本文表明，在視覺轉換器的每一層中添加可學習的記憶令牌可以極大地增強微調結果（除了可學習的任務特定的 CLS 令牌和適配器頭之外）。

您可以將其與經過特殊修改的ViT一起使用，如下所示

 import torch
from vit_pytorch . learnable_memory_vit import ViT , Adapter

# normal base ViT

v = ViT (
    image_size = 256 ,
    patch_size = 16 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 4 , 3 , 256 , 256 )
logits = v ( img ) # (4, 1000)

# do your usual training with ViT
# ...


# then, to finetune, just pass the ViT into the Adapter class
# you can do this for multiple Adapters, as shown below

adapter1 = Adapter (
    vit = v ,
    num_classes = 2 ,               # number of output classes for this specific task
    num_memories_per_layer = 5     # number of learnable memories per layer, 10 was sufficient in paper
)

logits1 = adapter1 ( img ) # (4, 2) - predict 2 classes off frozen ViT backbone with learnable memories and task specific head

# yet another task to finetune on, this time with 4 classes

adapter2 = Adapter (
    vit = v ,
    num_classes = 4 ,
    num_memories_per_layer = 10
)

logits2 = adapter2 ( img ) # (4, 4) - predict 4 classes off frozen ViT backbone with learnable memories and task specific head

恐龍

您可以使用最近的 SOTA 自監督學習技術 Dino 來訓練ViT ，程式碼如下。

雅尼克·基爾徹視頻

 import torch
from vit_pytorch import ViT , Dino

model = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 8 ,
    mlp_dim = 2048
)

learner = Dino (
    model ,
    image_size = 256 ,
    hidden_layer = 'to_latent' ,        # hidden layer name or index, from which to extract the embedding
    projection_hidden_size = 256 ,      # projector network hidden dimension
    projection_layers = 4 ,             # number of layers in projection network
    num_classes_K = 65336 ,             # output logits dimensions (referenced as K in paper)
    student_temp = 0.9 ,                # student temperature
    teacher_temp = 0.04 ,               # teacher temperature, needs to be annealed from 0.04 to 0.07 over 30 epochs
    local_upper_crop_scale = 0.4 ,      # upper bound for local crop - 0.4 was recommended in the paper 
    global_lower_crop_scale = 0.5 ,     # lower bound for global crop - 0.5 was recommended in the paper
    moving_average_decay = 0.9 ,        # moving average of encoder - paper showed anywhere from 0.9 to 0.999 was ok
    center_moving_average_decay = 0.9 , # moving average of teacher centers - paper showed anywhere from 0.9 to 0.999 was ok
)

opt = torch . optim . Adam ( learner . parameters (), lr = 3e-4 )

def sample_unlabelled_images ():
    return torch . randn ( 20 , 3 , 256 , 256 )

for _ in range ( 100 ):
    images = sample_unlabelled_images ()
    loss = learner ( images )
    opt . zero_grad ()
    loss . backward ()
    opt . step ()
    learner . update_moving_average () # update moving average of teacher encoder and teacher centers

# save your improved network
torch . save ( model . state_dict (), './pretrained-net.pt' )

埃斯維特

EsViT是 Dino（從上面）的變體，經過重新設計，透過考慮增強視圖之間的額外區域損失，透過補丁合併/下取樣來支援高效的ViT 。引用摘要，它outperforms its supervised counterpart on 17 out of 18 datasets吞吐量高出 3 倍。

儘管它被命名為新的ViT變體，但實際上它只是一種訓練任何多階段ViT的策略（在論文中，他們專注於 Swin）。下面的範例將展示如何將其與CvT一起使用。您需要將hidden_layer layer設定為高效ViT中輸出非平均池化視覺表示的圖層的名稱，就在全域池化和投影到logits之前。

 import torch
from vit_pytorch . cvt import CvT
from vit_pytorch . es_vit import EsViTTrainer

cvt = CvT (
    num_classes = 1000 ,
    s1_emb_dim = 64 ,
    s1_emb_kernel = 7 ,
    s1_emb_stride = 4 ,
    s1_proj_kernel = 3 ,
    s1_kv_proj_stride = 2 ,
    s1_heads = 1 ,
    s1_depth = 1 ,
    s1_mlp_mult = 4 ,
    s2_emb_dim = 192 ,
    s2_emb_kernel = 3 ,
    s2_emb_stride = 2 ,
    s2_proj_kernel = 3 ,
    s2_kv_proj_stride = 2 ,
    s2_heads = 3 ,
    s2_depth = 2 ,
    s2_mlp_mult = 4 ,
    s3_emb_dim = 384 ,
    s3_emb_kernel = 3 ,
    s3_emb_stride = 2 ,
    s3_proj_kernel = 3 ,
    s3_kv_proj_stride = 2 ,
    s3_heads = 4 ,
    s3_depth = 10 ,
    s3_mlp_mult = 4 ,
    dropout = 0.
)

learner = EsViTTrainer (
    cvt ,
    image_size = 256 ,
    hidden_layer = 'layers' ,           # hidden layer name or index, from which to extract the embedding
    projection_hidden_size = 256 ,      # projector network hidden dimension
    projection_layers = 4 ,             # number of layers in projection network
    num_classes_K = 65336 ,             # output logits dimensions (referenced as K in paper)
    student_temp = 0.9 ,                # student temperature
    teacher_temp = 0.04 ,               # teacher temperature, needs to be annealed from 0.04 to 0.07 over 30 epochs
    local_upper_crop_scale = 0.4 ,      # upper bound for local crop - 0.4 was recommended in the paper
    global_lower_crop_scale = 0.5 ,     # lower bound for global crop - 0.5 was recommended in the paper
    moving_average_decay = 0.9 ,        # moving average of encoder - paper showed anywhere from 0.9 to 0.999 was ok
    center_moving_average_decay = 0.9 , # moving average of teacher centers - paper showed anywhere from 0.9 to 0.999 was ok
)

opt = torch . optim . AdamW ( learner . parameters (), lr = 3e-4 )

def sample_unlabelled_images ():
    return torch . randn ( 8 , 3 , 256 , 256 )

for _ in range ( 1000 ):
    images = sample_unlabelled_images ()
    loss = learner ( images )
    opt . zero_grad ()
    loss . backward ()
    opt . step ()
    learner . update_moving_average () # update moving average of teacher encoder and teacher centers

# save your improved network
torch . save ( cvt . state_dict (), './pretrained-net.pt' )

獲得關注

如果您想視覺化研究的注意力權重（softmax 後），只需按照以下步驟操作

 import torch
from vit_pytorch . vit import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

# import Recorder and wrap the ViT

from vit_pytorch . recorder import Recorder
v = Recorder ( v )

# forward pass now returns predictions and the attention maps

img = torch . randn ( 1 , 3 , 256 , 256 )
preds , attns = v ( img )

# there is one extra patch due to the CLS token

attns # (1, 6, 16, 65, 65) - (batch x layers x heads x patch x patch)

收集足夠的資料後清理類別和掛鉤

 v = v . eject ()  # wrapper is discarded and original ViT instance is returned

訪問嵌入

您可以使用Extractor包裝器類似地存取嵌入

 import torch
from vit_pytorch . vit import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

# import Recorder and wrap the ViT

from vit_pytorch . extractor import Extractor
v = Extractor ( v )

# forward pass now returns predictions and the attention maps

img = torch . randn ( 1 , 3 , 256 , 256 )
logits , embeddings = v ( img )

# there is one extra token due to the CLS token

embeddings # (1, 65, 1024) - (batch x patches x model dim)

或者說CrossViT ，它有一個多尺度編碼器，可以輸出「大」和「小」尺度的兩組嵌入

 import torch
from vit_pytorch . cross_vit import CrossViT

v = CrossViT (
    image_size = 256 ,
    num_classes = 1000 ,
    depth = 4 ,
    sm_dim = 192 ,
    sm_patch_size = 16 ,
    sm_enc_depth = 2 ,
    sm_enc_heads = 8 ,
    sm_enc_mlp_dim = 2048 ,
    lg_dim = 384 ,
    lg_patch_size = 64 ,
    lg_enc_depth = 3 ,
    lg_enc_heads = 8 ,
    lg_enc_mlp_dim = 2048 ,
    cross_attn_depth = 2 ,
    cross_attn_heads = 8 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

# wrap the CrossViT

from vit_pytorch . extractor import Extractor
v = Extractor ( v , layer_name = 'multi_scale_encoder' ) # take embedding coming from the output of multi-scale-encoder

# forward pass now returns predictions and the attention maps

img = torch . randn ( 1 , 3 , 256 , 256 )
logits , embeddings = v ( img )

# there is one extra token due to the CLS token

embeddings # ((1, 257, 192), (1, 17, 384)) - (batch x patches x dimension) <- large and small scales respectively

研究思路

高效注意力

可能有些來自電腦視覺的人認為注意力仍然受到二次成本的影響。幸運的是，我們有很多可能有幫助的新技術。該存儲庫為您提供了一種插入自己的稀疏注意力轉換器的方法。

Nystromformer 的範例

$ pip install nystrom-attention

 import torch
from vit_pytorch . efficient import ViT
from nystrom_attention import Nystromformer

efficient_transformer = Nystromformer (
    dim = 512 ,
    depth = 12 ,
    heads = 8 ,
    num_landmarks = 256
)

v = ViT (
    dim = 512 ,
    image_size = 2048 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    transformer = efficient_transformer
)

img = torch . randn ( 1 , 3 , 2048 , 2048 ) # your high resolution picture
v ( img ) # (1, 1000)

我強烈推薦的其他稀疏注意力框架是 Routing Transformer 或 Sinkhorn Transformer

與其他 Transformer 改進相結合

本文特意使用了最普通的注意力網絡來發表聲明。如果您想使用注意力網路的一些最新改進，請使用此儲存庫中的Encoder 。

前任。

$ pip install x-transformers

 import torch
from vit_pytorch . efficient import ViT
from x_transformers import Encoder

v = ViT (
    dim = 512 ,
    image_size = 224 ,
    patch_size = 16 ,
    num_classes = 1000 ,
    transformer = Encoder (
        dim = 512 ,                  # set to be the same as the wrapper
        depth = 12 ,
        heads = 8 ,
        ff_glu = True ,              # ex. feed forward GLU variant https://arxiv.org/abs/2002.05202
        residual_attn = True        # ex. residual attention https://arxiv.org/abs/2012.11747
    )
)

img = torch . randn ( 1 , 3 , 224 , 224 )
v ( img ) # (1, 1000)

常問問題

如何傳遞非方形影像？

您已經可以傳遞非方形圖像 - 您只需確保您的高度和寬度小於或等於image_size ，並且都可以被patch_size整除

前任。

 import torch
from vit_pytorch import ViT

v = ViT (
    image_size = 256 ,
    patch_size = 32 ,
    num_classes = 1000 ,
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 1 , 3 , 256 , 128 ) # <-- not a square

preds = v ( img ) # (1, 1000)

如何傳遞非方形補丁？

 import torch
from vit_pytorch import ViT

v = ViT (
    num_classes = 1000 ,
    image_size = ( 256 , 128 ),  # image size is a tuple of (height, width)
    patch_size = ( 32 , 16 ),    # patch size is a tuple of (height, width)
    dim = 1024 ,
    depth = 6 ,
    heads = 16 ,
    mlp_dim = 2048 ,
    dropout = 0.1 ,
    emb_dropout = 0.1
)

img = torch . randn ( 1 , 3 , 256 , 128 )

preds = v ( img )

資源

來自電腦視覺並且是變形金剛的新手？這裡有一些資源大大加快了我的學習速度。

變壓器插圖 - Jay Alammar
從零開始的變形金剛 - 彼得·布洛姆
帶註釋的 Transformer - 哈佛 NLP

引文

 @article { hassani2021escaping ,
    title   = { Escaping the Big Data Paradigm with Compact Transformers } ,
    author  = { Ali Hassani and Steven Walton and Nikhil Shah and Abulikemu Abuduweili and Jiachen Li and Humphrey Shi } ,
    year    = 2021 ,
    url     = { https://arxiv.org/abs/2104.05704 } ,
    eprint  = { 2104.05704 } ,
    archiveprefix = { arXiv } ,
    primaryclass = { cs.CV }
}

 @misc { dosovitskiy2020image ,
    title   = { An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale } ,
    author  = { Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby } ,
    year    = { 2020 } ,
    eprint  = { 2010.11929 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { touvron2020training ,
    title   = { Training data-efficient image transformers & distillation through attention } , 
    author  = { Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Hervé Jégou } ,
    year    = { 2020 } ,
    eprint  = { 2012.12877 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { yuan2021tokenstotoken ,
    title   = { Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet } ,
    author  = { Li Yuan and Yunpeng Chen and Tao Wang and Weihao Yu and Yujun Shi and Francis EH Tay and Jiashi Feng and Shuicheng Yan } ,
    year    = { 2021 } ,
    eprint  = { 2101.11986 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { zhou2021deepvit ,
    title   = { DeepViT: Towards Deeper Vision Transformer } ,
    author  = { Daquan Zhou and Bingyi Kang and Xiaojie Jin and Linjie Yang and Xiaochen Lian and Qibin Hou and Jiashi Feng } ,
    year    = { 2021 } ,
    eprint  = { 2103.11886 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { touvron2021going ,
    title   = { Going deeper with Image Transformers } , 
    author  = { Hugo Touvron and Matthieu Cord and Alexandre Sablayrolles and Gabriel Synnaeve and Hervé Jégou } ,
    year    = { 2021 } ,
    eprint  = { 2103.17239 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { chen2021crossvit ,
    title   = { CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification } ,
    author  = { Chun-Fu Chen and Quanfu Fan and Rameswar Panda } ,
    year    = { 2021 } ,
    eprint  = { 2103.14899 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { wu2021cvt ,
    title   = { CvT: Introducing Convolutions to Vision Transformers } ,
    author  = { Haiping Wu and Bin Xiao and Noel Codella and Mengchen Liu and Xiyang Dai and Lu Yuan and Lei Zhang } ,
    year    = { 2021 } ,
    eprint  = { 2103.15808 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { heo2021rethinking ,
    title   = { Rethinking Spatial Dimensions of Vision Transformers } , 
    author  = { Byeongho Heo and Sangdoo Yun and Dongyoon Han and Sanghyuk Chun and Junsuk Choe and Seong Joon Oh } ,
    year    = { 2021 } ,
    eprint  = { 2103.16302 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { graham2021levit ,
    title   = { LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference } ,
    author  = { Ben Graham and Alaaeldin El-Nouby and Hugo Touvron and Pierre Stock and Armand Joulin and Hervé Jégou and Matthijs Douze } ,
    year    = { 2021 } ,
    eprint  = { 2104.01136 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { li2021localvit ,
    title   = { LocalViT: Bringing Locality to Vision Transformers } ,
    author  = { Yawei Li and Kai Zhang and Jiezhang Cao and Radu Timofte and Luc Van Gool } ,
    year    = { 2021 } ,
    eprint  = { 2104.05707 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { chu2021twins ,
    title   = { Twins: Revisiting Spatial Attention Design in Vision Transformers } ,
    author  = { Xiangxiang Chu and Zhi Tian and Yuqing Wang and Bo Zhang and Haibing Ren and Xiaolin Wei and Huaxia Xia and Chunhua Shen } ,
    year    = { 2021 } ,
    eprint  = { 2104.13840 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { su2021roformer ,
    title   = { RoFormer: Enhanced Transformer with Rotary Position Embedding } , 
    author  = { Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu } ,
    year    = { 2021 } ,
    eprint  = { 2104.09864 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CL }
}

 @misc { zhang2021aggregating ,
    title   = { Aggregating Nested Transformers } ,
    author  = { Zizhao Zhang and Han Zhang and Long Zhao and Ting Chen and Tomas Pfister } ,
    year    = { 2021 } ,
    eprint  = { 2105.12723 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { chen2021regionvit ,
    title   = { RegionViT: Regional-to-Local Attention for Vision Transformers } , 
    author  = { Chun-Fu Chen and Rameswar Panda and Quanfu Fan } ,
    year    = { 2021 } ,
    eprint  = { 2106.02689 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { wang2021crossformer ,
    title   = { CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention } , 
    author  = { Wenxiao Wang and Lu Yao and Long Chen and Binbin Lin and Deng Cai and Xiaofei He and Wei Liu } ,
    year    = { 2021 } ,
    eprint  = { 2108.00154 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { caron2021emerging ,
    title   = { Emerging Properties in Self-Supervised Vision Transformers } ,
    author  = { Mathilde Caron and Hugo Touvron and Ishan Misra and Hervé Jégou and Julien Mairal and Piotr Bojanowski and Armand Joulin } ,
    year    = { 2021 } ,
    eprint  = { 2104.14294 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { he2021masked ,
    title   = { Masked Autoencoders Are Scalable Vision Learners } , 
    author  = { Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick } ,
    year    = { 2021 } ,
    eprint  = { 2111.06377 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { xie2021simmim ,
    title   = { SimMIM: A Simple Framework for Masked Image Modeling } , 
    author  = { Zhenda Xie and Zheng Zhang and Yue Cao and Yutong Lin and Jianmin Bao and Zhuliang Yao and Qi Dai and Han Hu } ,
    year    = { 2021 } ,
    eprint  = { 2111.09886 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { fayyaz2021ats ,
    title   = { ATS: Adaptive Token Sampling For Efficient Vision Transformers } ,
    author  = { Mohsen Fayyaz and Soroush Abbasi Kouhpayegani and Farnoush Rezaei Jafari and Eric Sommerlade and Hamid Reza Vaezi Joze and Hamed Pirsiavash and Juergen Gall } ,
    year    = { 2021 } ,
    eprint  = { 2111.15667 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { mehta2021mobilevit ,
    title   = { MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer } ,
    author  = { Sachin Mehta and Mohammad Rastegari } ,
    year    = { 2021 } ,
    eprint  = { 2110.02178 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { lee2021vision ,
    title   = { Vision Transformer for Small-Size Datasets } , 
    author  = { Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song } ,
    year    = { 2021 } ,
    eprint  = { 2112.13492 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { renggli2022learning ,
    title   = { Learning to Merge Tokens in Vision Transformers } ,
    author  = { Cedric Renggli and André Susano Pinto and Neil Houlsby and Basil Mustafa and Joan Puigcerver and Carlos Riquelme } ,
    year    = { 2022 } ,
    eprint  = { 2202.12015 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { yang2022scalablevit ,
    title   = { ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer } , 
    author  = { Rui Yang and Hailong Ma and Jie Wu and Yansong Tang and Xuefeng Xiao and Min Zheng and Xiu Li } ,
    year    = { 2022 } ,
    eprint  = { 2203.10790 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @inproceedings { Touvron2022ThreeTE ,
    title   = { Three things everyone should know about Vision Transformers } ,
    author  = { Hugo Touvron and Matthieu Cord and Alaaeldin El-Nouby and Jakob Verbeek and Herv'e J'egou } ,
    year    = { 2022 }
}

 @inproceedings { Sandler2022FinetuningIT ,
    title   = { Fine-tuning Image Transformers using Learnable Memory } ,
    author  = { Mark Sandler and Andrey Zhmoginov and Max Vladymyrov and Andrew Jackson } ,
    year    = { 2022 }
}

 @inproceedings { Li2022SepViTSV ,
    title   = { SepViT: Separable Vision Transformer } ,
    author  = { Wei Li and Xing Wang and Xin Xia and Jie Wu and Xuefeng Xiao and Minghang Zheng and Shiping Wen } ,
    year    = { 2022 }
}

 @inproceedings { Tu2022MaxViTMV ,
    title   = { MaxViT: Multi-Axis Vision Transformer } ,
    author  = { Zhengzhong Tu and Hossein Talebi and Han Zhang and Feng Yang and Peyman Milanfar and Alan Conrad Bovik and Yinxiao Li } ,
    year    = { 2022 }
}

 @article { Li2021EfficientSV ,
    title   = { Efficient Self-supervised Vision Transformers for Representation Learning } ,
    author  = { Chunyuan Li and Jianwei Yang and Pengchuan Zhang and Mei Gao and Bin Xiao and Xiyang Dai and Lu Yuan and Jianfeng Gao } ,
    journal = { ArXiv } ,
    year    = { 2021 } ,
    volume  = { abs/2106.09785 }
}

 @misc { Beyer2022BetterPlainViT
    title     = { Better plain ViT baselines for ImageNet-1k } ,
    author    = { Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander } ,
    publisher = { arXiv } ,
    year      = { 2022 }
}

 @article { Arnab2021ViViTAV ,
    title   = { ViViT: A Video Vision Transformer } ,
    author  = { Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario Lucic and Cordelia Schmid } ,
    journal = { 2021 IEEE/CVF International Conference on Computer Vision (ICCV) } ,
    year    = { 2021 } ,
    pages   = { 6816-6826 }
}

 @article { Liu2022PatchDropoutEV ,
    title   = { PatchDropout: Economizing Vision Transformers Using Patch Dropout } ,
    author  = { Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2208.07220 }
}

 @misc { https://doi.org/10.48550/arxiv.2302.01327 ,
    doi     = { 10.48550/ARXIV.2302.01327 } ,
    url     = { https://arxiv.org/abs/2302.01327 } ,
    author  = { Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil } ,
    title   = { Dual PatchNorm } ,
    publisher = { arXiv } ,
    year    = { 2023 } ,
    copyright = { Creative Commons Attribution 4.0 International }
}

 @inproceedings { Dehghani2023PatchNP ,
    title   = { Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution } ,
    author  = { Mostafa Dehghani and Basil Mustafa and Josip Djolonga and Jonathan Heek and Matthias Minderer and Mathilde Caron and Andreas Steiner and Joan Puigcerver and Robert Geirhos and Ibrahim M. Alabdulmohsin and Avital Oliver and Piotr Padlewski and Alexey A. Gritsenko and Mario Luvci'c and Neil Houlsby } ,
    year    = { 2023 }
}

 @misc { vaswani2017attention ,
    title   = { Attention Is All You Need } ,
    author  = { Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin } ,
    year    = { 2017 } ,
    eprint  = { 1706.03762 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CL }
}

 @inproceedings { dao2022flashattention ,
    title   = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
    author  = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
    booktitle = { Advances in Neural Information Processing Systems } ,
    year    = { 2022 }
}

 @inproceedings { Darcet2023VisionTN ,
    title   = { Vision Transformers Need Registers } ,
    author  = { Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski } ,
    year    = { 2023 } ,
    url     = { https://api.semanticscholar.org/CorpusID:263134283 }
}

 @inproceedings { ElNouby2021XCiTCI ,
    title   = { XCiT: Cross-Covariance Image Transformers } ,
    author  = { Alaaeldin El-Nouby and Hugo Touvron and Mathilde Caron and Piotr Bojanowski and Matthijs Douze and Armand Joulin and Ivan Laptev and Natalia Neverova and Gabriel Synnaeve and Jakob Verbeek and Herv{'e} J{'e}gou } ,
    booktitle = { Neural Information Processing Systems } ,
    year    = { 2021 } ,
    url     = { https://api.semanticscholar.org/CorpusID:235458262 }
}

 @inproceedings { Koner2024LookupViTCV ,
    title   = { LookupViT: Compressing visual information to a limited number of tokens } ,
    author  = { Rajat Koner and Gagan Jain and Prateek Jain and Volker Tresp and Sujoy Paul } ,
    year    = { 2024 } ,
    url     = { https://api.semanticscholar.org/CorpusID:271244592 }
}

 @article { Bao2022AllAW ,
    title   = { All are Worth Words: A ViT Backbone for Diffusion Models } ,
    author  = { Fan Bao and Shen Nie and Kaiwen Xue and Yue Cao and Chongxuan Li and Hang Su and Jun Zhu } ,
    journal = { 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) } ,
    year    = { 2022 } ,
    pages   = { 22669-22679 } ,
    url     = { https://api.semanticscholar.org/CorpusID:253581703 }
}

 @misc { Rubin2024 ,
    author  = { Ohad Rubin } ,
    url     = { https://medium.com/ @ ohadrubin/exploring-weight-decay-in-layer-normalization-challenges-and-a-reparameterization-solution-ad4d12c24950 }
}

 @inproceedings { Loshchilov2024nGPTNT ,
    title   = { nGPT: Normalized Transformer with Representation Learning on the Hypersphere } ,
    author  = { Ilya Loshchilov and Cheng-Ping Hsieh and Simeng Sun and Boris Ginsburg } ,
    year    = { 2024 } ,
    url     = { https://api.semanticscholar.org/CorpusID:273026160 }
}

 @inproceedings { Liu2017DeepHL ,
    title   = { Deep Hyperspherical Learning } ,
    author  = { Weiyang Liu and Yanming Zhang and Xingguo Li and Zhen Liu and Bo Dai and Tuo Zhao and Le Song } ,
    booktitle = { Neural Information Processing Systems } ,
    year    = { 2017 } ,
    url     = { https://api.semanticscholar.org/CorpusID:5104558 }
}

 @inproceedings { Zhou2024ValueRL ,
    title   = { Value Residual Learning For Alleviating Attention Concentration In Transformers } ,
    author  = { Zhanchao Zhou and Tianyi Wu and Zhiyun Jiang and Zhenzhong Lan } ,
    year    = { 2024 } ,
    url     = { https://api.semanticscholar.org/CorpusID:273532030 }
}

 @article { Zhu2024HyperConnections ,
    title   = { Hyper-Connections } ,
    author  = { Defa Zhu and Hongzhi Huang and Zihao Huang and Yutao Zeng and Yunyao Mao and Banggu Wu and Qiyang Min and Xun Zhou } ,
    journal = { ArXiv } ,
    year    = { 2024 } ,
    volume  = { abs/2409.19606 } ,
    url     = { https://api.semanticscholar.org/CorpusID:272987528 }
}