性能结果
这里提供了新 Bow Pod 平台的初步性能结果,以及MLPerf Training v1.1 提交结果,以及我们自己针对训练和推理的更广泛模型的基准测试结果。
Bow平台-训练
这里提供了全新 Bow Pod 平台的初步训练性能结果。 此处吞吐量定义为模型每秒处理的输入数据点(序列、图像或行)的数量。
以下结果详细说明了在指定配置中,每个引用模型的吞吐量。
Model | Variant | Platform | SDK Version | Framework | Dataset | Batch Size | Precision | Throughput (items/sec) |
---|---|---|---|---|---|---|---|---|
ResNet-50 v1.5 | Bow Pod16 | Pre-SDK2.5 | TensorFlow1 | ImageNet2012 | 1,920 | 16.16 | 42029 | |
ResNet-50 v1.5 | Bow Pod64 | Pre-SDK2.5 | TensorFlow1 | ImageNet2012 | 2,560 | 16.16 | 145287 | |
ResNet-50 v1.5 | Bow Pod256 | Pre-SDK2.5 | TensorFlow1 | ImageNet2012 | 10,240 | 16.16 | 425514 | |
EfficientNet-B4 | G16-EfficientNet | Bow Pod16 | Pre-SDK2.5 | TensorFlow1 | ImageNet2012 | 6,144 | 16.16 | 8936 |
EfficientNet-B4 | G16-EfficientNet | Bow Pod64 | Pre-SDK2.5 | TensorFlow1 | ImageNet2012 | 6,144 | 16.16 | 34943 |
EfficientNet-B4 | G16-EfficientNet | Bow Pod256 | Pre-SDK2.5 | TensorFlow1 | ImageNet2012 | 6,144 | 16.16 | 117641 |
ResNeXt101 | Bow Pod16 | Pre-SDK2.5 | TensorFlow1 | ImageNet2012 | 768 | 16.16 | 12221 | |
ViT | Vision Transformer | Bow Pod64 | Pre-SDK2.5 | PyTorch | ImageNet1k | 65,536 | 16.16 | 31200 |
Mini DALL-E | Bow Pod16 | Pre-SDK2.5 | PyTorch | COCO 2017 | 6,144 | 16.16 | 1855 | |
GraphSage | Bow Pod16 | Pre-SDK2.5 | TensorFlow2 | COCO 2017 | 16.16 | 1.95s epoch time | ||
BERT Large | Ph1 Pre-Training (SL128) | Bow Pod16 | Pre-SDK2.5 | PopART | Wikipedia | 65,536 | 16.16 | 5179 |
BERT Large | Ph1 Pre-Training (SL128) | Bow Pod16 | Pre-SDK2.5 | TensorFlow1 | Wikipedia | 65,600 | 16.16 | 5125 |
BERT Large | Ph1 Pre-Training (SL128) | Bow Pod64 | Pre-SDK2.5 | PopART | Wikipedia | 65,536 | 16.16 | 19353 |
BERT Large | Ph1 Pre-Training (SL128) | Bow Pod64 | Pre-SDK2.5 | TensorFlow1 | Wikipedia | 66,560 | 16.16 | 18907 |
BERT Large | Ph2 Pre-Training (SL384) | Bow Pod16 | Pre-SDK2.5 | PopART | Wikipedia | 16,384 | 16.16 | 1470 |
BERT Large | Ph2 Pre-Training (SL384) | Bow Pod16 | Pre-SDK2.5 | TensorFlow1 | Wikipedia | 16,400 | 16.16 | 1420 |
BERT Large | Ph2 Pre-Training (SL384) | Bow Pod64 | Pre-SDK2.5 | PopART | Wikipedia | 16,384 | 16.16 | 5444 |
BERT Large | Ph2 Pre-Training (SL384) | Bow Pod64 | Pre-SDK2.5 | TensorFlow1 | Wikipedia | 16,400 | 16.16 | 5340 |
BERT Base | Ph1 Pre-Training (SL128) | Bow Pod16 | Pre-SDK2.5 | PopART | Wikipedia | 65,536 | 16.16 | 16508 |
GPT2 | GPT2-Large | Bow Pod64 | Pre-SDK2.5 | PyTorch | Wikipedia | 1316 | ||
GPT2 | GPT2-Medium (SL1024) | Bow Pod64 | Pre-SDK2.5 | PyTorch | Wikipedia | 65,536 | 16.16 | 347 |
Conformer-Large | Bow Pod64 | Pre-SDK2.5 | PyTorch | AiShell1 | 16.16 | 8157 | ||
FastSpeech2 | Bow Pod16 | Pre-SDK2.5 | TensorFlow2 | LJ Speech | 64 | 16.16 | 1569 |
Bow平台-推理
此处模型推理是指在输入数据上运行经过训练的模型来推断输出。 实际商业应用中的推理性能通常根据两个指标来衡量:吞吐量(如前所述)和时延,在此上下文中是指为模型在给定输入的情况下提供输出所花费的时间。
以下为 Bow-2000 平台上指定批大小下的吞吐量和延迟结果。
Model | Variant | Platform | SDK Version | Framework | Dataset | Batch Size | Precision | Throughput (items/sec) | Latency (ms) |
---|---|---|---|---|---|---|---|---|---|
BERT-Large | SL128 | Bow-2000 | Pre-SDK2.5 | PopART | SQuAD | 4 | 16.16 | 2877 | 1.37 |
BERT-Large | SL128 | Bow-2000 | Pre-SDK2.5 | PopART | SQuAD | 16 | 16.16 | 5180 | 3.07 |
BERT-Base | SL128 | Bow-2000 | Pre-SDK2.5 | PopART | SQuAD | 4 | 16.16 | 6235 | 0.62 |
BERT-Base | SL128 | Bow-2000 | Pre-SDK2.5 | PopART | SQuAD | 320 | 16.16 | 28892 | 11.07 |
ResNet-50v1.5 | lowest latency config | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 9570 | 0.4 |
ResNet-50v1.5 | higher throughput config | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 12720 | 1.46 |
ResNet-50v1.5 | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 320 | 16.16 | 63774 | 22.79 | |
EfficientNet-B0 | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 192 | 16.16 | 89513 | 9.16 | |
EfficientNet-B4 | lowest latency config | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 4210 | 0.91 |
EfficientNet-B4 | higher throughput config | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 5094 | 1.46 |
EfficientNet-B4 | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 40 | 16.16 | 13807 | 4.96 | |
ResNeXt101 | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 5948 | 2 | |
ResNeXt101 | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 64 | 16.16 | 23195 | 8.26 | |
Yolo v4 | image size 896 | Bow-2000 | Pre-SDK2.5 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 980 | 6.92 |
EfficientDet-D3 | Bow-2000 | Pre-SDK2.5 | TF2 w/Keras | Synthetic (host-generated) | 4 | 16.16 | 1004 | 3.78 |
MLPerf训练1.1版本表现
对于我们向MLPerf训练1.1版本的提交,我们选择了提交流行应用程序基准类别,包括图像分类(ResNet-50)和自然语言处理(BERT)。
提交有两个分区(Division)。封闭分区(Closed Division)要求提交者使用完全相同的模型和优化器实施,包括定义超参数状态和训练时期。还有一个开放分区(Open Division),通过支持更适合不同处理器功能的不同模型实现来促进和支持创新,但确保达到与封闭分区完全相同的模型准确性和质量。


Division | Model | MLPerf Quality Target | Platform | SDK Version | Framework | MLPerf ID | Dataset | Precision | Time to Train (mins) |
---|---|---|---|---|---|---|---|---|---|
Closed | ResNet50 v1.5 | 75.90% classification | IPU-POD16 | SDK 2.3.0 | TensorFlow | 1.1-2040 | ImageNet2012 | 16.16 | 28.33 |
Closed | ResNet50 v1.5 | 75.90% classification | IPU-POD64 | SDK 2.3.0 | TensorFlow | 1.1-2042 | ImageNet2012 | 16.16 | 8.50 |
Closed | ResNet50 v1.5 | 75.90% classification | IPU-POD128 | SDK 2.3.0 | TensorFlow | 1.1-2044 | ImageNet2012 | 16.16 | 5.67 |
Closed | ResNet50 v1.5 | 75.90% classification | IPU-POD256 | SDK 2.3.0 | TensorFlow | 1.1-2045 | ImageNet2012 | 16.16 | 3.79 |
Closed | BERT | 0.72 Mask-LM accuracy | IPU-POD16 | SDK 2.3.0 | PopART | 1.1-2039 | Wikipedia | 16.16 | 32.70 |
Closed | BERT | 0.72 Mask-LM accuracy | IPU-POD64 | SDK 2.3.0 | PopART | 1.1-2041 | Wikipedia | 16.16 | 10.56 |
Closed | BERT | 0.72 Mask-LM accuracy | IPU-POD128 | SDK 2.3.0 | PopART | 1.1-2043 | Wikipedia | 16.16 | 6.86 |
Open | BERT | 0.72 Mask-LM accuracy | IPU-POD16 | SDK 2.3.0 | PopART | 1.1-2088 | Wikipedia | 16.16 | 26.05 |
Open | BERT | 0.72 Mask-LM accuracy | IPU-POD64 | SDK 2.3.0 | PopART | 1.1-2089 | Wikipedia | 16.16 | 8.25 |
Open | BERT | 0.72 Mask-LM accuracy | IPU-POD128 | SDK 2.3.0 | PopART | 1.1-2087 | Wikipedia | 16.16 | 5.88 |
MLPerf的名称和徽标是MLCommons联盟(MLCommons Association)在美国和其他国家的商标。
版权所有,严禁未经授权使用。有关更多信息,请访问 www.mlperf.org
IPU-POD经典款-训练
训练机器学习模型涉及在输入数据集(训练数据)上运行算法,直到模型收敛,收敛意味着它已经学会以指定的准确性产生所需的输出。在此语境中,吞吐量被定义为模型每秒处理的输入数据点(序列、图像或行)的数量。吞吐量通常用作衡量硬件性能的指标,因为它与将模型训练达到指定准确性的时间直接相关。
下面提供的结果详细说明了在指定配置中每个参考模型获得的吞吐量值。在真实数据上运行的所有配置都针对收敛进行了验证。
Model | Variant | Platform | SDK Version | Framework | Dataset | Batch Size | Precision | Throughput (items/sec) |
---|---|---|---|---|---|---|---|---|
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD16 | SDK 2.4.0 | PopART | Wikipedia | 65,536 | 16.16 | 3738 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD16 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 65,600 | 16.16 | 3704 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD16 | SDK 2.4.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 3582 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD64 | SDK 2.4.0 | PopART | Wikipedia | 65,536 | 16.16 | 14189 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD64 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 66,560 | 16.16 | 13917 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD64 | SDK 2.4.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 12251 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD128 | SDK 2.4.0 | PopART | Wikipedia | 65,536 | 16.16 | 24424 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD128 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 66,560 | 16.16 | 24900 |
BERT Large | Ph1 Pre-Training (SL128) | IPU-POD128 | SDK 2.4.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 22402 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD16 | SDK 2.4.0 | PopART | Wikipedia | 16,384 | 16.16 | 1063 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD16 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 16,400 | 16.16 | 1025 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD16 | SDK 2.4.0 | PyTorch | Wikipedia | 16,384 | 16.16 | 1012 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD64 | SDK 2.4.0 | PopART | Wikipedia | 16,384 | 16.16 | 4003 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD64 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 16,640 | 16.16 | 3938 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD64 | SDK 2.4.0 | PyTorch | Wikipedia | 16,384 | 16.16 | 3611 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD128 | SDK 2.4.0 | PopART | Wikipedia | 16,384 | 16.16 | 7127 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD128 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 16,640 | 16.16 | 7292 |
BERT Large | Ph2 Pre-Training (SL384) | IPU-POD128 | SDK 2.4.0 | PyTorch | Wikipedia | 16,384 | 16.16 | 6500 |
BERT Large | Fine-Tuning (SL384 - SQuAD) | IPU-POD16 | SDK 2.4.0 | PopART | SQuAD | 256 | 16.16 | 884 |
BERT Large | Fine-Tuning (SL384 - SQuAD) | IPU-POD16 | SDK 2.4.0 | PyTorch | SQuAD | 256 | 16.16 | 744 |
BERT Base | Ph1 Pre-Training (SL128) | IPU-POD16 | SDK 2.4.0 | PopART | Wikipedia | 65,536 | 16.16 | 11991 |
BERT Base | Ph1 Pre-Training (SL128) | IPU-POD16 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 65,280 | 16.16 | 11647 |
BERT Base | Ph1 Pre-Training (SL128) | IPU-POD16 | SDK 2.4.0 | TensorFlow2 | Wikipedia | 65,280 | 16.16 | 11035 |
BERT Base | Ph1 Pre-Training (SL128) | IPU-POD16 | SDK 2.4.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 11184 |
BERT Base | Ph2 Pre-Training (SL384) | IPU-POD16 | SDK 2.4.0 | PopART | Wikipedia | 16,384 | 16.16 | 3545 |
BERT Base | Ph2 Pre-Training (SL384) | IPU-POD16 | SDK 2.4.0 | TensorFlow1 | Wikipedia | 16,320 | 16.16 | 3288 |
BERT Base | Ph2 Pre-Training (SL384) | IPU-POD16 | SDK 2.4.0 | TensorFlow2 | Wikipedia | 16,320 | 16.16 | 3155 |
BERT Base | Ph2 Pre-Training (SL384) | IPU-POD16 | SDK 2.4.0 | PyTorch | Wikipedia | 16,384 | 16.16 | 3334 |
BERT Base - HuggingFace | Fine-Tuning (SL384 - SQuAD) | IPU-POD16 | SDK 2.4.0 | TensorFlow2 | SQuAD | 320 | 16.16 | 375 |
GPT2 | GPT2-medium | IPU-POD16 | SDK 2.3.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 2540 |
GPT2 | GPT2-medium | IPU-POD64 | SDK 2.3.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 9870 |
GPT2 | GPT2-medium | IPU-POD128 | SDK 2.3.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 18842 |
GPT2 | GPT2-medium | IPU-POD256 | SDK 2.3.0 | PyTorch | Wikipedia | 65,536 | 16.16 | 31025 |
ResNet-50 v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 1,920 | 16.16 | 7864 | |
ResNet-50 v1.5 | IPU-M2000 | SDK 2.4.0 | PyTorch | ImageNet2012 | 16,384 | 16.16 | 7303 | |
ResNet-50 v1.5 | IPU-POD16 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 1,920 | 16.16 | 30690 | |
ResNet-50 v1.5 | IPU-POD16 | SDK 2.4.0 | PyTorch | ImageNet2012 | 16,384 | 16.16 | 25534 | |
ResNet-50 v1.5 | IPU-POD64 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 2,560 | 16.16 | 108566 | |
ResNet-50 v1.5 | IPU-POD128 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 5,120 | 16.16 | 205006 | |
ResNet-50 v1.5 | IPU-POD256 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 10,240 | 16.16 | 365040 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 768 | 16.16 | 2514 | |
ResNeXt101 | IPU-POD16 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 768 | 16.16 | 9023 | |
EfficientNet-B4 | G16-EfficientNet | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 800 | 16.16 | 1618 |
EfficientNet-B4 | G16-EfficientNet | IPU-M2000 | SDK 2.4.0 | PyTorch | ImageNet2012 | 1,024 | 16.32 | 1400 |
EfficientNet-B4 | G16-EfficientNet | IPU-POD16 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 6,144 | 16.16 | 6379 |
EfficientNet-B4 | G16-EfficientNet | IPU-POD16 | SDK 2.4.0 | PyTorch | ImageNet2012 | 1,024 | 16.32 | 4311 |
EfficientNet-B4 | G16-EfficientNet | IPU-POD64 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 6,144 | 16.16 | 24946 |
EfficientNet-B4 | G16-EfficientNet | IPU-POD128 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 6,144 | 16.16 | 48015 |
EfficientNet-B4 | G16-EfficientNet | IPU-POD256 | SDK 2.4.0 | TensorFlow1 | ImageNet2012 | 6,144 | 16.16 | 87968 |
ViT | Vision Transformer | IPU-POD16 | SDK 2.3.0 | PyTorch | ImageNet1k | 65,536 | 16.16 | 6535 |
ViT | Vision Transformer | IPU-POD64 | SDK 2.3.0 | PyTorch | ImageNet1k | 65,536 | 16.16 | 25080 |
ViT | Vision Transformer | IPU-POD128 | SDK 2.3.0 | PyTorch | ImageNet1k | 65,536 | 16.16 | 46320 |
ViT | Vision Transformer | IPU-POD256 | SDK 2.3.0 | PyTorch | ImageNet1k | 65,536 | 16.16 | 68800 |
UNet (Medical) | IPU-M2000 | SDK 2.4.0 | TensorFlow2 | EM segmentation | 24 | 16.16 | 139 | |
Mini DALL-E | IPU-M2000 | SDK 2.4.0 | PyTorch | COCO 2017 | 1,536 | 16.16 | 319 | |
Mini DALL-E | IPU-POD16 | SDK 2.4.0 | PyTorch | COCO 2017 | 6,144 | 16.16 | 815 | |
DeepVoice3 | IPU-M2000 | SDK 2.4.0 | PopART | VCTK Corpus | 128 | 32.32 | 8496 | |
FastSpeech2 | IPU-M2000 | SDK 2.4.0 | TensorFlow2 | LJ Speech | 32 | 16.16 | 406 | |
FastSpeech2 | IPU-POD16 | SDK 2.4.0 | TensorFlow2 | LJ Speech | 64 | 16.16 | 1141 | |
Conformer | IPU-M2000 | SDK 2.4.0 | PyTorch | AiShell1 | 96 | 16.16 | 1030 | |
Conformer | IPU-POD16 | SDK 2.4.0 | PyTorch | AiShell1 | 96 | 16.16 | 3395 | |
TGN | Temporal Graph Network | GC200 IPU | SDK 2.4.0 | TensorFlow1 | JODIE Wikipedia | 200 | 16.32 | 190472 |
IPU-POD经典款(Time to Result)
Model | Variant | Platform | SDK Version | Framework | Dataset | Batch Size | Precision | Time To Result (secs) |
---|---|---|---|---|---|---|---|---|
MCMC TFP | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Proprietary | 32.32 | 49 |
IPU-POD经典款- 推理
此语境中的模型推理是指在输入数据上运行模型以推断输出。生产设置中的推理性能通常通过两个指标来衡量:吞吐量(如前所述)和时延,后者被定义为执行推理所需的时间。
Model | Variant | Platform | SDK Version | Framework | Dataset | Batch Size | Precision | Throughput (items/sec) | Latency (ms) |
---|---|---|---|---|---|---|---|---|---|
BERT-Large | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 4 | 16.16 | 2071 | 1.92 |
BERT-Large | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 8 | 16.16 | 2911 | 2.73 |
BERT-Large | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 12 | 16.16 | 3303 | 3.62 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 4 | 16.16 | 4580 | 0.86 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 8 | 16.16 | 7069 | 1.11 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 16 | 16.16 | 9687 | 1.65 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 32 | 16.16 | 12584 | 2.53 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 64 | 16.16 | 15346 | 4.16 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 128 | 16.16 | 17972 | 7.11 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 256 | 16.16 | 19484 | 13.11 |
BERT-Base | SL128 | IPU-M2000 | SDK 2.4.0 | PopART | SQuAD | 320 | 16.16 | 20803 | 15.36 |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 4 | 16.16 | 7152 | 1.66 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 8 | 16.16 | 10515 | 2.27 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 16 | 16.16 | 16207 | 2.95 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 32 | 16.16 | 22544 | 4.24 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 64 | 16.16 | 28762 | 6.66 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 128 | 16.16 | 35155 | 10.91 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 256 | 16.16 | 40085 | 19.14 | |
ResNet-50v1.5 | lowest latency config | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 7397 | 0.52 |
ResNet-50v1.5 | higher throughput config | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 9404 | 2.04 |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 16 | 16.16 | 14321 | 2.69 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 32 | 16.16 | 20927 | 3.7 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 64 | 16.16 | 36193 | 8.62 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 128 | 16.16 | 43472 | 14.38 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 256 | 16.16 | 49816 | 25.13 | |
ResNet-50v1.5 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 360 | 16.16 | 50883 | 30.68 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 4 | 16.16 | 4483 | 2.66 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 8 | 16.16 | 6435 | 3.71 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 16 | 16.16 | 9705 | 4.93 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 32 | 16.16 | 13693 | 6.99 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 64 | 16.16 | 17176 | 11.16 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 3395 | 1.14 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 8 | 16.16 | 4840 | 1.62 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 16 | 16.16 | 6483 | 2.43 | |
ResNeXt101 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 64 | 16.16 | 11320 | 27.83 | |
EfficientNet-B0 | lowest latency config | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 8686 | 0.44 |
EfficientNet-B0 | higher throughput config | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 10907 | 1.69 |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 32 | 16.16 | 50510 | 3.05 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 64 | 16.16 | 71839 | 4.26 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 128 | 16.16 | 86986 | 6.77 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 144 | 16.16 | 69852 | 9.15 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 196 | 16.16 | 61714 | 13.38 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 4 | 16.16 | 8289 | 1.43 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 8 | 16.16 | 13056 | 1.82 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 16 | 16.16 | 22217 | 2.15 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 32 | 16.16 | 34448 | 2.77 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 64 | 16.16 | 43351 | 4.41 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 128 | 16.16 | 53256 | 7.19 | |
EfficientNet-B0 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 160 | 16.16 | 55169 | 8.68 | |
EfficientNet-B4 | lowest latency config | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 3539 | 1.09 |
EfficientNet-B4 | higher throughput config | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 4081 | 1.85 |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 16 | 16.16 | 8299 | 3.5 | |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 24 | 16.16 | 9874 | 4.37 | |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 32 | 16.16 | 10753 | 5.3 | |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 40 | 16.16 | 11578 | 6.22 | |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 4 | 16.16 | 3718 | 3.21 | |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 8 | 16.16 | 5514 | 4.34 | |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 16 | 16.16 | 7959 | 6.01 | |
EfficientNet-B4 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 20 | 16.16 | 8958 | 6.68 | |
EfficientNet-B7 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 4 | 16.16 | 1407 | 8.52 | |
EfficientNet-B7 | IPU-M2000 | SDK 2.4.0 | TensorFlow1 | Synthetic (host-generated) | 8 | 16.16 | 1869 | 12.82 | |
Yolo v4 | image 896, bps 5, max det 200 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 690 | 9.4 |
Yolo v4 | image 896, bps 10, max det 300 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 722 | 9.74 |
Yolo v4 | image 640, bps 5, max det 200 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 8 | 16.16 | 1306 | 10.03 |
Yolo v4 | image 640, bps 10, max det 300 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 8 | 16.16 | 1364 | 10.39 |
Yolo v4 | image 512, bps 5, max det 200 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 8 | 16.16 | 1772 | 7.25 |
Yolo v4 | image 512, bps 10, max det 300 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 8 | 16.16 | 1915 | 7.31 |
Yolo v4 | image 416, bps 5, max det 200 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 2195 | 5.88 |
Yolo v4 | image 416, bps 10, max det 100 | IPU-M2000 | SDK 2.4.0 | PyTorch | Synthetic (host-generated) | 4 | 16.16 | 2994 | 9.42 |
Unet (Medical) | IPU-M2000 | SDK 2.4.0 | TensorFlow2 | Synthetic (host-generated) | 4 | 16.16 | 1144 | ||
Unet (Medical) | IPU-M2000 | SDK 2.4.0 | TensorFlow2 | Synthetic (host-generated) | 8 | 16.16 | 1190 |
精度术语:X.Y定义如下:X是存储激活和梯度的精度,Y是存储权重的精度。在16.16权重中训练时,我们可能仍将FP32用于其他变量(例如规范或动量),并包括随机舍入。
基准测试是使用我们在 Graphcore GitHub 上的示例生成的。
本页最近更新日期为2022年3月3