性能结果

在这里,我们提供了我们最近提交的MLPerf训练1.1版本的结果,以及我们自己在针对训练和推理的更广泛模型中进行的基准测试活动的结果。

MLPerf训练1.1版本表现

对于我们向MLPerf训练1.1版本的提交,我们选择了提交流行应用程序基准类别,包括图像分类(ResNet-50)和自然语言处理(BERT)

提交有两个分区(Division)。封闭分区(Closed Division)要求提交者使用完全相同的模型和优化器实施,包括定义超参数状态和训练时期。还有一个开放分区(Open Division),通过支持更适合不同处理器功能的不同模型实现来促进和支持创新,但确保达到与封闭分区完全相同的模型准确性和质量。

DivisionModelMLPerf Quality TargetPlatformSDK VersionFrameworkMLPerf IDDatasetPrecisionTime to Train (mins)
ClosedResNet50 v1.575.90% classificationIPU-POD16SDK 2.3.0TensorFlow1.1-2040ImageNet201216.1628.33
ClosedResNet50 v1.575.90% classificationIPU-POD64SDK 2.3.0TensorFlow1.1-2042ImageNet201216.168.50
ClosedResNet50 v1.575.90% classificationIPU-POD128SDK 2.3.0TensorFlow1.1-2044ImageNet201216.165.67
ClosedResNet50 v1.575.90% classificationIPU-POD256SDK 2.3.0TensorFlow1.1-2045ImageNet201216.163.79
ClosedBERT0.72 Mask-LM accuracyIPU-POD16SDK 2.3.0PopART1.1-2039Wikipedia16.1632.70
ClosedBERT0.72 Mask-LM accuracyIPU-POD64SDK 2.3.0PopART1.1-2041Wikipedia16.1610.56
ClosedBERT0.72 Mask-LM accuracyIPU-POD128SDK 2.3.0PopART1.1-2043Wikipedia16.166.86
OpenBERT0.72 Mask-LM accuracyIPU-POD16SDK 2.3.0PopART1.1-2088Wikipedia16.1626.05
OpenBERT0.72 Mask-LM accuracyIPU-POD64SDK 2.3.0PopART1.1-2089Wikipedia16.168.25
OpenBERT0.72 Mask-LM accuracyIPU-POD128SDK 2.3.0PopART1.1-2087Wikipedia16.165.88

MLPerf的名称和徽标是MLCommons联盟(MLCommons Association)在美国和其他国家的商标。
版权所有,严禁未经授权使用。有关更多信息,请访问 www.mlperf.org

训练:吞吐量

训练机器学习模型涉及在输入数据集(训练数据)上运行算法,直到模型收敛,收敛意味着它已经学会以指定的准确性产生所需的输出。在此语境中,吞吐量被定义为模型每秒处理的输入数据点(序列、图像或行)的数量。吞吐量通常用作衡量硬件性能的指标,因为它与将模型训练达到指定准确性的时间直接相关。

下面提供的结果详细说明了在指定配置中每个参考模型获得的吞吐量值。在真实数据上运行的所有配置都针对收敛进行了验证。

ModelVariantPlatformSDK VersionFrameworkDatasetBatch SizePrecisionThroughput (items/sec)
BERT LargePh1 Pre-Training (SL128)IPU-POD16SDK 2.3.0PopARTWikipedia65,53616.163713
BERT LargePh1 Pre-Training (SL128)IPU-POD16SDK 2.3.0TensorFlowWikipedia65,60016.163664
BERT LargePh1 Pre-Training (SL128)IPU-POD16SDK 2.3.0PyTorchWikipedia65,53616.163562
BERT LargePh1 Pre-Training (SL128)IPU-POD64SDK 2.3.0PopARTWikipedia65,53616.1614122
BERT LargePh1 Pre-Training (SL128)IPU-POD64SDK 2.3.0TensorFlowWikipedia66,56016.1612903
BERT LargePh1 Pre-Training (SL128)IPU-POD64SDK 2.3.0PyTorchWikipedia65,53616.1612528
BERT LargePh1 Pre-Training (SL128)IPU-POD128SDK 2.3.0PopARTWikipedia65,53616.1624850
BERT LargePh1 Pre-Training (SL128)IPU-POD128SDK 2.3.0TensorFlowWikipedia66,56016.1625097
BERT LargePh1 Pre-Training (SL128)IPU-POD128SDK 2.3.0PyTorchWikipedia16.1622971
BERT LargePh2 Pre-Training (SL384)IPU-POD16SDK 2.3.0PopARTWikipedia16,38416.161055
BERT LargePh2 Pre-Training (SL384)IPU-POD16SDK 2.3.0TensorFlowWikipedia16,40016.161012
BERT LargePh2 Pre-Training (SL384)IPU-POD16SDK 2.3.0PyTorchWikipedia16,38416.16984
BERT LargePh2 Pre-Training (SL384)IPU-POD64SDK 2.3.0PopARTWikipedia16,38416.163989.9
BERT LargePh2 Pre-Training (SL384)IPU-POD64SDK 2.3.0TensorFlowWikipedia16,64016.163539
BERT LargePh2 Pre-Training (SL384)IPU-POD64SDK 2.3.0PyTorchWikipedia16,38416.163623
BERT LargePh2 Pre-Training (SL384)IPU-POD128SDK 2.3.0PopARTWikipedia16,38416.167082
BERT LargePh2 Pre-Training (SL384)IPU-POD128SDK 2.3.0TensorFlowWikipedia16,64016.167366
BERT LargePh2 Pre-Training (SL384)IPU-POD128SDK 2.3.0PyTorchWikipedia16.166579
BERT LargeFine-Tuning (SL384 - SQuAD)IPU-POD16SDK 2.3.0PopARTSQuAD25616.16851
BERT LargeFine-Tuning (SL384 - SQuAD)IPU-POD16SDK 2.3.0PyTorchSQuAD25616.16731
BERT BasePh1 Pre-Training (SL128)IPU-POD16SDK 2.3.0PopARTWikipedia65,53616.1611763
BERT BasePh1 Pre-Training (SL128)IPU-POD16SDK 2.3.0TensorFlowWikipedia65,28016.1611549
BERT BasePh1 Pre-Training (SL128)IPU-POD16SDK 2.3.0PyTorchWikipedia65,53616.1611116
BERT BasePh2 Pre-Training (SL384)IPU-POD16SDK 2.3.0PopARTWikipedia16,38416.163491
BERT BasePh2 Pre-Training (SL384)IPU-POD16SDK 2.3.0TensorFlowWikipedia16,32016.163250
BERT BasePh2 Pre-Training (SL384)IPU-POD16SDK 2.3.0PyTorchWikipedia16,38416.163265
ResNet-50 v1.5IPU-M2000SDK 2.3.0TensorFlowImageNet20121,92016.167738
ResNet-50 v1.5IPU-M2000SDK 2.3.0PyTorchImageNet201216,38416.167316
ResNet-50 v1.5IPU-POD16SDK 2.3.0TensorFlowImageNet20121,92016.1629565
ResNet-50 v1.5IPU-POD16SDK 2.3.0PyTorchImageNet201216,38416.1625382
ResNet-50 v1.5IPU-POD64SDK 2.3.0TensorFlowImageNet20122,56016.16102320
ResNet-50 v1.5IPU-POD128SDK 2.3.0TensorFlowImageNet201216.16186553
ResNet-50 v1.5IPU-POD256SDK 2.3.0TensorFlowImageNet201216.16355021
EfficientNet-B4G16-EfficientNetIPU-POD16SDK 2.3.0TensorFlowImageNet201280016.325369
EfficientNet-B4G16-EfficientNetIPU-POD16SDK 2.3.0PyTorchImageNet20121,02416.324334
EfficientNet-B4G16-EfficientNetIPU-POD64SDK 2.3.0TensorFlowImageNet20123,20016.3220951
DeepVoice3IPU-M2000SDK 2.3.0PopARTVCTK Corpus12832.328371

训练:结果效率(Time to Result)

ModelVariantPlatformSDK VersionFrameworkDatasetBatch SizePrecisionTime To Result (secs)
MCMC TFPIPU-M2000SDK 2.3.0TensorFlowProprietary32.3252.2

推理

此语境中的模型推理是指在输入数据上运行模型以推断输出。生产设置中的推理性能通常通过两个指标来衡量:吞吐量(如前所述)和时延,后者被定义为执行推理所需的时间。

ModelVariantPlatformSDK VersionFrameworkDatasetBatch SizePrecisionThroughput (items/sec)Latency (ms)
BERT-LargeSL128IPU-M2000SDK 2.3.0PopARTSQuAD416.1620241.95
BERT-LargeSL128IPU-M2000SDK 2.3.0PopARTSQuAD816.1628402.78
BERT-LargeSL128IPU-M2000SDK 2.3.0PopARTSQuAD1216.1632813.64
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD416.1643580.89
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD816.1668821.14
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD1616.1695091.66
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD3216.16123872.56
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD6416.16152114.2
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD12816.16178547.17
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD25616.161934613.22
BERT-BaseSL128IPU-M2000SDK 2.3.0PopARTSQuAD32016.162065315.48
ResNet-50v1.5IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)416.166816
ResNet-50v1.5IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)1616.1616652
ResNet-50v1.5IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)6416.1628695
ResNet-50v1.5IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)12816.1634700
ResNet-50v1.5IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)25616.1639359
ResNet-50v1.5IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)32016.1640341
ResNet-50v1.5lowest latency configIPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)416.1667560.55
ResNet-50v1.5higher throughput configIPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)416.1676690.95
ResNet-50v1.5IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)1616.16119511.22
ResNet-50v1.5IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)3216.16175501.66
ResNet-50v1.5IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)6416.16295703.91
ResNet-50v1.5IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)12816.16353886.44
ResNet-50v1.5IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)25616.163926011.46
ResNet-50v1.5IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)36016.164089115.27
ResNeXt101IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)416.163991
ResNeXt101IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)816.166140
ResNeXt101IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)1616.169324
ResNeXt101IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)3216.1613227
ResNeXt101IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)6416.1616808
ResNeXt101IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)416.1632951.17
EfficientNet-B0lowest latency configIPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)416.1676970.47
EfficientNet-B0higher throughput configIPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)416.1691570.78
EfficientNet-B0IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)3216.16375251.45
EfficientNet-B0IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)6416.16500092.12
EfficientNet-B0IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)12816.16560093.61
EfficientNet-B0IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)14416.16555514.05
EfficientNet-B0IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)19616.16620855.2
EfficientNet-B0IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)416.169130
EfficientNet-B0IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)3216.1639088
EfficientNet-B0IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)6416.1646755
EfficientNet-B0IPU-M2000SDK 2.3.0TensorFlowSynthetic (host-generated)12816.1653994
EfficientNet-B4lowest latency configIPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)416.1634021.13
EfficientNet-B4higher throughput configIPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)416.1638731.91
EfficientNet-B4IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)1616.1681863.57
EfficientNet-B4IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)2416.1698024.43
EfficientNet-B4IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)3216.16107525.35
EfficientNet-B4IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)4016.16113086.29
EfficientNet-B4IPU-M2000SDK 2.3.0PyTorchSynthetic (host-generated)4816.16120917.08

精度术语:X.Y定义如下:X是存储激活和梯度的精度,Y是存储权重的精度。在16.16权重中训练时,我们可能仍将FP32用于其他变量(例如规范或动量),并包括随机舍入。

基准测试是使用我们在 Graphcore GitHub 上的示例生成的。

本页最近更新日期为2021年12月1日

获取最新的GRAPHCORE资讯

在下方注册以获取最新的资讯和更新: