性能结果

在这里,我们提供了我们最近提交的MLPerf训练1.0版本的结果,以及我们自己在针对训练和推理的更广泛模型中进行的基准测试活动的结果。

MLPerf训练1.0版本表现

对于我们向MLPerf训练1.0版本的第一次提交,我们选择了提交流行应用程序基准类别,包括图像分类(ResNet-50)和自然语言处理(BERT)

提交有两个分区(Division)。封闭分区(Closed Division)要求提交者使用完全相同的模型和优化器实施,包括定义超参数状态和训练时期。还有一个开放分区(Open Division),通过支持更适合不同处理器功能的不同模型实现来促进和支持创新,但确保达到与封闭分区完全相同的模型准确性和质量。

BERT MLPerf
ResNet-50 v1.5 MLPerf
DivisionModelMLPerf Quality TargetPlatformSDK VersionFrameworkMLPerf IDDatasetPrecisionTime to Train (mins)
ClosedResNet50 v1.575.90% classificationIPU-POD16SDK 2.1.0TensorFlow1.0-1026ImageNet201216.1637.12
ClosedResNet50 v1.575.90% classificationIPU-POD64SDK 2.1.0TensorFlow1.0-1028ImageNet201216.1614.48
ClosedBERT0.712 Mask-LM accuracyIPU-POD16SDK 2.1.0PopART1.0-1025Wikipedia16.1634.49
ClosedBERT0.712 Mask-LM accuracyIPU-POD64SDK 2.1.0PopART1.0-1027Wikipedia16.1611.96
OpenBERT0.712 Mask-LM accuracyIPU-POD16SDK 2.1.0PopART1.0-1098Wikipedia16.1627.75
OpenBERT0.712 Mask-LM accuracyIPU-POD64SDK 2.1.0PopART1.0-1099Wikipedia16.169.39

MLPerf的名称和徽标是MLCommons联盟(MLCommons Association)在美国和其他国家的商标。
版权所有,严禁未经授权使用。有关更多信息,请访问 www.mlperf.org

训练:吞吐量

训练机器学习模型涉及在输入数据集(训练数据)上运行算法,直到模型收敛,收敛意味着它已经学会以指定的准确性产生所需的输出。在此语境中,吞吐量被定义为模型每秒处理的输入数据点(序列、图像或行)的数量。吞吐量通常用作衡量硬件性能的指标,因为它与将模型训练达到指定准确性的时间直接相关。

下面提供的结果详细说明了在指定配置中每个参考模型获得的吞吐量值。在真实数据上运行的所有配置都针对收敛进行了验证。

ModelVariantPlatformSDK VersionFrameworkDatasetBatch SizePrecisionThroughput (items/sec)
BERT LargePh1 Pre-Training (SL128)IPU-POD16SDK 2.2.0PopARTWikipedia65,53616.163727
BERT LargePh1 Pre-Training (SL128)IPU-POD16SDK 2.2.0TensorFlowWikipedia65,60016.163447
BERT LargePh1 Pre-Training (SL128)IPU-POD16SDK 2.2.0PyTorchWikipedia65,53616.163612
BERT LargePh1 Pre-Training (SL128)IPU-POD64SDK 2.2.0PopARTWikipedia65,53616.1614218
BERT LargePh1 Pre-Training (SL128)IPU-POD64SDK 2.2.0TensorFlowWikipedia66,56016.1612902
BERT LargePh1 Pre-Training (SL128)IPU-POD64SDK 2.2.0PyTorchWikipedia65,53616.1612809
BERT LargePh2 Pre-Training (SL384)IPU-POD16SDK 2.2.0PopARTWikipedia16,38416.161059
BERT LargePh2 Pre-Training (SL384)IPU-POD16SDK 2.2.0TensorFlowWikipedia16,40016.16989
BERT LargePh2 Pre-Training (SL384)IPU-POD16SDK 2.2.0PyTorchWikipedia16,38416.161003
BERT LargePh2 Pre-Training (SL384)IPU-POD64SDK 2.2.0PopARTWikipedia16,38416.164009
BERT LargePh2 Pre-Training (SL384)IPU-POD64SDK 2.2.0TensorFlowWikipedia16,64016.163524
BERT LargePh2 Pre-Training (SL384)IPU-POD64SDK 2.2.0PyTorchWikipedia16,38416.163702
BERT LargeFine-Tuning (SL384 - SQuAD)IPU-POD16SDK 2.2.0PopARTSQuAD25616.16876
BERT LargeFine-Tuning (SL384 - SQuAD)IPU-POD16SDK 2.2.0PyTorchSQuAD25616.16762
BERT BasePh1 Pre-Training (SL128)IPU-POD16SDK 2.2.0PopARTWikipedia65,53616.1611930
BERT BasePh1 Pre-Training (SL128)IPU-POD16SDK 2.2.0TensorFlowWikipedia65,28016.1611302
BERT BasePh1 Pre-Training (SL128)IPU-POD16SDK 2.2.0PyTorchWikipedia65,53616.1611135
BERT BasePh2 Pre-Training (SL384)IPU-POD16SDK 2.2.0PopARTWikipedia16,38416.163451
BERT BasePh2 Pre-Training (SL384)IPU-POD16SDK 2.2.0TensorFlowWikipedia16,32016.163187
BERT BasePh2 Pre-Training (SL384)IPU-POD16SDK 2.2.0PyTorchWikipedia16,38416.163311
ResNet-50 v1.5IPU-M2000SDK 2.2.0TensorFlowImageNet20121,92016.167758
ResNet-50 v1.5IPU-M2000SDK 2.2.0PyTorchImageNet201216,38416.164578
ResNet-50 v1.5IPU-POD16SDK 2.2.0TensorFlowImageNet20121,92016.1629590
ResNet-50 v1.5IPU-POD64SDK 2.2.0TensorFlowImageNet20122,56016.16102119
ResNeXt101IPU-M2000SDK 2.2.0TensorFlowImageNet201276816.162471
ResNeXt101IPU-POD16SDK 2.2.0TensorFlowImageNet201276816.168841
EfficientNet-B4G16-EfficientNetIPU-M2000SDK 2.2.0TensorFlowImageNet201280016.321485
EfficientNet-B4G16-EfficientNetIPU-POD16SDK 2.2.0TensorFlowImageNet201280016.325352
EfficientNet-B4G16-EfficientNetIPU-POD64SDK 2.2.0TensorFlowImageNet20123,20016.3220918
DeepVoice3IPU-M2000SDK 2.2.0PopARTVCTK Corpus12832.329049

训练:结果效率(Time to Result)

ModelVariantPlatformSDK VersionFrameworkDatasetBatch SizePrecisionTime To Result (secs)
MCMC TFPIPU-M2000SDK 2.2.0TensorFlowProprietary32.3251.88

推理

此语境中的模型推理是指在输入数据上运行模型以推断输出。生产设置中的推理性能通常通过两个指标来衡量:吞吐量(如前所述)和时延,后者被定义为执行推理所需的时间。

ModelVariantPlatformSDK VersionFrameworkDatasetBatch SizePrecisionThroughput (items/sec)Latency (ms)
BERT-LargeSL128IPU-M2000SDK 2.2.0PopARTSQuAD416.1620031.96
BERT-LargeSL128IPU-M2000SDK 2.2.0PopARTSQuAD816.1628662.76
BERT-LargeSL128IPU-M2000SDK 2.2.0PopARTSQuAD1216.1632813.63
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD416.1642840.9
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD816.1670341.12
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD1616.1695271.66
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD3216.16124732.55
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD6416.16152344.19
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD12816.16178357.16
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD25616.161928313.28
BERT-BaseSL128IPU-M2000SDK 2.2.0PopARTSQuAD32016.162000415.98
ResNet-50v1.5IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)416.1669480.58
ResNet-50v1.5IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)1616.16168620.95
ResNet-50v1.5IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)6416.16292152.19
ResNet-50v1.5IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)12816.16347673.68
ResNet-50v1.5IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)25616.16390256.56
ResNet-50v1.5IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)32016.16399658.01
ResNet-50v1.5IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)416.1678110.51
ResNet-50v1.5IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)1616.16176470.91
ResNet-50v1.5IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)6416.16300122.13
ResNet-50v1.5IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)12816.16358153.57
ResNet-50v1.5IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)25616.16403876.34
ResNet-50v1.5IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)36016.16405488.88
ResNeXt101IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)416.1640760.98
ResNeXt101IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)816.1662551.28
ResNeXt101IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)1616.1694791.69
ResNeXt101IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)3216.16134202.38
ResNeXt101IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)4816.16169742.83
EfficientNet-B0IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)416.1684110.48
EfficientNet-B0IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)3216.16355550.9
EfficientNet-B0IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)6416.16468661.37
EfficientNet-B0IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)12816.16576902.22
EfficientNet-B0IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)416.1691030.44
EfficientNet-B0IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)3216.16393770.81
EfficientNet-B0IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)6416.16474231.35
EfficientNet-B0IPU-M2000SDK 2.2.0TensorFlowSynthetic (host-generated)12816.16538172.38
EfficientNet-B4IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)416.1635971.11
EfficientNet-B4IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)816.1653551.49
EfficientNet-B4IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)1616.1678682.03
EfficientNet-B4IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)2016.1683922.38
EfficientNet-B4IPU-M2000SDK 2.2.0PyTorchSynthetic (host-generated)3216.16103093.1

精度术语:X.Y定义如下:X是存储激活和梯度的精度,Y是存储权重的精度。在16.16权重中训练时,我们可能仍将FP32用于其他变量(例如规范或动量),并包括随机舍入。

基准测试是使用我们在 Graphcore GitHub 上的示例生成的。

 

本页最近更新日期为2021年6月29日(星期二)

获取最新的GRAPHCORE资讯

在下方注册以获取最新的资讯和更新: