Pytorch Profiler 1 & TensorFlow Profiler 2

这篇主要介绍 Pytorch 和 TensorFlow 的 Profiler 设计。二者均由 C++ 实现，因此可以利用 OO 特性以及一定的设计方法轻松实现 Profiler（如汇总 operator 的计算）。但其他 DL 框架可能由于并未如此实现而无法使用类似设计。

1. Pytorch Profiler 1

基于 Event 驱动。简单来说是在 forward 前后有两个 hook。

注：autograd 中的 profiler 是第一代，profiler 下的是第二代。

# torch/autograd/profiler_legacy.py
def __enter__(self):
    # ignore some code
    self._start_trace()
    return self

def __exit__(self, exc_type, exc_val, exc_tb):
    # ignore some code
    records = _disable_profiler_legacy()
    parsed_results = _parse_legacy_records(records)
    self.function_events = EventList(
        parsed_results,
        use_cuda=self.use_cuda,
        profile_memory=self.profile_memory,
        with_flops=self.with_flops)
    self.function_events._build_tree()
    return False

_start_trace 又调用了 _enable_profiler_legacy，其与 _disable_profiler_legacy 都在 cpp 里面定义。在 torch/csrc/autograd/profiler_legacy.cpp 中定义了 enableProfilerLegacy 以及 disableProfilerLegacy。其中 disableProfilerLegacy：

thread_event_lists disableProfilerLegacy(
    c10::optional<ProfilerDisableOptions> profilerDisableOptions) {
    // ignore some code
    return state_ptr->consolidate();
}

thread_event_lists 存储了一个个事件（事件类型、事件发生时间、编号），会转化为 records。这个时候是没有做类似 Avg 之类的处理的。thread_event_lists 仅仅存储事件。

返回的 thread_event_lists 由 event_lists_map_ (unordered_map<uint64_t, RangeEventList>) 中的 event list 内的 events 与 remoteProfiledEvents_ 中的 events 两部分组成，不必多言。

前者 (RangeEventList) 通过 record 追加事件 evt。

不同的事件有不同的插入场合，例如 Memory 相关的在 Allocator，计算相关在 forward 处。

回到 py 层，可以在 class EventList 看到明显的打标输出：

class EventList(list):
    """A list of Events (for pretty printing)"""
    def __init__(self, *args, **kwargs):
        # ...

    def _build_table(events, sort_by=None, header=None, row_limit=100,
        max_src_column_width=75, max_name_column_width=55,
        max_shapes_column_width=80, with_flops=False,
        profile_memory=False, top_level_events_only=False):
    """Prints a summary of events (which can be a list of FunctionEvent or FunctionEventAvg)."""
        # 用这个来打印表

    def table(self, sort_by=None, row_limit=100, max_src_column_width=75,
            max_name_column_width=55, max_shapes_column_width=80,
            header=None, top_level_events_only=False):
        return _build_table(self, sort_by=sort_by, row_limit=row_limit,
            max_src_column_width=max_src_column_width, max_name_column_width=max_name_column_width,
            max_shapes_column_width=max_shapes_column_width, header=header,
            profile_memory=self._profile_memory, with_flops=self._with_flops,
            top_level_events_only=top_level_events_only)

    def __str__(self):
        return self.table()

也就是 start 开启，exit 调用时把 cpp 侧的 event list 返回 py 侧，打表输出。

2. TensorFlow Profiler 2

同样基于 Event。这里给出一些在 TensorBoard 中看得到的指标。

// tensorflow/tensorflow/core/profiler/utils/event_span.cc
static const auto* generic_event_type_str_map = new GenericEventTypeStrMap({
    {kDeviceCompute, "Device compute"},
    {kDeviceToDevice, "Device to device"},
    {kDeviceCollectives, "Device collective communication"},
    {kHostCompute, "Host compute"},
    {kHostPrepare, "Kernel launch"},
    {kInput, "Input"},
    {kOutput, "Output"},
    {kCompile, "Compilation"},
    {kAllOthers, "All others"},
});

以上述 “Device to device” 为例，其对应的 kDeviceToDevice 类型为 GenericEventType，是 Profiler 展示给用户的事件类型之一，表示设备间通信时间。这段通信时间由 “真实通信时间” 与 “设备等待时间” 两部分组成。如下所示，CreatePodStatsRecord 中将 DEVICE_TO_DEVICE 与 DEVICE_WAIT_DEVICE 和 kDeviceToDevice 绑定。

// tensorflow/tensorflow/core/profiler/convert/op_stats_to_pod_stats.cc
PodStatsRecord CreatePodStatsRecord(absl::string_view host_name,
                                    const StepInfoResult& step_info) {
    PodStatsRecord record;
    GenericStepBreakdown generic;
    bool success = step_info.step_breakdown().UnpackTo(&generic);
    DCHECK(success);
    record.set_host_name(string(host_name));
    record.set_step_num(step_info.step_num());
    record.set_total_duration_us(PicoToMicro(step_info.duration_ps()));
    auto& step_breakdown_map = *record.mutable_step_breakdown_us();
    std::vector<std::pair<uint64, absl::string_view>> metrics;
    auto add_event = [&](GenericEventType type,
                        std::initializer_list<EventType> event_list) {
        uint64 ps = 0;
        for (const auto& event_type : event_list) {
            ps += gtl::FindWithDefault(generic.type_ps(), event_type, /*value=*/0);
        }
        step_breakdown_map[type] = PicoToMicro(ps);
        // metrics is a vector contains pair<time_consuming, generic_event_type_str>.
        metrics.emplace_back(ps, GetGenericEventTypeStr(type));
    };

    add_event(kDeviceCompute, {DEVICE_COMPUTE_32, DEVICE_COMPUTE_16});
    add_event(kDeviceToDevice, {DEVICE_TO_DEVICE, DEVICE_WAIT_DEVICE});
    add_event(kDeviceCollectives, {DEVICE_COLLECTIVES});
    add_event(kHostCompute, {HOST_COMPUTE});
    add_event(kHostPrepare, {HOST_PREPARE});
    add_event(kInput, {HOST_WAIT_INPUT, HOST_TO_DEVICE, DEVICE_WAIT_HOST});
    add_event(kOutput, {DEVICE_TO_HOST});
    add_event(kCompile, {HOST_COMPILE});
    add_event(kAllOthers, {UNKNOWN_TIME});
    std::sort(metrics.begin(), metrics.end());
    record.set_bottleneck(metrics.back().second.data(),
                            metrics.back().second.size());
    return record;
}

CreatePodStatsRecord 根据 StepInfoResult 输出 metrics 信息。

注：DEVICE_TO_DEVICE 同样是枚举值。

3. 总结

以正常人类大脑出发，Profiler 的实现要么是 “在关键操作前后插入代码”，要么是 “按照预定义规则监控系统”。抛开后者不谈（因为上面没谈这个），前者又可分为静态和动态。对于 pytorch 和 tensorflow 这种自研 profiler，他们的开发者可以轻松获得源代码并在其上进行修改，并且还可以保证代码快速合入，我猜测或许也有这方面考虑。而对于接触源码难度较高的外部开发者，在不借助社区的前提下开发 profiler 更好的办法是 dynamic instrumentation。

但不排除语言本身或社区均无法提供相应支持。