Review of "Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs"

Summary of the work

GPU와 같은 가속기들은 CPU에 비해 limited capacity를 가지고 있다. large footprint를 가진 application들은 아래와 같은 다양한 방법들을 사용하였다.

Scale out to many GPUs
orchestrate data movement between the CPU and GPU
Off-GPU memory access or Unified Memory (Oversubscribe device memory)

해당 논문에서는 Memory Compression을 통한 capacity 문제를 해결하려고 한다. Memory compression은 CPU에서 이미 많은 연구가 진행되었다. CPU에서 제안된 방법들은 CPU와 GPU의 1) 구조적인 차이와 2) Compressibility의 잦은 변호로 인하여 GPU에 적용하기 어렵다.

Buddy compression의 Key idea는 GPU device 메모리와 Buddy 메모리(더 큰 capacity를 가지지만 다소 느린 메모리)를 Compressed memory 저장에 동시에 사용하는 것이다. 즉, Cache line(128Bytes)가 충분히 압축이 되었다면, GPU device memory에만 저장한다. 만약 압축이 충분하지 않다면, 일부는 Device memory에 일부는 Buddy memory에 저장한다. 이러한 디자인으로 page movement와 allocation 막을 수 있다고 주장한다.

Evaluation Results:

Implementation Details

Buddy memory는 boot up시 따로 reserved 된다. 이 영역은 Coherence issue로 인해 직접적으로 접근할 수 없다. 이렇게 따로 reserved하는 것이 충분히 타당한 접근인 이유는, host system은 충분히 많은 memory를 가지고 있을 수 있기 때문이다 (e.g., DGX-2 1.5TB CPU memory).

Linearly compressed page 접근 방식을 따르며, page table entry에 여러 메타 데이터들을 저장한다.

cache line마다, target compresison ratio를 위해 3bit를 사용한다.
cache line마다, 압축이 되었는지를 나타내는 flag(1bit).
Page 마다, Buddy memory의 offset(16bit or 를 저장할 수 있어야 한다.

Global buddy address register(GBAR)이 buddy memory의 시작 주소를 저장한다.

해당 시스템에서 major overhead는 Buddy memory에 대한 접근이고, 이러한 접근을 최소한으로 하기 위한 적절한 target compression ratio를 찾아야 한다.

Strengths & Weaknesses

Comments

Compressibility of GPU Workloads에서 사용한 data format 및 layout 무엇인가?

오히려 CPU memory compression scheme GPU에서 적용하기 힘든 이유가 frequent compressibility changes라고 한다. 이는 우리가 생각했던 것과 다른 주장이다.

Review of "Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs"

Summary of the work

Implementation Details

Strengths & Weaknesses

Comments

Takeaways

1 Comment