Home > The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of
The
Architecture and Evolution of CPU-GPU Systems for General Purpose Computing
Manish Arora
Computer Science and Engineering
University of California, San Diego
From
GPU to GPGPU
2
GPU
. . .
Input Assembly
Vertex Processing
Frame Buffer Operations
L2
Memory Controller
Off-Chip Memory
Geometry Processing
L2
SM
SM
Shared
Mem
Shared
Mem
. . .
GPGPU
Memory Controller
Off-Chip Memory
Widespread adoption (300M devices)
First with NVIDIA Tesla in 2006-2007.
3
1
2006 – 2010
L2
SM
SM
Shared
Mem
Shared
Mem
. . .
GPGPU
Memory Controller
Off-Chip Memory
PCI
Bridge
Previous
Generation Consumer Hardware1
Off-Chip Memory
Last Level Cache
Core
Cache
Hierarchy
Core
Cache
Hierarchy
CPU
. . .
Memory Controller
Current
Consumer Hardware2
4
L2
Off-Chip Memory
Shared On-Chip Last
Level Cache
Core
Cache
Hierarchy
Core
Cache
Hierarchy
SM
Shared
Mem
SM
SM
Shared
Mem
Shared
Mem
CPU
. . .
. . .
GPGPU
Memory Controller
2 Intel Sandy Bridge
AMD Fusion APUs
2011 - 2012
Our
Goals Today
5
Lower Costs
Overheads
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Throughput
Applications
Energy Efficient
GPUs
GPGPU
Part 1
6
Next Generation CPU –
GPU Architectures
GPGPU
Evolution
Part 2
Opportunistic
Optimizations
Part 5
Shared
Components
Part 4
Emerging
Technologies
Power
Temperature
Reliability
Part
6
Tools
(Future
Work)
Outline
Holistic
Optimizations
CPU Core
Optimization
Redundancy
Elimination
Part 3
Lower Costs
Overheads
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Throughput
Applications
Energy Efficient
GPUs
GPGPU
7
Part
1
Progression of GPGPU Architectures
GPGPUs
- 1
8
GPGPUs
- 2
9
GPGPUs
- 3
10
11
Contemporary
GPU Architecture
(Lindholm et al.
IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)
PCI
Bridge
Off-Chip Memory
Last Level Cache
Core
Cache
Hierarchy
Core
Cache
Hierarchy
CPU
. . .
Memory Controller
L2
SM
SM
Shared
Mem
Shared
Mem
. . .
GPGPU
Memory Controller
Off-Chip Memory
Memory Controller
Memory Controller
Memory Controller
Memory Controller
Memory Controller
Memory Controller
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Interconnect
. . .
. . .
. . .
. . .
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
SM Architecture
(Lindholm et al.
IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)
12
Banked Register File
Warp Scheduler
Operand Buffering
SIMT Lanes
Shared Memory / L1
Cache
ALUs
SFUs
MEM
TEX
Multi-threading
and Warp Scheduling
SM Multithreaded Instruction
Scheduler
Warp 1 Instruction
1
Warp 2 Instruction
1
Warp 3 Instruction
1
Time
Warp 2 Instruction
2
Warp 3 Instruction
2
.
.
.
Warp 1 Instruction
2
.
.
.
Example
of Warp Scheduling
(Lindholm et al.
IEEE Micro 2007)
Design
for Efficiency and Scalability
Nickolls et al.
IEEE Micro 2010 / Keckler et al. IEEE Micro 2011
15
Scalability
(Lee et al. ISCA
2010 / Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE Micro 2011
and other public sources)
16
Lower Costs
Overheads
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Throughput
Applications
Energy Efficient
GPUs
GPGPU
17
Part
2
Towards
Better GPGPU
Next Generation CPU –
GPU Architectures
GPGPU
Evolution
Time
Control-flow
Divergence Losses
(Fung et al.
Micro 2007)
Mask = 1111
Code
A
Code
B
Mask = 1111
Divergent
Branch
Merge
Point
Diverge Point
Path A: Ins 1
Path A: Ins 2
…
Path B: Ins 1
Path B: Ins 2
…
Converge Point
Low Utilization
Dynamically formed
2 new warps from 4 original warps
With DWF
Warp 0 : Path A
Time
Original Scheme
Dynamic
Warp Formation
(Fung et al.
Micro 2007)
Mask = 1111
Code
A
Code
B
Mask = 1111
Divergent
Branch
Merge
Point
Warp 1 : Path A
Warp 0 : Path B
Warp 1 : Path B
Warp
0+1 : Path A
Warp 0+1 : Path B
Register file accesses
during lane-aware dynamic warp formation
Bank 1
ALU 1
Bank 2
ALU 2
Bank N
ALU N
Bank 1
ALU 1
Bank 2
ALU 2
Bank N
ALU N
Denotes register accessed
Register File
Register File
Register file accesses
for static warps
Dynamic
Warp Formation Intricacies
(Fung et al.
Micro 2007)
Register file accesses
without lane awareness
Large
Warp Microarchitecture
(Narasiman et
al. Micro 2011)
1
0
0
0
1
0
0
0
1
1
1
1
T = 1
1
Activity Mask
-
-
0
0
0
0
-
0
0
-
1
1
T = 2
1
Activity Mask
-
-
0
0
0
0
-
0
0
-
T = 3
Activity Mask
Time
1
1
0
0
0
1
0
1
0
0
1
1
1
1
1
1
T = 0
Original
Large Warp
Two
level Scheduling
(Narasiman et
al. Micro 2011)
22
Dynamic
Warps vs Large Warp + 2-Level Scheduling
(Fung et
al Micro 2007 vs Narasiman et al. Micro 2011)
23
Lower Costs
Overheads
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Throughput
Applications
Energy Efficient
GPUs
GPGPU
24
Part
3
Holistically
Optimized
CPU Designs
Next Generation CPU –
GPU Architectures
GPGPU
Evolution
Holistic
Optimizations
CPU Core
Optimization
Redundancy
Elimination
Motivation
to Rethink CPU Design
(Arora et al.
In Submission to IEEE Micro 2012)
25
Benchmarks
26
Methodology
27
Results
– CPU Time
28
Results
– Instruction Level Parallelism
29
Results
– Branch Characterization
30
Results
– Loads
31
Results
– Vector Instructions
32
Results
– Thread Level Parallelism
33
Impact
: CPU Core Directions
34
Impact
: Redundancy Elimination
35
Lower Costs
Overheads
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Throughput
Applications
Energy Efficient
GPUs
GPGPU
36
Part
4
Shared
Component Designs
Next Generation CPU –
GPU Architectures
GPGPU
Evolution
Holistic
Optimizations
CPU Core
Optimization
Redundancy
Elimination
Shared
Components
Optimization
of Shared Structures
37
L2
Off-Chip Memory
Shared On-Chip Last
Level Cache
Core
Cache
Hierarchy
Core
Cache
Hierarchy
SM
Shared
Mem
SM
SM
Shared
Mem
Shared
Mem
CPU
. . .
. . .
GPGPU
Memory Controller
Latency
Sensitive
Potentially Latency In-Sensitive But
Bandwidth Hungry
TAP:
TLP Aware Shared LLC Management
(Lee et al.
HPCA 2012)
38
TAP
Design - 1
39
TAP
Design - 2
40
41
QoS
Aware Mem Bandwidth Partitioning
Jeong et al.
DAC 2012
42
QoS
Aware Mem Bandwidth Partitioning
(Jeong et al.
DAC 2012)
Lower Costs
Overheads
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Throughput
Applications
Energy Efficient
GPUs
GPGPU
43
Part
5
Opportunistic
Optimizations
Next Generation CPU –
GPU Architectures
GPGPU
Evolution
Holistic
Optimizations
CPU Core
Optimization
Redundancy
Elimination
Shared
Components
Opportunistic
Optimizations
Opportunistic
Optimizations
44
Idle
GPU Shader based Prefetching
(Woo et al.
ASPLOS 2010)
45
Miss
Address Provider
46
Shared On-Chip Last
Level Cache
Core
Core
. . .
SM
SM
. . .
MAP
Miss PC
Miss Address
Shader Pointer
Command Buffer
MAP
OS
Allocates Idle GPU Core
Miss info forwarded
To
GPU Core
GPU Core stores
and processes miss
stream
Data prefetched into
Shared LLC
CPU
assisted GPGPU processing
(Yang et al.
HPCA 2012)
47
Example
GPU Kernel and CPU program
48
__global__ void VecAdd (float *A, *B, *C, int N) {
int I = blockDim.x * blockIdx.x + threadIdx.x;
C[i] = A[i] +
B[i] }
float mem_fetch (float *A, *B, *C, int N) {
return A[N] +
B[N] + C[N] }
void cpu_prefetching (…) {
unroll_factor = 8
//traverse through all thread blocks (TB)
for (j = 0; j < N_TB; j += Concurrent_TB)
//loop to traverse concurrent threads TB_Size
for (i = 0; i < Concurrent_TB*TB_Size;
i += skip_factor*batch_size*unroll_factor) {
for (k=0; j<batch_size; k++) {
id = i + skip_factor*k*unroll_factor
+ j*TB_Size
//unrolled loop
float a0 = mem_fetch (id + skip_factor*0)
float a1 = mem_fetch (id + skip_factor*1)
. . .
sum += a0 + a1 + . . . }
update skip_factor
}}}
GPU
Kernel
Requests for
Single
thread
For all concurrent
Thread
blocks
Skip_factor
controls CPU timing
Batch_size
controls how often skip_fctor is updated
Unroll_factor artificially boost CPU requests
Drawbacks:
CPU assisted GPGPU processing
49
Lower Costs
Overheads
CPU only
Workloads
Chip Integrated
CPU-GPU Systems
Throughput
Applications
Energy Efficient
GPUs
GPGPU
50
Part
6
Future
Work
Next Generation CPU –
GPU Architectures
GPGPU
Evolution
Holistic
Optimizations
CPU Core
Optimization
Redundancy
Elimination
Shared
Components
Opportunistic
Optimizations
Emerging
Technologies
Power
Temperature
Reliability
Tools
Continued
System Optimizations
51
Research
Tools
52
Power,
Temperature and Reliability
53
Emerging
Technologies
54
Conclusions
55
Questions?
Backup Slides
Results
– Stores
57
Results
– Branch Prediction Rates
58
All Rights Reserved Powered by Free Document Search and Download
Copyright © 2011