DL system

DL 系统栈

http://dlsys.cs.washington.edu/schedule

lec3

User API
- Programming API
- Gradient Calculation(DIfferrentiation API)
System Components
- Computational Graph Optimization and Execution
- Runtime Parallel Scheduling
Architecture
- GPU Kernels, Optimizing Device COde
- Accelerators and Hardwares

tinyflow, 声明式, 定义计算图.

计算图: 节点表示计算(operation), 边表示数据依赖.

代码 https://github.com/tqchen/tinyflow

https://github.com/dlsys-course/tinyflow

lec4

反向传播(back prop)与自动微分(auto diff)?

两者居然含义不同..

反向传播需要在保存在前向过程中的中间变量. but it's a local process, 计算只用到附近的值.

自动微分则是直接为计算导数构建其计算图. 更有利于对系统的优化.< 6F1B /p>

阅读 http://cs231n.github.io/optimization-2/

虽然一般都是对参数w求导，对输入 x 求导有时也是有用的，比如对神经网络的可视化和原理理解。

对导数的感性认识: 在某个点的领域, 输入变量对输出变化的影响.

扩展到向量操作, 根据维度推导.

Erik Learned-Miller 的关于矩阵向量求导 Find it here.

阅读 Automatic differentiation in machine learning: aa survey

矩阵求导.

lec5

GPU 计算资源, SM 下面包括大量计算 core

内存层次, Titan X pascal

SMs 28 个 , R 0 cycle, R-after-W ~20 cycles
Cores / CM : 128
Reg / SM 256 KB
(L1/texture) / SM : 48 KB , 92 cycle
- constant L1 cache: 28 cyecles
Shared mem / SM 64 KB, 28 cycles
L2 cache 3MB, 200 cycles
GPU DRAM 12GB, 350 cycles.

编程模型 SIMT(single instruction , multi threads)

程序员编程单线程程序, 多条线程执行同样的代码, 各线程可以走不通的路径.
- threads 组成 block, 同一block可(?)同步
- blocks 组成 grid, blocks 被 GPU 独立调用, 它们之间的执行顺序可调换.
- a kernel is executed as a grid of blocks of threads.
kernel 执行
- 每个block被一个 SM 执行, 不会迁移
- 多个并行的blocks可以在同一个 SM 上执行, 取决于 block 的内存要求和 SM 的内存资源
- warp 包含 32 线程, 是kernel 执行的基本调度单元
- 一个 thread block 包含多个 32-thread warps
- 每个 cycle, a warp scheduler 选择一个 ready wraps 放到 CUDA cores 上执行
thread hierarcy & memory hierarchy
- thread -> 寄存器, local memory
- thread block -> shared memory
- grid -> global memory
Cuda 编程, 案例: Vector add, sliding window sum, GEMM

lec6

目标: 将高层曾经编译成高效底层 bare metal code.

例子, 结合 memory hierarchy, 优化矩阵乘法

Tiled, 分块
cache line aware

核心: 内存复用.

优化空间太大:

tiling paptern
fuse patterns
data layout
hardware backends

TVM

lec7

TVM, 自动代码生成.

相似工作:

TensorFlow XLA
- 高层/底层优化
Intel NGraph
Nvida TensorRT
- 基于规则的 Fusion

TVM 认为, 将计算图表示为 IR, 在计算图上做优化的方式, 需要对每一种硬件(不同layout, 精度, 线程数)进行算子优化. (但是我认为可以将公共的部分提出来的吧, 不应该算是缺点 )

TVM 使用 Tensor Expression Language, (其他使用 TEL 的, Halide, Loopy, TACO, Tensro Comprehension)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DL system #8

lec3

lec4

lec5

lec6

lec7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DL system #8

Description

lec3

lec4

lec5

lec6

lec7

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions