Large-Scale Deep Learning

Fast CPU implementation
GPU implementation
Large-scale distributed implementation
- Data parallelism
- Model parallelism
Model Compression
Dynamic structure. Data-processing systems can dynamically determine which subset of many neural networks should be run on a given input.
- Gater to pick expert network
- Switch formed from a hidden unit
Specialized Hardware Implementation of Deep Networks

Computer Vision

Preprocessing
- Contrast normalization - make sample standardized.
  - Global contrast normalization(GCN) aims to prevent images from having varying amounts of contrast by subtracting the mean from each image, then rescaling it so that the standard deviation across its pixels is equal to some constants.
  - Local contrast normalization
- Dataset augmentation

Speech Recognition

Nowadays use CNN

Natural Language Processing

N-Grams

$$ P (x_1, . . . , x_τ) = P (x_1, . . . , x_{n−1})\prod^τ_{t=n}P (x_t| x_{t−n+1}, . . . , x_{t−1}) $$

Neural Language Models

One example is word embeddings

High-Dimensional Outputs

Use of a short list: limit to possible 10000-20000 words
Hierarchical Softmax