Learning XOR

$$ f(x; W , c, w, b) = w^\top max\{0, W^\top x + c\} + b $$

$$ W=\begin{bmatrix} 1 & 1\\ 1 & 1 \end{bmatrix} $$

$$ c=\begin{bmatrix} 0\\ -1\end{bmatrix} $$

$$ w=\begin{bmatrix} 1\\ -2 \end{bmatrix} $$

$$ b=0 $$

Gradient-Based Learning

$$ J(θ) = −E_{x,y\sim \hat{p}{data}}log p{model}(y | x) $$

Output Unites:
- Linear units for Gaussian output distributions
- Sigmoid units for Bernoulli output distributions
- Softmax units for Multinoulli output distributions
Hidden Unites:
- ReLUs are an excellent default choice of hidden unit
- Maxout units. Instead of applying an element-wise function(z), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum element of one of these groups.
Logistic Sigmoid and Hyperbolic Tangent. These activation functions are closely related because tanh(z) = 2σ(2z) − 1
Radial basis function(RBF), unit: $hi=exp(−\frac{1}{σ^2_i}||W_{:,i}− x||^2).$ This function becomes more active as x approaches a template $W_{:, i}$. Because it saturates to 0 for most x, it can be diﬃcult to optimize.
Softplus: $g(a) =ζ(a) =log(1+e^a)$. This is a smooth version of the rectiﬁer. The use of the softplus is generally discouraged. The softplus demonstrates that the performance of hidden unit types can be very counterintuitive—one might expect it to have an advantage over the rectiﬁer due to being diﬀerentiable everywhere or due to saturating less completely, but empirically it does not
Hard tanh. This is shaped similarly to the tanh and the rectiﬁer, but unlike the latter, it is bounded, $g(a) =max(−1, min(1, a)).$

Untitled