0709 KISS Summer school DL Lecture
Introduction to DL
Traditional Linear Model > suffers at specific tasks(ex. MNIST) - Why : Designed to use at low-dimensional data - At higher dimension(ex. Image, Video data) : curse of dimensionality occurs - How about Principal component? : First PC finds overall mean usually(cannot capture small-scale feature) How to overcome > Use Basis function $f:R^{p}\to R^{d}$ (kernel function)
Neural Network
1-layer Neural Network
\[\begin{aligned} H^{(1)} &= g^{(1)}(W^{(1)}x + w_{0}^{(1)})\\ \Lambda&= g^{(\lambda)}(W^{(\lambda)}H^{(1)}+w_{0}^{(\lambda)})\\ Y &\sim p(y,\lambda) \end{aligned}\]Output layer $g^{(\lambda)}$ is determined as dist of $Y$
Neural Network is equivalent to GLM with Data-Driven Basis function PCA is also data driven but it is (1) linear (2) do not consider Y.
- $g$ is nonlinear (Ex. ReLU : piecewise linear)
Deep Neural Network : Multilayer Neural Network
- General Approximation Theory
- Model Fitting : MLE let $W$ vector contains every weight matrices, $(x_{1},y_{1}),\ldots,(x_{n}, y_{n})$ training data.
Then, cost function of $Y\sim N(\Lambda,\sigma^{2})$
\[C(W)=-l(W)\propto \frac{1}{n}\sum_{i=1}^{n}(y_{i}-\Lambda(x_{i}:W))^{2}\]If $Y\sim \text{Multi}(\Lambda)$ , the negative loglik becomes
\[C(W) \propto \sum_{i}\sum_{k}y_{k}\log p_{k}(x_{i}:W)\]How to optimize?
- At neural net : Newton-Raphson impossible(impossible to calculate Hessian)
- Backpropagation : easy to calculate Gradient and also easy to code
Gradient Descent
- Adam, AdaGrad, RMSprop
- Neural Net depends on Gradient i.e. the activation function should have good property
- Gradient Saturation Problem
-
- At Sigmoid, Hyperbolic tangent activation function
- the product of partial derivatives conv to 0
-
- Solution
- Rectified Linear Unit(ReLU)
-
What kind of Hidden Layer then?
Image, Video : CNN Text : LSTM, Transformer Density estimation : Normalizing Flow
Lecture 2
Overfitting Problem
Neural Net has high flexibility, but easy to suffer overfitting problem. Solutions : - Shrinkage penalty - Dropout - Batch training - Early Stopping
Penalization
L1, L2 Regularization(Ridge, Lasso) : add a regularization term to negative loss ex) L1 Term at Categorical response
\[-l(W) \propto -\sum_{i}\sum_{k}y_{ik}\log p_{k}(x_{i}:W)+\lambda\Vert W\Vert^{2}\]: work as Shrinkage estimator
Dropout
- Slap and spike
- add a Bernoulli random variable to every layer
Batch Training
Split the whole training data into $B$ batches, and do the gradient descent at every batch
- epoch : One cycle over all batches
- Batch : Sampling idea » for statisticians?
Stochastic Gradient Descent(SGD)
Due to dropout and batch training, likelihood and cost function changes at every training with randomness
Data Split
- Training data : Used to calculate gradient
- Validation data : Not used to calculate gradient, instead used to calculate cost function and determine if overfitting occur or not
Early Stopping
- If validation cost doesn’t improve during specific number of epochs(patience), stop training
Algorithm Summary
- Initialize : Generate $W_{0}$ from probability distribution(prior)
Validation Method for Binary Y
- Classification rule to 0 or 1
- From Model : We acquire $P(Y=1\vert X)$
- How to determine prediction of Y as 0 or 1?
-
- By : Threshold setting
- Large Threshold - both TPR, FPR lower
We should not make threshold constant, instead observe change of ROC curve as threshold changes
- ROC Curve is important : AUC-ROC
For Continuous Y
- MSE alone cannot explain the full result
- Draw scatter plot at test data
Lecture 3. CNN and Transformers
Convolutional Neural Network
- If Input data $X$ is spatial data
- Feedforward DNN : Input > Feature Extraction Layer > Fully connected(Dense) Layer > Output
Convolution
: Taking local weighted averages to create a summary image
Preserve spatial properties with much less parameters
Pooling
: Downsampling for translational invariance(?) Traditional kernel (ex. Gaussian) : Smoothing i.e. maybe not useful for image classification or etc.
Channel and Filter
Input Channels : ex. RGB image > 3 channel for layer Filter’s Kernel values > Also estimated during training process
- N output Channel produces N output image for convolutional layer
Stride, Filter(kernel size), Padding
- Stride : Step size for each slide
- Filter or kernel size : width and height of kernel
- Padding : Additional rows and columns to adjust the resulting images
Transformer
- Setting $\mathbf{x} = [x_{1},\ldots,x_{T}]$ : T is number of words in the text $\mathrm{x}$
- Embedding : Transform each word into vector
-
At deep learning : Get Embedded vector as parameter in DL model, with just SGD
- Text data analysis is just a special case of time series analysis.
Self-Attention Layer
- $x_{t}$ is d-dimensional vector at $t$ word
- Linear transformation for $x_t$ :
- Self-attention:
where $a(X_{t},X_{u})>0, \sum_{u}a(x_{u},x_{t})=1$ is an attention which $t$-word gives to $X_{u}$.
- Calculating $a(X_{u},X_{t})$
query $Q_{t} = W_{q}x_{t}+ w_{q}$ Key $K_{t}=W_{k}x_{t}+w_{k}$ Then, we calculate as follows:
Position Encoding
: The locational information of words are not contained at $x_{t}$.
- Make Absolute position embedding(embedding matrix w.r.t. order of words)
- Relative position embedding
Multi-head Self Attention
Define various number of self attention for $h=1,\cdots,H$
able to extract various kind of relationship of text
Transformer Layer
- Residual multi-head attention > Layer Normalization > Residual Dense Layers > Layer Normalization
Leave a comment