Deep Learning
0709 KISS Summer school DL Lecture
Introduction to DL Traditional Linear Model > suffers at specific tasks(ex. MNIST) - Why : Designed to use at low-dimensional data - At higher dimension(ex. Image, Video data) : curse of dimensionality occurs - How ab...
Introduction to DL
Traditional Linear Model > suffers at specific tasks(ex. MNIST) - Why : Designed to use at low-dimensional data - At higher dimension(ex. Image, Video data) : curse of dimensionality occurs - How about Principal component? : First PC finds overall mean usually(cannot capture small-scale feature) How to overcome > Use Basis function (kernel function)
Neural Network
1-layer Neural Network
Output layer is determined as dist of
Neural Network is equivalent to GLM with Data-Driven Basis function PCA is also data driven but it is (1) linear (2) do not consider Y.
- is nonlinear (Ex. ReLU : piecewise linear)
Deep Neural Network : Multilayer Neural Network
- General Approximation Theory
- Model Fitting : MLE let vector contains every weight matrices, training data.
Then, cost function of
If , the negative loglik becomes
How to optimize?
- At neural net : Newton-Raphson impossible(impossible to calculate Hessian)
- Backpropagation : easy to calculate Gradient and also easy to code
Gradient Descent
- Adam, AdaGrad, RMSprop
- Neural Net depends on Gradient i.e. the activation function should have good property
- Gradient Saturation Problem
- At Sigmoid, Hyperbolic tangent activation function : the product of partial derivatives conv to 0
- Solution : Rectified Linear Unit(ReLU)
What kind of Hidden Layer then?
Image, Video : CNN Text : LSTM, Transformer Density estimation : Normalizing Flow
Lecture 2
Overfitting Problem
Neural Net has high flexibility, but easy to suffer overfitting problem. Solutions : - Shrinkage penalty - Dropout - Batch training - Early Stopping
Penalization
L1, L2 Regularization(Ridge, Lasso) : add a regularization term to negative loss ex) L1 Term at Categorical response
: work as Shrinkage estimator
Dropout
- Slap and spike
- add a Bernoulli random variable to every layer
Batch Training
Split the whole training data into batches, and do the gradient descent at every batch
- epoch : One cycle over all batches
- Batch : Sampling idea >> for statisticians?
Stochastic Gradient Descent(SGD)
Due to dropout and batch training, likelihood and cost function changes at every training with randomness
Data Split
- Training data : Used to calculate gradient
- Validation data : Not used to calculate gradient, instead used to calculate cost function and determine if overfitting occur or not
Early Stopping
- If validation cost doesn't improve during specific number of epochs(patience), stop training
Algorithm Summary
- Initialize : Generate from probability distribution(prior)
Validation Method for Binary Y
- Classification rule to 0 or 1
- From Model : We acquire
- How to determine prediction of Y as 0 or 1?
- By : Threshold setting : Large Threshold - both TPR, FPR lower
We should not make threshold constant, instead observe change of ROC curve as threshold changes
- ROC Curve is important : AUC-ROC
For Continuous Y
- MSE alone cannot explain the full result
- Draw scatter plot at test data
Lecture 3. CNN and Transformers
Convolutional Neural Network
- If Input data is spatial data
- Feedforward DNN : Input > Feature Extraction Layer > Fully connected(Dense) Layer > Output
Convolution
: Taking local weighted averages to create a summary image
Preserve spatial properties with much less parameters
Pooling
: Downsampling for translational invariance(?) Traditional kernel (ex. Gaussian) : Smoothing i.e. maybe not useful for image classification or etc.
Channel and Filter
Input Channels : ex. RGB image > 3 channel for layer Filter's Kernel values > Also estimated during training process
- N output Channel produces N output image for convolutional layer
Stride, Filter(kernel size), Padding
- Stride : Step size for each slide
- Filter or kernel size : width and height of kernel
- Padding : Additional rows and columns to adjust the resulting images
Transformer
-
Setting : T is number of words in the text
-
Embedding : Transform each word into vector
-
At deep learning : Get Embedded vector as parameter in DL model, with just SGD
-
Text data analysis is just a special case of time series analysis.
Self-Attention Layer
- is d-dimensional vector at word
- Linear transformation for :
- Self-attention:
where is an attention which -word gives to .
- Calculating
query Key Then, we calculate as follows:
Position Encoding
: The locational information of words are not contained at .
- Make Absolute position embedding(embedding matrix w.r.t. order of words)
- Relative position embedding
Multi-head Self Attention
Define various number of self attention for
able to extract various kind of relationship of text
Transformer Layer
- Residual multi-head attention > Layer Normalization > Residual Dense Layers > Layer Normalization