Coursera Deep Learning Specialization

Self-study
Published

March 17, 2023

Study questions

  • How to input images of different dimensions in a neural net?
  • I don’t quite understand yet how convolutions finally integrate more than just local information. In my understanding this can only happen efficiently in the fully connected layers then?
  • Understand cost function and understand the derivatives
  • Keras works with float32 (single float) - double is never used? Single is sufficient?
  • Read a little about Boltzmann machines

Course 2, Week 2

Mini-batch Gradient Descent

  • how do I controll this in Keras / PyTorch / whatever?
  • mini batch size = 1 -> stochastic gradient descent
  • mini batch size power of 2

Exponentially moving averages:

\[V_t = \beta V_{t-1} + (1 - \beta) \theta_t\] * this corresponds approximately to averaging over \(\frac{1}{1-\beta}\) entries

Gradient descent with momentum

  • gradients with exponentially moving average
  • if learning rate too large risk of diverging
  • local minima are not the main problem - the problem are saddle points!
  • ball on a hill: velocity is stored in the average, new gradient is the acceleration
  • \(\beta = 0.9\) averaging over approx. the last 10 iterations
  • Update rule: \[ V_{dw} = \beta V_{dw} + (1-\beta) dW\]
  • often approximated by: \[ V_{dw} = \beta V_{dw} + dW\]

Root Mean Square Propagation

  • dW is small, db is large (is this true?), here we call the hyperparameter \(\beta_2\), also \(\epsilon = 10^{-8}\) \[ S_{dW} := \beta S_{dW} + (1 - \beta) dW^2\] \[ W := W - \alpha \frac{dW}{\sqrt{S_{dW}} + \epsilon}\] \[ S_{db} := \beta S_{db} + (1 - \beta) db^2\] \[ W := W - \alpha \frac{db}{\sqrt{S_{db}} + \epsilon}\]

Adaptive moment estimation optimization algorithm (Adam)

Initiliazation: \(V_{dW} = 0\), \(S_{dW} = 0\), \(V_{db} = 0\), \(S_{db} = 0\), \[ V_{dW} = \beta_1 V_{dW} + (1 - \beta_1) dW\] \[ V_{db} = \beta_1 V_{db} + (1 - \beta_1) db\] \[ S_{dW} = \beta_2 S_{dW} + (1 - beta_2) dW^2\] \[ S_{db} = \beta_2 S_{db} + (1 - beta_2) db^2\] Typically implemented with bias correction \[ V_{dW}^{corrected} = V_{dW} / (1 - \beta_1 ^ t)\] \[ V_{db}^{corrected} = V_{db} / (1 - \beta_1 ^ t)\] \[ S_{dW}^{corrected} = S_{dW} / (1 - \beta_2 ^ t)\] \[ S_{db}^{corrected} = S_{db} / (1 - \beta_2 ^ t)\] Update rule: \(w := w - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S^{corrected}_{dW}} + \epsilon}\) \(b := b - \alpha \frac{V_{db}^{corrected}}{\sqrt{S^{corrected}_{db}} + \epsilon}\) Hyperparameters:

  • \(\alpha\) needs to be tuned
  • \(\beta_1\): usually 0.9
  • \(\beta_2\): usually 0.999
  • $ $:usually \(10e-8\)

Learning rate decay

\[ \alpha = 1 / (1 + {decayrate} \times {epochnumber}) \alpha_0\]

Softmax activation function

\[ a_i = \frac{\exp{z_i}}{\sum_{j=1}^{n}{\exp{z_j}}}\]

Softmax regression

  • generalization of logistic regression to \(n\) classes
  • Likelihood: \[ L(\hat{y}, y) = - \sum^C_{j = 1} y_j log \hat{y_j} \]

Course 3, Week 2

  • Avoidable bias: discrepancy between Bayes’ error and training error

Bias and variance with mismatched data distrubtions

  • potential help: add a Training-dev set
  • training error, training-dev error, dev error
  • data mismatch problem
  • Data augmentation: when sampling only a small subset of all possible examples overfitting may occur (e. g. cars in a video game)
  • Transfer learning: example image recognition
    • train network to a lot of images
    • now x-rays: delete weights of the last layer of the network and retrain them (a few x-rays available for training) or all parameters of the network (a lot of x-rays available for training)
  • Multitask learning - output vector, doesn’t have to be fully labeled
  • Transfer learning currently learned more frequently than multitask learning
  • Multitask learning: each class should have similar number of items

Course 4

  • strided and padded convolution
  • cross-correlation (without flipping) vs. convolution (flipping around vertical and horizontal axis) - in ML, cross-correlation is by convention also called convolution
  • convolutions over volumes (\(n_c\) = number of channels) \[ n \times n \times n_c * f \times f \times n_c \rightarrow (n - f + 1) * (n - f + 1) * n_c^{prime}\]
  • prime means in the next layer

Pooling

  • max pooling: only hyperparameters stride, padding (usually zero) and f
  • average pooling: rarely used, very sometimes for example to collapse spatial dimension (e.g. 7x7 with 1000 channels to 1x1 with 1000 channels)
  • no parameters to learn