Jekyll2020-12-12T09:09:59+00:00https://blog.sayan.page/feed.xmlSayan’s BlogDotting the I's in Aİ. This is my blog focussing on machine learning, computer science and other topics that catch my fancy.Sayan GoswamiLaw of Iterated Expectation2020-12-12T00:00:00+00:002020-12-12T00:00:00+00:00https://blog.sayan.page/lie<h2 id="what-is-lie">What is LIE</h2>
<p>The law of iterated expectation or the law of total expectation states that</p>
\[E[X] = E[E[X \mid Y]]\]
<p>or, in other words the expected value of the conditional expected value \(X\) given \(Y\) is the same as expected value of \(X\).</p>
<h2 id="why-should-i-care">Why should I care</h2>
<p>Well, the LIE is used in the derivation of the Bellman equations of reinforcement learning, let me show you how.
From the definition of the value function (expected value of the discounted return), we have</p>
\[\begin{aligned}
V^\pi (x) & = E_\pi \left[\sum_{t=0}^{\infty} {\gamma ^ t r(x_t, a_t) \mid x_0 = x} \right]\\
& = r(x, \pi(x)) + E_\pi \left[ \sum_{t=1}^{\infty} {\gamma ^ t r(x_t, a_t) \mid x_0 = x} \right]\\
& = r(x, \pi(x)) + \gamma E_y \left[ E_\pi \left[ \sum_{t=1}^{\infty} {\gamma ^ {t - 1} r(x_t, a_t) \mid x_0 = x, x_1 = y} \right] \right] \quad \color{CornflowerBlue}{\text{(LIE used here)}}\\
& = r(x, \pi(x)) + \gamma \sum_y {P(y \mid x, \pi (x))} E_\pi \left[ \sum_{t=1}^{\infty} {\gamma ^ {t - 1} r(x_t, a_t) \mid x_1 = y} \right] \quad \color{CornflowerBlue}{\text{(by Markov property)}}\\
& = r(x, \pi(x)) + \gamma \sum_y {P(y \mid x, \pi (x))} V^\pi (y) \qquad \blacksquare
\end{aligned}\]
<p>These are the Bellman equations for the value function!</p>Sayan GoswamiWhat is LIEHow many NaNs is too many?2020-10-26T00:00:00+00:002020-10-26T00:00:00+00:00https://blog.sayan.page/how-many-nans<p>Every person who has ever trained a machine/deep learning model has at one point in their lives come across a NaN (acronym for <strong>N</strong>ot <strong>a</strong> <strong>N</strong>umber). But, did you know that the internal representation of one NaN may be different from another?</p>
<p>Well hold on, let’s take a ride!</p>
<p>Computers don’t understand the decimal number system, so, everything needs to be represented in terms of binary numbers. One such representation of floating point numbers is the <a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE-754</a> standard.</p>
<p>In this standard, a decimal number is represented by a \(1\)-bit sign (\(S\)), a \(w\)-bit biased exponent (\(E\)) and a \(t = (p-1)\)-bit trailing significand \(T=d_1d_2...d_t\). \(d_0\) is implicitly encoded in the biased exponent. \(p\) is the precision of the significand.</p>
<p>For a \(k\)-bit (\(k > 128\)) IEEE-754 representation of a floating point number, we have</p>
<ol>
<li>\(k\) is a multiple of 32.</li>
<li>Precision, \(p\) is \(k - \lfloor 4 \times \log_2 (k) \rceil + 13\).</li>
<li>Maximum exponent, \(\text{emax}\) is \(2 ^ {k - p - 1}\).</li>
<li>Bias is \(\text{emax}\).</li>
<li>Width of exponent field, \(w\) is \(\lfloor 4 \times \log_2 (k) \rceil - 13\)</li>
</ol>
<p>The bit representation of a 32-bit floating point number is</p>
<p style="text-align: center;"><img src="/images/float32.svg" alt="By Vectorization: Stannered - Own work based on: Float example.PNG, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3357169" title="By Vectorization: Stannered - Own work based on: Float example.PNG, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3357169" /></p>
<p>The steps for the conversion from the IEEE-754 representation to the decimal systems are</p>
<ol>
<li>The sign bit is \(0\), so, the sign of the number is \((-1)^0 = 1\).</li>
<li>The exponent is \(01111100_2 = 124_{10}\).</li>
<li>The fraction part is \(0.01000000000000000000000_2 = 1 \times 2^{-2}\).</li>
</ol>
<p>Putting it all together, the decimal value of the number is obtained as</p>
\[(-1)^\text{\colorbox{#c3fbfd}{sign bit}} \times (1 + \colorbox{#feacac}{\text{fraction}}) \times 2 ^ {\colorbox{#9ffeab}{\text{exponent}} - \text{bias}}\]
<p>For a 32-bit floating point number, the bias is 127.
Thus, we get \(1 \times (1+\frac{1}{2^2}) \times \frac{1}{2^3} = \frac{5}{2^5}= 0.15625\).</p>
<p>Similarly, a 64-bit floating point number is represented as</p>
<p style="text-align: center;"><img src="/images/float64.svg" alt="By Codekaizen - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=3595583" title="By Codekaizen - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=3595583" /></p>
<p>The representation \(r\) and the value \(v\) of the floating point datum are inferred from the constituent fields of the floating point representation as follows.</p>
<ul>
<li>If \(E = 2^w -1\) and \(T \ne 0\) then \(v\) is NaN and \(r\) is either qNaN or sNaN.</li>
<li>If \(E = 2^w -1\) and \(T = 0\) then \(r = v = (-1)^S \times \infty\).</li>
<li>Signed floating point numbers are represented by all other \(E \in [0, 2^w - 2]\).</li>
<li>Note that zeros can be signed!</li>
</ul>
<h3 id="types-of-nan">Types of NaN</h3>
<p>IEEE-754 defines two types of NaN, quiet NaN (qNaN) and signalling NaN (sNaN).</p>
<p>Quiet NaNs are implementation specific and can offer diagnostic information from the underlying data for a particular implementation.
The first bit of the trailing significand (\(d_1\)) is set to \(1\) for a quiet NaN.</p>
<p>Signalling NaNs are reserved operands that are used to signal the invalid operation exception.
The first bit of the trailing significand (\(d_1\)) is set to \(0\) for a quiet NaN.
Of the remaining \(t-1\) bits, at least 1 should be non-zero to distinguish the NaN from infinity.</p>
<h3 id="how-many-nans-are-there">How many NaNs are there</h3>
<p>We know that for a number to be NaN, it must have its exponent \(E\) set to \(2^w -1\) and \(T \ne 0\).
So, for a \(k\)-bit representation, we have</p>
\[\text{Total number of NaNs} = 2 ^ {k - w} - 2\]
\[\text{Number of quiet NaNs} = 2 ^ {k - (w + 1)}\]
\[\text{Number of signalling NaNs} = 2 ^ {k - w} - 2 ^ {k - (w + 1)} - 2\]
<p>The trend of the fraction of NaNs as compared to number of valid floating point representations is as follows</p>
<p style="text-align: center;"><img src="/images/nanpercent.svg" alt="" /></p>
<p><strong>Note</strong>: <em>The y-axis of the graph uses log scale.</em></p>Sayan GoswamiEvery person who has ever trained a machine/deep learning model has at one point in their lives come across a NaN (acronym for Not a Number). But, did you know that the internal representation of one NaN may be different from another?How does the Perceptron Learning Algorithm work?2020-10-10T00:00:00+00:002020-10-10T00:00:00+00:00https://blog.sayan.page/how-does-pla-work<h2 id="perceptron-model">Perceptron Model</h2>
<p>The <a href="https://doi.org/10.1037/h0042519">perceptron</a> is a binary classification algorithm with target set
\(\mathcal{Y} = \{-1, +1\}\), domain set \(\mathcal{X} \in \mathbb{R}^{d \times m}\),
where \(m\) is the number of data points in the domain set.</p>
<p>For \(x \in \mathcal{X}\), a weighted score is computed and the predicted output
is \(+1\) if \(\sum_{i=1}^{d} {w_i x_i} > \theta\) otherwise, \(-1\). Here,
\(x \in \mathbb{R}^{d \times 1}\) and \(w \in \mathbb{R}^{d \times 1}\).</p>
<p>Thus, \(h(x) = sgn(\sum_{i=1}^{d} {w_i x_i} - \theta)\). Now let \(x_0 = -\theta\) be
a dummy feature and \(w_0 = 1\) be its corresponding weight, we can rewrite \(h(x)\) as:</p>
\[\begin{aligned}
h(x) = sgn(\sum_{\textcolor{orange}{i=0}}^{d} {w_i x_i}) &= sgn(w^\top x)\\
&= sgn(w \cdot x)
\end{aligned}\]
<p><strong>Note</strong>: Here \(x_i\) is the i-th element of the vector \(x\).</p>
<p><em>For the model to work, ie. for \(h(x)\) to produce the correct label, when \(y= +1\),
we want \(w \cdot x > 0\), ie. the angle between \(w\) and \(x\) should lie in \([0, \pi/2]\).
Similarly, when \(y = -1\), we want \(w \cdot x < 0\), ie. the angle between \(w\)
and \(x\) should lie in \([\pi/2, \pi]\).</em></p>
<h2 id="perceptron-learning-algorithm">Perceptron Learning Algorithm</h2>
<p>The perceptron learning algorithm can be breifly stated as follows.</p>
<p>For \(x_i \in \mathcal{X}\), for \(y_i \in \mathcal{Y}\) and
for \(h \in \mathcal{H}\), where \(\mathcal{H}\) is the hypothesis class.</p>
<ol>
<li>Initialize the weight vector \(w^t\) to a zero vector (\(t=0\)).</li>
<li>Find a mistake \((x_i, y_i)\) such that \(h(x_i) = sgn({w^t}^\top x_i) \ne y_i\).</li>
<li>Update the weight vector such that \(w^{t+1} \leftarrow w^{t} + y_i x_i\).</li>
<li>Repeat from step 2 untill convergence.</li>
</ol>
<p><strong>Note</strong>: Here \(x_i\), \(y_i\) are the i-th data points from the training data set
and \(w^t\) is the weight vector at iteration \(t\).
Also, \(x_i \in \mathbb{R}^{d \times 1}\), \(y_i \in \mathbb{R}^{1 \times 1}\) and
\(w \in \mathbb{R}^{d \times 1}\).</p>
<p>Now, why does this work? I am glad you asked!</p>
<p>Let us assume that \((x_i, y_i)\) was a misclassification. We have
\(w^{t+1} \leftarrow w^{t} + y_i x_i\).</p>
\[\begin{aligned}
w^{t+1} \cdot x_i &= {w^{t+1}}^\top x_i \\
&= ({w^{t} + y_i x_i})^\top x_i \\
&= {w^{t}}^\top x_i + y_i x_i^\top x_i \\
&= {w^{t}}^\top x_i + y_i \|x_i\|^2 \text{\hspace{2em}(1)}
\end{aligned}\]
<p>Now, let \(\alpha^{t+1}\) be the angle between the weight vector \(w^{t+1}\) and the input
vector \(x_i\).</p>
\[\begin{aligned}
cos(\alpha^{t+1}) &= \frac{w^{t+1} \cdot x_i}{\|w^{t+1}\|\|x_i\|}\\
&\propto w^{t+1} \cdot x_i
\end{aligned}\]
<p>From Equation \((1)\), we get</p>
\[cos(\alpha^{t+1}) \propto cos(\alpha^t) + y_i \|x_i\|^2\]
<p>or,</p>
\[\begin{gathered}
cos(\alpha^{t+1}) < cos(\alpha^t) &\text{if } y_i = -1\\
cos(\alpha^{t+1}) > cos(\alpha^t) &\text{if } y_i = +1
\end{gathered}\]
<p>Thus, when \(y_i = -1\), \(cos(\alpha^{t+1}) < cos(\alpha^t) \implies \alpha^{t+1} > \alpha^t\).
The angle between a negative data point and the weight vector increases.</p>
<p>Whereas, when \(y_i = 1\), \(cos(\alpha^{t+1}) > cos(\alpha^t) \implies \alpha^{t+1} < \alpha^t\).
The angle between a positive data point and the weight vector decreases.</p>
<p>We can hereby conclude that the weight vector \(w^{t+1}\) is more aligned (tending towards \(0\) rad) to positive
data points and less aligned (tending towards \(\pi\) rad) to negative data points than the weight vector \(w^t\).
This is the intended behaviour!</p>Sayan GoswamiPerceptron Model