Jekyll2020-09-14T19:52:23+00:00https://dasayan05.github.io/feed.xmlAyan Das<b>Deep Learning</b> enthusiast; <b>Ph.D. Student</b> @ <a href="https://www.surrey.ac.uk/">University of Surrey</a>, United KingdomAyan Dasa.das@surrey.ac.ukEnergy Based Models (EBMs): A comprehensive introduction2020-08-13T00:00:00+00:002020-08-13T00:00:00+00:00https://dasayan05.github.io/blog-tut/2020/08/13/energy-based-models-one<p>We talked extensively about <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a> in my earlier article and also described <a href="/blog-tut/2020/01/01/variational-autoencoder.html">one particular model</a> following the principles of Variational Inference (VI). There exists another class of models conveniently represented by <em>Undirected</em> Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as <strong>Energy Based Models (EBM)</strong>, as we shall see, they rely on something called <em>Energy Functions</em>. In the early days of this Deep Learning <em>renaissance</em>, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as <strong>Boltmann Machines (BM)</strong> which are very well known in the literature.</p>
<h2 id="undirected-graphical-models">Undirected Graphical Models</h2>
<p>The story starts when we try to model a number of Random Variables (RVs) in the graph but we only have a weak knowledge about which variables are related but not the direction of influence. Direction is a necessary requirement for <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a>. For example, let’s consider a lattice of atoms (Fig.1(a)) where only neighbouring atoms influence the spins but it is unclear what the direction of the influences are. For simplicity, we will use a simpler model (Fig.2(b)) for demonstration purpose.</p>
<center>
<figure>
<img width="65%" style="padding-top: 20px;" src="/public/posts_res/17/undir_models.png" />
<figcaption>Fig.1: (a) An atom lattice model. (b) An arbitrary undirected model.</figcaption>
</figure>
</center>
<p>We model a set of random variables \(\mathbf{X}\) (in our example, \(\{ A,B,C,D \}\)) whose connections are defined by graph \(\mathcal{G}\) and have <em>“potential functions”</em> defined on each of its maximal <a href="https://en.wikipedia.org/wiki/Clique_(graph_theory)">cliques</a> \(\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})\). The total potential of the graph is defined as</p>
<p>\[
\Phi(\mathbf{x}) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q)
\]</p>
<p>\(q\) is an arbitrary instantiation of the set of RVs denoted by \(\mathcal{Q}\). The potential functions \(\phi_{\mathcal{Q}}(q)\in\mathbb{R}_{>0}\) are basically “affinity” functions on the state space of the cliques, e.g. given a state \(q\) of a clique \(\mathcal{Q}\), the corresponding potential function \(\phi_{\mathcal{Q}}(q)\) returns the <em>viability of that state</em> OR how likely that state is. Potential functions are somewhat analogous to conditional densities from Directed PGMs, except for the fact that potentials are <em>arbitrary non-negative values</em>. They don’t necessarily sum to one. For a concrete example, the example graph in Fig.1(b) can thus be factorized as \(\displaystyle{ \Phi(a,b,c,d) = \phi_{\{A,B,C\}}(a,b,c)\cdot \phi_{\{A,D\}}(a,d) }\). If we assume the variables \(\{ A,D \}\) are binary RVs, the potential function corresponding to that clique, at its simplest, could be a table like this:</p>
\[\phi_{\{A,D\}}(a=0,d=0) = +4.00 \\
\phi_{\{A,D\}}(a=0,d=1) = +0.23 \\
\phi_{\{A,D\}}(a=1,d=0) = +5.00 \\
\phi_{\{A,D\}}(a=1,d=1) = +9.45\]
<p>Just like every other model in machine learning, the potential functions can be parameterized, leading to</p>
<p>\[ \tag{1}
\Phi(\mathbf{x}; \Theta) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})
\]</p>
<p>Semantically, potentials denotes how likely a given state is. So, higher the potential, more likely that state is.</p>
<h2 id="reparameterizing-in-terms-of-energy">Reparameterizing in terms of Energy</h2>
<p>When we are defining a model, however, it is inconvenient to choose a constrained (non-negative valued) parameterized function. We can easily reparameterize each potential function in terms of <strong>energy</strong> functions \(E_{\mathcal{Q}}\) where</p>
<p>\[\tag{2}
\phi_{\mathcal{Q}}(q, \Theta_{\mathcal{Q}}) = \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}))
\]</p>
<p>The \(\exp(\cdot)\) enforces the potentials to be always non-negative and thus we are free to choose an <em>unconstrained</em> energy function. One question you might ask - “why the negative sign ?”. Frankly, there is no functional purpose of that negative sign. All it does is <em>reverts the semantic meaning</em> of the parameterized function. When we were dealing in terms of potentials, a state which is more likely, had higher potential. Now its opposite - states that are more likely have less energy. Does this semantics sound familiar ? It actually came from Physics where we deal with “energies” (potential, kinetic etc.) which are <em>bad</em>, i.e. less energy means a stable system.</p>
<p>Such reparameterization affects the semantics of Eq.1 as well. Putting Eq.2 into Eq.1 yields</p>
\[\begin{align}
\Phi(\mathbf{x}; \Theta) &= \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})) \\
\tag{3}
&= \exp\left(-\sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})\right) =
\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta))
\end{align}\]
<p>Here we defined \({ E_{\mathcal{G}}(\mathbf{x}; \Theta) \triangleq \sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}) }\) to be the energy of the whole model. Please note that the reparameterization helped us to convert the relationship between individual cliques and whole graph <em>from multiplicative (Eq.1) to additive (Eq.3)</em>. This implies that when we design energy functions for such undirected models, we design energy functions for each individual cliques and just add them.</p>
<p>All this is fine .. well .. unless we need to do things like <em>sampling</em>, <em>computing log-likelihood</em> etc. Then our energy-based parameterization fails because its not easy to incorporate an un-normalized function into probabilistic frameworks. So we need a way to get back to probabilities.</p>
<h2 id="back-to-probabilities">Back to Probabilities</h2>
<p>The obvious way to convert un-normalized potentials of the model to normalized distribution is to explicitely normalize Eq.3 over its domain</p>
\[\begin{align}
p(\mathbf{x}; \Theta) &= \frac{\Phi(\mathbf{x}; \Theta)}{
\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \Phi(\mathbf{x}'; \Theta)
} \\ \\
\tag{4}
&= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta)/\tau)}{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}(\mathbf{x}'; \Theta)/\tau)}\text{ (using Eq.3)}
\end{align}\]
<p>This is the probabilistic model implicitely defined by the enery functions over the whole state-space. [We will discuss \(\tau\) shortly. Consider it to be 1 for now]. If the reader is familiar with Statistical Mechanics at all, they might find it extremely similar to <code class="language-plaintext highlighter-rouge">Boltzmann Distribution</code>. Here’s what the <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Wikipedia</a> says:</p>
<blockquote>
<p>In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution) is a probability distribution or probability measure that gives the probability that a system will be in a certain state as a function of that state’s energy …</p>
</blockquote>
<p>From now on, Eq.4 will be the sole connection between energy-space and probability-space. We can now forget about potential functions. A 1-D example of an energy function and the corresponding probability distribution is shown below:</p>
<center>
<figure>
<img width="75%" style="padding-top: 20px;" src="/public/posts_res/17/energy_prob.png" />
<figcaption>Fig.2: An energy function and its corresponding probability distribution.</figcaption>
</figure>
</center>
<p>The denominator of Eq.4 is often known as the “Partition Function” (denoted as \(Z\)). Whatever may be the name, it is quite difficult to compute in general. Because the summation grows exponentially with the space of \(\mathbf{X}\).</p>
<p>A hyper-parameter called “temperature” (denoted as \(\tau\)) is often introduced in Eq.4 which also has its roots in the original <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Boltzmann Distribution from Physics</a>. A decrease in temperature gathers the probability mass near the lowest energy regions. If not specified, consider \(\tau=1\) for the rest of the article.</p>
<h2 id="a-general-learning-algorithm">A general learning algorithm</h2>
<p>The question now is - how do I learn the model given a dataset ? Let’s say my dataset has \(N\) samples: \(\mathcal{D} = \{ x^{(i)} \}_{i=1}^N\). An obvious way to derive a learning algorithm is to minimize the Negative Log-Likelihood (NLL) loss of the model over our dataset</p>
\[\begin{align}
\mathcal{L}(\Theta; \mathcal{D}) = - \log \prod_{i=1}^N p(x^{(i)}; \Theta) &= \sum_{i=1}^N -\log p(x^{(i)}; \Theta) \\
&= \underbrace{\frac{1}{N}\sum_{i=1}^N}_{\text{expectation}} \left[ E_{\mathcal{G}}(x^{(i)}; \Theta) \right] + \log Z\\
&\text{(putting Eq.4 followed by trivial calculations, and}\\
&\text{ dividing loss by constant N doesn't affect optima)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\bigl[ E_{\mathcal{G}}(x; \Theta) \bigr] + \log Z
\end{align}\]
<p>Computing gradient w.r.t. parameters yields</p>
\[\begin{align}
\frac{\partial \mathcal{L}}{\partial \Theta} &= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{\partial}{\partial \Theta} \log Z \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{1}{Z} \frac{\partial}{\partial \Theta} \left[ \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}) \right]\text{ (using definition of Z)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \underbrace{\frac{1}{Z} \exp(-E_{\mathcal{G}})}_{\text{RHS of Eq.4}} \cdot \frac{\partial (-E_{\mathcal{G}})}{\partial \Theta}\\
&\text{ (Both Z and the partial operator are independent}\\
&\text{ of x and can be pushed inside the summation)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \underbrace{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} p(\mathbf{x}'; \Theta)}_{\text{expectation}} \cdot \frac{\partial E_{\mathcal{G}}}{\partial \Theta}\\
\tag{5}
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]
\end{align}\]
<p>Take a few minutes to digest Eq.5. That’s a very important result. It would be worth discussing it a bit further. The first term in Eq.5 is often known as the “Positive Phase” and the second term as “Negative Phase”. The only difference between them, as you can see, is in the distributions on which the expectations are taken. The first expectation is on the <em>data distribution</em> - essentially picking up data from our dataset. The second expectation is on the <em>model distribution</em> - sampling from the model with current parameters. To understand their semantic interpretation, we need to see them in isolation. For the sake of explanation, consider both terms separately yielding a parameter update rule</p>
\[\Theta := \Theta - \lambda\cdot\mathbb{E}_{x\sim\mathcal{D}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\text{, and }
\Theta := \Theta + \lambda\cdot\mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\]
<p>The first update rule basically tries to changes the parameters in a way it can minimize the enrgy function at points <em>coming from data</em>. And the second one tries to maximize (notice the difference in sign) the energy function at points <em>coming from the model</em>. The original update rule (combining both of them) have both of these effects working simulteneously. The minima of the loss landscape occurs when our model discovers the data distribution, i.e. \(p_{\Theta} \approx p_{\mathcal{D}}\). At this point, both positive and negative phase is approximately same and the gradient becomes zero (i.e., no more progress). Below is a clear picture of the update process. The algorithm <em>pushes the energy down</em> at places where original data lies; it also <em>pull the energy up</em> at places which the <em>model thinks</em> original data lies.</p>
<center>
<figure>
<img width="95%" style="padding-top: 20px;" src="/public/posts_res/17/pos_neg_phase_diagram.png" />
<figcaption>Fig.3: (a) Model is being optimized. The arrows depict the "pulling up" and "pushing down" of energy landscape. (b) Model has converged to an optimum.</figcaption>
</figure>
</center>
<p>Whatever may be the interpretation, as I mentioned before that the denominator of \(p(\mathbf{x}; \Theta)\) (see Eq.4) is intractable in general case, computing the expectation in negative phase is extremely hard. In fact, that is the only difficulty that makes this algorithm practically challenging.</p>
<h2 id="gibbs-sampling">Gibbs Sampling</h2>
<p>As we saw in the last section, the only difficulty we have in implementing Eq.5 is not being able to sample from an intractable density (Eq.4). It tuns out, however, that the <em>conditional densities</em> of a small subset of variables given the others is indeed tractable in most cases. That is because, for conditionals, the \(Z\) cancels out. Conditional density of one variable (say \(X_j\)) given the others (let’s denote by \(X_{-j}\)) is:</p>
\[\tag{6}
p(x_j\vert \mathbf{x}_{-j}) = \frac{p(\mathbf{x})}{p(\mathbf{x}_{-j})}
= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}))}{\sum_{x_j} \exp(-E_{\mathcal{G}}(\mathbf{x}))}
\text{ (using Eq.4)}\]
<p>I excluded the parameter symbols for notational brevity. That summation in denominator is not as scary as the one in Eq.4 - its only on one variable. We take advantage of this and wisely choose a sampling algorithms that uses conditional densities. Its called <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a>. Well, I am not going to prove it. You either have to take my words OR read about it in the link provided. For the sake of this article, just believe that the following works.</p>
<p>To sample \(\mathbf{x}\sim p_{\Theta}(\mathbf{x})\), we iteratively execute the following for \(T\) iterations</p>
<ol>
<li>We have a sample from last iteration \(t-1\) as \(\mathbf{x}^{(t-1)}\)</li>
<li>We then pick one variable \(X_j\) (in some order) at a time and sample from its conditional given the others: \(x_j^{(t)}\sim p(x_j\vert \underbrace{x_1^{(t)}, \cdots, x_{j-1}^{(t)}}_{\text{current iteration}}, \underbrace{x_{j+1}^{(t-1)}, \cdots, x_D^{(t-1)}}_{\text{previous iteration}})\). Please note that once we sampled one variable, we fix its value to the latest, otherwise we keep the value from previous iteration.</li>
</ol>
<p>We can start this process by setting \(\mathbf{x}^{(0)}\) to anything. If \(T\) is sufficiently large, the samples towards the end are true samples from the density \(p_{\Theta}\). To know it a bit more rigorously, I <strong>highly recommend</strong> to <a href="https://en.wikipedia.org/wiki/Gibbs_sampling#Implementation">go through this</a>.
You might be curious as to why this algorithm has an iterative process. Thats because Gibbs sampling is an <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">MCMC family algorithm</a> which has something called “Burn-in period”.</p>
<p>Now that we have pretty much everything needed, let’s explore some popular model based on the general principles of EBMs.</p>
<h2 id="boltzmann-machine">Boltzmann Machine</h2>
<p>Boltzmann Machine (BM) is one particular model that has been in the literature for a long time. BM is the simplest one in its family and is used for modelling a binary random vector \(\mathbf{X}\in\{0,1\}^D\) with \(D\) components \([ X_1, X_2, \cdots, X_D ]\). All \(D\) RVs are connected to all others by an undirected graph \(\mathcal{G}\).</p>
<center>
<figure>
<img width="30%" style="padding-top: 20px;" src="/public/posts_res/17/bm_diagram.png" />
<figcaption>Fig.4: Undirected graph representing Boltzmann Machine</figcaption>
</figure>
</center>
<p>By design, BM has a fully connected graph and hence only one maximal clique (containing all RVs). The energy function used in BM is possibly the simplest one you can imagine:</p>
\[\tag{7}
E_{\mathcal{G}}(\mathbf{x}; W) = - \frac{1}{2} \mathbf{x}^T W \mathbf{x}\]
<p>Upon expanding the vectorized form (reader is encouraged to try), we can see each term \(x_i\cdot W_{ij}\cdot x_j\) for all \(i\lt j\) as the contribution of pair of RVs \((X_i, X_j)\) to the whole energy function. \(W_{ij}\) is the “connection strength” between them. If a pair of RVs \((X_i, X_j)\) turn on together more often, a high value of \(W_{ij}\) would encourage reducing the total energy. So by means of learning, we expect to see \(W_{ij}\) going up if \((X_i, X_j)\) fire together. This phenomenon is the founding idea of a closely related learning strategy called <a href="https://en.wikipedia.org/wiki/Hebbian_theory">Hebbian Learning</a>, proposed by Donald Hebb. Hebbian theory basically says:</p>
<blockquote>
<p>If fire together, then wire together</p>
</blockquote>
<p>How do we learn this model then ? We have already seen the general way of computing gradient. We have \(\displaystyle{ \frac{\partial E_{\mathcal{G}}}{\partial W} = -\mathbf{x}\mathbf{x}^T }\). So let’s use Eq.5 to derive a learning rule:</p>
\[W := W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{W})}[ -\mathbf{x}\mathbf{x}^T ] \right)\]
<p>Equipped with Gibbs sampling, it is pretty easy to implement now. But my description of the Gibbs sampling algorithm was very general. We have to specialize it for implementing BM. Remember that conditional density we talked about (Eq.6) ? For the specific energy function of BM (Eq.7), that has a very convenient and tractable form:</p>
\[p(x_j = 1\vert \mathbf{x}_{-j}; W) = \sigma\left(W_{-j}^T\cdot \mathbf{x}_{-j}\right)\]
<p>where \(\sigma(\cdot)\) is the Sigmoid function and \(W_{-j}\in\mathbb{R}^{D-1}\) denote the vector of parameters connecting \(x_j\) with the rest of the variables \(\mathbf{x}_{-j}\in\mathbb{R}^{D-1}\). I am leaving the proof for the readers; its not hard, maybe a bit lengthy [Hint: Just put the BM energy function in Eq.6 and keep simplifying]. This particular form makes the nodes behave somewhat like computation units (i.e., neurons) as shown in Fig.5 below:</p>
<center>
<figure>
<img width="25%" style="padding-top: 20px;" src="/public/posts_res/17/bm_conditional.png" />
<figcaption>Fig.5: The computational view of BM showing its dependencies by arrows.</figcaption>
</figure>
</center>
<h2 id="boltzmann-machine-with-latent-variables">Boltzmann Machine with latent variables</h2>
<p>To add more expressiveness in the model, we can introduce latent/hidden variables. They are not observed, but help <em>explaining</em> the visible variables (see my <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGM</a> article). However, all variables are still fully connected to each other (shown below in Fig.6(a)).</p>
<p><strong>[ A little disclaimer that as we have already covered a lots of founding ideas, I am going to go over this a bit faster. You may have to look back and find analogies with our previous formulations ]</strong></p>
<center>
<figure>
<img width="70%" style="padding-top: 20px;" src="/public/posts_res/17/hbm_diagram.png" />
<figcaption>Fig.6: (a) Undirected graph of BM with hidden units (shaded ones are visible). (b) Computational view of the model while computing conditionals. </figcaption>
</figure>
</center>
<p>Suppose we have \(K\) hidden units and \(D\) visible ones. The energy function is defined very similar to that of normal BM. Now it contains seperate terms for visible-hidden (\(W\in\mathbb{R}^{D\times K}\)), visible-visible (\(V\in\mathbb{R}^{D\times D}\)) and hidden-hidden (\(U\in\mathbb{R}^{K\times K}\)) interactions. We compactly represent them as \(\Theta \triangleq \{ W, U, V \}\).</p>
\[E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}; \Theta) = -\mathbf{x}^T W \mathbf{h} - \frac{1}{2} \mathbf{x}^T V \mathbf{x} - \frac{1}{2} \mathbf{h}^T U \mathbf{h}\]
<p>The motivation for such energy function is very similar to original BM. However, our probabilistic form of the model is no longer Eq.4, but the marginalized joint distribution.</p>
\[p(\mathbf{x}; \Theta) = \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} p(\mathbf{x}, \mathbf{h}; \Theta)
= \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}))}{\sum_{\mathbf{x}',\mathbf{h}'\in\mathrm{Dom}(\mathbf{X}, \mathbf{H})} \exp(-E_{\mathcal{G}}(\mathbf{x}', \mathbf{h}'))}\]
<p>It might look a bit scary, but its just marginalized over the hidden state-space. Very surprisingly though, the conditionals have pretty similar forms as original BM:</p>
\[\begin{align}
p(h_k\vert \mathbf{x}, \mathbf{h}_{-k}) = \sigma( W\cdot\mathbf{x} + U_{-k}\cdot\mathbf{h}_{-k} ) \\
p(x_j\vert \mathbf{h}, \mathbf{x}_{-j}) = \sigma( W\cdot\mathbf{h} + V_{-j}\cdot\mathbf{x}_{-j} )
\end{align}\]
<p>Hopefully the notations are clear. If they are not, try comparing with the ones we used before. I recommend the reader to try proving it as an exercise. The diagram in Fig.6(b) hopefully adds a bit more clarity. It shows a similar computation graph for the conditionals we saw before in Fig.5.</p>
<p>Coming to the gradients, they also takes similar forms as original BM .. only difference is that now we have more parameters</p>
\[\begin{align}
W &:= W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x,h}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{x,h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{h}^T ] \right)\\
V &:= V - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{x}^T ] \right)\\
U &:= U - \lambda \cdot \left( \mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}[ -\mathbf{h}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{h}\mathbf{h}^T ] \right)
\end{align}\]
<p>If you are paying attention, you might notice something strange .. how do we compute the terms \(\mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}\) (in the positive phase) ? We don’t have hidden vectors in our dataset, right ? Actually, we do have visible vectors \(\mathbf{x}^{(i)}\) in dataset and we can get an approximate <em>complete data</em> (visible plus hidden) density as</p>
\[p_{\mathcal{D}}(\mathbf{x}^{(i)}, \mathbf{h}) = p_{\mathcal{D}}(\mathbf{x}^{(i)}) \cdot p_{\Theta}(\mathbf{h}\vert \mathbf{x}^{(i)})\]
<p>Basically, we sample a visible data from our dataset and use the conditional to sample a hidden vector. We fix the visible vector and them sample from the hidden vector one component at a time (using Gibbs sampling).</p>
<p>For jointly sampling a visible and hidden vector from the model (for negative phase), we also use Gibbs sampling just as before. We sample all of visible and hidden RVs component by component starting the iteration from any random values. <strong>There is a clever hack though</strong>. What we can do is we can start the Gibbs iteration by fixing the visible vector to a real data from our dataset (and not anything random). Turns out, this is extremely useful and efficient for getting samples quickly from the model distribution. This algorithm is famously known as “<a href="https://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf">Contrastive Divergence</a>” and has long been used in practical implementations.</p>
<h2 id="restricted-boltzmann-machine-rbm">“Restricted” Boltzmann Machine (RBM)</h2>
<p>Here comes the all important RBM, which is probably one of the most famous energy based models of all time. But, guess what, I am not going to describe it bit by bit. We have already covered enough that we can quickly build on top of them.</p>
<p>RBM is basically same as Boltzmann Machine with hidden units but with <em>one big difference</em> - it doesn’t have visible-visible AND hidden-hidden interactions, i.e.</p>
\[U = \mathbf{0}, V = \mathbf{0}\]
<p>If you do just that, Boooom ! You get RBMs (see its graphical diagram in Fig.7(a)). It makes the formulation much simpler. I am leaving it entirely for the reader to do majority of the math. Just get rid of \(U\) and \(V\) from all our formulations in last section and you are done. Fig.7(b) shows the computational view of RBM while computing conditionals.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/17/rbm_diag_and_cond.png" />
<figcaption>Fig.7: (a) Graphical diagram of RBM. (b) Arrows just show computation deps</figcaption>
</figure>
</center>
<p>Let me point you out one nice consequence of this model: the conditionals for each visible node is independent of the other visible nodes and this is true for hidden nodes as well.</p>
\[\begin{align}
p(h_k\vert \mathbf{x}) = \sigma( W_{[:,k]}\cdot\mathbf{x} )\\
p(x_j\vert \mathbf{h}) = \sigma( W_{[j,:]}\cdot\mathbf{h} )
\end{align}\]
<p>That means they can be computed in parallel</p>
\[\begin{align}
p(\mathbf{h}\vert \mathbf{x}) = \prod_{k=1}^K p(h_k\vert \mathbf{x}) = \sigma( W\cdot\mathbf{x} )\\
p(\mathbf{x}\vert \mathbf{h}) = \prod_{j=i}^D p(x_j\vert \mathbf{h}) = \sigma( W\cdot\mathbf{h} )
\end{align}\]
<p>Moreover, the Gibbs sampling steps become super easy to compute. We just have to iterate the following steps:</p>
<ol>
<li>Sample a hidden vector \(\mathbf{h}^{(t)}\sim p(\mathbf{h}\vert \mathbf{x}^{(t-1)})\)</li>
<li>Sample a visible vector \(\mathbf{x}^{(t)}\sim p(\mathbf{x}\vert \mathbf{h}^{(t)})\)</li>
</ol>
<p>This makes RBM an attractive choice for practical implementation.</p>
<hr />
<p>Whoahh ! That was a heck of an article. I encourage everyone to try working out the RBM math more rigorously by themselves and also implement it in a familiar framework. Alright, that’s all for this article.</p>
<h4 id="references">References</h4>
<ol>
<li><a href="https://www.cs.toronto.edu/~hinton/csc321/readings/boltz321.pdf">Boltzmann Machine, by G. Hinton, 2007</a></li>
<li><a href="https://www.crim.ca/perso/patrick.kenny/BMNotes.pdf">Notes on Boltzmann Machine, by Patrick Kenny</a></li>
<li><a href="http://deeplearning.net/tutorial/rbm.html">deeplearning.net documentation</a></li>
<li><a href="https://www.youtube.com/watch?v=2fRnHVVLf1Y&list=PLiPvV5TNogxKKwvKb1RKwkq2hm7ZvpHz0">Hinton’s coursera course</a></li>
<li><a href="https://www.deeplearningbook.org/">Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville</a></li>
</ol>Ayan DasWe talked extensively about Directed PGMs in my earlier article and also described one particular model following the principles of Variational Inference (VI). There exists another class of models conveniently represented by Undirected Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as Energy Based Models (EBM), as we shall see, they rely on something called Energy Functions. In the early days of this Deep Learning renaissance, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as Boltmann Machines (BM) which are very well known in the literature.Pixelor: A Competitive Sketching AI Agent. So you think you can beat me?2020-07-30T00:00:00+00:002020-07-30T00:00:00+00:00https://dasayan05.github.io/pubs/2020/07/30/pub-8<p>We present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the optimal stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance. We will release the code and the dataset, optimized for the task of early recognition, upon acceptance.</p>
<center>
<a target="_blank" class="pubicon" href="https://drive.google.com/file/d/1Y91VLgrAveMpEu99RTk1g60bw51XvcUn/view?usp=sharing"><i class="fa fa-file fa-3x"></i>Paper</a>
<a target="_blank" class="pubicon" href="https://drive.google.com/file/d/1JnFPBM_AYx15dWiwTKaIb9MMy3nr_gwE/view?usp=sharing"><i class="fa fa-file fa-3x"></i>Suppl.</a>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/neuralsort-siggraph"><i class="fa fa-file fa-3x"></i>Code</a>
</center>
<p><br /></p>
<center>
<h2>Explanation Video (Suppl. material)</h2>
<iframe width="800" height="450" src="https://www.youtube-nocookie.com/embed/E_Aclms4g-w" frameborder="1" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</center>
<p><br /></p>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Not available as of now. Will be updated later.
</code></pre></div></div>Ayan DasWe present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the optimal stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance. We will release the code and the dataset, optimized for the task of early recognition, upon acceptance.rlx: A modular Deep RL library for research2020-06-27T00:00:00+00:002020-06-27T00:00:00+00:00https://dasayan05.github.io/projs/2020/06/27/rlx-deep-rl-library<p><code class="language-plaintext highlighter-rouge">rlx</code> is a Deep RL library written on top of PyTorch & built for educational and research purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but <code class="language-plaintext highlighter-rouge">rlx</code> is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, <code class="language-plaintext highlighter-rouge">rlx</code> adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).</p>
<p>If this page doesn’t redirect automatically, <a href="https://github.com/dasayan05/rlx">click here</a></p>Ayan Dasrlx is a Deep RL library written on top of PyTorch & built for educational and research purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but rlx is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, rlx adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).BézierSketch: A generative model for scalable vector sketches2020-05-22T00:00:00+00:002020-05-22T00:00:00+00:00https://dasayan05.github.io/pubs/2020/05/22/pub-7<p>The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we present BézierSketch, a novel generative model for fully vector sketches that are automatically scalable and high-resolution. To this end, we first introduce a novel inverse graphics approach to stroke embedding that trains an encoder to embed each stroke to its best fit Bézier curve. This enables us to treat sketches as short sequences of paramaterized strokes and thus train a recurrent sketch generator with greater capacity for longer sketches, while producing scalable high-resolution results. We report qualitative and quantitative results on the Quick, Draw! benchmark.</p>
<center>
<a target="_blank" class="pubicon" href="https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123710630.pdf"><i class="fa fa-file fa-3x"></i>Paper</a>
<a target="_blank" class="pubicon" href="https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123710630-supp.pdf"><i class="fa fa-file fa-3x"></i>Suppl.</a>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/stroke-ae"><i class="fa fa-file fa-3x"></i>Code</a>
</center>
<p><br /></p>
<center>
<h2>Poster presentation video (at ECCV 2020)</h2>
<iframe width="800" height="450" src="https://www.youtube-nocookie.com/embed/g2zzaLr2VfQ" frameborder="1" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</center>
<p><br /></p>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@InProceedings{das2020bziersketch,
title = {BézierSketch: A generative model for scalable vector sketches},
author = {Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
booktitle = {The European Conference on Computer Vision (ECCV)},
year = {2020}
}
</code></pre></div></div>Ayan DasThe study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we present BézierSketch, a novel generative model for fully vector sketches that are automatically scalable and high-resolution. To this end, we first introduce a novel inverse graphics approach to stroke embedding that trains an encoder to embed each stroke to its best fit Bézier curve. This enables us to treat sketches as short sequences of paramaterized strokes and thus train a recurrent sketch generator with greater capacity for longer sketches, while producing scalable high-resolution results. We report qualitative and quantitative results on the Quick, Draw! benchmark.Introduction to Probabilistic Programming2020-05-05T00:00:00+00:002020-05-05T00:00:00+00:00https://dasayan05.github.io/blog-tut/2020/05/05/probabilistic-programming<p>Welcome to another tutorial about probabilistic models, after <a href="https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm.html">a primer on PGMs</a> and <a href="https://dasayan05.github.io/blog-tut/2020/01/01/variational-autoencoder.html">VAE</a>. However, I am particularly excited to discuss a topic that doesn’t get as much attention as traditional Deep Learning does. The idea of <strong>Probabilistic Programming</strong> has long been there in the ML literature and got enriched over time. Before it creates confusion, let’s declutter it right now - it’s not really writing traditional “programs”, rather it’s building <a href="https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm.html">Probabilistic Graphical Models</a> (PGMs), but <em>equipped with imperative programming style</em> (i.e., iterations, branching, recursion etc). Just like Automatic Differentiation allowed us to compute derivative of arbitrary computation graphs (in PyTorch, TensorFlow), Black-box methods have been developed to “solve” probabilistic programs. In this post, I will provide a generic view on why such a language is indeed possible and how such black-box solvers are materialized. At the end, I will also introduce you to one such <em>Universal</em> Probabilistic Programming Language, <a href="http://pyro.ai/">Pyro</a>, that came out of <a href="https://www.uber.com/us/en/uberai/">Uber’s AI lab</a> and started gaining popularity.</p>
<h1 id="overview">Overview</h1>
<p>Before I dive into details, let’s get the bigger picture clear. It is highly advisable to read any good reference about PGMs before you proceed - my <a href="https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm.html">previous article</a> for example.</p>
<h3 id="generative-view--execution-trace">Generative view & Execution trace</h3>
<p>Probabilistic Programming is NOT really what we usually think of as <em>programming</em> - i.e., completely deterministic execution of hard-coded instructions which does exactly what its told and nothing more.
Rather it is about building PGMs (must read <a href="https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm.html">this</a>) which models our belief about the data generation process. We, as users of such language, would express a model in an imperative form which would encode all our uncertainties in the way we want. Here is a (Toy) example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model</span><span class="p">(</span><span class="n">theta</span><span class="p">):</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">Bernoulli</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">];</span> <span class="n">theta</span><span class="p">)</span>
<span class="n">P</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">A</span>
<span class="k">if</span> <span class="n">A</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">P</span><span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">A</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span>
</code></pre></div></div>
<p>If you assume this to be valid program (for now), this is what we are talking about here - all our traditional “variables” become “random variables” (RVs) and have uncertainty associated with them in the form of probability distributions. Just to give you a taste of its flexibility, here’s the constituent elements we encountered</p>
<ol>
<li>Various different distributions are available (e.g., Normal, Bernoulli, Uniform etc.)</li>
<li>We can do deterministic computation (i.e., \(P = 2 * A\))</li>
<li>Condition RVs on another RVs (i.e., \(C\vert B \sim \mathcal{N}(B, 1)\))</li>
<li>Imperative style branching allows dynamic structure of the model …</li>
</ol>
<p>Below is a graphical representation of the model defined by the above program.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/16/example_exectrace.png" />
</figure>
</center>
<p>Just like the invocation of a traditional compiler on a traditional program produces the desired output, this (probabilistic) program can be executed by means of “ancestral sampling”. I ran the program 5 times and each time I got samples from all my RVs. Each such “forward” run is often called an <em>execution trace</em> of the model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="mf">0.5</span><span class="p">))</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">0.318</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.069</span><span class="p">)</span>
<span class="p">(</span><span class="o">-</span><span class="mf">1.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.156</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.822</span><span class="p">)</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">0.594</span><span class="p">,</span> <span class="mf">0.865</span><span class="p">)</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">1.100</span><span class="p">,</span> <span class="mf">1.079</span><span class="p">)</span>
<span class="p">(</span><span class="o">-</span><span class="mf">1.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.262</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.403</span><span class="p">)</span>
</code></pre></div></div>
<p>This is the so called “generative view” of a model. We typically use the leaf-nodes of PGMs as our observed data. And rest of the graph can be the “latent factors” of the model which we either know or want to estimate. In general, a practical PGM can often be encapsulated as a set of latent nodes \(\mathbf{Z} \triangleq \{ Z_1, Z_2, \cdots, Z_H \}\) and visible nodes \(\mathbf{X} \triangleq \{ X_1, X_2, \cdots, X_V \}\) related probabilistically as
<br />
\[
\mathbf{Z} \rightarrow \mathbf{X}
\]</p>
<h3 id="training-and-inference">Training and Inference</h3>
<p>From now on, we’ll use the general notation rather than the specific example. The model may be parametric. For example, we had the bernoulli success probability \(\theta\) in our toy example. The full joint probability is given as</p>
<p>\[
\mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) = \mathbb{P}_{\theta}(\mathbf{Z}) \cdot \mathbb{P}_{\theta}(\mathbf{X}\vert \mathbf{Z})
\]</p>
<p>We would like to do two things:</p>
<ol>
<li>Estimate model parameters \(\theta\) from data</li>
<li>Compute the posterior, i.e., infer latent variables given data</li>
</ol>
<p>As discussed in my <a href="https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm.html">PGM article</a>, both of them are infeasible due to the fact that</p>
<ol>
<li>Log-likehood maximization is not possible because of the presence of latent variables</li>
<li>For continuous distributions on latent variables, the posterior is intractible</li>
</ol>
<p>The way forward is to take help of <em>Variational Inference</em> and maximize our very familiar <strong>E</strong>vidence <strong>L</strong>ower <strong>BO</strong>und (ELBO) loss to estimate the model parameters and also a set of variational parameters which help building a proxy for the original posterior \(\mathbb{P}_{\theta}(\mathbf{Z}\vert \mathbf{X})\). Mathematically, we choose a known and tractable family of distribution \(\mathbb{Q}_{\phi}(\mathbf{Z})\) (parameterized by variational parameters \(\phi\)) to approximate the posterior. The learning process is facilitated by maximizing the following</p>
<p>\[
\mathrm{ELBO}(\theta, \phi) \triangleq \mathbb{E}_{\mathbb{Q}_{\phi}} \bigl[\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z}) \bigr]
\]</p>
<p>by estimating gradients w.r.t all its parameters</p>
<p>\[\tag{1}
\nabla_{[\theta, \phi]} \mathrm{ELBO}(\theta, \phi)
\]</p>
<h1 id="black-box-variational-inference">Black-Box Variational Inference</h1>
<p><br />
If you have gone through my <a href="https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm.html">PGM article</a>, you might think you’ve seen these before. Actually, you’re right ! There is really nothing new to this. What we really need for establishing a Probabilistic Programming framework is <strong>a unified way to implement the ELBO optimization for ANY given problem</strong>. And by “problem” I mean the following:</p>
<ol>
<li>A model specification \(\mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X})\) written in a probabilistic language (like we saw before)</li>
<li>An optional (parameterized) “Variational Model” \(\mathbb{Q}_{\phi}(\mathbf{Z})\), famously known as a “Guide”</li>
<li>And .. the observed data \(\mathcal{D}\), of course</li>
</ol>
<!-- Very importantly, we CAN NOT make any *assumptions* about the inner structure of either the "model" or the "guide". This motivated the research on a "Black-box" method for solving such probabilistic programs. Please realize that this is exactly how "traditional compilers" (like C, Python) are built - they make no assumption about the symantic meaning/structure of your program .. they just check for syntactic validity. -->
<p>But, how do we compute (1) ? The appearent problem is that gradient w.r.t. \(\phi\) is required but it appears in the expectation itself. To mitigate this, we make use of the famous trick known as the “log-derivative” trick (it actually has many other names like REINFORCE etc). For notational simplicy let’s denote \(f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \triangleq \log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z})\) and continue from (1)</p>
\[\sum_{\mathbf{Z}} \nabla_{[\theta, \phi]} \bigg[ \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]
=\sum_{\mathbf{Z}} \bigg[ \nabla_{\phi} \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[=\sum_{\mathbf{Z}} \bigg[ \color{red}{\mathbb{Q}_{\phi}(\mathbf{Z})} \cdot \frac{\nabla_{\phi} \mathbb{Q}_{\phi}(\mathbf{Z})}{\color{red}{\mathbb{Q}_{\phi}(\mathbf{Z})}} \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[=\sum_{\mathbf{Z}} \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \bigg[ \color{red}{\nabla_{\phi} \log\mathbb{Q}_{\phi}(\mathbf{Z})} \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[\tag{2}
= \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[ \nabla_{[\theta, \phi]} \bigg( \underbrace{\log\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \overline{f(\mathbf{Z}, \mathbf{X}; \theta, \phi)}
+f(\mathbf{Z}, \mathbf{X}; \theta, \phi)}_\text{Surrogate Objective} \bigg) \bigg]\]
<p>Eq. (2) shows that the trick helped the \(\nabla_{[\theta, \phi]}\) to penetrate the \(\mathbb{E}[\cdot]\), but in the process, it changed the original \(f\) with a “<a href="https://arxiv.org/abs/1506.05254">surrogate</a> function” \(f_{surr} \triangleq \overline{f}\cdot\log\mathbb{Q}+f\) where the <em>bar</em> protects a quantity from differentiation. Equation (2) is all we need - it provides an insight on how to make the gradient estimation practical. In fact, it can be proven theoretically that this gradient is an unbiased estimate of the true gradient in Equation (1).</p>
<p>Succinctly, we run the Guide \(L\) times to record a set of \(L\) execution-traces (i.e., samples \(\mathbf{\widehat{Z}}\sim\mathbb{Q}_{\phi}\)) and compute the following Monte-Carlo approximation to Equation (2)</p>
<p>\[\tag{3}
\nabla_{[\theta, \phi]} \mathrm{ELBO}(\theta, \phi) \approx \frac{1}{L} \sum_{\mathbf{\widehat{Z}}\sim\mathbb{Q}_{\phi}} \left[ \nabla_{[\theta, \phi]} f_{surr}(\mathbf{\widehat{Z}}, \mathcal{D}) \right]_{\theta=\theta_{old}, \phi=\phi_{old}}
\]</p>
<p>The nice thing about Equation (2) (or equivalently Equation (3)) is we got the differentiation operator right on top of a deterministic function (i.e., \(f_{surr}\)). It means we can construct \(f_{surr}\) as a computation graph and take advantage of modern day automatic differentaition engines. Here’s how the computation graph and the graphical model are linked</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/16/gm_cg.png" />
</figure>
</center>
<p>Last but not the least, let’s look at the function \(f_{surr}\) which is basically built on the log-density terms \(\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X})\) and \(\log \mathbb{Q}_{\phi}(\mathbf{Z})\). We need a way to compute them flexibly. Please remember that the model and guide is written in a <em>language</em> and hence we have access to their graph-structure. A clever software implementation can harness this structure to estimate the log-densities (and eventually \(f_{surr}\)).</p>
<p>I claimed before that the gradient estimates are unbiased. However, such generic way of computing the gradient introduces high variance in the estimate and make things unstable for complex models. There are few tricks used widely to get around them. But please note that such tricks always exploits model-specific structure. Three such tricks are presented below.</p>
<h3 id="i-re-parameterization">I. Re-parameterization</h3>
<p>We might get lucky that \(\mathbb{Q}_{\phi}(\mathbf{Z})\) is <a href="https://arxiv.org/abs/1312.6114">re-parameterizable</a>. What that means is the expectation w.r.t \(\mathbb{Q}_{\phi}(\mathbf{Z})\) can be made free of its parameters and by doing so the gradient operator can be pushed inside without going through the log-derivative trick.
So, let’s step back a bit and consider the original ELBO gradient in (1). Assuming re-parameterizable nature, the following can be done
\[
\nabla_{[\theta, \phi]} \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z}) \bigg] = \nabla_{[\theta, \phi]} \mathbb{E}_{Q(\mathbf{\epsilon})} \bigg[\log \mathbb{P}_{\theta}(G_{\phi}(\epsilon), \mathbf{X}) - \log \mathbb{Q}_{\phi}(G_{\phi}(\epsilon)) \bigg]
\]
\[
= \mathbb{E}_{Q(\mathbf{\epsilon})} \bigg[\nabla_{[\theta, \phi]} \bigg( \log \mathbb{P}_{\theta}(G_{\phi}(\mathbf{\epsilon}), \mathbf{X}) - \log \mathbb{Q}_{\phi}(G_{\phi}(\epsilon)) \bigg) \bigg]
\]</p>
<p>Where \(Q(\mathbf{\epsilon})\) is an independent source of randomness. Computing this expectation with empirical average (just like Eq.2) gives us a better (variance reduced) estimate of the true gradient of ELBO.</p>
<h3 id="ii-rao-blackwellization">II. Rao-Blackwellization</h3>
<p>This is another well-known variance reduction technique. It is a bit mathematically rigorous, so I will explain it simply without making it confusing. This requires the full variational distributions to have some kind of factorization. A specific case is when we have mean-field assumption, i.e.</p>
<p>\[
\mathbb{Q}_{\phi}(\mathbf{Z}) = \prod_i Q_{\phi_i}(Z_i)
\]</p>
<p>With a little effort, we can pull out the gradient estimator for each of these \(\phi_i\) parameters from (2). They look something like this</p>
\[\nabla_{\phi_i} \mathrm{ELBO}(\theta, \phi) = \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[ \nabla_{\phi_i} \log\mathbb{Q}_{\phi_i}(Z_i) \cdot \bigg( \overline{\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z})} \bigg)
+\cdots \bigg]\]
<p>The reason why the quantity under bar still has all the factors because it is immune to gradient operator. Also because the expectation is outside the gradient operator, it contains all factors. At this point, the Rao-Blackwellization offers a variance-reduced estimate of the above gradient, i.e.,</p>
\[\nabla_{\phi_i} \mathrm{ELBO}(\theta, \phi) \approx \mathbb{E}_{\mathbb{Q}_{\phi}^{(i)}} \bigg[ \nabla_{\phi_i} \log\mathbb{Q}_{\phi_i}(Z_i) \cdot \bigg( \overline{\log \mathbb{P}^{(i)}_{\theta}(\mathbf{Z}^{(i)}, \mathbf{X}) - \log \mathbb{Q}_{\phi_i}(Z_i)} \bigg)
+\cdots \bigg]\]
<p>where \(\mathbf{Z}^{(i)}\) is the set of variables that forms the “markov blanket” of \(Z_i\) w.r.t to the structure of guide, \(\mathbb{Q}_{\phi}^{(i)}\) is the part of the variational distribution that depends on \(\mathbf{Z}^{(i)}\) and \(\mathbb{P}^{(i)}_{\theta}(\mathbf{Z}^{(i)}, \cdot)\) is the factors of the model that involves \(\mathbf{Z}^{(i)}\).</p>
<h3 id="iii-explicit-enumeration-for-discrete-rvs">III. Explicit enumeration for Discrete RVs</h3>
<p>While exploiting the graph structure of the guide while simplifying (1), we might end up getting a term like this due to factorization in the guide density</p>
<p>\[
\mathbb{E}_{Z_i\sim\mathbb{Q}_{\phi_i}(Z_i)} \bigl[ f(\cdot) \bigr]
\]</p>
<p>If it happens that the variable \(Z_i\) is discrete with the size of its state space reasonably small (e.g., a \(d=5\) dimensional binary RV has \(2^5 = 32\) states), we can replace sampling-based empirical expectations with true expectation where we have to evaluate a sum over its entire state-space</p>
<p>\[
\sum_{Z_i} \mathbb{Q}_{\phi_i}(Z_i)\cdot f(\cdot)
\]</p>
<p>So make sure the state-space is resonable in size. This helps reducing the variance quite a bit.</p>
<p>Whew ! That’s a lot of maths. But good thing is, you hardly ever have to think about them in detail because software engineers have put tremendous effort to make these algorithms as easily accessible as possible via libraries. One of them we are going to have a brief look on.</p>
<h1 id="pyro-universal-probabilistic-programming"><code class="language-plaintext highlighter-rouge">Pyro</code>: Universal Probabilistic Programming</h1>
<p><a href="http://pyro.ai/">Pyro</a> is a probabilistic programming framework that allows users to write flexible models in terms of a simple API. Pyro is written in Python and uses the popular PyTorch library for its internal representation of computation graph and as auto differentiation engine. Pyro is quite expressive due to the fact that it allows the model/guide to have fully imperative flow. It’s core API consists of these functionalities</p>
<ol>
<li><code class="language-plaintext highlighter-rouge">pyro.param()</code> for defining learnable parameters.</li>
<li><code class="language-plaintext highlighter-rouge">pyro.dist</code> contains a large collection of probability distribution.</li>
<li><code class="language-plaintext highlighter-rouge">pyro.sample()</code> for sampling from a given distribution.</li>
</ol>
<p>Let’s take a concrete example and work it out.</p>
<h4 id="problem-mixture-of-gaussian">Problem: Mixture of Gaussian</h4>
<p>MoG (Mixture of Gaussian) is a realatively simple but widely studied probabilistic model. It has an important application in soft-clustering. For the sake of simplicity we assume we only have two mixtures. The generative view of the model is basically this: we flip a coin (latent) with bias \(\rho\) and depending on the outcome \(C\in \{ 0, 1 \}\) we sample data from either of the two gaussian \(\mathcal{N}(\mu_0, \sigma_0)\) and \(\mathcal{N}(\mu_1, \sigma_1)\)</p>
\[C_i \sim Bernoulli(\rho) \\
X_i \sim \mathcal{N}(\mu_{C_i}, \sigma_{C_i})\]
<p>where \(i = 1 \cdots N\) is data index, \(\theta \triangleq \{ \rho, \mu_1, \sigma_1, \mu_2, \sigma_2 \}\) is the set of model parameters we need to learn. This is how you write that in Pyro:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model</span><span class="p">(</span><span class="n">data</span><span class="p">):</span> <span class="c1"># Take the observation
</span> <span class="c1"># Define coin bias as parameter. That's what 'pyro.param' does
</span> <span class="n">rho</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"rho"</span><span class="p">,</span> <span class="c1"># Give it a name for Pyro to track properly
</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">0.5</span><span class="p">]),</span> <span class="c1"># Initial value
</span> <span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">unit_interval</span><span class="p">)</span> <span class="c1"># Has to be in [0, 1]
</span> <span class="c1"># Define both means and std with random initial values
</span> <span class="n">means</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"M"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.5</span><span class="p">,</span> <span class="mf">3.</span><span class="p">]))</span>
<span class="n">stds</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"S"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]),</span>
<span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">positive</span><span class="p">)</span> <span class="c1"># std deviation cannot be negative
</span>
<span class="k">with</span> <span class="n">pyro</span><span class="p">.</span><span class="n">plate</span><span class="p">(</span><span class="s">"data"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span> <span class="c1"># Mark conditional independence
</span> <span class="c1"># construct a Bernoulli and sample from it.
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"c"</span><span class="p">,</span> <span class="n">dist</span><span class="p">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="n">rho</span><span class="p">))</span> <span class="c1"># c \in {0, 1}
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="nb">type</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">LongTensor</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">],</span> <span class="n">stds</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="c1"># pick a mean as per 'c'
</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"x"</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">obs</span><span class="o">=</span><span class="n">data</span><span class="p">)</span> <span class="c1"># sample data (also mark it as observed)
</span></code></pre></div></div>
<p>Due to the discrete and low dimensional nature of the latent variable \(C\), this problem is in general tracktable in terms of computing posterior. But let’s assume it is not. The true posterior \(\mathbb{P}(C_i\vert X_i)\) is the quantity known as “assignment” that reveals the latent factor, i.e., what was the coin toss result when a given \(X_i\) was sampled. We define a guide on \(C\), parameterized by variational parameters \(\phi \triangleq \{ \lambda_i \}_{i=1}^N\)</p>
\[C_i \sim Bernoulli(\lambda_i)\]
<p>In Pyro, we define a guide that encodes this</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">guide</span><span class="p">(</span><span class="n">data</span><span class="p">):</span> <span class="c1"># Guide doesn't require data; just need the value of N
</span> <span class="k">with</span> <span class="n">pyro</span><span class="p">.</span><span class="n">plate</span><span class="p">(</span><span class="s">"data"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span> <span class="c1"># conditional independence
</span> <span class="c1"># Define variational parameters \lambda_i (one for every data point)
</span> <span class="n">lam</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"lam"</span><span class="p">,</span>
<span class="n">torch</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)),</span> <span class="c1"># randomly initiallized
</span> <span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">unit_interval</span><span class="p">)</span> <span class="c1"># \in [0, 1]
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"c"</span><span class="p">,</span> <span class="c1"># Careful, this name HAS TO BE same to match the model
</span> <span class="n">dist</span><span class="p">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="n">lam</span><span class="p">))</span>
</code></pre></div></div>
<p>We generate some synthetic data from the following simualator to train our model on.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">getdata</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">mean1</span><span class="o">=</span><span class="mf">2.0</span><span class="p">,</span> <span class="n">mean2</span><span class="o">=-</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">std1</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">std2</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
<span class="n">D1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">N</span><span class="o">//</span><span class="mi">2</span><span class="p">,)</span> <span class="o">*</span> <span class="n">std1</span> <span class="o">+</span> <span class="n">mean1</span>
<span class="n">D2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">N</span><span class="o">//</span><span class="mi">2</span><span class="p">,)</span> <span class="o">*</span> <span class="n">std2</span> <span class="o">+</span> <span class="n">mean2</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">D1</span><span class="p">,</span> <span class="n">D2</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">D</span><span class="p">)</span>
<span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">D</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">))</span>
</code></pre></div></div>
<p>Finally, Pyro requires a bit of boilerplate to setup the optimization</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">getdata</span><span class="p">(</span><span class="mi">200</span><span class="p">)</span> <span class="c1"># 200 data points
</span><span class="n">pyro</span><span class="p">.</span><span class="n">clear_param_store</span><span class="p">()</span>
<span class="n">optim</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">({})</span>
<span class="n">svi</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">infer</span><span class="p">.</span><span class="n">SVI</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">guide</span><span class="p">,</span> <span class="n">optim</span><span class="p">,</span> <span class="n">infer</span><span class="p">.</span><span class="n">Trace_ELBO</span><span class="p">())</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">):</span>
<span class="n">svi</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>
<p>That’s pretty much all we need. I have plotted the (1) ELBO loss, (2) Variational parameter \(\lambda_i\) for every data points, (3) The two gaussians in the model and (4) The coin bias as the training progresses.</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/16/example_loss.gif" />
</figure>
</center>
<p>The full code is available in this gist: <a href="https://gist.github.com/dasayan05/aca3352cd00058511e8372912ff685d8">https://gist.github.com/dasayan05/aca3352cd00058511e8372912ff685d8</a>.</p>
<hr />
<p>That’s all for today. Hopefully I was able to convey the bigger picture of probabilistic programming which is quite useful for modelling lots of problems. The following references the sources of information while writing the article. Interested readers are encouraged to check them out.</p>
<ol>
<li><a href="http://pyro.ai/examples/svi_part_iii.html">Pyro’s VI tutorial</a></li>
<li><a href="https://arxiv.org/abs/1401.0118">Black Box variational inference</a></li>
<li><a href="https://arxiv.org/abs/1506.05254">Gradient Estimation Using Stochastic Computation Graphs</a></li>
<li><a href="https://arxiv.org/abs/1701.03757">Deep Probabilistic Programming</a></li>
<li><a href="https://arxiv.org/abs/1810.09538">Pyro: Deep Universal Probabilistic Programming</a></li>
</ol>Ayan DasWelcome to another tutorial about probabilistic models, after a primer on PGMs and VAE. However, I am particularly excited to discuss a topic that doesn’t get as much attention as traditional Deep Learning does. The idea of Probabilistic Programming has long been there in the ML literature and got enriched over time. Before it creates confusion, let’s declutter it right now - it’s not really writing traditional “programs”, rather it’s building Probabilistic Graphical Models (PGMs), but equipped with imperative programming style (i.e., iterations, branching, recursion etc). Just like Automatic Differentiation allowed us to compute derivative of arbitrary computation graphs (in PyTorch, TensorFlow), Black-box methods have been developed to “solve” probabilistic programs. In this post, I will provide a generic view on why such a language is indeed possible and how such black-box solvers are materialized. At the end, I will also introduce you to one such Universal Probabilistic Programming Language, Pyro, that came out of Uber’s AI lab and started gaining popularity.Patterns of Randomness2020-04-15T00:00:00+00:002020-04-15T00:00:00+00:00https://dasayan05.github.io/blog-tut/2020/04/15/patterns-of-randomness<p>Welcome folks ! This is an article I was planning to write for a long time. I finally managed to get it done while locked at home due to the global COVID-19 situation. So, its basically something fun, interesting, attractive and hopefully understandable to most readers. To be specific, my plan is to dive into the world of finding visually appealing patterns in different sections of mathematics. I am gonna introduce you to four distinct mathematical concepts by means of which we can generate artistic patterns that are very soothing to human eyes. Most of these use random number as the underlying principle of generation. These are not necessarily very useful in real life problem solving but widely loved by artists as a tool for content creation. They are sometimes referred to as <em>Mathematical Art</em>. I will deliberately keep the fine-grained details out of the way so that it is reachable to a larger audience. In case you want to reproduce the content in this post, here is the <a href="https://github.com/dasayan05/patterns-of-randomness">code</a>. <strong>Warning: This post contains quite heavily sized images which may take some time to load in your browser; so be patient</strong>.</p>
<h1 id="-random-walk--brownian-motion-">[ Random Walk & Brownian Motion ]</h1>
<p>Let’s start with something simple. Consider a Random Variable \(\mathbf{R}_t\) (\(t\) being time) with support \(\{ -1, +1\}\) with equal probability on both of its possible values. Think of it as a <em>score</em> you get at time \(t\) which can be either \(+1\) or \(-1\) as a result of an unbiased coin-flip. In terms of probability:</p>
<p>\[
\mathbb{P}\bigl[ \mathbf{R}_t = +1 \bigr] = \mathbb{P}\bigl[ \mathbf{R}_t = -1 \bigr] = \frac{1}{2}
\]</p>
<p>Realization (samples) of \(\mathbf{R}_t\) for \(t=0 \rightarrow T (=10)\) would look like
\[
\bigl[ +1, -1, -1, +1, -1, -1, -1, +1, +1, -1, +1 \bigr]
\]</p>
<p>Let us define another Random Variable \(\mathbf{S}_t\) which is nothing but an accumulator of \(\mathbf{R}_t\) till time \(t\). So, by definition</p>
<p>\[
\mathbf{S}_t = \sum_{i=0}^t \mathbf{R}_i
\]
Realization of \(\mathbf{S}_t\) corresponding to above \(\mathbf{R}_t\) sequence would look like
\[
\bigl[ +1, 0, -1, 0, -1, -2, -3, -2, -1, -2, -1 \bigr]
\]</p>
<p>This is popularly known as the <strong>Random Walk</strong>. With the basics ready, let us have two such random walks namely \(\mathbf{S}^x_t\) and \(\mathbf{S}^y_t\) and treat them as \(X\) and \(Y\) coordinates of a <em>Random Vector</em> namely \(\displaystyle{ \bar{\mathbf{S}}_t \triangleq \begin{bmatrix} \mathbf{S}^x_t \\ \mathbf{S}^y_t \end{bmatrix} }\).</p>
<p>As of now it look all nice and mathy, right ! Here’s the fun part. Let me keep the time (i.e., \(t\)) running and keep track on the path that the vector \(\bar{\mathbf{S}}_t\) traces on a 2D plane</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/15/2d_disc_brown.gif" />
</figure>
</center>
<p>It will create a cool random checkerboard-like pattern as time goes on. Looking at the tip (the ‘dot’), you might see it as a tiny particle. As it happened that this is a discretized verision of a continuous <a href="http://www1.lsbu.ac.uk/water/Brownian.html">phenomenon observed in real microscopic particles in fluid</a>, famously known as <strong>Brownian Motion</strong>.</p>
<p>Real Brownian Motion is continuous. Let’s work it out, but very briefly. We divide an arbitrary time interval \([0, T]\) into \(N\) small intervals of length \(\displaystyle{ \Delta t = \frac{T}{N} }\) and have a modified score Random Variable \(\mathbf{R}_t\) with support \(\displaystyle{ \left\{ +\sqrt{\frac{T}{N}}, -\sqrt{\frac{T}{N}} \right\} }\) with equal probability as before. We still have the same definition of \(\mathbf{S}_t = \sum_{i=0}^t \mathbf{R}_i\). It so happened that as we appraoch the limiting case of</p>
<p>\[
N \rightarrow \infty,\text{ and consequently } \sqrt{\frac{T}{N}} \rightarrow 0\text{ and } \Delta t\rightarrow 0
\]</p>
<p>it gives us the continuous analogue of <strong>Brownian Motion</strong>. Similar to the discrete case, if we trace the path of \(\displaystyle{ \bar{\mathbf{S}}_t \triangleq \begin{bmatrix} \mathbf{S}^x_t \\ \mathbf{S}^y_t \end{bmatrix} }\) with large \(N\) (yes, in practice we cannot go to infinity, sorry), patterns like this will emerge</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/15/brown.gif" />
</figure>
</center>
<p>To make it more artistic, I took an even bigger \(N\) and ran the simulation for quite a while and got quite beautiful jittery patterns. Random numbers being at the heart of the phenomenon, we’ll get different patterns in different runs. Here are two such simulation results:</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/brownian_full.png" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Brownian_motion">Wikipedia</a></li>
<li><a href="https://en.wikipedia.org/wiki/Geometric_Brownian_motion">Geometric BM</a></li>
<li><a href="https://en.wikipedia.org/wiki/It%C3%B4_calculus">Stochastic Calculus</a></li>
</ol>
<h1 id="-dynamical-systems--chaos-">[ Dynamical Systems & Chaos ]</h1>
<p>Dynamical Systems are defined by a state space \(\mathbb{R}^n\) and a system dynamics (a function \(\mathbf{F}\)). A state \(\mathbf{x}\in\mathbb{R}^n\) is a specific (abstract) configuration of a system and the dynamics determines how the state “evolves” over time. The dynamics is often represented by a <a href="https://en.wikipedia.org/wiki/Differential_equation">differential equation</a> that specifies the chnage of state over time. So,</p>
<p>\[
\mathbf{F}(\mathbf{x}, t) \triangleq \frac{d\mathbf{x}}{dt}
\]</p>
<p>The true states of the system at some point of time is determined by solving and Initial Value Problem (IVP) starting from an initial state \(\mathbf{x}_0\). We then solve consecutive states with \(t\gt 0\) as</p>
<p>\[
\mathbf{x}_t = \mathbf{x}_0 + \Delta t \cdot \mathbf{F}(\mathbf{x}, t)
\]</p>
<p>Having sufficiently small \(\Delta t\) ensures propert evolution of states.</p>
<p>Now this may seem quite trivial, at least to those who have studied Differential Equations. But, there are specific cases of \(\mathbf{F}\) which leads to an evolution of states whose trajectory is surprisingly beautiful. For reasons that are beyond the scope of this article, these are called <strong>Chaos</strong>. There is a specific branch of dynamical systems (named “<a href="https://en.wikipedia.org/wiki/Chaos_theory">Chaos Theory</a>”) that deals with characteristics of such chaotic systems. Below are three such chaotic systems with there trajectory visualized in 3D state space. To be specific, we take each system with an initial state (they are very sensitive to initial states) and compute successive states with a small enough \(\Delta t\) and visualize them as a continuous path in 3D. The corresponding figures depict an animation of the evolution of states over time as well as the whole trajectory all at once.</p>
<h3 id="lorentz-system">Lorentz System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ \sigma (y-x), x(\rho - z) - y, xy - \beta z \bigr]^T
\]
\[
\text{with }\sigma = 10, \beta = \frac{8}{3}, \rho = 28 \text{, and } \mathbf{x}_0 = \bigl[ 1,1,1 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/lorentz.gif" />
</figure>
</center>
<h3 id="rössler-system">Rössler System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ -(y+z), x+Ay, B+xz-Cz \bigr]^T
\]
\[
\text{with }A=0.2, B=0.2, C=5.7 \text{, and } \mathbf{x}_0 = \bigl[ 1,1,1 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/roseller.gif" />
</figure>
</center>
<h3 id="halvorsen-system">Halvorsen System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ -ax-4y-4z-y^2, -ay-4z-4x-z^2, -az-4x-4y-x^2 \bigr]^T
\]
\[
\text{with }a=1.89 \text{, and } \mathbf{x}_0 = \bigl[ -1.48, -1.51, 2.04 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/helvorsen.gif" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Differential_equation">Differential Equation</a>, <a href="https://en.wikipedia.org/wiki/Dynamical_system">Dynamical System</a></li>
<li><a href="https://en.wikipedia.org/wiki/Chaos_theory">Chaos Theory</a></li>
<li><a href="https://en.wikipedia.org/wiki/Attractor">Attractors</a>, <a href="http://www.stsci.edu/~lbradley/seminar/attractors.html">Strange Attractors</a></li>
<li><a href="https://en.wikipedia.org/wiki/Lorenz_system">Lorentz System</a>, <a href="https://en.wikipedia.org/wiki/R%C3%B6ssler_attractor">Rössler System</a>, <a href="https://www.dynamicmath.xyz/calculus/velfields/Halvorsen/">Halvorsen System</a></li>
</ol>
<h1 id="-complex-fourier-series-">[ Complex Fourier Series ]</h1>
<p>We all know about Fourier Series, right ! But I am sure not all of you have seen this artistic side of it. Well, this isn’t really related to fourier series, but fourier series helps in creating them.</p>
<p>We know the following to be the “synthesis equation” of complex fourier series</p>
<p>\[
f(t) = \sum_{n=-\infty}^{+\infty} c_n e^{j \frac{2\pi n}{T} t} \in \mathbb{C}
\]</p>
<p>which represents the synthesis of a periodic function \(f(t)\) of period \(T\) from its frequency components \(\mathbf{C} \triangleq \left[ c_{-\infty}, \cdots, c_{-2}, c_{-1}, c_{0}, c_{+1}, c_{+2}, \cdots, c_{+\infty} \right]\). Often, as a practical measure, we crop the infinite summation to a limited range \([ -N, N ]\). Furthermore, let’s consider \(T=1\) without lose of generality. So, we see \(f(t)\) as a function parameterized by the frequence components \(\mathbf{C} \in \mathbb{C}^{2N+1}\)</p>
<p>\[
f(t, \mathbf{C}) \approx \sum_{n=-N}^{+N} c_n e^{j 2\pi n t} \in \mathbb{C}
\]</p>
<p>By doing this, we can make complex valued functions by putting different \(\mathbf{C}\) and running \(t=0\rightarrow 1\). However, not all \(\mathbf{C}\) leads to anything visually appealing. A particular feature of an object that appeals to the human eyes is “Symmetry”. We are gonna exploit this here. A little refresher on fourier series will make you realize that if the coefficients are real-valued, then \(f(t, \mathbf{C})\) has symmetric property. And that’s all we need.</p>
<p>We pick random \(\mathbf{C} \in \mathbb{R}^{2N+1}\) (see, its real numbers now) and run the clock \(t=0\rightarrow 1\) and trace the path travelled by the complex point \(f(t, \mathbf{C}) \in \mathbb{C}\) as time progresses. It creates patterns like the ones shown below</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/fourier_6.gif" />
</figure>
</center>
<p>There is one way to customize these - the value of \(N\). As we know that \(c_n\) has the interpretation of the magnitude of \(n^{th}\) frequency component. A large value of \(N\) implies the introduction of more high frequency into the time-domain signal. This visually leads to \(f(t)\) having finer details (i.e., more curves and bendings). Lowering the value of \(N\) would clear out these fine details and the path will become more and more flat. The below image shows decreasing value of \(N = 10 \rightarrow 6\) along columns. You can see the patterns losing details as we go right. And just like before, every run will create different patterns as they are solely controlled by random numbered coefficients.</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/15/fourier_10_6.png" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="http://www.ee.ic.ac.uk/hp/staff/dmb/courses/E1Fourier/00300_ComplexFourier.pdf">Complex Fourier Series</a></li>
<li><a href="http://www.jezzamon.com/fourier/">Fourier patterns</a></li>
<li><a href="https://www.youtube.com/watch?v=ds0cmAV-Yek">Visualizing fourier series</a></li>
<li><a href="https://www.youtube.com/watch?v=r6sGWTCMz2k&t=725s">Amazing Video by 3Blue1Brown</a></li>
</ol>
<h1 id="-mandelbrot--julia-set-">[ Mandelbrot & Julia set ]</h1>
<p>These two sets are very important in the study of “Fractals” - objects with self-repeating patterns. Fractals are extremely popular concepts in certain branches of mathematics but they are mostly famous for having eye-catching visual appearance. If you ever come across an article about fractals, you are likely to see some of the most artistic patterns you’ve ever seen in the context of mathematics. Diving into the details of fractals and self-repeating patterns will open a vast world of “Mathematical Art”. Although, in this article, I can only show you a tiny bit of it - two sets namely “Mandelbrot” and “Julia” set. Let’s start with the <em>all important function</em></p>
<p>\[
f_C(z) = z^2 + C
\]</p>
<p>where \(C, f_C(z), z \in \mathbb{C}\) are complex numbers. This appearantly simply complex-valued function is in the heart of these sets. All it does is squares its argument and adds a complex number that the function is parameterized with. Also, we denote \(f^{(k)}_C(z)\) as \(k\) times repeated application of the function on a given \(z\), i.e.</p>
<p>\[
f^{(k)}_C(z) = f_C(\cdots f_C(f_C(z)))
\]</p>
<h3 id="mandelbrot-set">Mandelbrot Set</h3>
<p>With these basic definitions in hand, the <strong>Mandelbrot set</strong> (invented by mathematician <a href="https://en.wikipedia.org/wiki/Benoit_Mandelbrot">Benoit Mandelbrot</a>) is the set of all \(C\in\mathbb{C}\) for which
\[
\lim_{k\rightarrow\infty} \vert f^{(k)}_C(0+0j) \vert < \infty
\]</p>
<p>Simply put, there is a set of values for \(C\) where if you repeatedly apply \(f_C\) on zero (i.e. \(0+0j\)), the output <em>does not diverge</em>. All such values of \(C\) makes the so called “Mandelbrot Set”. For the values of \(C\) that does not diverge, can be characterized by how many repeated application of \(f_C(\cdot)\) they can tolerate before their absolute value goes higher than a predefined “<em>escape radius</em>”, let’s call it \(r\in\mathbb{R}\). This creates a loose sense of “strength” of a certain \(C\) that can be written as</p>
<p>\[
\mathbb{K}(C) = \max_{\vert f^{(k)}_C(0+0j) \vert \leq r} k
\]</p>
<p>It might look all strange but if you treat the integer \(\mathbb{K}(C)\) as grayscale intensity value for a grid of points on 2D complex plane (i.e., an image), you will get a picture similar to this (Don’t get confused, the picture is indeed grayscale; I added PyPlot’s <a href="https://matplotlib.org/tutorials/colors/colormaps.html"><code class="language-plaintext highlighter-rouge">plt.cm.twilight_shifted</code></a> colormap for enhancing the visual appeal). The grid is in the range \((-2.5+1.5j) \rightarrow (1.5-1.5j)\) and the escape radius is \(r=2.5\).</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/15/mandelbrot_thumbnail.png" />
</figure>
</center>
<p>What is so fascinating about this pattern is the fact that it is self-repeating. If you zoom into a small portion of the image, you would see the same pattern again.</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/mandelbrot_zoom.png" />
</figure>
</center>
<h3 id="julia-set">Julia Set</h3>
<p>Another very similar concept exists, called the “Julia Set” which exhibits similar visual \(\mathbb{K}\) diagram. Unlike Mandelbrot set, we consider a \(z\in\mathbb{C}\) to be in Julia set \(\mathbf{J}_C\) if</p>
<p>\[
\lim_{k\rightarrow\infty} \vert f^{(k)}_C(z) \vert < \infty
\]</p>
<p>Please note that this time the set is parameterized by \(C\) and we are interested in how the <em>argument of the function</em> behaves under repeated application of \(f_C(\cdot)\). Now things from here are similar. We define a similar “strength” for every \(z\in\mathbb{C}\) as</p>
<p>\[
\mathbb{K}_C(z) = \max_{\vert f^{(k)}_C(z) \vert \leq r} k
\]</p>
<p>Please note that as a result of this new definition, the \(\mathbb{K}\) diagram is parameterized by \(C\), i.e., we will get different image for different \(C\). In principle, we can visualize such images for different \(C\) (they are indeed pretty cool), but let’s go a bit further than that. We will vary \(C\) along a trajectory and produce the \(\mathbb{K}\) diagrams for each \(C\) and see them as an animation. This creates an amazing visual effect. Technically, I varied \(C\) along a circle of radius \(R = 0.75068\), i.e., \(C = R e^{j\theta}\) with \(\theta = 0\rightarrow 2\pi\)</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/julia1.gif" />
</figure>
</center>
<p><strong>Want to know more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Mandelbrot_set">Mandelbrot set</a></li>
<li><a href="https://en.wikipedia.org/wiki/Julia_set">Julia set</a></li>
<li><a href="https://en.wikipedia.org/wiki/Fractal">Fractals</a></li>
</ol>
<hr />
<p>Alright then ! That is pretty much it. Due to constraint of time, space and scope its not possible to explain everything in detail in one article. There are plenty of resources available online (I have already provided some link) which might be useful in case you are interested. Feel free to explore the details of whatever new you learnt today. If you would like to reproduce the diagrams and images, please use the code here <a href="https://github.com/dasayan05/patterns-of-randomness">https://github.com/dasayan05/patterns-of-randomness</a> (sorry, the code is a bit messy, you have to figure out).</p>Ayan DasWelcome folks ! This is an article I was planning to write for a long time. I finally managed to get it done while locked at home due to the global COVID-19 situation. So, its basically something fun, interesting, attractive and hopefully understandable to most readers. To be specific, my plan is to dive into the world of finding visually appealing patterns in different sections of mathematics. I am gonna introduce you to four distinct mathematical concepts by means of which we can generate artistic patterns that are very soothing to human eyes. Most of these use random number as the underlying principle of generation. These are not necessarily very useful in real life problem solving but widely loved by artists as a tool for content creation. They are sometimes referred to as Mathematical Art. I will deliberately keep the fine-grained details out of the way so that it is reachable to a larger audience. In case you want to reproduce the content in this post, here is the code. Warning: This post contains quite heavily sized images which may take some time to load in your browser; so be patient.Neural Ordinary Differential Equation (Neural ODE)2020-03-20T00:00:00+00:002020-03-20T00:00:00+00:00https://dasayan05.github.io/blog-tut/2020/03/20/neural-ode<p>Neural Ordinary Differential Equation (Neural ODE) is a very recent and first-of-its-kind idea that emerged in <a href="https://nips.cc/Conferences/2018">NeurIPS 2018</a>. The authors, four researchers from University of Toronto, reformulated the parameterization of deep networks with differential equations, particularly first-order ODEs. The idea evolved from the fact that ResNet, a very popular deep network, possesses quite a bit of similarity with ODEs in their core structure. <a href="https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations">The paper</a> also offered an efficient algorithm to train such ODE structures as a part of a larger computation graph. The architecture is flexible and memory efficient for learning. Being a bit non-trivial from a deep network standpoint, I decided to dedicate this article explaining it in detail, making it easier for everyone to understand. Understanding the whole algorithm requires fair bit of rigorous mathematics, specially ODEs and their algebric understanding, which I will try to cover at the beginning of the article. I also provided a (simplified) PyTorch implementation that is easy to follow.</p>
<h2 id="ordinary-differential-equations-ode">Ordinary Differential Equations (ODE)</h2>
<p><br />
\(\mathbf{Definition}\): Let’s put Neural ODEs aside for a moment and take a refresher on ODE itself. Because of their unpopularity in the deep learning community, chances are that you haven’t looked at them since high school. We will focus our discussion on first-order linear ODEs which takes a generic form of</p>
<p>\[
\frac{dx}{dt} = f(x, t)
\]</p>
<p>where \(\displaystyle{ x,t,\frac{dx}{dt} \in \mathbb{R} }\). Please recall that ODEs are differential equations that involve only one independent variable, which in our case is \(t\). Geometrically, such an ODE represents a <em>family of curves/functions</em> \(x(t)\), also called the <em>solutions</em> of the ODE. The function \(f(x, t)\), often called the <em>dynamics of the system</em>, denotes a common characteristics of all the solutions. Specifically, it denotes the first-derivative (slope) of all the solutions. An example would make things clear: let’s say the dynamics of an ODE is \(\displaystyle{ f(x, t) = 2xt }\). With the help of basic calculus, we can see the family of solutions are \(\displaystyle{ x(t) = k\cdot e^{t^2} }\) for any value of \(k\).</p>
<p><br />
\(\mathbf{System\ of\ ODEs}\): Just like any other algorithms in Deep Learning, we can (and we have to) go beyond \(\mathbb{R}\) space and eshtablish similar ODEs in higher dimension. A <em>system of ODEs</em> with dependent variables \(x_1, x_2, \cdots x_d \in \mathbb{R}\) and independent variable \(t \in \mathbb{R}\) can be written as</p>
<p>\[
\frac{dx_1}{dt} = f_1(x_1,x_2,\cdots,x_d,t); \frac{dx_2}{dt} = f_2(x_1,x_2,\cdots,x_d,t); \cdots
\]</p>
<p>With a vectorized notation of \(\mathbf{x} \triangleq [ x_1, x_2, \cdots, x_d ]^T \in \mathbb{R}^d\) and \(\mathbf{f}(\mathbf{x}) \triangleq [ f_1, f_2, \cdots, f_d ]^T \in \mathbb{R}^d\), we can write</p>
<p>\[
\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t)
\]</p>
<p>The dynamics \(\mathbf{f}(\mathbf{x}, t)\) can be seen as a <strong>Vector Field</strong> where given any \(\mathbf{x} \in \mathbb{R}^d\), \(\mathbf{f} \in \mathbb{R}^d\) denotes its gradient with respect to \(t\). The independent variable \(t\) can often be regarded as <strong>time</strong>. For example, Fig.1 shows the \(\mathbb{R}^2\) space and a dynamics \(\mathbf{f}(\mathbf{x}, t) = tanh(W\mathbf{x} + b)\) defined on it. Please note that it is a time-invariant system, i.e., the dynamics is independent of \(t\). A system with time-dependent dynamics would have a different gradient on a given \(\mathbf{x}\) depending on which time you visit it.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/vector_field.png" />
<figcaption>Fig.1: A vector field in 2D space denoting the dynamics of an ODE</figcaption>
</figure>
</center>
<p><br />
\(\mathbf{Initial\ Value\ Problem}\): Although I showed the solution of an extremely simple system with dynamics \(f(x, t) = 2xt\), most practical systems are far from it. Systems with higher dimension and complicated dynamics are very difficult to solve analytically. This is when we resort to <em>numerical methods</em>. A specific way of solving any ODE numerically is to solve an <strong>Initial Value Problem</strong> where given the system (dynamics) and an <em>initial condition</em>, one can iteratively “trace” the solution. I emphasized the term <em>trace</em> because that’s what it is. Think of it as dropping a small particle on the vector field at some point and let it <em>flow according to the gradients</em> at any point.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/trace.png" />
<figcaption>Fig.2: Solving for two solutions with two different initial condition</figcaption>
</figure>
</center>
<p>Fig.2 shows two different initial condition (red dots) leads to two different curves/solution (a small segment of the curve is shown). These curves/solutions are from the family of curves represented by the system whose dynamics is shown with black arrows. Different numerical methods are available on how well we do the “tracing” and how much error we tolerate. Strating from naive ones, we have modern numerical solvers to tackle the initial value problems. We will focus on one of the simplest yet popular method known as <strong>Forward Eular’s method</strong> for the sake of simplicity. The algorithm simply does the following: It starts from a given initial state \(\mathbf{x}_0\) at \(t=0\) and literally goes in the direction of gradient at that point, i.e. \(\mathbf{f}(\mathbf{x}=\mathbf{x}_0, t=0)\) and keeps doing it till \(t=N\) using a small step size of \(\Delta t \triangleq t_{i+1} - t_i\). The following iterative update rule summerizes everything</p>
<p>\[
\mathbf{x}_{t+1} = \mathbf{x}_t + \Delta t \cdot \mathbf{f}(\mathbf{x}_t, t)
\]</p>
<p>In case you haven’t noticed, the formula can be obtained trivially from the discretized version of analytic derivative</p>
<p>\[
\mathbf{f}(\mathbf{x}, t) = \frac{d\mathbf{x}}{dt} \approx \frac{\mathbf{x}_{t+1} - \mathbf{x}_t}{\Delta t}
\]</p>
<p>If you look at Fig.2 closely enough, you would see the red curves are made up of discrete segements which is a result of solving an initial value problem using Forward Eular’s method.</p>
<h2 id="motivation-of-neural-ode">Motivation of Neural ODE</h2>
<p>Let’s look at the core structure of <a href="https://arxiv.org/abs/1512.03385">ResNet</a>, an extremely popular deep network that almost revolutionized deep network architecture. The most unique structural component of ResNet is its residual blocks that computes “increaments” on top of previous layer’s activation instead of activations directly. If the activation of layer \(t\) is \(\mathbf{h}_t\) then</p>
<p>\[ \tag{1}
\mathbf{h}_{t+1} = \mathbf{h}_t + \mathbf{F}(\mathbf{h}_t; \theta_t)
\]</p>
<p>where \(\mathbf{F}(\cdot)\) is the residual function (increament on top of last layer). I am pretty sure the reader can feel where it’s going. Yes, the residual architectire resembles the forward eular’s method on an ODE with dynamics \(\mathbf{F}(\cdot)\). Having \(N\) such residual layers is similar to executing \(N\) steps of forward eular’s method with step size \(\Delta t = 1\). The idea of Neural ODE is to “<em>parameterize the dynamics of this ODE explicitely rather than parameterizing every layer</em>”. So we can have</p>
<p>\[
\frac{d\mathbf{h}_t}{dt} = \mathbf{F}(\mathbf{h}_t, t; \theta)
\]</p>
<p>and \(N\) successive layers can be realized by \(N\)-step forward eular evaluations. As you can guess, we can choose \(N\) as per our requirement and in limiting case we can think of it as an infinite layer (\(N \rightarrow \infty\)) network. Although you must understand that such parameterization cannot provide an infinite capacity as the number of parameters is shared and finite. Fig.3 below depicts the resemblance of ResNet with forward eular iteration.</p>
<p><br /></p>
<center>
<figure>
<img width="75%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/ode_core_idea.png" />
<figcaption>Fig.3: Resemblence of ResNet and Forward Eular's method.</figcaption>
</figure>
</center>
<h2 id="parameterization-and-forward-pass">Parameterization and Forward pass</h2>
<p>Although we already went over this in the last section, but let me put it more formally one more time. An “ODE Layer” is basically characterized by its dynamics function \(\mathbf{F}(\mathbf{h}_t, t; \theta)\) which can be realized by a (deep) neural network. This network takes input the “current state” \(\mathbf{h}_t\) (basically activation) and time \(t\) and produces the “direction” (i.e., \(\displaystyle{ \mathbf{F}(\mathbf{h}_t, t; \theta) = \frac{d\mathbf{h}_t}{dt} }\)) where the state should go next. A full forward pass through this layer is essentially executing an \(N\) step Forward Eular on the ODE with an “initial state” (aka “input”) \(\mathbf{h}_0\). \(N\) is a hyperparameter we choose and can be compared to “depth” in standard deep neural network. Following the original paper’s convention (with a bit of python-style syntax), we write the forward pass as</p>
<p>\[ \tag{2}
\mathbf{h}_N = \mathrm{ODESolve}(start\_state=\mathbf{h}_0, dynamics=\mathbf{F}, t\_start=0, t\_end=N; \theta)
\]</p>
<p>where the “ODESolve” is <em>any</em> iterative ODE solver algorithm and not just Forward Eular. By the end of this article you’ll understand why the specific machinery of Eular’s method is not essential.</p>
<p>Coming to the backward pass, a naive solution you might be tempted to offer is to back-propagate thorugh the operations of the solver. I mean, look at the iterative update equation Eq.1 of an ODE Solver (for now just Forward Eular) - everything is indeed differentiable ! But then, it is no better than ResNet, not from a memory cost point of view. Note that backpropagating through a ResNet (and so with any standard deep network) requires storing the intermediate activations to be used later for the backward pass. Such operation is resposible for the memory complexity of backpropagation being linear in number of layers (i.e., \(\mathcal{O}(L)\)). This is where the authors proposed a brilliant idea to make it \(\mathcal{O}(1)\) by not storing the intermediate states.</p>
<p><br /></p>
<center>
<figure>
<img width="45%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/ode_block.png" />
<figcaption>Fig.3: Block diagram of ODE Layer.</figcaption>
</figure>
</center>
<h2 id="adjoint-method-and-the-backward-pass">“Adjoint method” and the backward pass</h2>
<p>Just like any other computational graph associated with a deep network, we get a gradient signal coming from the loss. Let’s denote the incoming gradient at the end of the ODE layer as \(\displaystyle{ \frac{d \mathcal{L}}{d \mathbf{h}_N} }\), where \(\mathcal{L}\) is a scalar loss. All we have to do is use this incoming gradient to compute \(\displaystyle{ \frac{d \mathcal{L}}{d\theta} }\) and perform an SGD (or any variant) step. A bunch of parameter updates in the right direction would cause the dynamics to change and consequently the whole trajectory (i.e., trace) except the input. Fig.4 shows a graphical representation of the same. Please note that for simplicity, the loss has been calculated using \(\mathbf{h}_N\) itself. To be specific, the loss (green dotted line) is the euclidian distance between \(\mathbf{h}_N\) and its (avaialble) ground truth \(\mathbf{\widehat{h}}_N\).</p>
<p><br /></p>
<center>
<figure>
<img width="75%" style="padding-top: 20px; border: 0px solid black;" src="/public/posts_res/14/optim_goal.png" />
<figcaption>Fig.4: Effect of updating parameters of the dynamics.</figcaption>
</figure>
</center>
<p>In order to accomplish our goal of computing the parameter gradients, we define a quantity \(\mathbf{a}_t\), called the “Adjoint state”</p>
<p>\[
\mathbf{a}_t \triangleq \frac{d\mathcal{L}}{d\mathbf{h}_t}
\]</p>
<p>comparing to a standard neural network, this is basically the gradient of the loss \(\mathcal{L}\) w.r.t all intermediate activations (states of the ODE). It is indeed a generalization of a quantity I mentioned earlier, i.e., the incoming gradient into the layer \(\displaystyle{ \frac{d\mathcal{L}}{d\mathbf{h}_N} = \mathbf{a}_N }\). Although we cannot compute this quantity independently for every timestep, a bit of rigorous mathematics (refer to appendix B.1 of <a href="https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations">original paper</a>) can show that the adjoint state follows a differential equation with a dynamics function</p>
<p>\[ \tag{3}
\mathbf{F}_a(\mathbf{a}_t, \mathbf{h}_t, t, \theta) \triangleq \frac{d\mathbf{a}_t}{dt} = -\mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \mathbf{h}_t}
\]</p>
<p>and that’s a good news ! We now have the dynamics that \(\mathbf{a}_t\) follows and an initial value \(\mathbf{a}_N\) (value at the extreme end \(t = N\)). That means we can run an ODE solver backward in time from \(t = N \rightarrow 0\) and calculate all \(\mathbf{a}_t\) in succession, like this</p>
<p>\[
\mathbf{a}_{N-1}, \cdots, \mathbf{a}_0 = \mathrm{ODESolve}(\mathbf{a}_N, \mathbf{F}_a, N, 0; \theta)
\]</p>
<p>Please look at Eq.2 for the signature of the “ODESolve” function. This time we also produced all intermediate states of the solver as output. An intuitive visualization of the adjoint state and its dynamics is given in Fig.5 below.</p>
<center>
<figure>
<img width="75%" style="padding-top: 20px; border: 0px solid black;" src="/public/posts_res/14/adj_viz.png" />
<figcaption>Fig.5: An intuitive visualization of the adjoint state and its dynamics.</figcaption>
</figure>
</center>
<p>The quantity on the right hand side of Eq.3 is a vector-jacobian product where \(\displaystyle{ \frac{\partial \mathbf{F}}{\partial \mathbf{h}_t} }\) is the jacobian matrix. Given the functional form of \(\mathbf{F}\), this can be readily computed using the current state \(\mathbf{h}_t\) and the latest parameter value. But wait a minute ! I said before that we are not storing the intermediate \(\mathbf{h}_t\) values. Where do we get them now ? The answer is - we can compute them again. Please remeber that we still have \(\mathbf{F}\) with us along with an extreme value \(\mathbf{h}_N\) (output of the forward pass). We can run another ODE backwards in time starting from \(t=N\rightarrow 0\). Essentially we can fuse two ODEs together</p>
<p>\[
[ \mathbf{a}_{N-1}; \mathbf{h}_{N-1} ], \cdots, [ \mathbf{a}_0; \mathbf{h}_0 ] = \mathrm{ODESolve}([ \mathbf{a}_N; \mathbf{h}_N ], [ \mathbf{F}_a; \mathbf{F} ], N, 0; \theta)
\]</p>
<p>Its basically executing two update equations for two ODEs in one “for loop” traversing from \(N\rightarrow 0\). The intermediate values of \(\mathbf{h}_t\) wouldn’t be exactly same as what we got in the forward pass (because no numerical solver is of infinite precision), but they are indeed good approximations.</p>
<p>Okay, what about the parameters of the model (dynamics) ? How do we get to our ultimate goal, \(\displaystyle{ \frac{d\mathcal{L}}{d\theta} }\) ?</p>
<p>Let’s define another quantity very similar to the adjoint state, i.e., the parameter gradient of the loss at every step of the ODE solver</p>
<p>\[
\mathbf{a}^{\theta}_t \triangleq \frac{d\mathcal{L}}{d\mathbf{\theta}_t}
\]</p>
<p>Point to note here is that \(\theta_t = \theta\) as the parameters do not change during a trajectory. Instead, these quantities signify <em>local influences</em> of the parameters at each step of computation. This is very similar to a roll-out of RNN in time where parameters are shared accross time steps. With a proof very similar to that of the adjoint state, it can be shown that</p>
<p>\[
\mathbf{a}^{\theta}_t = \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta}
\]</p>
<p>just like shared-weight RNNs, we can compute the full parameter gradient as combination of local influences</p>
<p>\[\tag{4}
\frac{d\mathcal{L}}{d\theta} = \int_{0}^{N} \mathbf{a}^{\theta}_t dt = \int_{0}^{N} \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} dt
\]</p>
<p>The quantity \(\displaystyle{ \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} }\) is another vector-jacobian product and can be evaluted using the values of \(\mathbf{h}_t\), \(\mathbf{a}_t\) and latest parameter \(\theta\). So do we need another pass over the whole trajectory as Eq.4 consists of a integral ? <strong>Fortunately, NO</strong>. Let me bring your attention to the fact that whatever we need to compute this vector-jacobian is already being computed in the fused ODE we saw before. Furthermore we can tweak the Eq.4 as</p>
<p>\[\tag{5}
\frac{d\mathcal{L}}{d\theta} = \mathbf{0} - \int_{N}^{0} \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} dt
\]</p>
<p>I hope you are seeing what I am seeing. This is equivalent to solving yet another ODE (backwards in time, again!) with dynamics \(\displaystyle{ \mathbf{F}_{\theta}(\mathbf{a}_t, \mathbf{h}_t, \theta, t) \triangleq -\mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} }\) and initial state \(\mathbf{a}^{\theta}_N = \mathbf{0}\). The end state \(\mathbf{a}^{\theta}_0\) of this ODE completes the whole integral in Eq.5 and therefore is equal to \(\displaystyle{\frac{d\mathcal{L}}{d\theta}}\). Just like last time, we can fuse this ODE with the last two combined</p>
<p>\[
[ \mathbf{a}_{N-1}; \mathbf{h}_{N-1}; \_ ], \cdots, [ \mathbf{a}_0; \mathbf{h}_0; \mathbf{a}^{\theta}_0 ] = \mathrm{ODESolve}([ \mathbf{a}_N; \mathbf{h}_N; \mathbf{0} ], [ \mathbf{F}_a; \mathbf{F}; \mathbf{F}_{\theta} ], N, 0; \theta)
\]</p>
<center>
<figure>
<img width="85%" style="padding-top: 20px; border: 0px solid black;" src="/public/posts_res/14/full_diagram.png" />
<figcaption>Fig.6: A pictorial representation of the forward and backward pass with all its ODEs.</figcaption>
</figure>
</center>
<p>Take some time to digest the final 3-way ODE and make sure you get it. Because that is pretty much it. Once we get the parameter gradient, we can continue with normal stochastic gradient update rule (SGD or family). Additionally you may want to pass \(\mathbf{a}_0\) to the computation graph that comes before our ODE layer. A representative diagram containing a clear picture of all the ODEs and their interdependencies are shown above.</p>
<h2 id="pytorch-implementation">PyTorch Implementation</h2>
<p>Implementing this algorithm is a bit tricky due to its non-conventional approach for gradient computations. Specially if you are using library like PyTorch which adheres to a specific model of computation. I am providing a very simplified implementation of ODE Layer as a PyTorch <code class="language-plaintext highlighter-rouge">nn.Module</code>. Because this post has already become quite long and stuffed with maths and new concepts, I am leaving it here. I am putting the core part of the code (well commented) here just for reference but a complete application can be found on this <a href="https://github.com/dasayan05/neuralode-pytorch">GitHub repo of mine</a>. My implementation is quite simplified as I have hard-coded “Forward Eular” method as the only choice of ODE solver. Feel free to contribute to my repo.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#############################################################
# Full code at https://github.com/dasayan05/neuralode-pytorch
#############################################################
</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="k">class</span> <span class="nc">ODELayerFunc</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">Function</span><span class="p">):</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">context</span><span class="p">,</span> <span class="n">z0</span><span class="p">,</span> <span class="n">t_range_forward</span><span class="p">,</span> <span class="n">dynamics</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span><span class="p">):</span>
<span class="n">delta_t</span> <span class="o">=</span> <span class="n">t_range_forward</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">t_range_forward</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># get the step size
</span>
<span class="n">zt</span> <span class="o">=</span> <span class="n">z0</span><span class="p">.</span><span class="n">clone</span><span class="p">()</span>
<span class="k">for</span> <span class="n">tf</span> <span class="ow">in</span> <span class="n">t_range_forward</span><span class="p">:</span> <span class="c1"># Forward eular's method
</span> <span class="n">f</span> <span class="o">=</span> <span class="n">dynamics</span><span class="p">(</span><span class="n">zt</span><span class="p">,</span> <span class="n">tf</span><span class="p">)</span>
<span class="n">zt</span> <span class="o">=</span> <span class="n">zt</span> <span class="o">+</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">f</span> <span class="c1"># update
</span>
<span class="n">context</span><span class="p">.</span><span class="n">save_for_backward</span><span class="p">(</span><span class="n">zt</span><span class="p">,</span> <span class="n">t_range_forward</span><span class="p">,</span> <span class="n">delta_t</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span><span class="p">)</span>
<span class="n">context</span><span class="p">.</span><span class="n">dynamics</span> <span class="o">=</span> <span class="n">dynamics</span> <span class="c1"># 'save_for_backwards() won't take it, so..
</span>
<span class="k">return</span> <span class="n">zt</span> <span class="c1"># final evaluation of 'zt', i.e., zT
</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="n">context</span><span class="p">,</span> <span class="n">adj_end</span><span class="p">):</span>
<span class="c1"># Unpack the stuff saved in forward pass
</span> <span class="n">zT</span><span class="p">,</span> <span class="n">t_range_forward</span><span class="p">,</span> <span class="n">delta_t</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span> <span class="o">=</span> <span class="n">context</span><span class="p">.</span><span class="n">saved_tensors</span>
<span class="n">dynamics</span> <span class="o">=</span> <span class="n">context</span><span class="p">.</span><span class="n">dynamics</span>
<span class="n">t_range_backward</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">t_range_forward</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">,])</span> <span class="c1"># Time runs backward
</span>
<span class="n">zt</span> <span class="o">=</span> <span class="n">zT</span><span class="p">.</span><span class="n">clone</span><span class="p">().</span><span class="n">requires_grad_</span><span class="p">()</span>
<span class="n">adjoint</span> <span class="o">=</span> <span class="n">adj_end</span><span class="p">.</span><span class="n">clone</span><span class="p">()</span>
<span class="n">dLdp</span> <span class="o">=</span> <span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">theta</span><span class="p">]</span> <span class="c1"># Parameter grads (an accumulator)
</span>
<span class="k">for</span> <span class="n">tb</span> <span class="ow">in</span> <span class="n">t_range_backward</span><span class="p">:</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">set_grad_enabled</span><span class="p">(</span><span class="bp">True</span><span class="p">):</span>
<span class="c1"># above 'set_grad_enabled()' is required for the graph to be created ...
</span> <span class="n">f</span> <span class="o">=</span> <span class="n">dynamics</span><span class="p">(</span><span class="n">zt</span><span class="p">,</span> <span class="n">tb</span><span class="p">)</span>
<span class="c1"># ... and be able to compute all vector-jacobian products
</span> <span class="n">adjoint_dynamics</span><span class="p">,</span> <span class="o">*</span><span class="n">dldp_</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">([</span><span class="o">-</span><span class="n">f</span><span class="p">],</span> <span class="p">[</span><span class="n">zt</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span><span class="p">],</span> <span class="n">grad_outputs</span><span class="o">=</span><span class="p">[</span><span class="n">adjoint</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dldp_</span><span class="p">):</span>
<span class="n">dLdp</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">dLdp</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">p</span> <span class="c1"># update param grads
</span> <span class="n">adjoint</span> <span class="o">=</span> <span class="n">adjoint</span> <span class="o">-</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">adjoint_dynamics</span> <span class="c1"># update the adjoint
</span> <span class="n">zt</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">zt</span><span class="p">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">f</span><span class="p">.</span><span class="n">data</span> <span class="c1"># Forward eular's (backward in time)
</span>
<span class="k">return</span> <span class="p">(</span><span class="n">adjoint</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="o">*</span><span class="n">dLdp</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">ODELayer</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dynamics</span><span class="p">,</span> <span class="n">t_start</span> <span class="o">=</span> <span class="mf">0.</span><span class="p">,</span> <span class="n">t_end</span> <span class="o">=</span> <span class="mf">1.</span><span class="p">,</span> <span class="n">granularity</span> <span class="o">=</span> <span class="mi">25</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dynamics</span> <span class="o">=</span> <span class="n">dynamics</span>
<span class="bp">self</span><span class="p">.</span><span class="n">t_start</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">t_end</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">granularity</span> <span class="o">=</span> <span class="n">t_start</span><span class="p">,</span> <span class="n">t_end</span><span class="p">,</span> <span class="n">granularity</span>
<span class="bp">self</span><span class="p">.</span><span class="n">t_range</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">t_start</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">t_end</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">granularity</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
<span class="k">return</span> <span class="n">ODELayerFunc</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">t_range</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">dynamics</span><span class="p">,</span> <span class="o">*</span><span class="bp">self</span><span class="p">.</span><span class="n">dynamics</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span>
</code></pre></div></div>
<p>That’s all for today. See you.</p>Ayan DasNeural Ordinary Differential Equation (Neural ODE) is a very recent and first-of-its-kind idea that emerged in NeurIPS 2018. The authors, four researchers from University of Toronto, reformulated the parameterization of deep networks with differential equations, particularly first-order ODEs. The idea evolved from the fact that ResNet, a very popular deep network, possesses quite a bit of similarity with ODEs in their core structure. The paper also offered an efficient algorithm to train such ODE structures as a part of a larger computation graph. The architecture is flexible and memory efficient for learning. Being a bit non-trivial from a deep network standpoint, I decided to dedicate this article explaining it in detail, making it easier for everyone to understand. Understanding the whole algorithm requires fair bit of rigorous mathematics, specially ODEs and their algebric understanding, which I will try to cover at the beginning of the article. I also provided a (simplified) PyTorch implementation that is easy to follow.Foundation of Variational Autoencoder (VAE)2020-01-01T00:00:00+00:002020-01-01T00:00:00+00:00https://dasayan05.github.io/blog-tut/2020/01/01/variational-autoencoder<p>In the <a href="https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm.html">previous article</a>, I started with Directed Probabilitic Graphical Models (PGMs) and a family of algorithms to do efficient approximate inference on them. Inference problems in Directed PGMs with continuous latent variables are intractable in general and require special attention. The family of algorithms, namely <strong>Variation Inference (VI)</strong>, introduced in the last article is a general formulation of approximating the intractable posterior in such models. <strong>Variational Autoencoder</strong> or famously known as <strong>VAE</strong> is an algorithm based on the principles on VI and have gained a lots of attention in past few years for being extremely efficient. With few more approximations/assumptions, VAE eshtablished a clean mathematical formulation which have later been extended by researchers and used in numerous applications. In this article, I will explain the intuition as well as mathematical formulation of Variational Autoencoders.</p>
<h2 id="variational-inference-a-recap">Variational Inference: A recap</h2>
<p>A quick recap would make going forward easier.</p>
<p>Given a Directed PGM with countinuos latent variable \(Z\) and observed variable \(X\), the inference problem for \(Z\) turned out to be intractable because of the form of its posterior</p>
<p>\[
\mathbb{P}(Z|X) = \frac{\mathbb{P}(X,Z)}{\mathbb{P}(X)} = \frac{\mathbb{P}(X,Z)}{\sum_Z \mathbb{P}(X,Z)}
\]</p>
<p>To solve this problem, VI defines a <em>parameterized approximation</em> of \(\mathbb{P}(Z\vert X)\), i.e., \(\mathbb{Q}(Z;\phi)\) and formulates it as an optimization problem</p>
<p>\[
\mathbb{Q}^*(Z) = arg\min_{\phi}\ \mathbb{K}\mathbb{L}[\mathbb{Q}(Z;\phi)\ ||\ \mathbb{P}(Z|X)]
\]</p>
<p>The objective can further be simplified as</p>
<p>\[
\mathbb{K}\mathbb{L}[\mathbb{Q}(Z;\phi)\ \vert\vert \ \mathbb{P}(Z\vert X)]
\]
\[
\let\sb_
= \mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z;\phi)] - \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X, Z)]
\triangleq - ELBO(\mathbb{Q})
\]</p>
<p>\(ELBO(\mathbb{Q})\) is precisely the objective we maximize. The \(ELBO(\cdot)\) can best be explained by decomposing it into two factors. One of them takes care of maximizing the expected conditional log-likelihood (of the data given latent) and the other arranges the latent space in a way that it matches a predifined distribution.</p>
<p>\[
ELBO(\mathbb{Q}) = \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X\vert Z)] - \mathbb{K}\mathbb{L}[\mathbb{Q}(Z;\phi)\ ||\ \mathbb{P}(Z)]
\]</p>
<p>For a detailed explanation, go through the previous article.</p>
<h2 id="variational-autoencoder">Variational Autoencoder</h2>
<p>Variational Autoencoder (VAE) is first proposed in the <a href="https://arxiv.org/pdf/1312.6114.pdf">paper</a> titled “Auto-Encoding Variational Bayes” by D.P.Kingma & Max Welling. The paper proposes two things:</p>
<ol>
<li>A parameterized <em>inference model</em> instead of just \(\mathbb{Q}(Z;\phi)\)</li>
<li>The reparameterization trick to achieve efficient training</li>
</ol>
<p>As we go along, I will try to convey the fact that these are essentially developments on top of the general VI framework we learnt earlier. I will focus on how each of them is related to VI in the following (sub)sections.</p>
<h3 id="the-inference-model">The “Inference Model”</h3>
<center>
<figure>
<img width="60%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/13/model.PNG" />
<figcaption>Fig.1. Subfig.1: The Bayesian Network defining VAE. Subfig.2: The forward pass (abstarct) of VAE. Subfig.3: The forward pass of VAE with explicit sampling shown at the end of encoder </figcaption>
</figure>
</center>
<p>The idea is to replace the generically parameterized \(\mathbb{Q}(Z;\phi)\) in the VI framework by a data-driven model \(\mathbb{Q}(Z\vert X; \phi)\), named as <em>Inference model</em>. What does it mean ? It basically means, we are no longer interested in the unconditional distribution on \(Z\) but instead we want to have a conditional distribution on \(Z\) given observed data. Please recall our “generative view” of the model</p>
<p>\[z^{(i)} \sim \mathbb{P}(Z)\]
\[x^{(i)} \sim \mathbb{P}(X|Z=z^{(i)})\]</p>
<p>With the inference model in hand, we now have an “inference view” as follows</p>
<p>\[z^{(i)} \sim \mathbb{P}(Z\vert X=x^{(i)})\]</p>
<p>It means, we can do inference just by ancestral sampling after our model is trained. Of course, we don’t know the real \(\mathbb{P}(Z\vert X)\), so we consider a parameterized approximation \(\mathbb{Q}(Z\vert X; \phi)\) as I already mentioned.</p>
<p>These two “views”, when combined, forms the basis of Variational Autoencoder (See <em>Fig.1: Subfig.1</em>).</p>
<p>\[z^{(i)} \sim \mathbb{P}(Z\vert X=x^{(i)})\]
\[x^{(i)} \sim \mathbb{P}(X\vert Z=z^{(i)})\]</p>
<p>The “combined model” shown above gives us insight about the training process. Please note that the model starts from \(x^{(i)}\) (a data sample from our dataset) - generates \(z^{(i)}\) via the Inference model - and then maps it back to \(x^{(i)}\) again using the Generative model (See <em>Fig.1: Subfig.2</em>). I hope the reader can now guess why its called an <a href="https://en.wikipedia.org/wiki/Autoencoder">Autoencoder</a> ! So, we clearly have a computational advantage here: we can perform training on per-sample basis; just like Inference. This is not true for many of the approximate inference algorithms of pre-VAE era.</p>
<p>So, succinctly, all we have to do is a “forward pass” through the model (yes, the two sampling equations above) and maximize \(\log \mathbb{P}(X=x^{(i)}\vert Z=z^{(i)}; \theta)\) where \(z^{(i)}\) is a sample we got from the Inference model. Note that we need to parameterize the generative model as well (with \(\theta\)). In general, we almost always choose \(\mathbb{Q}(\cdot;\phi)\) and \(\mathbb{P}(\cdot;\theta)\) as a fully-differentiable functions like Neural Network (See <em>Fig.1: Subfig.3</em> for a cleaner diagram).
Now we go back to our objective function from VI framework. To formalize the training objective for VAE, we just need to replace \(\mathbb{Q}(Z; \phi)\) by \(\mathbb{Q}(Z\vert X; \phi)\) in the VI framework (please compare the equations with the recap section)</p>
<p>\[
\mathbb{Q}^*(Z\color{red}{\vert X}) = arg\min_{\phi}\ \mathbb{K}\mathbb{L}[\mathbb{Q}(Z\color{red}{\vert X};\phi)\ ||\ \mathbb{P}(Z|X)]
\]</p>
<p>And the objective</p>
<p>\[
\mathbb{K}\mathbb{L}[\mathbb{Q}(Z\color{red}{\vert X};\phi)\ \vert\vert \ \mathbb{P}(Z\vert X)]
\]
\[
\let\sb_
= \mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z\color{red}{\vert X};\phi)] - \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X, Z)]
\triangleq - ELBO(\mathbb{Q})
\]</p>
<p>Then,</p>
<p>\[
ELBO(\mathbb{Q}) = \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X\vert Z; \theta)] - \mathbb{K}\mathbb{L}[\mathbb{Q}(Z\color{red}{\vert X};\phi)\ ||\ \mathbb{P}(Z)]
\]</p>
<p>As usual, \(\mathbb{P}(Z)\) is a chosen distribution which we want the structure of \(\mathbb{Q}(Z\vert X; \phi)\) to be; which is often <em>Standard Gaussian/Normal</em> (i.e., \(\mathbb{P}(Z) = \mathcal{N}(0, I)\))</p>
<p>\[
\mathbb{Q}(Z\vert X; [ \phi_1, \phi_2 ]) = \mathcal{N}(Z; \mu (X; \phi_1), \sigma (X; \phi_2))
\]</p>
<p>The specific parameterization of \(\mathbb{Q}(Z\vert X; \left[ \phi_1, \phi_2 \right])\) reveals that we predict a distribution in the forward pass just by predicting its parameters.</p>
<p>The first term of \(ELBO(\cdot)\) is relatively easy, its a loss function that we have used a lot in machine learning - the <em>log-likelihood</em>. Very often it is just the MSE loss between the predicted \(\hat{X}\) and original data \(X\). What about the second term ? It turns out that we can have closed-form solution for that. Because I don’t want unnecessary maths to clutter this post, I am just putting the formula for the readers to look at. But, I would highly recommend looking at the proof in Appendix B of the <a href="https://arxiv.org/pdf/1312.6114.pdf">original VAE paper</a>. Its not hard, believe me. So, putting the proper values of \(\mathbb{Q}(Z\vert \cdot)\) and \(\mathbb{P}(Z)\) into the KL term, we get</p>
<p>\[
\mathbb{K}\mathbb{L}\bigl[\mathcal{N}(\mu (X; \phi_1), \sigma (X; \phi_2))\ ||\ \mathcal{N}(0, I)\bigr]
\]</p>
<p>\[
= \frac{1}{2} \sum_j \bigl( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \bigr)
\]</p>
<p>Please note that \(\mu_j, \sigma_j\) are the individual dimensions of the predicted mean and std vector. We can easily compute this in forward pass and add it to the log-likelihood (first term) to get the full (ELBO) loss.</p>
<p>Okay. Let’s talk about the forward pass in a bit more detail. Believe me, its not as easy as it looks. You may have noticed (<em>Fig.1: Subfig.3</em>) that the forward pass contains a sampling operation (sampling \(z^{(i)}\) from \(\mathbb{P}(Z\vert X=x^{(i)})\)) which is <em>NOT differentiable</em>. What do we do now ?</p>
<h3 id="the-reparameterization-trick">The reparameterization trick</h3>
<center>
<figure>
<img width="60%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/13/reparam.JPG" />
<figcaption>Fig.1. Subfig.1: The full forward pass. Subfig.2: The full forward pass with reparameterized sampling. </figcaption>
</figure>
</center>
<p>I showed before that in forward pass, we get the \(z^{(i)}\) by sampling from our parameterized inference model. Now that we know the exact form of the inference model, the sampling will look something like this</p>
<p>\[
z^{(i)} \sim \mathcal{N}(Z\vert \mu (X; \phi_1), \sigma (X; \phi_2))
\]</p>
<p>The idea is basically to make this sampling operation differentiable w.r.t \(\mu\) and \(\sigma\). In order to do this, we pull a trick like this</p>
<p>\[
z^{(i)} = \mu^{(i)} + \epsilon^{(i)} * \sigma^{(i)}\text{ , where } \epsilon^{(i)} \sim \mathcal{N}(0, I)
\]</p>
<p>This is known as the “reparameterization”. We basically rewrite the sampling operation in a way that <em>separates the source of randomness</em> (i.e., \(\epsilon^{(i)}\)) from the deterministic quantities (i.e., \(\mu\) and \(\sigma\)). This allows the backpropagation algorithm to flow derivatives into \(\mu\) and \(\sigma\). However, please note that it is still not differentiable w.r.t \(\epsilon\) but .. guess what .. we don’t need it ! Just having derivatives w.r.t \(\mu\) and \(\sigma\) is enough to flow it backwards and pass it to the parameters of inference model (i.e., \(\phi\)). Fig.2 should make everything clear if not already.</p>
<hr />
<h3 id="wrap-up">Wrap up</h3>
<p>That’s pretty much it. To wrap up, here is the full forward-backward algorithm for training VAE:</p>
<ol>
<li>Given \(x^{(i)}\) from the dataset, compute \(\mu(x^{(i)}, \phi_1), \sigma(x^{(i)}, \phi_1)\).</li>
<li>Compute a latent sample as \(z^{(i)} = \mu^{(i)} + \epsilon^{(i)} * \sigma^{(i)}\text{ , where } \epsilon^{(i)} \sim \mathcal{N}(0, I)\)</li>
<li>Compute the full loss as \(L = \log \mathbb{P}(x^{(i)}\vert Z = z^{(i)}) + \frac{1}{2} \sum_j \bigl( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \bigr)\).</li>
<li>Update parameters as \(\left\{ \phi, \theta \right\} := \left\{ \phi, \theta \right\} - \eta \frac{\delta L}{\delta \left\{ \phi, \theta \right\}}\)</li>
<li>Repeat.</li>
</ol>
<hr />
<p>That’s all for this article. Wait for more probabilistic models .. umm, maybe the next one is <strong>Normalizing Flow</strong>. See you.</p>Ayan DasIn the previous article, I started with Directed Probabilitic Graphical Models (PGMs) and a family of algorithms to do efficient approximate inference on them. Inference problems in Directed PGMs with continuous latent variables are intractable in general and require special attention. The family of algorithms, namely Variation Inference (VI), introduced in the last article is a general formulation of approximating the intractable posterior in such models. Variational Autoencoder or famously known as VAE is an algorithm based on the principles on VI and have gained a lots of attention in past few years for being extremely efficient. With few more approximations/assumptions, VAE eshtablished a clean mathematical formulation which have later been extended by researchers and used in numerous applications. In this article, I will explain the intuition as well as mathematical formulation of Variational Autoencoders.Directed Graphical Models & Variational Inference2019-11-20T00:00:00+00:002019-11-20T00:00:00+00:00https://dasayan05.github.io/blog-tut/2019/11/20/inference-in-pgm<p>Welcome to the first part of a series of tutorials about Directed Probabilistic Graphical Models (PGMs) & Variational methods. Directed PGMs (OR, Bayesian Networks) are very powerful probabilistic modelling techniques in machine learning literature and have been studied rigorously by researchers over the years. Variational Methods are family of algorithms arise in the context of Directed PGMs when it involves solving an intractable integrals. Doing inference on a set of latent variables (given a set of observed variables) involves such an intractable integral. Variational Inference (VI) is a specialised form of variation method that handles this situation. This tutorial is NOT for absolute beginners as I assume the reader to have basic-to-moderate knowledge about Random Variables, probability theories and PGMs. The next tutorial in this series will cover one perticular VI method, namely “Variational Autoencoder (VAE)” built on top of VI.</p>
<center>
<figure>
<img width="35%" style="padding-top: 20px;" src="/public/posts_res/12/prob_thumbnail.jpeg" />
<figcaption>Fig.1: An example of Directed PGM (Bayesian Network)</figcaption>
</figure>
</center>
<h2 id="a-review-of-directed-pgms">A review of Directed PGMs</h2>
<p>A Directed PGM, also known as <a href="https://en.wikipedia.org/wiki/Bayesian_network">Bayesian Network</a>, is a set of random variables (RVs) associated with a graph structure (DAG) expressing conditional independance (CI assumptions) among them. Without the CI assumptions, one had to model the joint distribution over all the RVs, which would’ve been difficult.
Fig. 1 shows a typical DAG expressing the conditional independance among the set of participating RVs \(\{X, Y, Z\}\). With the CI assumptions in place, we can write the join distribution over \(X, Y, Z\) as
\[
\mathbb{P}(X,Y,Z) = \mathbb{P}(Z|X,Y)\cdot \mathbb{P}(Y|X)\cdot \mathbb{P}(X)
\]</p>
<p>In general, joint distribution over a set of RVs \({X_1, X_2, \cdots, X_i, \cdots, X_N}\) with CI assumptions encoded in a graph \(\mathbb{G}\) can be written/factorized as</p>
<p>\[
\mathbb{P}({X_1, X_2, \cdots, X_N}) = \prod_{i=1}^N \mathbb{P}(X_i | Pa_{\mathbb{G}}(X_i))
\]</p>
<p>Where, \(Pa_{\mathbb{G}}(X_i)\) is the parents of node \(X_i\) according to graph \(\mathbb{G}\). One can easily verify that the factorization of \(\mathbb{P}(X,Y,Z)\) above resembles the general formula.</p>
<h4 id="ancestral-sampling">Ancestral sampling</h4>
<center>
<figure>
<img width="35%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/12/anc_sampling.JPG" />
<figcaption>Fig.2: Ancestral Sampling</figcaption>
</figure>
</center>
<p>A key idea in Directed PGMs is the way we sample from them. We use something known as <strong>Ancestral Sampling</strong>. Unlike joint distributions over all random variables (\(\mathbb{P}(\cdots, X_i, \cdots)\)), a graph structure (i.e., \(\mathbb{G}\)) breaks it down into multiple factors which then needs to be synchronized to get a full sample from the graph. Here’s how we do it:</p>
<ol>
<li>We start with RVs with no parent (according to \(\mathbb{G}\)). We sample from them as usual.</li>
<li>Plug the samples in the conditionals involving those RVs. Sample the new RVs from those conditionals.</li>
<li>Plug the samples from step 2 further and keep sampling until all variables are sampled.</li>
</ol>
<p>So, in the example depicted in Fig. 1</p>
<p>\[
x \sim \mathbb{P}(X) \]
\[
y \sim \mathbb{P}(Y|X=x) \]
\[
z \sim \mathbb{P}(Z|Y=y,X=x)
\]</p>
<p>So, we get one sample as \((x, y, z)\). In contrast, a full joint distribution needs to be sampled all at once</p>
<p>\[
(x, y, z) \sim \mathbb{P}(X,Y,Z)
\]</p>
<h4 id="parameterization-and-learning">Parameterization and Learning</h4>
<p>Now, as we have a clean way of representing a complicated distribution in the form of a graph structure, we can parameterize each individual distribution in the factorized form to parameterize the whole joint distribution. Parameterizing the distribution in the example in Fig. 1</p>
<p>\[
\mathbb{P}(Z|X,Y; \theta_1)\cdot \mathbb{P}(Y|X; \theta_2)\cdot \mathbb{P}(X; \theta_3)
\]</p>
<p>For convenience, we will write the above factorization as
\[
p_{model}(X, Y, Z; \theta)
\]
where \(\theta = \{\theta_1, \theta_2, \theta_3\}\) is the set of all parameters.</p>
<p>For learning, we require a set of data samples (i.e., a dataset) collected from an unknown data generating distribution \(p_{data}(X, Y, Z)\). So, a dataset \(D = \{ x^{(i)}, y^{(i)}, z^{(i)} \}_{i=1}^M\) where each data sample is drawn as</p>
<p>\[
x^{(i)}, y^{(i)}, z^{(i)} \sim p_{data}(X, Y, Z)
\]</p>
<p>The likelihood of the data samples under our model signifies the probability that the samples came from our model. It is simply</p>
<p>\[
\mathbb{L}(D;\theta) = \prod_{x^{(i)}, y^{(i)}, z^{(i)} \sim p_{data}}\ p_{model}(x^{(i)}, y^{(i)}, z^{(i)}; \theta)
\]</p>
<p>The goal of learning is to get a good estimate of \(\theta\). We do it by maximizing the likelihood (or, often log-likelihood) w.r.t \(\theta\), which we call <strong>Maximum Likelihood Estimation</strong> or <strong>MLE</strong> in short</p>
<p>\[
\hat{\theta} = arg\max_{\theta}\ \log\mathbb{L}(D;\theta)
\]</p>
<h4 id="inference">Inference</h4>
<p>Inference in a Directed PGM refers to estimating a set of RVs given another set of RVs in a graph \(\mathbb{G}\). To do inference, we need to have an already learnt model. Inference is like “answering queries” after gathering knowledge (i.e., learning). For example, in our running example, one may ask, “What is the value of \(X\) given \(Y=y_0\) and \(Z=z_0\) ?”. The question can be answered by constructing the conditional distribution using the definition of conditional probability</p>
<p>\[
\mathbb{P}(X|Y=y_0,Z=z_0) = \frac{\mathbb{P}(X,Y=y_0,Z=z_0)}{\int_x \mathbb{P}(X,Y=y_0,Z=z_0)}
\]</p>
<p>In case a deterministic answer is desired, one can figure out the expected value of \(X\) under the above distribution</p>
<p>\[
\hat{x_0} = \mathbb{E}_{\mathbb{P}(X|Y=y_0,Z=z_0)}[X|Y=y_0,Z=z_0]
\]</p>
<h2 id="a-generative-view-of-data">A generative view of data</h2>
<p>Its quite important to understand this point. In this section, I won’t tell you anything new per se, but repeat some of the things I explained in the <strong>Ancestral Sampling</strong> subsection above.</p>
<p>Given a finite set of data (i.e., a dataset), we start our model building process from asking ourselves one question, “<em>How my data could’ve been generated ?</em>”. The answer to this question is precisely “<em>our model</em>”. The model (i.e., the graph structure) we build is essentially our belief of how the data was generated. <strong>Well, we might be wrong</strong>. The data may have been generated by some other ways, but we always start with a belief - our model. The reason I started modelling my data with a graph structure shown in Fig.1 is because I believe all my data (i.e, \(\{ x^{(i)}, y^{(i)}, z^{(i)} \}_{i=1}^M\)) was generated as follows:</p>
<p>\[
x^{(i)} \sim \mathbb{P}(X) \]
\[
y^{(i)} \sim \mathbb{P}(Y|X=x^{(i)}) \]
\[
z^{(i)} \sim \mathbb{P}(Z|Y=y^{(i)},X=x^{(i)})
\]</p>
<p>Or, equivalently</p>
<p>\[
(x^{(i)}, y^{(i)}, z^{(i)}) \sim p_{model}(X, Y, Z)
\]</p>
<h2 id="latent-variables-a-model-with-hidden-factor">Latent variables: A model with “hidden factor”</h2>
<p>Equipped with the knowledge of general Directed PGMs, we are now ready to look at one particular model (or rather family of models) that is extremely important and used heavily everywhere in practice. The idea of <strong>latent variable</strong> is basically <em>our belief</em> of having a <em>hidden factor</em> behind generation of our data. But, the hidden factor is (of course) not available in our dataset. Fig.3 shows the structure of the model (i.e., our belief about the data generation process) which has the variable \(Z\) which we believe to be a factor that is latent/hidden, but have contribution in generation of \(X\) (i.e., the observed variable). So, my model is as follows:</p>
<p>\[z^{(i)} \sim \mathbb{P}(Z)\]
\[x^{(i)} \sim \mathbb{P}(X|Z=z^{(i)})\]</p>
<p>Or, equivalently</p>
<p>\[ (z^{(i)}, x^{(i)}) \sim p_{model}(X, Z) \]</p>
<p>But unfortunately, the dataset \(D = \{x^{(i)}\}_i^M\) does not contain \(z^{(i)}\)</p>
<p>\[ x^{(i)} \sim p_{data}(X) \]</p>
<p>An example will clear any doubts:</p>
<center>
<figure>
<img width="40%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/12/latentvar_model.JPG" />
<figcaption>Fig.3: Latent factor responsible for data generation</figcaption>
</figure>
</center>
<p>Think of our dataset as facial images of \(K\) persons, but without identifying them with any labels. So, \(D = \{x^{(i)}\}_{i=1}^M\) where \(x^{(i)}\) is a facial image. But our model may contain a hidden factor, namely, the “identity” of the person in a given image \(x^{(i)}\). We can model this with a latent (discrete) variable having \(K\) states.</p>
<p>\[ z^{(i)} \sim \mathbb{P}(Z \in \{ 0, 1, \cdots K \} ) \]
\[ x^{(i)} \sim \mathbb{P}(X|Z=z^{(i)}) \]</p>
<p>Let’s see if we can do MLE on this. The likelihood is</p>
<p>\[
\mathbb{L}(D;\theta) = \prod_{x^{(i)} \sim p_{data}} p_{model}(x^{(i)}, \color{red}{z^{(i)}}; \theta)
\]</p>
<p>Wait a minute! We don’t have \(z^{(i)}\) available in our dataset. This is why MLE won’t work here.</p>
<h2 id="expectation-maximization-em-algorithm">Expectation-Maximization (EM) Algorithm</h2>
<p>EM algorithm solves the above problem. Although, this tutorial is not focused on EM algorithm, I will give a brief idea about how it works. Remember where we got stuck last time ? We didn’t have \(z^{(i)}\) in our dataset and so couldn’t perform normal MLE on the model. That’s literally the only thing that stopped us. The core idea of EM algorithm is to estimate \(Z\) using the model and the \(X\) we have in our dataset and then use that estimate to perform normal MLE.</p>
<p>The <strong>Expectation (E) step</strong> estimates \(z^{(i)}\) from a given \(x^{(i)}\) using the model</p>
<p>\[
\hat{z}^{(i)} = \mathbb{E}_{\mathbb{P}(Z|X=x^{(i)})}[Z | X]
\]</p>
<p>where
\[
\mathbb{P}(Z|X=x^{(i)}) = \frac{p_{model}(x, z)}{p_{model}(x)} = \frac{p_{model}(x, z)}{\sum_z p_{model}(x, z)}
\]</p>
<p>And then, the <strong>Maximization (M) step</strong> plugs that \(\hat{z}^{(i)}\) into the likelihood and performs standard MLE. The likelihood looks like</p>
<p>\[
\mathbb{L}(D;\theta) = \prod_{x^{(i)} \sim p_{data}} p_{model}(x^{(i)}, \hat{z}^{(i)}; \theta)
\]</p>
<p>By repeating the <em>E & M steps iteratively</em>, we can get an optimal solution for the parameters and eventually discover the latent factors in the data.</p>
<h2 id="the-intractable-inference-problem">The intractable inference problem</h2>
<p>Apart from the learning problem, which involves estimating the whole joint distribution, there exists another problem that is worth solving on its own - the <strong>inference problem</strong>, i.e., estimating the latent factor given an observation. For examples, we may want to estimate “pose” of an object given its image in an unsupervised way, OR, estimating identity of a person given his/her facial photograph (our last example). Although we have seen how to perform inference in the EM algorithm, I am rewriting it here for convenience.</p>
<p>Taking up the same example of latent variable (i.e., \(Z \rightarrow X\)), we <em>infer</em> \(Z\) as</p>
<p>\[
\mathbb{P}(Z|X) = \frac{\mathbb{P}(X,Z)}{\mathbb{P}(X)} = \frac{\mathbb{P}(X,Z)}{\sum_Z \mathbb{P}(X,Z)}
\]</p>
<p>This quantity is also called the <strong>posterior</strong>.</p>
<p>For continuos \(Z\), we have integral instead of summation</p>
<p>\[
\mathbb{P}(Z|X) = \frac{\mathbb{P}(X,Z)}{\mathbb{P}(X)} = \frac{\mathbb{P}(X,Z)}{\int_Z \mathbb{P}(X,Z)}
\]</p>
<p>If you are a keen observer, you might notice an appearent problem with the inference - the inference will be computationally intractable as it involves a <em>summation/integration over a high dimensional vector with potentially unbounded support</em>. For example, if the latent variable denotes a continuous “pose” vector of length \(d\), the denominator will contain a \(d\)-dimensional integral over \((-\infty, \infty)^d\). At this point, as you might understand that even EM algorithm suffers from intractability problem.</p>
<h2 id="variational-inference-vi-comes-to-rescue">Variational Inference (VI) comes to rescue</h2>
<p>Finally, here we are. This is the one alogrithm I was most excited to explain because this is what some of the ground-breaking ideas of this field born out of. Variational Inference (VI), although there in the literature for a long time, has recently shown very promising results on problems involving latent variables and deep structure. In the next post, I will go into some of those specific algorithms, but not today. In this article, I will go over the basic framework of VI and how it works.</p>
<p>The idea is really simple: <strong>If we can’t get a tractable closed-form solution for \(\mathbb{P}(Z\vert X)\), we’ll approximate it</strong>.</p>
<p>Let the approximation be \(\mathbb{Q}(Z;\phi)\) and we can now form this as an optimization problem:</p>
<p>\[
\mathbb{Q}^*(Z) = arg\min_{\phi}\ \mathbb{K}\mathbb{L}[\mathbb{Q}(Z;\phi)\ ||\ \mathbb{P}(Z|X)]
\]</p>
<p>By choosing a family of distribution \(\mathbb{Q}(Z;\phi)\) flexible enough to model \(\mathbb{P}(Z\vert X)\) and optimizing over \(\phi\), we can push the approximation towards the real posterior. \(\mathbb{K}\mathbb{L}(\cdot \|\cdot)\) is KL-divergance, a distance between two probability distributions.</p>
<center>
<figure>
<img width="50%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/12/vi.JPG" />
<figcaption>Fig.4: Variational approximation of the true posterior</figcaption>
</figure>
</center>
<p>Now let’s expand the KL-divergence term</p>
<p>\[
\mathbb{K}\mathbb{L}[\mathbb{Q}(Z;\phi)\ \vert\vert \ \mathbb{P}(Z\vert X)]
\]</p>
<p>\[
\let\sb_
= \mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z;\phi)] - \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(Z\vert X)]
\]</p>
<p>\[
\let\sb_
= \mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z;\phi)] - \mathbb{E}\sb{\mathbb{Q}} [\log \frac{\mathbb{P}(X,Z)}{\mathbb{P}(X)}]
\]</p>
<p>\[
\let\sb_
= \mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z;\phi)] - \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X, Z)] + \log \mathbb{P}(X)
\]</p>
<p>Although we can compute the first two terms in the above expansion, but oh lord ! the third term is the <em>same annoying (intractable) integral</em> we were avoiding before. What do we do now ? This seems to be a deadlock !</p>
<h2 id="the-evidence-lower-bound-elbo">The Evidence Lower BOund (ELBO)</h2>
<p>Please recall that our original objective was a minimization problem over \(\mathbb{Q}(\cdot;\phi)\). We can pull a little trick here - <strong>we can optimize only the first two terms and ignore the third term</strong>. How ?</p>
<p>Because the third term is independent of \(\mathbb{Q}(\cdot;\phi)\). So, we just need to minimize</p>
<p>\[
\let\sb_
\mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z;\phi)] - \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X, Z)]
\]</p>
<p>Or equivalently, maximize (just flip the two terms)</p>
<p>\[
\let\sb_
ELBO(\mathbb{Q}) \triangleq \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X, Z)] - \mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z;\phi)]
\]</p>
<p>This term, usually defined as ELBO, is quite famous in VI literature and you have just witnessed how it looks like and where it came from. Taking a deeper look into the \(ELBO(\cdot)\) yields even further insight</p>
<p>\[
\let\sb_
ELBO(\mathbb{Q}) = \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X\vert Z)] + \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(Z)] - \mathbb{E}_{\mathbb{Q}} [\log \mathbb{Q}(Z;\phi)]
\]</p>
<p>\[
\let\sb_
= \mathbb{E}\sb{\mathbb{Q}} [\log \mathbb{P}(X\vert Z)] - \mathbb{K}\mathbb{L}[\mathbb{Q}(Z;\phi)\ ||\ \mathbb{P}(Z)]
\]</p>
<p>Now, please consider looking at the last equation for a while because that is what all our efforts led us to. The last equation is totally tractable and also solves our problem. What it basically says is that maximizing \(ELBO(\cdot)\) (which is a proxy objective for our original optimization problem) is equivalent to maximizing the conditional data likelihood (which we can choose in our graphical model design) and simultaneously pushing our approximate posterior (i.e., \(\mathbb{Q}(;\phi)\)) towards a prior over \(Z\). The prior \(\mathbb{P}(Z)\) is basically how the true latent space is organized. Now the immediate question might arise: “Where do we get \(\mathbb{P}(Z)\) from?”. The answer is, we can just choose any distribution as a hypothesis. It will be our belief of how the \(Z\) space is organized.</p>
<center>
<figure>
<img width="50%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/12/elbo_px_gap.JPG" />
<figcaption>Fig.5: Interpretation of ELBO</figcaption>
</figure>
</center>
<p>There is one more interpretation (see figure 5) of the KL-divergence expansion that is interesting to us. Rewriting the KL-expansion and substituting \(ELBO(\cdot)\) definition, we get</p>
<p>\[
\log \mathbb{P}(X) = ELBO(\mathbb{Q}) + \mathbb{K}\mathbb{L}[\mathbb{Q}(Z;\phi)\ \vert\vert \ \mathbb{P}(Z\vert X)]
\]</p>
<p>As we know that \(\mathbb{K}\mathbb{L}(\cdot\vert\vert \cdot) \geq 0\) for any two distributions, the following inequality holds</p>
<p>\[
\log \mathbb{P}(X) \geq ELBO(\mathbb{Q})
\]</p>
<p>So, the \(ELBO(\cdot)\) that we vowed to maximize is a <strong>lower bound</strong> on the observed data log likelihood. Thats amazing, isn’t it ! Just by maximing the \(ELBO(\cdot)\), we can implicitely get closer to our dream of estimating maximum (log)-likelihood - <em>tighter the bound, better the approximation</em>.</p>
<hr />
<p>Okay ! Way too much math for today. This is overall how the Variational Inference looks like. There are numerous directions of research emerged from this point onwards. Its impossible to talk about all of them. But few directions, which succeded to grab attention of the community with its amazing formulations and results will be discussed in later parts of the tutorial series. One of them being “Variational AutoEncoder” (VAE). Stay tuned.</p>
<h4 id="references">References</h4>
<ol>
<li>“Variational Inference: A Review for Statisticians”, David M. Blei, Alp Kucukelbir, Jon D. McAuliffe</li>
<li>“Pattern Recognition and Machine Learning”, C.M. Bishop</li>
<li>“Machine Learning: A Probabilistic Perspective”, Kevin P. Murphy</li>
</ol>Ayan DasWelcome to the first part of a series of tutorials about Directed Probabilistic Graphical Models (PGMs) & Variational methods. Directed PGMs (OR, Bayesian Networks) are very powerful probabilistic modelling techniques in machine learning literature and have been studied rigorously by researchers over the years. Variational Methods are family of algorithms arise in the context of Directed PGMs when it involves solving an intractable integrals. Doing inference on a set of latent variables (given a set of observed variables) involves such an intractable integral. Variational Inference (VI) is a specialised form of variation method that handles this situation. This tutorial is NOT for absolute beginners as I assume the reader to have basic-to-moderate knowledge about Random Variables, probability theories and PGMs. The next tutorial in this series will cover one perticular VI method, namely “Variational Autoencoder (VAE)” built on top of VI.TeX & family : The Typesetting ecosystem2019-05-29T00:00:00+00:002019-05-29T00:00:00+00:00https://dasayan05.github.io/blog-tut/2019/05/29/tex-and-family<p>Welcome to the very first and an introductory article on <code class="language-plaintext highlighter-rouge">typesetting</code>. If you happened to be from the scientific community, you must have gone through at least one document (maybe in the form of <code class="language-plaintext highlighter-rouge">.pdf</code> or a printed paper) which is the result of years of developments in typesetting. If you are from technical/research background, chances are that you have even <em>typeset</em> a document before using something called <code class="language-plaintext highlighter-rouge">LaTeX</code>. Let me assure you that <code class="language-plaintext highlighter-rouge">LaTeX</code> is neither the beginning nor the end of the entire typesetting ecosystem. In this article, I will provide a brief introduction to what typesetting is and what all modern tools are available for use. Specifically, the most popular members of the <code class="language-plaintext highlighter-rouge">TeX</code> family will be introduced, including <code class="language-plaintext highlighter-rouge">LaTeX</code>.</p>
<p>Although many people (including the ones who use it one way or the other) do not recognize this but typesetting is an <strong>art</strong>. Technically it’s defined as the process of arranging various symbols (letters, numbers & special characters) called <a href="https://en.wikipedia.org/wiki/Glyph">glyphs</a> on a physical paper or in a digital medium in a way that it is appealing as a <em>reading</em> material. I emphasized on that word “reading” because that’s the key - typesetting aims to produce documents that are <em>pleasant to the human eye</em>. You might ask, “So, what is the big deal here ?”. The answer is a set of (technical) terms/phrases which, I am pretty sure you haven’t even heard of. “Optimal line length”, “<a href="https://en.wikipedia.org/wiki/Typographic_ligature">Ligatures</a>”, “Italic correction”, “<a href="https://en.wikipedia.org/wiki/Hyphenation_algorithm">Hyphenation</a>”, “Optimal spacing” etc. which, if not done right, may result in fatigue while reading. Please believe me at this point that there indeed is a science behind deciding what exactly is pleasant to the human eye and what’s not. I will try to illustrate few of them here:</p>
<center>
<figure>
<img width="50%" style="padding-top: 20px;" src="/public/posts_res/11/uERdv.png" />
<figcaption>Fig.1: Difference in output with Typesetting and Word processors</figcaption>
</figure>
</center>
<p>Have a looks at the example above with two lines with identical content but one (top one) produced by a <em>typesetting</em> system and the other one is by a word processor. Did you notice any visual difference ? Let me help you.</p>
<ol>
<li>
<p><strong>Ligature</strong> is a special <em>glyph</em> which is formed by joining two <em>glyphs</em> of two letters (or <a href="https://en.wikipedia.org/wiki/Grapheme">grapheme</a>s). Canonical examples of ligature are the <em>grapheme</em> pair “ff” and “fi” both of which happened to be present in the content of the example sentence. The typeset sentence use just one <em>glyph</em> for both “ff” and “fi” which is not the case with word processors.</p>
</li>
<li>
<p><strong>(Non-)optimal spacing</strong> does affect the appeal of the text when read by human eye. The word “AVAST” clearly has more amount of spaces in between the letters in the later case which indeed is very awful. If you want more of it, look at character-pair “Fe” and “Ta” in the words “Feline” and “Table” respectively.</p>
</li>
<li>
<p><strong>Hyphenation</strong> is defined as the process of breaking words between lines. A better hyphenation algorithm can produce much less breakage of words in a given paragraph. The reason word processors are not-so-good at it is their hyphenation algorithm works on a single line and not on the entire paragraph. The below example (Fig.2; taken from <a href="http://www.rtznet.nl/zink/latex.php?lang=en">here</a>) should be self-explanatory.</p>
</li>
<li>
<p><strong>Typesetting mathematics</strong> is crucial when dealing with scientific documents. Scientific engineers/researchers will not be happy if their complicated equation looks ugly. Refer to Fig.3 for a visual comparison of equations.</p>
</li>
</ol>
<center>
<figure>
<img width="50%" style="padding-top: 20px;" src="/public/posts_res/11/hyphenation.png" />
<figcaption>Fig.2: Effect of (im)proper Hyphenation</figcaption>
</figure>
</center>
<center>
<figure>
<img width="56%" style="padding-top: 20px; margin: 0px;" src="/public/posts_res/11/eq_word.PNG" />
<img width="60%" style="padding:0px;margin:0px;" src="/public/posts_res/11/eq_latex.PNG" />
<figcaption>Fig.3: First one is produced using MS Word's equation feature; Second one is typeset with LaTeX</figcaption>
</figure>
</center>
<p>There are many more of these. It’s difficult to discuss all of them here. If you are interested, read articles like <a href="http://www.rtznet.nl/zink/latex.php?lang=en">this</a> by expert typographers. In this article, I would rather focus on the tools available for digital typesetting.</p>
<p>Before we begin, it’s good to have an idea about how exactly these typeset documents are produced <em>digitally</em>. They are achieved by means of some specially crafted file formats:</p>
<ol>
<li>
<p><strong>Device Independent</strong> (<code class="language-plaintext highlighter-rouge">.dvi</code>): DVI is a format created by <strong>David R. Fuchs</strong> and implemented by <strong>Donald Knuth</strong> as the primary output format of <code class="language-plaintext highlighter-rouge">TeX</code>. DVIs are binary (encoded) files and are not intended to be readable as text. DVI viewers (e.g. <code class="language-plaintext highlighter-rouge">xdvi</code>) can recognize and display them.</p>
</li>
<li>
<p><strong>PostScript</strong> (<code class="language-plaintext highlighter-rouge">.ps</code>): PostScript (in short, called “PS”) is a very popular format used heavily in publishing industry created by Adobe. PS is a page description language (yes, its a full-fledged programming language) that describes a page by means of its commands. It is readable as text because of the fact that it is a source code of a programming language.</p>
</li>
<li>
<p><strong>Portable Document Format</strong> (<code class="language-plaintext highlighter-rouge">.pdf</code>): Here comes the beast. PDF is a widely used document format used .. well .. everywhere. Created by Adobe, this format is intended to be dependency-free and a complete description of the document including text, images (raster/vector), fonts and other assets.</p>
</li>
</ol>
<p><br /></p>
<h2 id="digital-typesetting-and-history-of-tex-">Digital typesetting and history of <code class="language-plaintext highlighter-rouge">TeX</code> :</h2>
<p><strong>Digital typesetting</strong> refers to the process of typesetting in digital medium and produce high quality printing material. So, a digital typesetting system must consume the <em>content</em> and <em>formatting</em> of the material we want to print and produce a DVI/PS/PDF which can then be used by the traditional printers.</p>
<center>
<img width="30%" style="padding-top: 20px; padding-bottom: 20px;" src="/public/posts_res/11/knuth.jpg" />
</center>
<p>It all started when this guy, <a href="https://en.wikipedia.org/wiki/Donald_Knuth">Donald E. Knuth</a>, felt the need of a reliable typesetting system because he had a bad experience with typesetting his book (The art of Computer Programming). Around 1977, while in Stanford, Knuth developed the very first version of <code class="language-plaintext highlighter-rouge">TeX</code> - a digital typesetting engine that allows users to describe the <em>content</em> and <em>formatting</em> of a printing material by means of text files and can produce <code class="language-plaintext highlighter-rouge">.dvi</code>s. TeX (pronounced as <em>tech</em>) is much like a programming language which takes source code as inputs and produces a beautifully typeset document. Although <code class="language-plaintext highlighter-rouge">TeX</code> is a turing-complete programming language, it is mostly used as a <em>description language</em> which is flexible enough to describe not only content of the document but also granular formating details. TeX is highly popular in academia because of it’s ability to beautifully typeset mathematical notations and symbols. The core TeX engine uses quite sophisticated algorithms to address problems (the ones I described before like “optimal spacing”, “italic correction” etc.) which makes a document unpleasant to human eye. Although many things are automatic, TeX provides users with <em>granular control</em> over formatting details.</p>
<hr />
<h2 id="tex--the-core-typesetting-engine"><code class="language-plaintext highlighter-rouge">TeX</code> : The core typesetting engine</h2>
<p>Okay, enough of history and vague descriptions. Let me introduce you to the language of <code class="language-plaintext highlighter-rouge">TeX</code>. In case you want to follow along, please install any complete TeX-distribution (<code class="language-plaintext highlighter-rouge">MikTeX</code>, <code class="language-plaintext highlighter-rouge">TeXLive</code> etc.) and you’ll get all the required tools ready for use. Here’s a simple TeX program (adopted from the <a href="https://www.amazon.com/TeXbook-Donald-Knuth/dp/0201134489">TeXBook</a>):</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">% filename: story.tex</span>
<span class="k">\hsize</span>=3in
<span class="k">\centerline</span><span class="p">{</span><span class="k">\bf</span> A Short Story<span class="p">}</span>
<span class="k">\vskip</span> 6pt
<span class="k">\centerline</span><span class="p">{</span><span class="k">\sl</span> Ayan Das<span class="p">}</span>
<span class="k">\vskip</span> .5cm
<span class="p">{</span><span class="k">\parindent</span>=1em
<span class="k">\indent</span> Once upon a time, in a distant galaxy called `<span class="k">\"</span>O<span class="k">\"</span>o<span class="k">\c</span> c, there lived a computer named R.~J. Drofnats.`<span class="p">}</span>.
<span class="p">{</span><span class="k">\parindent</span>=2em
<span class="k">\indent</span> Mr.~Drofnats---or ``R. J.,'' as he preferred to be called---was happiest when he was at work typesetting beautiful documents using <span class="k">\TeX</span>.<span class="p">}</span>
<span class="k">\bye</span>
</code></pre></div></div>
<p>if <em>compiled</em> with the TeX engine</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt<span class="nv">$ </span>tex story.tex
This is TeX, Version 3.14159265 <span class="o">(</span>MiKTeX 2.9.7000 64-bit<span class="o">)</span>
<span class="o">(</span>story.tex <span class="o">[</span>1] <span class="o">)</span>
Output written on story.dvi <span class="o">(</span>1 page, 680 bytes<span class="o">)</span><span class="nb">.</span>
Transcript written on story.log.
</code></pre></div></div>
<p>produces a <code class="language-plaintext highlighter-rouge">.dvi</code> which when opened with a DVI viewer, will look like this:</p>
<center>
<img width="70%" style="padding-top: 20px; padding-bottom: 20px;" src="/public/posts_res/11/tex_example1.PNG" />
</center>
<p>Although, the whole point of this tutorial is not to teach you TeX in detail, but I do want you to get a feel of how TeX accomplishes fine-quality typesetting programmatically. Here goes the explanation of the source code:</p>
<ol>
<li>The very first control sequence (yes, that’s what they are called) <code class="language-plaintext highlighter-rouge">\hsize</code> determines the width of the text area. Look how narrow the text is; its just 3 inches !</li>
<li>Control sequence <code class="language-plaintext highlighter-rouge">\centerline</code>, as you can guess, centers a text. That <code class="language-plaintext highlighter-rouge">\bf</code> is there to make all texts <strong>boldface</strong> inside its <em>enclosing braces</em>. Try to guess what that <code class="language-plaintext highlighter-rouge">\sl</code> in the next-to-next line is for.</li>
<li>Couple of <code class="language-plaintext highlighter-rouge">\vskip</code>s are there to make <em>vertical gaps</em>. We can use several units of length (inches, point etc.) as per our convenience.</li>
<li><code class="language-plaintext highlighter-rouge">\parindent</code> decides how much space to put for <em>paragraph indentation</em>. The first paragraph has a <code class="language-plaintext highlighter-rouge">\parindent=1em</code> and the next one has <code class="language-plaintext highlighter-rouge">\parindent=2em</code> which is quite evident from the output.</li>
<li>Did you notice the <em>accents</em> and how they are written in the source code (<code class="language-plaintext highlighter-rouge">\"O\"o\c c</code>) ?</li>
<li>The <code class="language-plaintext highlighter-rouge">~</code> sign represents a single <em>space</em> but with an extra instruction given to TeX to not <em>break the line at that point</em> while running it’s optimal line breaking algorithm.</li>
</ol>
<p>I seriously have no intention to make it any longer, but I can’t resist myself to show you the level of <em>granularity</em> TeX offers:</p>
<ol>
<li>The control sequence <code class="language-plaintext highlighter-rouge">\centerline</code> is not a primary one - it is defined using a more fundamental concept called <strong>Glue</strong>. Think of them as <em>virtual springs</em> which have <em>stretchability</em> and <em>shrinkability</em>. Think of the <em>centering of a text</em> as putting two identical springs of <em>infinite stretchability</em> horizontally on both sides of the text. In the equilibrium, the text will be centered. Seems strange, right ? That <code class="language-plaintext highlighter-rouge">\centerline</code> can then be (roughly) defined as
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">% This is how you define a control sequence with one argument</span>
<span class="k">\def\centerline</span>#1<span class="p">{</span>
<span class="k">\hskip</span>0pt plus 1fil #1<span class="k">\hskip</span>0pt plus 1fil
<span class="p">}</span>
</code></pre></div> </div>
<p>where those <code class="language-plaintext highlighter-rouge">\hskip0pt plus 1fil</code>s are the two glues/springs I mentioned earlier. Try to figure out what it exactly means (Hint: <code class="language-plaintext highlighter-rouge">1fil</code> means a length of “Infinity with strength 1”).</p>
</li>
<li>Another control sequence I want to bring your attention to is that <code class="language-plaintext highlighter-rouge">\TeX</code> at the very end of the second paragraph that produces the special <code class="language-plaintext highlighter-rouge">TeX</code>-logo. It might seem like a primary command but it’s not - proper placing of that ‘E’ can be done with more fundamental commands like <code class="language-plaintext highlighter-rouge">\lower</code> and <code class="language-plaintext highlighter-rouge">\kern</code>:
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\def\TeX</span><span class="p">{</span>
T<span class="k">\kern</span>-.2em<span class="k">\lower</span>.5ex<span class="k">\hbox</span><span class="p">{</span>E<span class="p">}</span><span class="k">\kern</span>-.13em X
<span class="p">}</span>
</code></pre></div> </div>
<p><code class="language-plaintext highlighter-rouge">\kern</code> is there to produce a given amount of horizontal space. A negative number will cause the next character to overlap. <code class="language-plaintext highlighter-rouge">\lower</code>, as you can guess, is for <em>lowering</em> the following <em>box</em> from its horizontal baseline by the given amount.</p>
</li>
</ol>
<p>I hope I have successfully conveyed the essence of <code class="language-plaintext highlighter-rouge">TeX</code> and the granularity/flexibility it offers. We will now move on to other members of <code class="language-plaintext highlighter-rouge">TeX</code> family.</p>
<hr />
<h2 id="latex--a-layer-of-abstraction"><code class="language-plaintext highlighter-rouge">LaTeX</code> : A layer of abstraction</h2>
<p>If you have ever heard about or worked with any one member of the TeX family, chances are, it is <code class="language-plaintext highlighter-rouge">LaTeX</code>. <code class="language-plaintext highlighter-rouge">LaTeX</code> was designed by <strong>Leslie B. Lamport</strong> (<code class="language-plaintext highlighter-rouge">LaTeX</code> happened to be an abbreviation of <code class="language-plaintext highlighter-rouge">Lamport TeX</code>) around 1983 as a document management system. It focuses heavily on <em>separating the content from formatting</em> as it helps users to focus more on the content. <code class="language-plaintext highlighter-rouge">LaTeX</code> is technically a gigantic <em>macro package</em> of <code class="language-plaintext highlighter-rouge">TeX</code> whose primary motive is to provide users with <em>document management</em> capabilities like “Automated page numbering”, “Automatic (sub)section formatting”, “Automatic Table-of-Content generation”, “Easy referencing mechanism” etc. This allows users not to worry about putting proper page numbers every time they create a new page or adding a new entry in the Table-of-Content every time they add a new section in the document.</p>
<p>Plain TeX would require you to do at least this much to produce a numbered list:</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="k">\parindent</span>=2em
<span class="k">\indent</span> 1. First point<span class="p">}</span>
<span class="k">\vskip</span>1em
<span class="p">{</span><span class="k">\parindent</span>=2em
<span class="k">\indent</span> 2. Second point<span class="p">}</span>
<span class="k">\vskip</span>1em
<span class="p">{</span><span class="k">\parindent</span>=2em
<span class="k">\indent</span> 3. Third point<span class="p">}</span>
<span class="k">\bye</span>
</code></pre></div></div>
<p>whereas <code class="language-plaintext highlighter-rouge">LaTeX</code>’s abstraction will allows you to have much more <em>focus on the content</em> rather than formatting details. Here’s how you would do it in LaTeX</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">\begin{enumerate}</span>
<span class="k">\item</span> First point
<span class="k">\item</span> Second point
<span class="k">\item</span> Third point
<span class="nt">\end{enumerate}</span>
</code></pre></div></div>
<p>Apart from being more readable and content-focused, the LaTeX version is much more feature-complete as it handles all possible situations you might get into. Similarly, (sub-)sectioning is just as easy:</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\documentclass</span><span class="p">{</span>article<span class="p">}</span>
<span class="nt">\begin{document}</span>
<span class="k">\section</span><span class="p">{</span>Introduction<span class="p">}</span>
<span class="k">\subsection</span><span class="p">{</span>Problem Statement<span class="p">}</span>
Content for 'Problem Statement'
<span class="k">\subsection</span><span class="p">{</span>History<span class="p">}</span>
Content for 'History'
<span class="k">\subsection</span><span class="p">{</span>Motivation<span class="p">}</span>
Content for 'Motivation'
<span class="k">\section</span><span class="p">{</span>Details<span class="p">}</span>
<span class="k">\subsection</span><span class="p">{</span>Analysis<span class="p">}</span>
Content for 'Analysis'
<span class="k">\subsection</span><span class="p">{</span>Experiments<span class="p">}</span>
Content for 'Experiments'
<span class="nt">\end{document}</span>
</code></pre></div></div>
<p>will produce</p>
<center>
<img width="40%" style="padding-top: 20px; padding-bottom: 20px;" src="/public/posts_res/11/sectioning.PNG" />
</center>
<hr />
<h2 id="pdflatex--the-pdf-variants"><code class="language-plaintext highlighter-rouge">pdf(La)TeX</code> : The <strong>PDF</strong> variants</h2>
<p>There exist two members of the TeX family, namely <code class="language-plaintext highlighter-rouge">pdfTeX</code> and <code class="language-plaintext highlighter-rouge">pdfLaTeX</code>, which are essentially similar engines as <code class="language-plaintext highlighter-rouge">TeX</code> and <code class="language-plaintext highlighter-rouge">LaTeX</code> respectively but produces <code class="language-plaintext highlighter-rouge">.pdf</code>s directly instead of <code class="language-plaintext highlighter-rouge">.dvi</code>s. They may use some modern/advanced features that only PDFs offer. These two are extremely popular as the demand for <code class="language-plaintext highlighter-rouge">.pdf</code>s are significantly higher than that of <code class="language-plaintext highlighter-rouge">.dvi</code>s. <code class="language-plaintext highlighter-rouge">pdf(La)TeX</code> is a separate program and implemented independently from <code class="language-plaintext highlighter-rouge">(La)TeX</code>. They can accessed via command line programs</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt<span class="nv">$ </span>pdftex file.tex
<span class="c"># produces file.pdf</span>
</code></pre></div></div>
<p>and</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt<span class="nv">$ </span>pdflatex file.tex
<span class="c"># produces file.pdf</span>
</code></pre></div></div>
<hr />
<h2 id="lualatex--when-latex-meets-lua"><code class="language-plaintext highlighter-rouge">LuaLaTeX</code> : When <code class="language-plaintext highlighter-rouge">LaTeX</code> meets <code class="language-plaintext highlighter-rouge">Lua</code></h2>
<p>A successful attempt of extending the <code class="language-plaintext highlighter-rouge">pdfTeX</code> engine by embedding <code class="language-plaintext highlighter-rouge">Lua</code> in it was <code class="language-plaintext highlighter-rouge">LuaTeX</code> (beware of the spelling, it’s not <code class="language-plaintext highlighter-rouge">LuaLaTeX</code>). This engine, if used with the <code class="language-plaintext highlighter-rouge">LaTeX</code> format, assumes the name <code class="language-plaintext highlighter-rouge">LuaLaTeX</code>. <code class="language-plaintext highlighter-rouge">Lua(La)TeX</code> is primarily used when a little more <em>dynamicity/flexibility</em> is required in the <em>source code</em>. Now that we understand how a <code class="language-plaintext highlighter-rouge">(La)TeX</code> programs look like and how they work, I will directly go on showing some code rather than beating around the bush.</p>
<p>Before that, I would like to bring your attention to something which (La)TeX is not so good at. La(TeX) is known to be difficult when it comes to general purpose programming. To express very basic logics of programming, TeX needs a lot of unnecessary commands which are neither convenient nor readable as a source code. One very important logical block that almost every sensible program contains is a <strong>for loop</strong>. Here’s what TeX and LaTeX needs respectively in order to accomplish it.</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\newcount\myvar</span> <span class="c">% a command to define a variable</span>
<span class="k">\myvar</span>=1
<span class="k">\loop</span>
<span class="k">\the\myvar</span> <span class="c">% a command to access a variable, I mean seriously !</span>
<span class="k">\advance\myvar</span>1
<span class="k">\ifnum\myvar</span><5
<span class="k">\repeat</span>
<span class="k">\bye</span>
</code></pre></div></div>
<p>and</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code>..
<span class="k">\usepackage</span><span class="p">{</span>forloop<span class="p">}</span>
..
<span class="k">\newcounter</span><span class="p">{</span>ct<span class="p">}</span>
<span class="k">\forloop</span><span class="p">{</span>ct<span class="p">}{</span>1<span class="p">}{</span><span class="k">\value</span><span class="p">{</span>ct<span class="p">}</span> < 5<span class="p">}</span>
<span class="p">{</span>
<span class="k">\thect\
</span><span class="p">}</span>
</code></pre></div></div>
<p>If interested, you may try to read and understand it line by line. But we can agree on one thing - it’s nowhere near convenient or readable. Although the later one is somewhat easy to interpret, but it takes a separate package (called <code class="language-plaintext highlighter-rouge">forloop</code>) to get to it.</p>
<p>Now, here’s how Lua helps.</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\documentclass</span><span class="p">{</span>article<span class="p">}</span>
<span class="nt">\begin{document}</span>
<span class="k">\directlua</span><span class="p">{</span>
for i = 1, 10
do
tex.print(i .. ' ')
end
<span class="p">}</span>
<span class="nt">\end{document}</span>
</code></pre></div></div>
<p>You can see the very basic <code class="language-plaintext highlighter-rouge">LuaLaTeX</code> command that makes the bridge between LaTeX with Lua is at work here. <code class="language-plaintext highlighter-rouge">\directlua</code> enables users to write arbitrary Lua code inside it. Here’s how it works:</p>
<ol>
<li>The engine halts interpreting the usual LaTeX commands (i.e., stops typesetting) once it has encountered a <code class="language-plaintext highlighter-rouge">\luacode</code> block</li>
<li>The code inside this block is then fed into a special Lua interpreter for execution.</li>
<li>The special <code class="language-plaintext highlighter-rouge">tex.print(..)</code> function (same <code class="language-plaintext highlighter-rouge">print()</code> API from standard Lua) injects the characters into a special output stream.</li>
<li>The <code class="language-plaintext highlighter-rouge">\luacode</code> block is then replaced by the content of the output stream.</li>
<li>LaTeX engine starts its typesetting again from where it was halted.</li>
</ol>
<p>Take a moment to digest this. Hopefully the explanation is clear enough to understand why the output looks this this</p>
<center>
<img width="40%" style="padding-top: 20px; padding-bottom: 20px;" src="/public/posts_res/11/lualatex_out1.PNG" />
</center>
<p>Constructs are available to define pure Lua functions as well. Also, convenient mechanisms are built to translate arguments given to a LaTeX command into equivalent Lua objects. A concrete example is shown below:</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\documentclass</span><span class="p">{</span>article<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>luacode<span class="p">}</span>
<span class="nt">\begin{luacode*}</span>
function intro<span class="p">_</span>helper<span class="p">_</span>tex(name)
tex.print('Hello, my name is ' .. name .. '. I love <span class="k">\\</span>TeX')
end
function intro<span class="p">_</span>helper<span class="p">_</span>latex(name)
tex.print('Hello, my name is ' .. name .. '. I love <span class="k">\\</span>LaTeX')
end
<span class="nt">\end{luacode*}</span>
<span class="k">\newcommand</span><span class="p">{</span><span class="k">\intro</span><span class="p">}</span>[2]<span class="p">{</span>
<span class="k">\directlua</span> <span class="p">{</span>
if <span class="k">\luastring</span><span class="p">{</span>#2<span class="p">}</span> == 'tex' then
intro<span class="p">_</span>helper<span class="p">_</span>tex(<span class="k">\luastring</span><span class="p">{</span>#1<span class="p">}</span>)
elseif <span class="k">\luastring</span><span class="p">{</span>#2<span class="p">}</span> == 'latex' then
intro<span class="p">_</span>helper<span class="p">_</span>latex(<span class="k">\luastring</span><span class="p">{</span>#1<span class="p">}</span>)
end
<span class="p">}</span>
<span class="p">}</span>
<span class="nt">\begin{document}</span>
<span class="k">\intro</span><span class="p">{</span>Ayan Das<span class="p">}{</span>latex<span class="p">}</span>
<span class="nt">\end{document}</span>
</code></pre></div></div>
<ol>
<li><code class="language-plaintext highlighter-rouge">\begin{luacode*} .. \end{luacode*}</code> is an environment to put pure Lua definitions. In our example, there are two functions namely <code class="language-plaintext highlighter-rouge">intro_helper_tex(..)</code> and <code class="language-plaintext highlighter-rouge">intro_helper_latex(..)</code>.</li>
<li>To understand the reason for <em>escaping</em> the backslash, go and read the 4th point of the earlier explanation very carefully. The output stream generated by the <code class="language-plaintext highlighter-rouge">tex.print(..)</code>s has to be <strong>valid (La)TeX code</strong> in order to be successfully parsed subsequently by the LaTeX engine. Escaping the backslash produces “\TeX” as a string in the output stream which is a valid (La)TeX command.</li>
<li>Coming to the custom command named <code class="language-plaintext highlighter-rouge">\intro</code>, it takes 2 inputs - your name and favorite TeX format. They are <em>translated</em> to Lua strings via <code class="language-plaintext highlighter-rouge">\luastring{#x}</code> where <code class="language-plaintext highlighter-rouge">x</code> is the argument number of <code class="language-plaintext highlighter-rouge">\intro</code>.</li>
<li>Depending on the second argument, the <code class="language-plaintext highlighter-rouge">if .. elseif .. end</code> block choses one of the two Lua functions defined earlier.</li>
</ol>
<p>The output of the above program, if compiled like this</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt<span class="nv">$ </span>lualatex funcarg.tex
This is LuaTeX, Version 1.10.0 <span class="o">(</span>MiKTeX 2.9.7000 64-bit<span class="o">)</span>
restricted system commands enabled.
<span class="o">(</span>./funcarg.tex
LaTeX2e <2018-12-01>
...
Output written on funcarg.pdf <span class="o">(</span>1 page, 7187 bytes<span class="o">)</span><span class="nb">.</span>
Transcript written on funcarg.log.
</code></pre></div></div>
<p>is</p>
<center>
<img width="40%" style="padding-top: 20px; padding-bottom: 20px;" src="/public/posts_res/11/lualatex_out2.PNG" />
</center>
<hr />
<p>Phew ! That was a hell of a lengthy tutorial; but hopefully conveys the essence of typesetting and the <code class="language-plaintext highlighter-rouge">TeX</code> family of tools. All the members of the <code class="language-plaintext highlighter-rouge">TeX</code> family are themselves huge systems to learn about. With the introductory ideas given in this tutorial, it will be easier to read their official documentations available online.</p>Ayan DasWelcome to the very first and an introductory article on typesetting. If you happened to be from the scientific community, you must have gone through at least one document (maybe in the form of .pdf or a printed paper) which is the result of years of developments in typesetting. If you are from technical/research background, chances are that you have even typeset a document before using something called LaTeX. Let me assure you that LaTeX is neither the beginning nor the end of the entire typesetting ecosystem. In this article, I will provide a brief introduction to what typesetting is and what all modern tools are available for use. Specifically, the most popular members of the TeX family will be introduced, including LaTeX.