Proceedings of Machine Learning Research

Borrowing From the Future: An Attempt to Address Double Sampling

Sun, 16 Aug 2020 00:00:00 +0000

For model-free reinforcement learning, one of the main challenges of stochastic Bellman residual minimization is the double sampling problem, i.e., while only one single sample for the next state is available in the model-free setting, two independent samples for the next state are required in order to perform unbiased stochastic gradient descent. We propose new algorithms for addressing this problem based on the idea of borrowing extra randomness from the future. When the transition kernel varies slowly with respect to the state, it is shown that the training trajectory of new algorithms is close to the one of unbiased stochastic gradient descent. Numerical results for policy evaluation in both tabular and neural network settings are provided to confirm the theoretical findings.

A type of generalization error induced by initialization in deep neural networks

Sun, 16 Aug 2020 00:00:00 +0000

How initialization and loss function affect the learning of a deep neural network (DNN), specifically its generalization error, is an important problem in practice. In this work, by exploiting the linearity of DNN training dynamics in the NTK regime \citep{jacot2018neural,lee2019wide}, we provide an explicit and quantitative answer to this problem. Focusing on regression problem, we prove that, in the NTK regime, for any loss in a general class of functions, the DNN finds the same \emph{global} minima—the one that is nearest to the initial value in the parameter space, or equivalently, the one that is closest to the initial DNN output in the corresponding reproducing kernel Hilbert space. Using these optimization problems, we quantify the impact of initial output and prove that a random non-zero one increases the generalization error. We further propose an antisymmetrical initialization (ASI) trick that eliminates this type of error and accelerates the training. To understand whether the above results hold in general, we also perform experiments for DNNs in the non-NTK regime, which demonstrate the effectiveness of our theoretical results and the ASI trick in a qualitative sense. Overall, our work serves as a baseline for the further investigation of the impact of initialization and loss function on the generalization of DNNs, which can potentially guide and improve the training of DNNs in practice.

Deep learning interpretation: Flip points and homotopy methods

Sun, 16 Aug 2020 00:00:00 +0000

Deep learning models are complicated mathematical functions, and their interpretation remains a challenging research question. We formulate and solve optimization problems to answer questions about the models and their outputs. Specifically, we develop methods to study the decision boundaries of classification models using {\em flip points}. A flip point is any point that lies on the boundary between two output classes: e.g. for a neural network with a binary yes/no output, a flip point is any input that generates equal scores for “yes” and “no”. The flip point closest to a given input is of particular importance, and this point is the solution to a well-posed optimization problem. To compute the closest flip point, we develop a homotopy algorithm to overcome the issues of vanishing and exploding gradients and to find a feasible solution for our optimization problem. We show that computing closest flip points allows us to systematically investigate the model, identify decision boundaries, interpret and audit the model with respect to individual inputs and entire datasets, and find vulnerability against adversarial attacks. We demonstrate that flip points can help identify mistakes made by a model, improve the model’s accuracy, and reveal the most influential features for classifications.

Policy Gradient based Quantum Approximate Optimization Algorithm

Sun, 16 Aug 2020 00:00:00 +0000

The quantum approximate optimization algorithm (QAOA), as a hybrid quantum/classical algorithm, has received much interest recently. QAOA can also be viewed as a variational ansatz for quantum control. However, its direct application to emergent quantum technology encounters additional physical constraints: (i) the states of the quantum system are not observable; (ii) obtaining the derivatives of the objective function can be computationally expensive or even inaccessible in experiments, and (iii) the values of the objective function may be sensitive to various sources of uncertainty, as is the case for noisy intermediate-scale quantum (NISQ) devices. Taking such constraints into account, we show that policy-gradient-based reinforcement learning (RL) algorithms are well suited for optimizing the variational parameters of QAOA in a noise-robust fashion, opening up the way for developing RL techniques for continuous quantum control. This is advantageous to help mitigate and monitor the potentially unknown sources of errors in modern quantum simulators. We analyze the performance of the algorithm for quantum state transfer problems in single- and multi-qubit systems, subject to various sources of noise such as error terms in the Hamiltonian, or quantum uncertainty in the measurement process. We show that, in noisy setups, it is capable of outperforming state-of-the-art existing optimization algorithms.

Non-Gaussian processes and neural networks at finite widths

Sun, 16 Aug 2020 00:00:00 +0000

Gaussian processes are ubiquitous in nature and engineering. A case in point is a class of neural networks in the infinite-width limit, whose priors correspond to Gaussian processes. Here we perturbatively extend this correspondence to finite-width neural networks, yielding non-Gaussian processes as priors. The methodology developed herein allows us to track the flow of preactivation distributions by progressively integrating out random variables from lower to higher layers, reminiscent of renormalization-group flow. We further develop a perturbative procedure to perform Bayesian inference with weakly non-Gaussian priors.

Butterfly-Net2: Simplified Butterfly-Net and Fourier Transform Initialization

Sun, 16 Aug 2020 00:00:00 +0000

Structured CNN designed using the prior information of problems potentially improves efficiency over conventional CNNs in various tasks in solving PDEs and inverse problems in signal processing. This paper introduces BNet2, a simplified Butterfly-Net and inline with the conventional CNN. Moreover, a Fourier transform initialization is proposed for both BNet2 and CNN with guaranteed approximation power to represent the Fourier transform operator. Experimentally, BNet2 and the Fourier transform initialization strategy are tested on various tasks, including approximating Fourier transform operator, end-to-end solvers of linear and nonlinear PDEs, and denoising and deblurring of 1D signals. On all tasks, under the same initialization, BNet2 achieves similar accuracy as CNN but has fewer parameters. And Fourier transform initialized BNet2 and CNN consistently improve the training and testing accuracy over the randomly initialized CNN.

Calibrating Multivariate Lévy Processes with Neural Networks

Sun, 16 Aug 2020 00:00:00 +0000

Calibrating a Lévy process usually requires characterizing its jump distribution. Traditionally this problem can be solved with nonparametric estimation using the empirical characteristic functions (ECF), assuming certain regularity, and results to date are mostly in 1D. For multivariate Lévy processes and less smooth Lévy densities, the problem becomes challenging as ECFs decay slowly and have large uncertainty because of limited observations. We solve this problem by approximating the Lévy density with a parametrized functional form; the characteristic function is then estimated using numerical integration. In our benchmarks, we used deep neural networks and found that they are robust and can capture sharp transitions in the Lévy density compared to piecewise linear functions and radial basis functions.

DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Sun, 16 Aug 2020 00:00:00 +0000

Machine learning (ML) models trained by differentially private stochastic gradient descent (DP-SGD) have much lower utility than the non-private ones. To mitigate this degradation, we propose a DP Laplacian smoothing SGD (DP-LSSGD) to train ML models with differential privacy (DP) guarantees. At the core of DP-LSSGD is the Laplacian smoothing, which smooths out the Gaussian noise used in the Gaussian mechanism. Under the same amount of noise used in the Gaussian mechanism, DP-LSSGD attains the same DP guarantee, but in practice, DP-LSSGD makes training both convex and nonconvex ML models more stable and enables the trained models to generalize better. The proposed algorithm is simple to implement and the extra computational complexity and memory overhead compared with DP-SGD are negligible. DP-LSSGD is applicable to train a large variety of ML models, including DNNs. The code is available at \url{https://github.com/BaoWangMath/DP-LSSGD}.

NeuPDE: Neural Network Based Ordinary and Partial Differential Equations for Modeling Time-Dependent Data

Sun, 16 Aug 2020 00:00:00 +0000

We propose a neural network based approach for extracting models from dynamic data using ordinary and partial differential equations. In particular, given a time-series or spatio-temporal dataset, we seek to identify an accurate governing system which respects the intrinsic differential structure. The unknown governing model is parameterized by using both (shallow) multilayer perceptrons and nonlinear differential terms, in order to incorporate relevant correlations between spatio-temporal samples. We demonstrate the approach on several examples where the data is sampled from various dynamical systems and give a comparison to recurrent networks and other data-discovery methods. In addition, we show that for SVHN, MNIST, Fashion MNIST, and CIFAR10/100, our approach lowers the parameter cost as compared to other deep neural networks.

Neural network integral representations with the ReLU activation function

Sun, 16 Aug 2020 00:00:00 +0000

In this effort, we derive a formula for the integral representation of a shallow neural network with the ReLU activation function. We assume that the outer weighs admit a finite $L_1$-norm with respect to Lebesgue measure on the sphere. For univariate target functions we further provide a closed-form formula for all possible representations. Additionally, in this case our formula allows one to explicitly solve the least $L_1$-norm neural network representation for a given function.

Geometric Wavelet Scattering Networks on Compact Riemannian Manifolds

Sun, 16 Aug 2020 00:00:00 +0000

The Euclidean scattering transform was introduced nearly a decade ago to improve the mathematical understanding of convolutional neural networks. Inspired by recent interest in geometric deep learning, which aims to generalize convolutional neural networks to manifold and graph-structured domains, we define a geometric scattering transform on manifolds. Similar to the Euclidean scattering transform, the geometric scattering transform is based on a cascade of wavelet filters and pointwise nonlinearities. It is invariant to local isometries and stable to certain types of diffeomorphisms. Empirical results demonstrate its utility on several geometric learning tasks. Our results generalize the deformation stability and local translation invariance of Euclidean scattering, and demonstrate the importance of linking the used filter structures to the underlying geometry of the data.

SchrödingerRNN: Generative modeling of raw audio as a continuously observed quantum state

Sun, 16 Aug 2020 00:00:00 +0000

We introduce SchrödingeRNN, a quantum-inspired generative model for raw audio. Audio data is wave-like and is sampled from a continuous signal. Although generative modeling of raw audio has made great strides lately, relational inductive biases relevant to these two characteristics are mostly absent from models explored to date. Quantum Mechanics is a natural source of probabilistic models of wave behavior. Our model takes the form of a stochastic Schrödinger equation describing the continuous time measurement of a quantum system, and is equivalent to the continuous Matrix Product State (cMPS) representation of wavefunctions in one dimensional many-body systems. This constitutes a deep autoregressive architecture in which the system’s state is a latent representation of the past observations. We test our model on synthetic data sets of stationary and non-stationary signals. This is the first time cMPS are used in machine learning.

Deep learning Markov and Koopman models with physical constraints

Sun, 16 Aug 2020 00:00:00 +0000

The long-timescale behavior of complex dynamical systems can be described by linear Markov or Koopman models in a suitable latent space. Recent variational approaches allow the latent space representation and the linear dynamical model to be optimized via unsupervised machine learning methods. Incorporation of physical constraints such as time-reversibility or stochasticity into the dynamical model has been established for a linear, but not for arbitrarily nonlinear (deep learning) representations of the latent space. Here we develop theory and methods for deep learning Markov and Koopman models that can bear such physical constraints. We prove that the model is an universal approximator for reversible Markov processes and that it can be optimized with either maximum likelihood or the variational approach of Markov processes (VAMP). We demonstrate that the model performs equally well for equilibrium and systematically better for biased data compared to existing approaches, thus providing a tool to study the long-timescale processes of dynamical systems.

On the stable recovery of deep structured linear networks under sparsity constraints

Sun, 16 Aug 2020 00:00:00 +0000

We consider a deep structured linear network under sparsity constraints. We study sharp conditions guaranteeing the stability of the optimal parameters defining the network. More precisely, we provide sharp conditions on the network architecture and the sample under which the error on the parameters defining the network scales linearly with the reconstruction error (i.e. the risk). Therefore, under these conditions, the weights obtained with a successful algorithms are well defined and only depend on the architecture of the network and the sample. The features in the latent spaces are stably defined. The stability property is required in order to interpret the features defined in the latent spaces. It can also lead to a guarantee on the statistical risk. This is what motivates this study. The analysis is based on the recently proposed Tensorial Lifting. The particularity of this paper is to consider a sparsity prior. This leads to a better stability constant. As an illustration, we detail the analysis and provide sharp stability guarantees for convolutional linear network under sparsity prior. In this analysis, we distinguish the role of the network architecture and the sample input. This highlights the requirements on the data in connection to parameter stability.

Landscape Complexity for the Empirical Risk of Generalized Linear Models

Sun, 16 Aug 2020 00:00:00 +0000

We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. We obtain a rigorous explicit variational formula for the \emph{annealed complexity}, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the \emph{quenched complexity}, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.

The Slow Deterioration of the Generalization Error of the Random Feature Model

Sun, 16 Aug 2020 00:00:00 +0000

The random feature model exhibits a kind of resonance behavior when the number of parameters is close to the training sample size. This behavior is characterized by the appearance of large generalization gap, and is due to the occurrence of very small eigenvalues for the associated Gram matrix. In this paper, we examine the dynamic behavior of the gradient descent algorithm in this regime. We show, both theoretically and experimentally, that there is a dynamic self-correction mechanism at work: The larger the eventual generalization gap, the slower it develops, both because of the small eigenvalues. This gives us ample time to stop the training process and obtain solutions with good generalization property.

SelectNet: Learning to Sample from the Wild for Imbalanced Data Training

Sun, 16 Aug 2020 00:00:00 +0000

Supervised learning from training data with imbalanced class sizes, a commonly encountered scenario in real applications such as anomaly/fraud detection, has long been considered a significant challenge in machine learning. Motivated by recent progress in curriculum and self-paced learning, we propose to adopt a semi-supervised learning paradigm by training a deep neural network, referred to as SelectNet, to selectively add unlabelled data together with their predicted labels to the training dataset. Unlike existing techniques designed to tackle the difficulty in dealing with class imbalanced training data such as resampling, cost-sensitive learning, and margin-based learning, SelectNet provides an end-to-end approach for learning from important unlabelled data “in the wild” that most likely belong to the under-sampled classes in the training data, thus gradually mitigates the imbalance in the data used for training the classifier. We demonstrate the efficacy of SelectNet through extensive numerical experiments on standard datasets in computer vision.

Deep Domain Decomposition Method: Elliptic Problems

Sun, 16 Aug 2020 00:00:00 +0000

This paper proposes a deep-learning-based domain decomposition method (DeepDDM), which leverages deep neural networks (DNN) to discretize the subproblems divided by domain decomposition methods (DDM) for solving partial differential equations (PDE). Using DNN to solve PDE is a physics-informed learning problem with the objective involving two terms, domain term and boundary term, which respectively make the desired solution satisfy the PDE and corresponding boundary conditions. DeepDDM will exchange the subproblem information across the interface in DDM by adjusting the boundary term for solving each subproblem by DNN. Benefiting from the simple implementation and mesh-free strategy of using DNN for PDE, DeepDDM will simplify the implementation of DDM and make DDM more flexible for complex PDE, e.g., those with complex interfaces in the computational domain. This paper will firstly investigate the performance of using DeepDDM for elliptic problems, including a model problem and an interface problem. The numerical examples demonstrate that DeepDDM exhibits behaviors consistent with conventional DDM: the number of iterations by DeepDDM is independent of network architecture and decreases with increasing overlapping size. The performance of DeepDDM on elliptic problems will encourage us to further investigate its performance for other kinds of PDE and may provide new insights for improving the PDE solver by deep learning.

New Potential-Based Bounds for the Geometric-Stopping Version of Prediction with Expert Advice

Sun, 16 Aug 2020 00:00:00 +0000

This work addresses the classic machine learning problem of online prediction with expert advice. A new potential-based framework for the fixed horizon version of this problem has been recently developed using verification arguments from optimal control theory. This paper extends this framework to the random (geometric) stopping version. To obtain explicit bounds, we construct potentials for the geometric version from potentials used for the fixed horizon version of the problem. This construction leads to new explicit lower and upper bounds associated with specific adversary and player strategies. While there are several known lower bounds in the fixed horizon setting, our lower bounds appear to be the first such results in the geometric stopping setting with an arbitrary number of experts. Our framework also leads in some cases to improved upper bounds. For two and three experts, our bounds are optimal to leading order.

Deep Fictitious Play for Finding Markovian Nash Equilibrium in Multi-Agent Games

Sun, 16 Aug 2020 00:00:00 +0000

We propose a deep neural network-based algorithm to identify the Markovian Nash equilibrium of general large $N$-player stochastic differential games. Following the idea of fictitious play, we recast the $N$-player game into $N$ decoupled decision problems (one for each player) and solve them iteratively. The individual decision problem is characterized by a semilinear Hamilton-Jacobi-Bellman equation, to solve which we employ the recently developed deep BSDE method. The resulted algorithm can solve large $N$-player games for which conventional numerical methods would suffer from the curse of dimensionality. Multiple numerical examples involving identical or heterogeneous agents, with risk-neutral or risk-sensitive objectives, are tested to validate the accuracy of the proposed algorithm in large group games. Even for a fifty-player game with the presence of common noise, the proposed algorithm still finds the approximate Nash equilibrium accurately, which, to our best knowledge, is difficult to achieve by other numerical algorithms.

Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint

Sun, 16 Aug 2020 00:00:00 +0000

Motivated by the gap between theoretical optimal approximation rates of deep neural networks (DNNs) and the accuracy realized in practice, we seek to improve the training of DNNs. The adoption of an adaptive basis viewpoint of DNNs leads to novel initializations and a hybrid least squares/gradient descent optimizer. We provide analysis of these techniques and illustrate via numerical examples dramatic increases in accuracy and convergence rate for benchmarks characterizing scientific applications where DNNs are currently used, including regression problems and physics-informed neural networks for the solution of partial differential equations.

Large deviations for the perceptron model and consequences for active learning

Sun, 16 Aug 2020 00:00:00 +0000

Active learning is a branch of machine learning that deals with problems where unlabeled data is abundant yet obtaining labels is expensive. The learning algorithm has the possibility of querying a limited number of samples to obtain the corresponding labels, subsequently used for supervised learning. In this work, we consider the task of choosing the subset of samples to be labeled from a fixed finite pool of samples. We assume the pool of samples to be a random matrix and the ground truth labels to be generated by a single-layer teacher random neural network. We employ replica methods to analyze the large deviations for the accuracy achieved after supervised learning on a subset of the original pool. These large deviations then provide optimal achievable performance boundaries for any active learning algorithm. We show that the optimal learning performance can be efficiently approached by simple message-passing active learning algorithms. We also provide a comparison with the performance of some other popular active learning strategies.

Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs

Sun, 16 Aug 2020 00:00:00 +0000

Recurrent neural networks (RNNs) are powerful dynamical models for data with complex temporal structure. However, training RNNs has traditionally proved challenging due to exploding or vanishing of gradients. RNN models such as LSTMs and GRUs (and their variants) significantly mitigate these issues associated with training by introducing various types of {\it gating} units into the architecture. While these gates empirically improve performance, how the addition of gates influences the dynamics and trainability of GRUs and LSTMs is not well understood. Here, we take the perspective of studying randomly initialized LSTMs and GRUs as dynamical systems, and ask how the salient dynamical properties are shaped by the gates. We leverage tools from random matrix theory and mean-field theory to study the state-to-state Jacobians of GRUs and LSTMs. We show that the update gate in the GRU and the forget gate in the LSTM can lead to an accumulation of slow modes in the dynamics. Moreover, the GRU update gate can poise the system at a marginally stable point. The reset gate in the GRU and the output and input gates in the LSTM control the spectral radius of the Jacobian, and the GRU reset gate also modulates the complexity of the landscape of fixed-points. Furthermore, for the GRU we obtain a phase diagram describing the statistical properties of fixed-points. We also provide a preliminary comparison of training performance to the various dynamical regimes realized by varying hyperparameters. Looking to the future, we have introduced a powerful set of techniques which can be adapted to a broad class of RNNs, to study the influence of various architectural choices on dynamics, and potentially motivate the principled discovery of novel architectures.

Quantum Ground States from Reinforcement Learning

Sun, 16 Aug 2020 00:00:00 +0000

Finding the ground state of a quantum mechanical system can be formulated as an optimal control problem. In this formulation, the drift of the optimally controlled process is chosen to match the distribution of paths in the Feynman–Kac (FK) representation of the solution of the imaginary time Schrödinger equation. This provides a variational principle that can be used for reinforcement learning of a neural representation of the drift. Our approach is a drop-in replacement for path integral Monte Carlo, learning an optimal importance sampler for the FK trajectories. We demonstrate the applicability of our approach to several problems of one-, two-, and many-particle physics.

Exact asymptotics for phase retrieval and compressed sensing with random generative priors

Sun, 16 Aug 2020 00:00:00 +0000

We consider the problem of compressed sensing and of (real-valued) phase retrieval with random measurement matrix. We derive sharp asymptotics for the information-theoretically optimal performance and for the best known polynomial algorithm for an ensemble of generative priors consisting of fully connected deep neural networks with random weight matrices and arbitrary activations. We compare the performance to sparse separable priors and conclude that in all cases analysed generative priors have a smaller statistical-to-algorithmic gap than sparse priors, giving theoretical support to previous experimental observations that generative priors might be advantageous in terms of algorithmic performance. In particular, while sparsity does not allow to perform compressive phase retrieval efficiently close to its information-theoretic limit, it is found that under the random generative prior compressed phase retrieval becomes tractable.

Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning

Sun, 16 Aug 2020 00:00:00 +0000

Statistical learning theory provides bounds of the generalization gap, using in particular the Vapnik-Chervonenkis dimension and the Rademacher complexity. An alternative approach, mainly studied in the statistical physics literature, is the study of generalization in simple synthetic-data models. Here we discuss the connections between these approaches and focus on the link between the Rademacher complexity in statistical learning and the theories of generalization for \emph{typical-case} synthetic models from statistical physics, involving quantities known as \emph{Gardner capacity} and \emph{ground state energy}. We show that in these models the Rademacher complexity is closely related to the ground state energy computed by replica theories. Using this connection, one may reinterpret many results of the literature as rigorous Rademacher bounds in a variety of models in the high-dimensional statistics limit. Somewhat surprisingly, we also show that statistical learning theory provides predictions for the behavior of the ground-state energies in some full replica-symmetry breaking models.

Data-driven Compact Models for Circuit Design and Analysis

Sun, 16 Aug 2020 00:00:00 +0000

Compact semiconductor device models are essential for efficiently designing and analyzing large circuits. However, traditional compact model development requires a large amount of manual effort and can span many years. Moreover, inclusion of new physics (\eg{}, radiation effects) into an existing model is not trivial and may require redevelopment from scratch. Machine Learning (ML) techniques have the potential to automate and significantly speed up the development of compact models. In addition, ML provides a range of modeling options that can be used to develop hierarchies of compact models tailored to specific circuit design stages. In this paper, we explore three such options: (1) table-based interpolation, (2) Generalized Moving Least-Squares, and (3) feed-forward Deep Neural Networks, to develop compact models for a p-n junction diode. We evaluate the performance of these “data-driven” compact models by (1) comparing their voltage-current characteristics against laboratory data, and (2) building a bridge rectifier circuit using these devices, predicting the circuit’s behavior using SPICE-like circuit simulations, and then comparing these predictions against laboratory measurements of the same circuit.