We have performed a detailed analysis of the fast multipole method (FMM) in the adaptive case, in which the depth of the FMM tree is nonuniform. We show that the O(N) complexity is achievable for any distribution of particles, when a modified adaptive FMM is exploited. We analyzed how the FMM performs for fractal point distributions. A new subdividing double-threshold method is introduced, and better performance demonstrated. A three-dimensional kernel independent black box adaptive FMM is implemented and used for all calculations.

Inversion of sparse matrices with standard direct solve schemes is robust, but computationally expensive. Iterative solvers, on the other hand, demonstrate better scalability, but need to be used with an appropriate preconditioner. The choice of an effective preconditioner is highly problem dependent. We propose a novel fully algebraic sparse matrix solve algorithm, which has linear complexity with the problem size. This method can be used as a stand-alone direct solver with linear complexity and tunable accuracy, or it can be used as a black-box preconditioner in conjunction with iterative methods. The proposed solver is based on the low-rank approximation of fill-ins generated during the elimination.

In this paper, we propose a preconditioner with broad applicability and with cost O(N) for dense matrices, when the matrix is given by a smooth kernel. These preconditioners have a controlled accuracy. The linear scaling of the algorithm is achieved by means of two key ideas. First, the H2-structure of the dense matrix is exploited to obtain an extended sparse system of equations. Second, fill-ins arising when performing the elimination are compressed as low-rank matrices if they correspond to well-separated interactions. Numerical examples are discussed to demonstrate the linear scaling of the method and to illustrate its effectiveness as a preconditioner.

Approximate factorization preconditioners, such as incomplete LU factorization, provide cheap approximations to the system matrix. However, even a highly accurate preconditioner may have deteriorating performance when the condition number of the system matrix increases. By increasing the accuracy on low-frequency errors, we propose a novel hierarchical solver with improved robustness with respect to the condition number of the linear system. This solver retains the linear computational cost and memory footprint of the original algorithm.

We have developed a parallel version of the algorithm in LoRaSp to solve large sparse matrices on distributed memory machines. The factorization time of our parallel solver scales almost linearly with the problem size for three-dimensional problems, as opposed to the quadratic scalability of many existing sparse direct solvers. Moreover, our solver leads to almost constant numbers of iterations when used as a preconditioner for Poisson problems. Our parallel algorithm also has significant speed-ups on more than one processor. As demonstrated by our numerical experiments, our parallel algorithm can solve large problems much faster than many existing packages.

The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility of orchestrating these complex components is often left to one-off scripts and glue code customized for specific problems. To address these restrictions, we introduce\emph {Alchemist}-an internal service built at Apple from the ground up for\emph {easy},\emph {fast}, and\emph {scalable} distributed training. We discuss its design, implementation, and examples of running different flavors of distributed training. We also present case studies of its internal adoption in the development of autonomous systems, where training times have been reduced by 10x to keep up with the ever-growing data collection.

Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. In this work, we focus on the binary quantization, in which values are mapped to -1 and 1. We provide a unified framework to analyze different scaling strategies. Inspired by the pareto-optimality of 2-bits versus 1-bit quantization, we introduce a novel 2-bits quantization with provably least squares error. Our quantization algorithms can be implemented efficiently on the hardware using bitwise operations. We present proofs to show that our proposed methods are optimal, and also provide empirical error analysis. We conduct experiments on the ImageNet dataset and show a reduced accuracy gap when using the proposed least squares quantization algorithms.

We use 8 V100 Nvidia GPUs to train cifar-10 in under 11 seconds using mini-batches of 2048 elements, with 256 elements per GPU. Our code modifies David C. Pages's bag of tricks implementation to take advantage of Nvidia's NCCL through pytorch's distributed training framework.

In this project, we want to predict whether a given stock will be rising in the following day after earnings announcement or not. This will lead to a binary classification problem, which can be tackled based on the huge amount of available public data. This project consists of two major tasks: data collection and application of machine learning algorithms. We will discuss our efforts to collect the required data. We have considered and discussed different classification algorithms.

We implemented the game 2048 in python, and applied various AI analysis to that. We found that pruning some of the computer actions can accelerate the calculation, while the performance is not reducing a lot. In addition, we introduced an evaluation function based on our domain specific knowledge of the game.

In this work, we use Convolutional Neural Networks (CNNs) trained on GPUs for classifying images in the tiny ImageNet dataset. Specifically, we are pursuing two different goals. First, we try to train a relatively deep network with a large number of filters per convolutional layer to achieve a high accuracy on the test dataset. Second, we train another classifier that is slightly shallower and has fewer number of parameters several times, to build a dataset that will allow us to perform a thorough study of ensemble techniques. We introduce several notions, namely, “model confidence”, “class-specific accuracy”, and “prediction frequency”. We use these quantities to study various ensemble methods, and provide some insight into the behavior of the CNN classifier for certain input images.

In this study, we explore various natural language processing (NLP) methods to perform sentiment analysis. We look at two different datasets, one with binary labels, and one with multi-class labels. For the binary classification we applied the bag of words, and skip-gram word2vec models followed by various classifiers, including random forest, SVM, and logistic regression. For the multi-class case, we implemented the recursive neural tensor networks (RNTN). To overcome the high computational cost of training the standard RNTN we introduce the low- rank RNTN, in which the matrices involved in the quadratic term of RNTN are replaced by symmetric low-rank matrices. We show that the low-rank RNTN leads to significant saving in computational cost.

We have developed a parallel C++/MPI based simulation code for variable-density particle-laden turbulent flows. The fluid is represented through a uniform Eulerian staggered grid, while particles are modeled using a Lagrangian point-particle framework. Spatial discretization is second order accurate, and time integration has a fourth-order accuracy. Two-way coupling of the particles with the background flow is considered in both momentum and energy equations. The code is fully modular and abstracted, and easily can be extended or modified. We have considered two different boundary conditions. We have also developed a novel parallel linear solver for the variable density Poisson equation that arises in the calculation.

The working principle of particle-based solar receivers is to utilize the absorptivity of a dispersed particle phase in an otherwise optically transparent carrier fluid. In comparison to their traditional counterparts, which use a solid surface for radiation absorption, particle-based receivers offer a number of opportunities for improved efficiency and heat transfer uniformity. The physical phenomena at the core of such receivers involve coupling between particle transport, fluid turbulence, and radiative heat transfer. We have performed three-dimensional direct numerical simulations of turbulent flows coupled with radiative heating and particle transport over a range of particle Stokes numbers. Our study demonstrates that the particle preferential concentration has strong implications on the heat transfer statistics. We demonstrate that “for a typical setting” the preferential concentration of particles reduces the effective heat transfer between particles and the gas by as much as 25%.

In this study we consider particle-laden turbulent flows with significant heat transfer between the two phases due to sustained heating of the particle phase. Our objective is to investigate the effects of fluid heating by a dispersed phase on the turbulence evolution. We considered a decaying homogeneous isotropic turbulence laden with heated particles over a wide range of particle Stokes numbers. We applied a high fidelity framework to perform spectral analysis of kinetic energy in a variable-density fluid. Our results indicate that particle heating can considerably influence the turbulence cascade. We show that the pressure-dilatation term introduces turbulent kinetic energy at range of scales consistent with the scales observed in particle clusters.

We study the case of inertial particles heated by thermal radiation while settling by gravity through a turbulent transparent gas. We consider dilute and optically thin regimes in which each particle receives the same heat flux. Numerical simulations of forced homogeneous turbulence are performed taking into account the two-way coupling of both momentum and temperature between the dispersed and continuous phases. Particles much smaller than the smallest flow scales are considered and the point-particle approximation is adopted. The particle Stokes number (based on the Kolmogorov time scale) is of order unity, while the nominal settling velocity is up to an order of magnitude larger than the Kolmogorov velocity, marking a critical difference with previous two-way coupled simulations. It is found that non-heated particles enhance turbulence when their settling velocity is sufficiently high compared to the Kolmogorov velocity. Energy spectra show that the non-heated particle settling impacts both the very small and very large flow scales, while the intermediate scales are weakly affected. When heated, particles shed plumes of buoyant gas, further modifying the turbulence structure. At the considered radiation intensities, clustering is strong but the classic mechanism of preferential concentration is modified, while preferential sweeping is eliminated or even reversed. Particle heating also causes a significant reduction of the mean settling velocity, which is caused by rising buoyant plumes in the vicinity of particle clusters. The turbulent kinetic energy is affected non-monotonically as the radiation intensity is increased due to the competing effects of the downward gravitational force and the upward buoyancy force. The thermal radiation influences all scales of the turbulence. The effects of settling and buoyancy on the turbulence anisotropy are also discussed.

The goal of this present work is to assess the ability of Eulerian moment methods to reproduce the physics of two-way coupled particle-laden turbulent flow systems. We show that Eulerian methods need resolutions finer than nominal Kolmogorov scale in order to capture statistics of particle segregation, but gas and disperse phase velocity variances can be captured with resolutions comparable to the Kolmogorov length. The work is then extended to address the question whether Eulerian methods are suitable in scenarios in which the continuum field of interest (temperature or momentum) is itself primarily driven by particles. For each case corresponding Lagrangian calculations are developed and convergence of statistics with respect to the number of particles is established. Then the statistically-converged Lagrangian and Eulerian results are compared. Results show that accurate capture of segregation by the Eulerian methods always requires resolutions much higher than the nominal Kolmogorov scale.

Preferential concentration of inertial particles by turbulence is a well recognized phenomenon. This study investigates how this phenomenon impacts the mean heat transfer between the fluid phase and the particle phase. Using direct numerical simulation for turbulent flows and Lagrangian point particle tracking, we explore this phenomena over wide range of input parameters. Among the nine independent dimensionless numbers defining this problem, we show that particle Stokes number and a new identified number called heat mixing parameter have the most significant effect on particle to gas heat transfer, while variation in other non-dimensional numbers can be ignored. Using our numerical results we propose an algebraic reduced order model for heat transfer in particle-laden turbulence.