# TAG-ML 2023 Papers and Posters

Abstract:

In this paper, we introduce a novel unsupervised, graph-based filter feature selection technique which exploits the power of topologically constrained network representations. We model dependency structures among features using a family of chordal graphs (i.e. the Triangulated Maximally Filtered Graph), and we maximise the likelihood of features' relevance by studying their relative position inside the network. Such an approach presents three aspects that are particularly satisfactory compared to its alternatives: (i) it is highly tunable and easily adaptable to the nature of input data; (ii) it is fully explainable, maintaining, at the same time, a remarkable level of simplicity; (iii) it is computationally cheap. We test our algorithm on 16 benchmark datasets from different application domains showing that it outperforms or matches the current state-of-the-art under heterogeneous evaluation conditions.

### Abstract:

Few Shot Class Incremental Learning (FSCIL) with few examples per class for each incremental session is the realistic setting of continual learning since obtaining large number of annotated samples is not feasible and cost effective. We present the framework MASIL as a step towards learning the maximal separable classifier. It addresses the common problem i.e forgetting of old classes and over-fitting to novel classes by learning the classifier weights to be maximally separable between classes forming a simplex Equiangular Tight Frame. We propose the idea of concept factorization explaining the collapsed features for base session classes in terms of concept basis and use these to induce classifier simplex for few shot classes. We further adds fine tuning to reduce any error occurred during factorization and train the classifier jointly on base and novel classes without retaining any base class samples in memory. Experimental results on miniImageNet, CIFAR-100 and CUB-200 demonstrate that MASIL outperforms all the benchmarks.

Abstract:

We propose a family of curvature-based regularization terms for deep generative model learning. Explicit coordinate-invariant formulas for both intrinsic and extrinsic curvature measures are derived for the case of arbitrary data manifolds embedded in higher-dimensional Euclidean space. Because computing the curvature is a highly computation-intensive process involving the evaluation of second-order derivatives, efficient formulas are derived for approximately evaluating intrinsic and extrinsic curvatures. Comparative studies are conducted that compare the relative efficacy of intrinsic versus extrinsic curvature-based regularization measures, as well as performance comparisons against existing autoencoder training methods. Experiments involving noisy motion capture data confirm that curvature-based methods outperform existing autoencoder regularization methods, with intrinsic curvature measures slightly more effective than extrinsic curvature measures.

Abstract:

Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor medical imaging, structure tensor computer vision, radar signal processing, machine learning, etc. In order to process those data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notions of dissimilarities and paths joining normal distributions. The Fisher-Rao distance defined as the Riemannian geodesic distance induced by the Fisher information is such a principled distance which however is not known in closed-form excepts on a few particular cases. We first report a fast and robust method to approximate arbitrarily finely the Fisher-Rao distance between normal distributions. Second, we introduce a distance based on a diffeomorphic embedding of the Gaussian manifold into a submanifold of the higher-dimensional symmetric positive-definite cone. We show that the projective Hilbert distance on the cone is a metric on the embedded Gaussian submanifold and pullback that distance with the straight line Hilbert cone geodesics to obtain a distance and paths between normal distributions. Compared to the Fisher-Rao distance approximation, the pullback Hilbert cone distance is computationally light since it requires to compute only extreme eigenvalues of matrices. Finally, we show how to use those distances in clustering tasks.

Abstract:

The solution set of a system of polynomial equations typically contains ill-behaved, singular points. Resolution is a fundamental process in geometry in which we replace singular points with smooth points, while keeping the rest of the solution set unchanged. Resolutions are not unique: the usual way to describe them involves repeatedly performing a fundamental operation known as "blowing-up", and the complexity of the resolution highly depends on certain choices. The process can be translated into various versions of a 2-player game, the so-called Hironaka game, and a winning strategy for the first player provides a solution to the resolution problem. In this paper we introduce a new approach to the Hironaka game that uses reinforcement learning agents to find optimal resolutions of singularities. In certain domains, the trained model outperforms state-of-the-art selection heuristics in total number of polynomial additions performed, which provides a proof-of-concept that recent developments in machine learning have the potential to improve performance of algorithms in symbolic computation.

Abstract:

Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.

Abstract:

A ReLU neural network leads to a finite polyhedral decomposition of input space and a corresponding finite dual graph. We show that while this dual graph is a coarse quantization of input space, it is sufficiently robust that it can be combined with persistent homology to detect homological signals of manifolds in the input space from samples. This property holds for a wide range of networks trained for a wide range of purposes that have nothing to do with this topological application. We found this feature to be surprising and interesting; we hope it will also be useful.

Abstract:

Topology describes the essential structure of a space, and in 4D, a larger variety of topologically distinct manifolds can be embedded versus 2D or 3D. The present study investigates an end-to-end visual approach, which couples data generation software and convolutional neural networks (CNNs) to estimate the topology of 4D data. A synthetic 4D training data set is generated with the use of several manifolds, and then labelled with their associated Betti numbers by using techniques from algebraic topology. Several approaches to implementing a 4D convolution layer are compared. Experiments demonstrate that already a basic CNN can be trained to provide estimates for the Betti numbers associated with the number of one-, two-, and three-dimensional holes in the data. Some of the intricacies of topological data analysis in the 4D setting are also put on view, including aspects of persistent homology.

Abstract:

This work proposes a geometric insight into equivariant message passing on Riemannian manifolds. As previously proposed, numerical features on Riemannian manifolds are represented as coordinate-independent feature fields on the manifold. To any coordinate-independent feature field on a manifold comes attached an equivariant embedding of the principal bundle to the space of numerical features. We argue that the metric this embedding induces on the numerical feature space should optimally preserve the principal bundle’s original metric. This optimality criterion leads to the minimization of a twisted form of the Polyakov action with respect to the graph of this embedding, yielding an equivariant diffusion process on the associated vector bundle. We obtain a message passing scheme on the manifold by discretizing the diffusion equation flow for a fixed time step. We propose a higher- order equivariant diffusion process equivalent to diffusion on the cartesian product of the base manifold. The discretization of the higher-order diffusion process on a graph yields a new general class of equivariant GNN, generalizing the ACE and MACE formalism to data on Riemannian manifolds.

Abstract:

In this paper, we present a novel interpretation of the Weisfeiler-Lehman (WL) distance introduced by \cite{chen2022weisfeilerlehman} using concepts from stochastic processes. The WL distance aims compares graphs with node features, has the same discriminative power as the classic Weisfeiler-Lehman graph isomorphism test and has deep connections to the Gromov-Wasserstein distance. Our interpretation connects the WL distance to the literature on distances for stochastic processes, which also makes the interpretation of the distance more accessible and intuitive. We further explore the connections between the WL distance and certain Message Passing Neural Networks, and discuss the implications of the WL distance for understanding the Lipschitz property and the universal approximation results for these networks.

Abstract:

Graph neural networks (GNNs) are widely used for modeling complex interactions between entities represented as vertices of a graph. Despite recent efforts to theoretically analyze the expressive power of GNNs, a formal characterization of their ability to model interactions is lacking. The current paper aims to address this gap. Formalizing strength of interactions through an established measure known as separation rank, we quantify the ability of certain GNNs to model interaction between a given subset of vertices and its complement, i.e. between the sides of a given partition of input vertices. Our results reveal that the ability to model interaction is primarily determined by the partition's walk index --- a graph-theoretical characteristic defined by the number of walks originating from the boundary of the partition. Experiments with common GNN architectures corroborate this finding. As a practical application of our theory, we design an edge sparsification algorithm named Walk Index Sparsification (WIS), which preserves the ability of a GNN to model interactions when input edges are removed. WIS is simple, computationally efficient, and in our experiments has markedly outperformed alternative methods in terms of induced prediction accuracy. More broadly, it showcases the potential of improving GNNs by theoretically analyzing the interactions they can model.

Abstract:

Wasserstein dictionary learning is an unsupervised approach to learning a collection of probability distributions that generate observed distributions as Wasserstein barycentric combinations. Existing methods solve an optimization problem that only seeks a dictionary and weights that minimize the reconstruction accuracy. However, there is no a priori reason to believe there are unique solutions in general to this problem. Moreover, the learned dictionary is, by design, optimized to represent the observed data set, and may not be useful for classification tasks or generative modeling. Just as regularization plays a key role in linear dictionary learning, we propose a geometric regularizer for Wasserstein space that promotes representations of a data distributions using nearby dictionary elements. We show that this regularizer leads to barycentric weights that concentrate on dictionary atoms local to each data distribution. When data are generated as Wasserstein barycenters of fixed distributions, this regularizer facilitates the recovery of the generating distributions in cases that are ill-posed for unregularized Wasserstein dictionary learning. Through experimentation on synthetic and real data, we show that our geometrically regularized approach yields more interpretable dictionaries in Wasserstein space which perform better in downstream applications.

Abstract:

Geometric deep learning enables the encoding of physical symmetries in modeling 3D objects. Despite rapid progress in encoding 3D symmetries into Graph Neural Networks (GNNs), a comprehensive evaluation of the expressiveness of these networks through a local-to-global analysis lacks today. In this paper, we propose a local hierarchy of 3D isomorphism to evaluate the expressive power of equivariant GNNs and investigate the process of representing global geometric information from local patches. Our work leads to two crucial modules for designing expressive and efficient geometric GNNs; namely local substructure encoding (\textbf{LSE}) and frame transition encoding (\textbf{FTE}). To demonstrate the applicability of our theory, we propose LEFTNet which effectively implements these modules and achieves state-of-the-art performance on both scalar-valued and vector-valued molecular property prediction tasks. We further point out the design space for future developments of equivariant graph neural networks.

Abstract:

Recurrent neural networks (RNNs) have emerged as powerful models capable of storing and manipulating external information over long periods in various domains. Yet, the mechanisms that underly this behavior remain a mystery due to the black-box nature of these models. This paper addresses this question by proposing an episodic memory theory of RNN dynamics, enabling a more comprehensive understanding of the RNN weights as memories and inter-memory interactions. This approach sheds light on the inner workings of RNNs and connects to existing research on memory representation and organization. The theory extends the current linearization approaches by providing alternative interpretations of the eigenspectrum and its connection to the long-term storage and manipulation of information. We discuss how the segregation, representation, and composition of the variable binding problem—a fundamental question in cognitive science and artificial intelligence—can be mechanistically interpreted within the theory. Using an elementary task - repeat copy, we demonstrate the validity of the theory in experimental settings. Our work represents a step towards opening the black box of RNNs, offering new insights into their functionality and bridging the gap between recurrent neural networks and memory models.

Abstract:

Hyperbolic space embeddings have been shown beneficial for many learning tasks where data have an underlying hierarchical structure. Consequently, many machine learning tools were extended to such spaces, but only few discrepancies to compare probability distributions defined over those spaces exist. Among the possible candidates, optimal transport distances are well defined on such Riemannian manifolds and enjoy strong theoretical properties, but suffer from high computational cost. On Euclidean spaces, sliced-Wasserstein distances, which leverage a closed-form solution of the Wasserstein distance in one dimension, are more computationally efficient, but are not readily available on hyperbolic spaces. In this work, we propose to derive novel hyperbolic sliced-Wasserstein discrepancies. These constructions use projections on the underlying geodesics either along horospheres or geodesics. We study and compare them on different tasks where hyperbolic representations are relevant, such as sampling or image classification.

Abstract:

1-parameter persistent homology, a cornerstone in Topological Data Analysis (TDA), studies the evolution of topological features such as connected components and cycles hidden in data. It has been applied to enhance the representation power of deep learning models, such as Graph Neural Networks (GNNs). To enrich the representations of topological features, here we propose to study 2-parameter persistence modules induced by bi-filtration functions. In order to incorporate these representations into machine learning models, we introduce a novel vector representation called Generalized Rank Invariant Landscape (GRIL) for 2-parameter persistence modules. We show that this vector representation is 1-Lipschitz stable and differentiable with respect to underlying filtration functions and can be easily integrated into machine learning models to augment encoding topological features. We present an algorithm to compute the vector representation efficiently. We also test our methods on synthetic and benchmark graph datasets, and compare the results with previous vector representations of 1-parameter and 2-parameter persistence modules. Further, we augment GNNs with GRIL features and observe an increase in performance indicating that GRIL can capture additional features enriching GNNs. We make the complete code for the proposed method available at https://github.com/soham0209/mpml-graph.

Abstract:

Clustering problems (such as k-means and k-median) are fundamental unsupervised machine learning primitives. Recently, these problems have been subject to large interest in the privacy literature. All prior work on private clustering, however, has been devoted to the offline case where the entire dataset is known in advance. In this work, we focus on the more challenging private data stream setting where the aim is to design memory-efficient algorithms that process a large stream incrementally as points arrive in a private way. Our main contribution is to provide the first differentially private algorithms for k-means and k-median clustering in data streams. In particular, our algorithms are the first to guarantee differential privacy both in the continual release and in the one-shot setting while achieving space sublinear in the stream size. We complement our theoretical results with an empirical analysis of our algorithms on real data.

Abstract:

In this paper, we initiate the study of Euclidean clustering with Distance-based differential privacy. Distance-based privacy is motivated by the fact that it is often only needed to protect the privacy of exact, rather than approximate, locations. We provide constant-approximate algorithms for k-means and k-median clustering, with additive error depending only on the attacker's precision bound p, rather than the radius Λ of the space. In addition, we empirically demonstrate that our algorithm performs significantly better than previous differentially private clustering algorithms, as well as naive distance-based private clustering baselines.

Abstract:

Homology-based invariants can be used to characterize the geometry of datasets and thereby gain some understanding of the processes generating those datasets. In this work we investigate how the geometry of a dataset changes when it is subsampled in various ways. In our framework the dataset serves as a reference object; we then consider different points in the ambient space and endow them with a geometry defined in relation to the reference object, for instance by subsampling the dataset proportionally to the distance between its elements and the point under consideration. We illustrate how this process can be used to extract rich geometrical information, allowing for example to classify points coming from different data distributions.

### Abstract:

In this paper, we study linear regression applied to data structured on a manifold. We assume that the data manifold is smooth and is embedded in a Euclidean space, and our objective is to reveal the impact of the data manifold's extrinsic geometry on the regression. Specifically, we analyze the impact of the manifold's curvatures (or higher order nonlinearity in the parameterization when the curvatures are locally zero) on the uniqueness of the regression solution. Our findings suggest that the corresponding linear regression does not have a unique solution when the manifold is flat. Otherwise, the manifold's curvature (or higher order nonlinearity in the embedding) may contribute significantly, particularly in the solution associated with the normal directions of the manifold. Our findings thus reveal the role of data manifold geometry in ensuring the stability of regression models for out-of-distribution inferences.

Abstract:

Recent advances in neural network (NN) architectures have demonstrated that complex topologies possess the potential to surpass the performance of conventional feedforward networks. Nonetheless, previous studies investigating the relationship between network topology and model performance have yielded inconsistent results, complicating their applicability in contexts beyond those scrutinized. In this study, we examine the utility of directed acyclic graphs (DAGs) for modeling intricate relationships among neurons within NNs. We introduce a novel algorithm for the efficient training of DAG-based networks and assess their performance relative to multilayer perceptrons (MLPs). Through experimentation on synthetic datasets featuring varying levels of difficulty and noise, we observe that complex networks founded on pertinent graphs outperform MLPs in terms of accuracy, particularly within high-difficulty scenarios. Additionally, we explore the theoretical underpinnings of these observations and explore the potential trade-offs associated with employing complex networks. Our research offers valuable insights into the capabilities and constraints of complex NN architectures, thus contributing to the ongoing pursuit of designing more potent and efficient deep learning models.

Abstract:

The most prevalent class of neural networks operating on graphs are message passing neural networks (MPNNs), in which the representation of a node is updated iteratively by aggregating information in the 1-hop neighborhood. Since this paradigm for computing node embeddings may prevent the model from learning coarse topological structures, the initial features are often augmented with structural information of the graph, typically in the form of Laplacian eigenvectors or Random Walk transition probabilities. In this work, we explore the contribution of message passing when strong structural encodings are provided. We introduce a novel way of modeling the interaction between feature and structural information based on their tensor product rather than the standard concatenation. The choice of interaction is compared in common scenarios and in settings where the capacity of the message-passing layer is severely reduced and ultimately the message-passing phase is removed altogether. Our results indicate that using tensor-based encodings is always at least on par with the concatenation-based encoding and that it makes the model much more robust when the message passing layers are removed, on some tasks incurring almost no drop in performance. This suggests that the importance of message passing is limited when the model can construct strong structural encodings.

Abstract:

In many dimensionality reduction tasks, we wish to identify the constituent components that explain our observations. For manifold learning, this can be formalized as factoring a Riemannian product manifold. Recovering this factorization, however, may suffer from certain difficulties in practice, especially when data is sparse or noisy, or when one factor is distorted by the other. To address these limitations, we propose identifying non-redundant coordinates on the product manifold before applying product manifold learning to identify which coordinates correspond to different factor manifolds. We demonstrate our approach on both synthetic and real-world data.

Abstract:

A key technique of machine learning and computer vision is to embed discrete weighted graphs into continuous spaces for further downstream analysis. Embedding discrete hierarchical structures in hyperbolic geometry has proven very successful since it was shown that any weighted tree can be embedded in that geometry with arbitrary low distortion. Various optimization methods for hyperbolic embeddings based on common models of hyperbolic geometry have been studied. In this paper, we consider Hilbert geometry for the standard simplex which is isometric to a vector space equipped with the variation polytope norm. We study the representation power of this Hilbert simplex geometry by embedding distance matrices of graphs. Our findings demonstrate that Hilbert simplex geometry is competitive to alternative geometries such as the Poincar'e hyperbolic ball or the Euclidean geometry for embedding tasks while being fast and numerically robust.

Abstract:

Deep learning models have seen significant successes in numerous applications, but their inner workings remain elusive. The purpose of this work is to quantify the learning process of deep neural networks through the lens of a novel topological invariant called magnitude. Magnitude is an isometry invariant; its properties are an active area of research as it encodes many known invariants of a metric space. We use magnitude to study the internal representations of neural networks and propose a new method for determining their generalisation capabilities. Moreover, we theoretically connect magnitude dimension and the generalisation error, and demonstrate experimentally that the proposed framework can be a good indicator of the latter.

Abstract:

The rapid progress of Artificial Intelligence research came with the development of increasingly complex deep learning models, leading to growing challenges in terms of computational complexity, energy efficiency and interpretability. In this study, we apply advanced network-based information filtering techniques to design a novel deep neural network unit characterized by a sparse higher-order graphical architecture built over the homological structure of underlying data. We demonstrate its effectiveness in two application domains which are traditionally challenging for deep learning: tabular data and time series regression problems. Results demonstrate the advantages of this novel design which can tie or overcome the results of state-of-the-art machine learning and deep learning models using only a fraction of parameters.

Abstract:

Graph convolutional networks (GCNs) suffer from the curse of depth, a phenomenon where performance degrades significantly as network depth increases. While over-smoothing has been considered the primary cause of this issue, we discover that gradient vanishing or exploding under commonly-used initialization methods also contributes to the curse of depth. To this end, we propose to evaluate GCN initialization quality from three aspects: forward-propagation, backward-propagation, and output diversity. We theoretically prove that conventional initialization methods fail to simultaneously maintain reasonable forward propagation and output diversity. To tackle this problem, We develop a new GCN initialization method called Signal Propagation on Graph (SPoGInit). By carefully designing and optimizing initial weight metrics, SPoGInit effectively alleviates performance degradation in deep GCNs. We further introduce a new architecture termed ReZeroGCN, which simultaneously addresses the three aspects at initialization. This architecture achieves performance gains on node classification tasks when increasing the depth from 4 to 64, e.g., 10% gain in training and 3% gain in test accuracy on OGBN-Arxiv. To the best of our knowledge, this is the first result to fully resolve the curse of depth on OGBN-Arxiv over such a range of depths.

### Abstract:

Hypergraphs are expressive structures for describing higher-order relationships among entities, with widespread applications across biology and drug discovery. Hypergraph neural networks (HGNNs) have recently emerged as a promising representation learning approach on these structures for clustering, classification, and more. However, despite their promising performance, HGNNs remain a black box, and explaining how they make predictions remains an open challenge. To address this problem, we propose HyperEX, a post-hoc explainability framework for hypergraphs that can be applied to any trained HGNN. HyperEX computes node-hyperedge pair importance to identify sub-hypergraphs as explanations. Our experiments demonstrate how HyperEX learns important sub-hypergraphs responsible for driving node classification to give useful insight into HGNNs.

Abstract:

In this paper, we investigate properties and limitations of invariance learned by neural networks from the data compared to the genuine invariance achieved through invariant weight-tying. To do so, we adopt a group theoretical perspective and analyze invariance learning in neural networks without weight-tying constraints. We demonstrate that even when a network learns to correctly classify samples on a group orbit, the underlying decision-making in such a model does not attain genuine invariance. Instead, learned invariance is strongly conditioned on the input data, rendering it unreliable if the input distribution shifts. We next demonstrate how to guide invariance learning toward genuine invariance by regularizing the invariance of a model at the training. To this end, we propose several metrics to quantify learned invariance: (i) predictive distribution invariance, (ii) logit invariance, and (iii) saliency invariance similarity. We show that the invariance learned with the invariance error regularization closely reassembles the genuine invariance of weight-tying models and reliably holds even under a severe input distribution shift. Closer analysis of the learned invariance also reveals the spectral decay phenomenon, when a network chooses to achieve the invariance to a specific transformation group by reducing the sensitivity to any input perturbation.

Abstract:

We explore the equivalence between neural networks and kernel methods by deriving the first exact representation of any finite-size parametric classification model trained with gradient descent as a kernel machine. We compare our exact representation to the well-known Neural Tangent Kernel (NTK) and discuss approximation error relative to the NTK and other non-exact path kernel formulations. We experimentally demonstrate that the kernel can be computed for realistic networks up to machine precision. We use this exact kernel to show that our theoretical contribution can provide useful insights into the predictions made by neural networks, particularly the way in which they generalize.

Abstract:

Protein pockets are essential for many proteins to carry out their functions. Locating and measuring protein pockets as well as studying the anatomy of pockets helps us further understand protein function. Most research studies focus on learning either local or global information from protein structures. However, there is a lack of studies that leverage the power of integrating both local and global representations of these structures. In this work, we combine topological data analysis (TDA) and geometric deep learning (GDL) to analyze the putative protein pockets of enzymes. TDA captures blueprints of the global topological invariant of protein pockets, whereas GDL decomposes the fingerprints to building blocks of these pockets. This integration of local and global views provides a comprehensive and complementary understanding of the protein structural motifs (niches for short) within protein pockets. We also analyze the distribution of the building blocks making up the pocket and profile the predictive power of coupling local and global representations for the task of discriminating between enzymes and non-enzymes. We demonstrate that our representation learning framework for macromolecules is particularly useful when the structure is known, and the scenarios heavily rely on local and global information.

classes.

Abstract:

There has been considerable effort to better understand the generalization capabilities of deep neural networks both as a means to unlock a theoretical understanding of their success as well as providing directions for further improvements. In this paper we investigate margin-based multiclass generalization bounds for neural networks which rely on a recent complexity measure, the geometric complexity, developed for neural networks and which measures the variability of the model function.

We derive a new upper bound on the generalization error which scale with the margin-normalized geometric complexity of the network and which hold for a broad family of data distributions and model classes. Our generalization bound is empirically investigated for a ResNet-18 model trained with SGD on the CIFAR-10 and CIFAR-100 datasets with both original and random labels.

Abstract:

Unsupervised learning has recently significantly gained in popularity, especially with deep learning-based approaches. Despite numerous successes and approaching supervised-level performance on a variety of academic benchmarks, it is still hard to train and evaluate SSL models in practice due to the unsupervised nature of the problem. Even with networks trained in a supervised fashion, it is often unclear whether they will perform well when transferred to another domain.

Past works are generally limited to assessing the amount of information contained in embeddings, which is most relevant for self-supervised learning of deep neural networks. This works chooses to follow a different approach: can we quantify how easy it is to linearly separate the data in a stable way? We survey the literature and uncover three methods that could be potentially used for evaluating quality of representations. We also introduce one novel method based on recent advances in understanding the high-dimensional geometric structure self-supervised learning.

We conduct extensive experiments and study the properties of these metrics and ones introduced in the previous work. Our results suggest that while there is no free lunch, there are metrics that can robustly estimate embedding quality in an unsupervised way.

Abstract:

Nonnegative matrix factorization (NMF) is widely used for clustering with strong interpretability. Among general NMF problems, symmetric NMF (SymNMF) is a special one that plays an important role in graph clustering where each element measures the similarity between data points. In this paper, we explore factorizing a symmetric matrix that does not have to be nonnegative, presenting an efficient factorization algorithm which is applicable and scalable. To our best knowledge, for the first time we prove there is no spurious local minimal solution for SymNMF, which indicates our algorithm will always yield global optimal solution. We also leave remaining open problems in the discussion part, which may inspire new research direction and topic.

Abstract:

AI-assisted solutions have recently proven successful when applied to Mathematics and have opened new possibilities for exploring unsolved problems that have eluded traditional approaches for years or even centuries. Following this direction, this paper presents an innovative approach aiming at establishing correlations between equational properties of algebraic structures that can be represented through graphs and specific sub-portions of their topological representation. The methodology incorporates the utilization of graph neural architectures to validate theorems or conjectures, complemented by Explainability (XAI) metrics that lend support to these statements. In particular, we examine the distributive and modular properties of algebraic lattices, whose characterization is well-known in universal algebra, hence using these properties as an experimental test bench. The findings of this study demonstrate the effectiveness of the proposed approach in identifying and retrieving established subpatterns that characterize the equational properties under investigation. Moreover, the approach exhibits the capability to generate novel and noteworthy candidates as theorem suggesters, thereby offering valuable prospects for further exploration by mathematicians.

gates, identifying relevant nodes.

Abstract:

We propose an interpretable local surrogate (ILS) method for understanding the predictions of black-box graph models. Explainability methods are commonly employed to gain insights into black-box models and, given the widespread adoption of GNNs in diverse applications, understanding the underlying reasoning behind their decision-making processes becomes crucial. Our ILS method approximates the behavior of a black-box graph model by fitting a simple surrogate model in the local neighborhood of a given input example. Leveraging the interpretability of the surrogate, ILS is able to identify the most relevant nodes contributing to a specific prediction. To efficiently identify these nodes, we utilize group sparse linear models as local surrogates. Through empirical evaluations on explainability benchmarks, our method consistently outperforms state-of-the-art graph explainability methods. This demonstrates the effectiveness of our approach in providing enhanced interpretability for GNN predictions.

Abstract:

Deep models are known to be vulnerable to data adversarial attacks, and many adversarial training techniques have been developed to improve their adversarial robustness. While data adversaries attack model predictions through modifying data, little is known about their impact on the neuron activations produced by the model, which play a crucial role in determining the model's predictions and interpretability. In this work, we aim to develop a topological understanding of adversarial training to enhance its interpretability. We analyze the topological structure-in particular, mapper graphs-of neuron activations of data samples produced by deep adversarial training. Each node of a mapper graph represents a cluster of activations, and two nodes are connected by an edge if their corresponding clusters have a nonempty intersection. We provide an interactive visualization tool that demonstrates the utility of our topological framework in exploring the activation space. We found that stronger attacks make the data samples more indistinguishable in the neuron activation space that leads to a lower accuracy. Our tool also provides a natural way to identify the vulnerable data samples that may be useful in improving model robustness.

Abstract:

Deep neural networks implement a sequence of layer-by-layer operations that are each relatively easy to understand, but the resulting overall computation is generally difficult to understand. An intuitive hypothesis is that the role of each layer is to reformat information to reduce the "distance" to the desired outputs. With this spatial analogy, the layer-wise computation implemented by a deep neural network can be viewed as a path along a high-dimensional manifold of neural representations. With this framework, each hidden layer transforms its inputs by taking a step of a particular size and direction along the manifold, ideally moving towards the desired network outputs. We formalize this intuitive idea by leveraging recent advances in metric representational similarity. We extend existing representational distance methods by defining and characterizing the manifold that neural representations live on, allowing us to calculate quantities like the shortest path or tangent direction separating representations between hidden layers of a network or across different networks. We then demonstrate these tools by visualizing and comparing the paths taken by a collection of trained neural networks with a variety of architectures, finding systematic relationships between model depth and width, and properties of their paths.

Abstract:

In this paper we introduce a novel family of attributed graphs for the purpose of shape discrimination. Our graphs typically arise from variations on the Mapper graph construction, which is an an approximation of the Reeb graph for point cloud data. Our attributions enrich these constructions with (persistent) homology in ways that are provably stable, thereby recording extra topological information that is typically lost in these graph constructions. We provide experiments which illustrate the use of these invariants for shape representation and classification. In particular, we obtain competitive shape classification results when using our topologically attributed graphs as inputs to a simple graph neural network classifier.

Abstract:

Groups with complex set intersection relations are a natural way to model a wide array of data, from the formation of social groups to the complex protein interactions which form the basis of biological life. One approach to representing such “higher order” relationships is as a hypergraph. However, efforts to apply machine learning techniques to hypergraph structured datasets have been limited thus far. In this paper, we address the problem of link prediction in knowledge hypergraphs as well as simple hypergraphs and develop a novel, simple, and effective optimization architecture that addresses both tasks. Additionally, we introduce a novel feature extraction technique using node level clustering and we show how integrating data from node-level labels can improve system performance. Our self-supervised approach achieves significant improvement over state of the art baselines on several hyperedge prediction and knowledge hypergraph completion benchmarks.

Abstract:

Contrastive representation learning has been a long-standing research area due to its versatility and importance in learning representations. Recent works have shown improved results if the representations are constrained to be on a hypersphere. In this work, we propose a geometric interpretation of contrastive learning by making use of geodesic distances on the hypersphere to learn contrasts between representations. Through empirical results, we show that this contrastive learning approach improves performance on several downstream tasks.

Abstract:

Natural language processing (NLP) made an impressive jump with the introduction of Transformers. ChatGPT is one of the most famous examples, changing the perception of the possibilities of AI even outside the research community. However, besides the impressive performance, the quadratic time and space complexity of Transformers with respect to sequence length pose significant limitations for handling long sequences. While efficient Transformer architectures like Linformer and Performer with linear complexity have emerged as promising solutions, their theoretical understanding remains limited. In this paper, we introduce Sumformer, a novel and simple architecture capable of universally approximating equivariant sequence-to-sequence functions. We use Sumformer to give the first universal approximation results for Linformer and Performer. Moreover, we derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.

Abstract:

Neural network pruning has gained significant attention for its potential to reduce computational resources required for training and inference. A large body of research has shown that networks can be pruned both after training and at initialisation, while maintaining competitive accuracy compared to dense networks. However, current methods rely on iteratively pruning or repairing the network to avoid over-pruning and layer collapse. Recent work has found that by treating neural networks as a sequence of bipartite graphs, pruning can be studied through the lens of spectral graph theory. Therefore, in this work, we propose a novel pruning approach using spectral sparsification, which aims to preserve meaningful properties of a dense graph with a sparse subgraph, by preserving the spectrum of the dense graph's adjacency matrix. We empirically validate and investigate our method, and show that one-shot pruning using spectral sparsification preserves performance at higher levels of sparsity compared to its one-shot counterparts. Additionally, we theoretically analyse our method with respect to local and global connectivity.

Abstract:

The problem of detecting and quantifying the presence of symmetries in datasets is useful for model selection, generative modeling, and data analysis, amongst others. While existing methods for hard-coding transformations in neural networks require prior knowledge of the symmetries of the task at hand, this work focuses on discovering and characterising unknown symmetries present in the dataset, namely, Lie group symmetry transformations beyond the traditional ones usually considered in the field (rotation, scaling, and translation). Specifically, we consider a scenario in which a dataset has been transformed by a one-parameter subgroup of transformations with different parameter values for each data point. Our goal is to characterise the transformation group and the distribution of the parameter values, even when they aren’t small or the transformation group isn’t one of the traditional ones. The results showcase the effectiveness of the approach in both these settings.

Abstract:

Flatness of the loss curve around a model at hand has been shown to empirically correlate with its generalization ability. Optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber, and was followed by more recent successful sharpness-aware optimization techniques. Their widespread adoption in practice, though, is dubious because of the lack of theoretically grounded connection between flatness and generalization, in particular in light of the reparameterization curse—certain reparameterizations of a neural network change most flatness measures but do not change generalization. Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization and solves the reparameterization curse. In this paper, we derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions. It requires computing the Hessian only of a single layer of the network, which makes it applicable to large neural networks, and with it avoids an expensive mapping of the loss surface in the vicinity of the model. In an extensive empirical evaluation we show that this relative flatness aware minimization (FAM) improves generalization in a multitude of applications and models, both in finetuning and standard training. We make the code available at github.

Abstract:

Reconstructing the 3D volume of a molecule from its differently oriented 2D projections is the central problem of cryo-EM, one of the main techniques for macro-molecule imaging. Because the orientations are unknown, the estimation of the images' poses is essential to solve this inverse problem. Typical methods either rely on synchronization, which leverages the estimated relative poses of the images to constrain their absolute ones, or jointly estimate the poses and the 3D density of the molecule in an iterative fashion. Unfortunately, synchronization methods don't account for the complete images' generative process and, therefore, achieve lower noise robustness. In the second case, the iterative joint optimization suffers from convergence issues and a higher computational cost, due to the 3D reconstruction steps. In this work, we directly estimate individual poses with equivariant deep graph networks trained using a self-supervised loss, which enforces agreement in Fourier domain of images pairs along the common lines defined by their poses. In particular, the equivariant design turns out essential for the proper convergence. As a result, our method can leverage the synchronization constraints - encoded by the synchronization graph structure - to improve convergence as well as the images generative process - via the common lines loss -, with no need to perform intermediate reconstructions.

Abstract:

Neural operations that rely on neighborhood information are much more expensive when deployed on point clouds than on grid data due to the irregular distances between points in a point cloud. For example, in a convolution, the convolutional kernel must be recomputed for every point in a point cloud to consider the distances to all other points in its neighbourhood. In a grid, on the other hand, we can compute the kernel only once and reuse it for all query positions. As a result, operations that rely on neighborhood information scale much worse for point clouds than for grid data, specially for large inputs and large neighborhoods. In this work, we address the scalability issue of point cloud methods by tackling its root cause: the irregularity of the data. To this end, we propose learnable gridification as the first step in a point cloud processing pipeline to transform the point cloud into a compact, regular grid. Thanks to gridification, subsequent layers can use operations defined on regular grids, e.g., Conv3D, which scale much better than native point cloud methods. We then extend gridification to point cloud to point cloud tasks, e.g., segmentation, by adding a earnable de-gridification step at the end of the point cloud processing pipeline to map the compact, regular grid back to its original point cloud form. Through theoretical and empirical analysis, we show that gridified networks scale better in terms of memory and time than networks directly applied on raw point cloud data, while being able to achieve competitive results.

Abstract:

Physics-informed neural networks (PINNs) are computationally efficient alternatives to traditional partial differential equation (PDE) solvers. However, their reliability is dependent on the accuracy of the trained neural network. In this work, we introduce a mechanism for leveraging the symmetries of a given PDE to improve PINN performance. In particular, we propose a loss function that informs the network about Lie point symmetries, similar to how traditional PINN models try to enforce the underlying PDE. Intuitively, our symmetry loss ensures that infinitesimal generators of the Lie group preserve solutions of the PDE. Effectively, this means that once the network learns a solution, it also learns the neighbouring solutions generated by Lie point symmetries. Our results confirm that Lie point symmetries of the respective PDEs are an effective inductive bias for PINNs and can lead to a significant increase in sample efficiency.

Abstract:

Here we present information theoretic measures based on the data diffusion operator as characterisations of the representations learned by neural networks. Specifically, we define diffusion spectral entropy (DSE), i.e., entropy of the diffusion operator computed on the neural representation of a dataset as well as diffusion spectral mutual information (DSMI), which assesses the relationship between different sets of variables representing data. First, we show that these definitions form robust measures of intrinsic dimensionality and relationship strength respectively on toy data, outperforming binned Shannon entropy in terms of accuracy. Then we study the evolution of representations within classification networks and networks with self-supervised losses. In both cases, we see that generalizable training results in decrease in DSE over epochs --- starting from a random initialization. We also see that there is an increase in DSMI with the class label over time. On the other hand, training with corrupt labels results in a maintenance or increase in entropy and near-zero DSMI with labels. We also assess DSMI with the input and observe differing trends. On MNIST it grows until plateaus, whereas on CIFAR it increases and then decreases. Overall results show that these measures can elucidate characteristics of network performance as well as data complexity. Code is available at https://github.com/ChenLiu-1996/DiffusionSpectralEntropy.

Abstract:

In robotic tasks, changes of reference frames typically do not affect the underlying physical meaning. These are isometric transformations, including translations, rotations, and reflections, called Euclidean group. In this work, we study reinforcement learning and planning tasks that have Euclidean group symmetry. We provide a theory that extends prior work (on symmetry in reinforcement learning, planning, and optimal control) to compact Lie groups and covers them as special cases, and show examples to explain the benefits of equivariance to Euclidean symmetry. We extend the 2D path planning with value-based planning to continuous MDPs and propose a pipeline for equivariant sampling-based planning algorithm with empirical evidence.

Abstract:

Existing equivariant neural networks require explicit knowledge of the symmetry group before model implementation. Various symmetry discovery methods have been developed to learn invariance and equivariance from data, but their search spaces are limited to linear symmetries. We propose to discover arbitrary nonlinear symmetries by factorizing the group action into nonlinear transformations parameterized by an autoencoder network and linear symmetries generated by an existing symmetry discovery framework, LieGAN. Our method can capture the intrinsic symmetry in high-dimensional observations, which also results in a well-structured latent space that is useful for other downstream tasks, including long-term prediction and latent space equation discovery.

Abstract:

Discrete curvature has recently been used in graph machine learning to improve performance, understand message-passing and assess structural differences between graphs. Despite these advancements, the theoretical properties of discrete curvature measures, such as their representational power and their relationship to graph features is yet to be fully explored. This paper studies Ollivier--Ricci curvature on graphs, providing both a discussion and empirical analysis of its expressivity, i.e. the ability to distinguish non-isomorphic graphs.

Abstract:

Recent studies propose enhancing machine learning models by aligning the geometric characteristics of the latent space with the underlying data structure. Instead of relying solely on Euclidean space, researchers suggest using hyperbolic and spherical spaces with constant curvature, or their combinations (product manifolds), to improve model performance. However, determining the best latent product manifold signature, which refers to the choice and dimensionality of manifold components, lacks a principled technique. To address this, we introduce a novel notion of distance between candidate latent geometries using Gromov-Hausdorff from metric geometry. We propose using a graph search space that utilizes computed Gromov-Hausdorff distances to search for the optimal latent geometry. In this work we focus on providing a description of an algorithm to compute the Gromov-Hausdorff distance between model spaces and its computational implementation.

Abstract:

Learning policies that are robust to changes in the environment are critical for real world deployment of Reinforcement Learning (RL) agents. They are also necessary for achieving good generalization across environment shifts. Bisimulation provides a powerful means for abstracting task relevant components of the observation and learning a succinct representation space for training the RL agent in high dimensional spaces by exploiting the rich metric structure induced by the RL dynamics. In this work, we extend the bisimulation framework to also account for context dependent observation shifts. We use simulator based learning as an exemplary setting to demonstrate the use alternate observations to learn a representation space which is invariant to observation shifts using a novel bisimulation based objective. This allows us to deploy the agent to varying observation settings during test time and generalize to unseen scenarios. Empirical analysis on the high-dimensional image based control domains demonstrates the efficacy of our method.

Abstract:

Persistent Homology is a widely used topological data analysis tool that creates a concise description of the topological properties of a point cloud based on a specified filtration. Most of these filtrations used for persistent homology depend (implicitly) on a chosen metric, which is typically agnostically chosen as the standard euclidean metric on R^n. Recent work has tried to uncover the "true" metric on the point cloud using distance-to-measure functions, in order to obtain more meaningful persistent homology results. Here we propose an alternative look at this problem: we posit that information on the point cloud is lost when restricting persistent homology to a single (correct) distance function. Instead, we show how by varying the distance function on the underlying space and analysing the corresponding shifts in the persistence diagrams, we can extract additional topological and geometrical information. Finally, we show in synthetic experiments that non-isotropic persistent homology (NIPH) can extract information on orientation, orientational variance, and scaling of randomly generated point clouds with good accuracy.

transformer = general-purpose architecture for 3D data

Abstract:

Problems involving geometric data arise in a variety of fields, including computer vision, robotics, chemistry, and physics. Such data can take numerous forms, such as points, direction vectors, planes, or transformations, but to date there is no single architecture that can be applied to such a wide variety of geometric types while respecting their symmetries. In this paper we introduce the Geometric Algebra Transformer (GATr), a general-purpose architecture for geometric data. GATr represents inputs, outputs, and hidden states in the projective geometric algebra, which offers an efficient 16-dimensional vector space representation of common geometric objects and operators acting on them. GATr is equivariant with respect to E(3), the symmetry group of 3D Euclidean space. As a transformer, GATr is scalable, expressive, and versatile. In experiments with n-body modeling and robotic planning, GATr shows strong improvements over non-geometric baselines.

Abstract:

Self-supervised learning converts raw perceptual data such as images to a compact space where simple Euclidean distances measure meaningful variations in data. In this paper, we extend this formulation by adding additional geometric structure to the embedding space by enforcing transformations of input space to correspond to simple (i.e., linear) transformations of embedding space. Specifically, in the contrastive learning setting, we introduce an equivariance objective and theoretically prove that its minima forces augmentations on input space to correspond to rotations on the spherical embedding space. We show that merely combining our equivariant loss with a non-collapse term results in non-trivial representations, without requiring invariance to data augmentations. Optimal performance is achieved by also encouraging approximate invariance, where input augmentations correspond to small rotations. Our method, CARE: Contrastive Augmentation-induced Rotational Equivariance, leads to improved performance on downstream tasks and ensures sensitivity in embedding space to important variations in data (e.g., color) that standard contrastive methods do not achieve.

Abstract:

A multitude of metrics for learning and evaluating disentangled representations has been proposed. However, it remains unclear what these metrics truly quantify and how to compare them. To solve this problem, we introduce a systematic approach for transforming an equational definition into a quantitative metric via enriched category theory. We show how to replace (i) equality with metric, (ii) logical connectives with order operations, (iii) universal quantifier with aggregation, and (iv) existential quantifier with the best approximation. Using this approach, we can derive useful metrics for measuring the modularity and informativeness of a disentangled representation extractor.

Abstract:

Can deep learning models generalize if their problem's underlying structure is unknown a priori? We analyze this theoretically and empirically in an idealistic setting for linear regression with invariant/equivariant data. We prove that linear regression models learn to become invariant/equivariant, with their weights being decomposed into a component that respects the symmetry and one that does not. These two components evolve independently over time, with the asymmetric component decaying exponentially given sufficient data. Extending these results to complex systems will be pursued in future work.

Abstract:

We introduce a set of polynomial learning problems that are equivariant to the non-compact group SL(2,R). SL(2,R) consists of area-preserving linear transformations, and captures the symmetries of a variety of polynomial-based problems not previously studied in the machine learning community, such as verifying positivity (for e.g. sum-of-squares optimization) and minimization. While compact groups admit many architectural building blocks, such as group convolutions, non-compact groups do not fit within this paradigm and are therefore more challenging. We consider several equivariance-based learning approaches for solving polynomial problems, including both data augmentation and a fully SL(2,R)-equivariant architecture for solving polynomial problems. In experiments, we broadly demonstrate that machine learning provides a promising alternative to traditional SDP-based baselines, achieving tenfold speedups while retaining high accuracy. Surprisingly, the most successful approaches incorporate only a well-conditioned subset of SL(2,R), rather than the entire group. This provides a rare example of a symmetric problem where data augmentation outperforms full equivariance, and provides interesting lessons for other problems with non-compact symmetries.

Abstract:

In practice, encoding invariances into models improves sample complexity. In this work, we study this phenomenon from a theoretical perspective. In particular, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a group action on the manifold. Our results hold for any smooth compact Lie group action, even groups of positive dimension. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold's dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials. This new geometric viewpoint on learning with invariances may be of independent interest.

Abstract:

Multi-condition single-cell data reveals expression differences between corresponding cell subpopulations in different conditions. Here, we propose to use regression on latent spaces to simultaneously account for variance from known and latent factors. Our approach is built around multivariate regression on Grassmann manifolds. We use the method to analyze a drug treatment experiment on brain tumor biopsies. The method is a versatile new approach for identifying differentially expressed genes from single-cell data of heterogeneous cell subpopulations under arbitrary experimental designs without clustering.

Abstract:

A NN consisting of piecewise affine building blocks, such as fully-connected layers and ReLU activations, is itself a piecewise affine function supported on a polyhedral complex. This complex has been studied to characterize theoretical properties of NNs and linked to geometry representations, but, in practice, extracting it remains a challenge. Previous works subdivide the regions via intersections with hyperplanes induced by each neuron. Instead, we propose to subdivide the edges, leading to a novel method for polyhedral complex extraction. This alleviates computational redundancy and affords efficient data-structures. A key to this are sign-vectors, which encode the combinatorial structure of the complex. Our implementation (available on GitHub) uses standard tensor operations and can run exclusively on the GPU, taking seconds for millions of cells on a consumer grade machine. Motivated by the growing interest in neural shape representation, we use the speed and differentiablility of our method to optimize geometric properties of the complex. Our code is available at https://github.com/arturs-berzins/relu_edge_subdivision.

Abstract:

Much work has been devoted to devising architectures that build group-equivariant representations, while invariance is often induced using simple global pooling mechanisms. Little work has been done on creating expressive layers that are invariant to given symmetries, despite the success of permutation invariant pooling in various tasks. In this work, we present Group Invariant Global Pooling (GIGP), an invariant pooling layer that is provably sufficiently expressive to represent a large class of invariant functions. We validate GIGP on rotated MNIST and QM9, showing improvements for the latter while attaining identical results for the former. By making the pooling process over group orbits, this invariant aggregation method leads to improved performance, while performing well-principled group aggregations.

Abstract:

The Earth Mover's Distance (EMD) is the measure of choice to assess similarity between point clouds. However the computational cost of standard algorithms to compute it makes it prohibitive as a training loss, and the standard approach is to use a surrogate such as the Chamfer distance. We propose instead to use a deep model dubbed DeepEMD to directly get an estimate of the EMD. We formulate casting the prediction of the bipartite matching as that of an attention matrix, from which we get an accurate estimate of both the EMD, and its gradient. Experiments demonstrate not only the accuracy of this model, in particular even when test and train data are from different origins. Moreover, in our experiments, the model performs accurately when processing point clouds which are several times larger than those seen during training. Computation-wise, while the complexity of the exact Hungarian algorithm is $O(N^3)$, DeepEMD scales as $O(N^2)$, where $N$ is the total number of points. This leads to a $100\times$ wall-clock speed-up with $1024$ points. DeepEMD also achieves better performance than the standard Sinkhorn algorithm, with about $40\times$ speed-up. The availability of gradients allows DeepEMD to be used for training a VAE, leading to a model with lower reconstruction EMD than a standard baseline trained with Chamfer distance.

Abstract:

Neural Persistence is a prominent measure for quantifying neural network complexity, proposed in the emerging field of topological data analysis in deep learning. In this work, however, we find both theoretically and empirically that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. First, we prove tighter bounds on neural persistence that motivate this claim theoretically. Then, we confirm that our interpretation holds in practise by calculating neural persistence for synthetic weight matrices and for trained deep neural networks. This raises the question if the benefits of neural persistence can be achieved by simpler means, since already calculating 0-order persistent homology for large matrices is costly.

Abstract:

State-of-the-art neural algorithmic reasoners make use of message passing in graph neural networks (GNNs). But typical GNNs blur the distinction between the definition and invocation of the message function, forcing a node to send messages to its neighbours at every layer, synchronously. When applying GNNs to learn to execute dynamic programming algorithms, however, on most steps only a handful of the nodes would have meaningful updates to send. One, hence, runs the risk of inefficiencies by sending too much irrelevant data across the graph---with many intermediate GNN steps having to learn identity functions. In this work, we explicitly separate the concepts of node state update and message function invocation. With this separation, we obtain a mathematical formulation that allows us to reason about asynchronous computation in both algorithms and neural networks.

Abstract:

Graph neural networks (GNNs) demonstrate outstanding performance in a broad range of applications. While the majority of GNN applications assume that a graph structure is given, some recent methods substantially expanded the applicability of GNNs by showing that they may be effective even when no graph structure is explicitly provided. The GNN parameters and a graph structure are jointly learned. Previous studies adopt different experimentation setups, making it difficult to compare their merits. In this paper, we propose a benchmarking strategy for graph structure learning using a unified framework. Our framework, called Unified Graph Structure Learning (UGSL), reformulates existing models into a single model. We implement a wide range of existing models in our framework and conduct extensive analyses of the effectiveness of different components in the framework. Our results provide a clear and concise understanding of the different methods in this area as well as their strengths and weaknesses.

Abstract:

In this work we study adversarial examples in deep neural networks through the lens of a predefined data manifold. By forcing certain geometric properties of this manifold, we are able to analyze the behavior of the learned decision boundaries. It has been shown previously that training to be robust against adversarial attacks produces models with gradients aligned to a small set of principal variations in the data. We demonstrate the converse of this statement; aligning model gradients with a select set of principal variations improves robustness against gradient based adversarial attacks. Our analysis shows that this also makes data more orthogonal to decision boundaries. We conclude that robust training methods make the problem better posed by focusing the model on more important dimensions of variation.

Abstract:

Diffusion-based manifold learning methods have proven useful in representation learning and dimensionality reduction of modern high dimensional, high throughput, noisy datasets. Such datasets are especially present in fields like biology and physics. While it is thought that these methods preserve underlying manifold structure of data by learning a proxy for geodesic distances, no specific theoretical links have been established. Here, we establish such a link via results in Riemannian geometry explicitly connecting heat diffusion to manifold distances. In this process, we also formulate a more general heat kernel based manifold embedding method that we call heat geodesic embeddings. This novel perspective makes clearer the choices available in manifold learning and denoising. Results show that our method outperforms existing state of the art in preserving ground truth manifold distances, and preserving cluster structure in toy datasets. We also showcase our method on single cell RNA-sequencing datasets with both continuum and cluster structure, where our method enables interpolation of withheld timepoints of data.

Abstract:

Convolutional Kernel Networks (CKNs) were proposed as multilayered representation learning models that are based on stacking multiple Reproducing Kernel Hilbert Spaces (RKHSs) in a hierarchical manner. CKN has been studied to understand the (near) group invariance and (geometric) deformation stability properties of deep convolutional representations by exploiting the geometry of corresponding RKHSs. The objective of this paper is two-fold: (1) Analyzing the construction of group equivariant Convolutional Kernel Networks (equiv-CKNs) that induce in the model symmetries like translation, rotation etc., (2) Understanding the deformation stability of equiv-CKNs that takes into account the geometry of inductive biases and that of RKHSs. Multiple kernel based construction of equivariant representations might be helpful in understanding the geometric model complexity of equivariant CNNs as well as shed lights on the construction practicalities of robust equivariant networks.

Abstract:

Graph neural networks (GNNs) have demonstrated success in modeling relational data, especially for data that exhibits homophily: when a connection between nodes tends to imply that they belong to the same class. However, while this assumption is true in many relevant situations, there are important real-world scenarios that violate this assumption. In this work, we propose Evolving Computation Graphs (ECGs), a novel method for enhancing GNNs on heterophilic datasets without requiring prior domain knowledge. Our approach builds on prior theoretical insights linking node degree, high homophily, and inter vs intra-class embedding similarity by rewiring the GNNs’ computation graph towards adding edges that connect nodes that are likely to be in the same class. We utilise weaker classifiers to identify these edges and evaluate ECGs on a diverse set of recently-proposed heterophilous datasets, demonstrating improvements over 95% of the relevant baselines.

Abstract:

We propose Higher-Order Networks (HONs) for historically challenging problems for Graph Neural Networks (GNNs), such as Constraint Satisfaction Problems (CSPs). We apply a simple extension of GNNs to HONs and show its advantages for solving 3-coloring.

Abstract:

Linear neural network layers that are either equivariant or invariant to permutations of their inputs form core building blocks of modern deep learning architectures. Examples include the layers of DeepSets, as well as linear layers occurring in attention blocks of transformers and some graph neural networks. The space of permutation equivariant linear layers can be identified as the invariant subspace of a certain symmetric group representation, and recent work parameterized this space by exhibiting a basis whose vectors are sums over orbits of standard basis elements with respect to the symmetric group action. A parameterization opens up the possibility of learning the weights of permutation equivariant linear layers via gradient descent. The space of permutation equivariant linear layers is a generalization of the partition algebra, an object first discovered in statistical physics with deep connections to the representation theory of the symmetric group, and the basis described above generalizes the so-called orbit basis of the partition algebra. We exhibit an alternative basis, generalizing the diagram basis of the partition algebra, with computational benefits stemming from the fact that the tensors making up the basis are low rank in the sense that they naturally factorize into Kronecker products. Just as multiplication by a rank one matrix is far less expensive than multiplication by an arbitrary matrix, multiplication with these low rank tensors is far less expensive than multiplication with elements of the orbit basis. Finally, we describe an algorithm implementing multiplication with these basis elements.

Abstract:

Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder class-relevant features by focusing on learning easy class-irrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of \textit{class collapse} or \textit{feature suppression} at \textit{test} time. We provide the first unified theoretically rigorous framework to determine \textit{which} features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder class-relevant features. We also provide the first theoretical explanation for why employing supervised and unsupervised CL together yields higher-quality representations, even when using commonly-used stochastic gradient methods.

Abstract:

Positional encodings are ubiquitous as an input featurization tool in language modeling, computer vision, and graph representation learning, enabling neural networks to capture important geometric structure of the input. Traditionally, positional encodings have been defined anew for each data domain. In this work, we reinterpret positional encodings for disparate data types --- including sequences, grids, graphs, and manifolds --- in the unifying framework of group representations. We show how to express existing positional encodings as group representations, and conversely, propose new positional encodings by choosing suitable groups and representations. We validate our framework with experiments on implicit neural representations of images and vector fields, highlighting the practical utility of such positional encodings for encouraging approximate equivariance and capturing geometric structure.

Abstract:

Group-invariant probability distributions appear in many data-generative models in machine learning, such as graphs, point clouds, and images. In practice, one often needs to estimate divergences between such distributions. In this work, we study how the inherent invariances with respect to any smooth action of a Lie group on a manifold improve sample complexity when estimating the Wasserstein distance. Our result indicates a two-fold gain: (1) reducing the sample complexity by a multiplicative factor corresponding to the group size (for finite groups) or the normalized volume of the quotient space (for groups of positive dimension), (2) improving the exponent in the convergence rate (for groups of positive dimension). These results are completely new for groups of positive dimension and tighten recent bounds for finite group actions.

Abstract:

We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent space consists of a set of Gaussian distributions. It is known that the set of the univariate Gaussian distributions with the Fisher information metric form a hyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed with the Gaussian manifolds, we propose a pseudo-Gaussian manifold normal distribution based on the Kullback-Leibler divergence, a local approximation of the squared Fisher-Rao distance, to define a density over the latent space. In experiments, we demonstrate the efficacy of GM-VAE on two different tasks: density estimation of image datasets and environment modeling in model-based reinforcement learning. GM-VAE outperforms the other variants of hyperbolic- and Euclidean-VAEs on density estimation tasks and shows competitive performance in model-based reinforcement learning. We observe that our model provides strong numerical stability, addressing a common limitation reported in previous hyperbolic-VAEs.

Abstract:

Real-world graphs naturally exhibit hierarchical or cyclical structures that are unfit for the typical Euclidean space. While there exist graph neural networks that leverage non-Euclidean spaces to embed such structures more accurately, these methods are confined under the message-passing paradigm, making the models vulnerable against side-effects such as oversmoothing. More recent work have proposed attention-based graph Transformers that can easily model long-range interactions, but their extensions towards non-Euclidean geometry are yet unexplored. To bridge this gap, we propose Fully Product-Stereographic Transformer, a generalization of Transformers towards operating entirely on the product of constant curvature spaces. Our model can learn the curvature appropriate for the input graph in an end-to-end fashion, without the need of additional tuning on different curvature initializations. We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in cost linear to the number of nodes and edges while respecting the underlying geometry. Experiments on graph reconstruction and node classification demonstrate the benefits of our approach.

Abstract:

Neural networks that can process the parameters of other neural networks find applications in diverse domains, including processing implicit neural representations, domain adaptation of pretrained networks, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the weight space or rely on intricate weight-sharing patterns to achieve equivariance. In this work, we propose representing neural networks as computation graphs, enabling the use of standard graph neural networks to preserve permutation symmetry. We also introduce probe features computed from the forward pass of the input neural network. Our proposed solution improves over prior methods from 86% to 97% accuracy on the challenging MNIST INR classification benchmark, showcasing the effectiveness of our approach.

Abstract:

Machine learning subfields define useful representations differently: disentanglement strives for semantic meaning and symmetries, identifiability for recovering the ground-truth factors of the (unobservable) data generating process, group-structured representations for equivariance. We demonstrate that despite their merits, each approach has shortcomings. Surprisingly, joining forces helps overcome the limitations: we use insights from latent space statistics, geometry, and topology in our examples to elucidate how combining the desiderata of identifiability, disentanglement, and group structure yields more useful representations.

Abstract:

Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation −v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models.

Abstract:

It has been observed that inner representations learned by different neural networks conceal structural similarities when the networks are trained under similar inductive biases. Exploring the geometric structure of latent spaces within these networks offers insights into the underlying similarity among different neural models and facilitates reasoning about the transformations that connect them. Identifying and estimating these transformations presents a challenging task, but it holds significant potential for various downstream tasks, including merging and stitching different neural architectures for model reuse. In this study, drawing on the geometrical structure of latent spaces, we show how it is possible to define representations that incorporate invariances to the targeted transformations in a single framework. We experimentally analyze how inducing different invariances in the representations affects downstream performances on classification and reconstruction tasks, suggesting that the classes of transformations that relate independent latent spaces depend on the task at hand. We analyze models in a variety of settings including different initializations, architectural changes, and trained on multiple modalities (e.g., text, images), testing our framework on 8 different benchmarks.

Abstract:

Graph neural networks have shown promising performance on graph analytical tasks such as node classification and link prediction, contributing to great advances in many graph-based applications. Despite the success of graph neural networks, most of them lack fairness considerations. Consequently, they could yield discriminatory results towards certain populations when such algorithms are exploited in high stakes applications. In this work, we study the problem of predictive bias propagated by relational information, and subsequently propose an in-training edge editing approach to promote fairness. We introduce the notions of faithfulness and unfairness for an edge in a graph, and use it as prior knowledge to edit graph topology and improve fairness.