<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Proceedings of Machine Learning Research</title>
    <description>Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models
  Held in Ernest N. Morial Convention Center, New Orleans, USA on 15 December 2023

Published as Volume 243 by the Proceedings of Machine Learning Research on 14 May 2024.

Volume Edited by:
  Marco Fumero
  Emanuele Rodolá
  Clementine Domine
  Francesco Locatello
  Karolina Dziugaite
  Caron Mathilde

Series Editors:
  Neil D. Lawrence
</description>
    <link>https://proceedings.mlr.press/v243/</link>
    <atom:link href="https://proceedings.mlr.press/v243/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 14 May 2024 09:27:10 +0000</pubDate>
    <lastBuildDate>Tue, 14 May 2024 09:27:10 +0000</lastBuildDate>
    <generator>Jekyll v3.9.5</generator>
    
      <item>
        <title>WavSpA: Wavelet Space Attention for Boosting Transformers’ Long Sequence Learning Ability</title>
        <description>Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting attention learning in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer’s performance and also significantly outperform learning in Fourier space. We further show our method can enhance Transformer’s reasoning extrapolation capability over distance on the LEGO chain-of-reasoning task.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/zhuang24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/zhuang24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Comparing neural models using their perceptual discriminability predictions</title>
        <description>A variety of methods have been developed to compare models of visual representation. However, internal representations are not uniquely identifiable from perceptual measurements: different representations can generate identical perceptual predictions, and dissimilar model representations (according to existing model comparison methods) do not guarantee dissimilar perceptual predictions. Here, we generalize a previous method (“eigendistortions” - Berardino et al, 2017) to compare models based on their metric tensors. Metric tensors characterize a model’s sensitivity to stimulus perturbations, reflecting both the geometric and stochastic properties of the representation, and providing an explicit prediction of perceptual discriminability. Brute force comparison of model-predicted metric tensors using human perceptual thresholds would require an impossibly large set of measurements, since one needs to perturb a stimulus in all possible orthogonal directions. To circumvent this “perceptual curse of dimensionality”, we compute and measure discrimination capabilities for a small set of most-informative perturbations, reducing the measurement cost from thousands of hours (a conservative estimate) to a single trial. We show that this single measurement, made for a variety of different test stimuli, is sufficient to differentiate models, select models that better match human perception, or generate new models that combine the advantages of both. We demonstrate the power of this method in assessing two examples: 1) comparing models for color discrimination; 2) comparing autoencoders trained with different regularizers.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/zhou24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/zhou24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Role Taxonomy of Units in Deep Neural Networks</title>
        <description>Identifying the role of network units in deep neural networks (DNNs) is critical in many aspects including giving understandings on the mechanisms of DNNs and building basic connections between deep learning and neuroscience. However, there remains unclear on which roles the units in DNNs with different generalization ability could present. To this end, we give role taxonomy of units in DNNs, where units are categorized into four types in terms of their functional preference on separately the training set and testing set. We show that ratios of the four categories are highly associated with the generalization ability of DNNs from two distinct perspectives, based on which we give signs of DNNs with well generalization.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/zhao24b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/zhao24b.html</guid>
        
        
      </item>
    
      <item>
        <title>NEUCORE: Neural Concept Reasoning for Composed Image Retrieval</title>
        <description>Composed image retrieval which combines a reference image and a text modifier to identify the desired target image is a challenging task, and requires the model to comprehend both vision and language modalities and their interactions. Existing approaches focus on holistic multi-modal interaction modeling, and ignore the composed and complimentary property between the reference image and text modifier. In order to better utilize the complementarity of multi-modal inputs for effective information fusion and retrieval, we move the multi-modal understanding to fine-granularity at concept-level, and learn the multi-modal concept alignment to identify the visual location in reference or target images corresponding to text modifier. Toward the end, we propose a NEUral COncept REasoning (NEUCORE) model which incorporates multi-modal concept alignment and progressive multi-modal fusion over aligned concepts. Specifically, considering that text modifier may refer to semantic concepts not existing in the reference image and requiring to be added into the target image, we learn the multi-modal concept alignment between the text modifier and the concatenation of reference and target images, under multiple-instance learning framework with image and sentence level weak supervision. Furthermore, based on aligned concepts, to form discriminative fusion features of the input modalities for accurate target image retrieval, we propose a progressive fusion strategy with unified execution architecture instantiated by the attended language semantic concepts. Our proposed approach is evaluated on three datasets and achieves state-of-the-art results.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/zhao24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/zhao24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Object-Centric Semantic Vector Quantization</title>
        <description>Neural discrete representations are crucial components of modern neural networks. However, their main limitation is that the primary strategies such as VQ-VAE can only provide representations at the patch level. Therefore, one of the main goals of representation learning, acquiring conceptual, semantic, and compositional abstractions such as the color and shape of an object, remains elusive. In this paper, we present the first approach to semantic neural discrete representation learning. The proposed model, called Semantic Vector-Quantized Variational Autoencoder (SVQ), leverages recent advances in unsupervised object-centric learning to address this limitation. Specifically, we observe that a simple approach quantizing at the object level poses a significant challenge and propose constructing scene representations hierarchically, from low-level discrete concept schemas to object representations. Additionally, we suggest a novel method for training a prior over these semantic representations, enabling the ability to generate images following the underlying data distribution, which is lacking in most object-centric models. In experiments on various 2D and 3D object-centric datasets, we find that our model achieves superior generation performance compared to non-semantic vector quantization methods such as VQ-VAE and previous object-centric generative models. Furthermore, we find that the semantic discrete representations can solve downstream scene understanding tasks that require reasoning about the properties of different objects in the scene.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/wu24b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/wu24b.html</guid>
        
        
      </item>
    
      <item>
        <title>What Mechanisms Does Knowledge Distillation Distill?</title>
        <description>Knowledge distillation is a commonly-used compression method in ML due to the popularity of increasingly large-scale models, but it is unclear if all the information a teacher model contains is distilled into the smaller student model. We aim to formalize the concept of ‘knowledge’ to investigate how knowledge is transferred during distillation, focusing on shared invariant outputs to counterfactual changes of dataset latent variables (we call these latents mechanisms). We define a student model to be a good stand-in model for a teacher if it shares the teacher’s learned mechanisms, and find that Jacobian matching and contrastive representation learning are viable methods by which to train such models. While these methods do not result in perfect transfer of mechanisms, we show they often improve student fidelity or mitigate simplicity bias (as measured by the teacher-to-student KL divergence and accuracy on various out-of-distribution test datasets), especially on datasets with spurious statistical correlations.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/wu24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/wu24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Bio-inspired parameter reuse: Exploiting inter-frame representation similarity with recurrence for accelerating temporal visual processing</title>
        <description>Feedforward neural networks are the dominant approach in current computer vision research. They typically do not incorporate recurrence, which is a prominent feature of biological vision brain circuitry. Inspired by biological findings, we introduce $\textbf{RecSlowFast}$, a recurrent slow-fast framework aimed at showing how recurrence can be useful for temporal visual processing. We perform a variable number of recurrent steps of certain layers in a network receiving input video frames, where each recurrent step is equivalent to a feedforward layer with weights reuse. By harnessing the hidden states extracted from the previous input frame, we reduce the computation cost by executing fewer recurrent steps on temporally correlated consecutive frames, while keeping good task accuracy. The early termination of the recurrence can be dynamically determined through newly introduced criteria based on the distance between hidden states and without using any auxiliary scheduler network. RecSlowFast  $\textbf{reuses a single set of parameters}$, unlike previous work which requires one computationally heavy network and one light network, to achieve the speed versus accuracy trade-off. Using a new $\textit{Temporal Pathfinder}$ dataset proposed in this work, we evaluate RecSlowFast on a task to continuously detect the longest evolving contour in a video. The slow-fast inference mechanism speeds up the average frame per second by 279% on this dataset with comparable task accuracy using a desktop GPU. We further demonstrate a similar trend on CamVid, a video semantic segmentation dataset.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/wang24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/wang24a.html</guid>
        
        
      </item>
    
      <item>
        <title>NoPose-NeuS: Jointly Optimizing Camera Poses with Neural Implicit Surfaces for Multi-view Reconstruction</title>
        <description>Learning neural implicit surfaces from volume rendering has become popular for multi-view reconstruction. Neural surface reconstruction approaches can recover complex 3D geometry that are difficult for classical Multi-view Stereo (MVS) approaches, such as non-Lambertian surfaces and thin structures. However, one key assumption for these methods is knowing accurate camera parameters for the input multi-view images, which are not always available. In this paper, we present NoPose-NeuS, a neural implicit surface reconstruction method that extends NeuS to jointly optimize camera poses with the geometry and color networks. We encode the camera poses as a multi-layer perceptron (MLP) and introduce two additional losses, which are multi-view feature consistency and rendered depth losses, to constrain the learned geometry for better estimated camera poses and scene surfaces. Extensive experiments on the DTU dataset show that the proposed method can estimate relatively accurate camera poses, while maintaining a high surface reconstruction quality with 0.89 mean Chamfer distance.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/sabae24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/sabae24a.html</guid>
        
        
      </item>
    
      <item>
        <title>A sparse null code emerges in deep neural networks</title>
        <description>The internal representations of deep vision models are often assumed to encode specific image features, such as contours, textures, and object parts. However, it is possible for deep networks to learn highly abstract representations that may not be linked to any specific image feature. Here we present evidence for one such abstract representation in transformers and modern convolutional architectures that appears to serve as a null code, indicating image regions that are non-diagnostic for the object class. These null codes are both statistically and qualitatively distinct from the more commonly reported feature-related codes of vision models. Specifically, these null codes have several distinct characteristics: they are highly sparse, they have a single unique activation pattern for each network, they emerge abruptly at intermediate network depths, and they are activated in a feature-independent manner by weakly informative image regions, such as backgrounds. Together, these findings reveal a new class of highly abstract representations in deep vision models: sparse null codes that seem to indicate the absence of relevant features.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/robinson24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/robinson24a.html</guid>
        
        
      </item>
    
      <item>
        <title>DisCoV: Disentangling Time Series Representations via Contrastive based $l$-Variational Inference</title>
        <description>Learning disentangled representations is crucial for Time Series, offering benefits like feature derivation and improved interpretability, thereby enhancing task performance. We focus on disentangled representation learning for home appliance electricity usage, enabling users to understand and optimize their consumption for a reduced carbon footprint. Our approach frames the problem as disentangling each attribute’s role in total consumption (e.g., dishwashers, fridges, …). Unlike existing methods assuming attribute independence, we acknowledge real-world time series attribute correlations, like the operating of dishwashers and washing machines during the winter season. To tackle this, we employ weakly supervised contrastive disentanglement, facilitating representation generalization across diverse correlated scenarios and new households. Our method utilizes innovative $l$-variational inference layers with self-attention, effectively addressing temporal dependencies across bottom-up and top-down networks. We find that DisCoV (Disentangling via Contrastive $l$-Variational) can enhance the task of reconstructing electricity consumption for individual appliances. We introduce TDS (Time Disentangling Score) to gauge disentanglement quality. TDS reliably reflects disentanglement performance, making it a valuable metric for evaluating time series representations. Code available at https://anonymous.4open.science/r/DisCo.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/oublal24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/oublal24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Supervising Variational Autoencoder Latent Representations with Language</title>
        <description>Supervising latent representations of data is of great interest for modern multi-modal generative machine learning.  In this work, we propose two new methods to use text to condition the latent representations of a VAE, and evaluate them on a novel conditional image-generation benchmark task. We find that the applied methods can be used to generate highly accurate reconstructed images through language querying with minimal compute resources. Our methods are quantitatively successful at conforming to textually-supervised attributes of an image while keeping unsupervised attributes constant. At large, we present critical observations on disentanglement between supervised and unsupervised properties of images and identify common barriers to effective disentanglement.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/lu24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/lu24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Unsupervised learning on spontaneous retinal activity leads to efficient neural representation geometry</title>
        <description>Prior to the onset of vision, neurons in the developing mammalian retina spontaneously fire in correlated activity patterns known as retinal waves. Experimental evidence suggests that retinal waves strongly influence the emergence of sensory representations before visual experience. We aim to model this early stage of functional development by using movies of neurally active developing retinas as pre-training data for neural networks. Specifically, we pre-train a ResNet-18 with an unsupervised contrastive learning objective (SimCLR) on both simulated and experimentally-obtained movies of retinal waves, then evaluate its performance on image classification tasks. We find that pre-training on retinal waves significantly improves performance on tasks that test object invariance to spatial translation, while slightly improving performance on more complex tasks like image classification. Notably, these performance boosts are realized on held-out natural images even though the pre-training procedure does not include any natural image data. We then propose a geometrical explanation for the increase in network performance, namely that the spatiotemporal characteristics of retinal waves facilitate the formation of separable feature representations. In particular, we demonstrate that networks pre-trained on retinal waves are more effective at separating image manifolds than randomly initialized networks, especially for manifolds defined by sets of spatial translations. These findings indicate that the broad spatiotemporal properties of retinal waves prepare networks for higher order feature extraction.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/ligeralde24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/ligeralde24a.html</guid>
        
        
      </item>
    
      <item>
        <title>A General Method for Testing Bayesian Models using Neural Data</title>
        <description>Bayesian models have been successful in explaining human and animal behavior, but the extent to which they can also explain neural activity is still an open question. A major obstacle to answering this question is that current methods for generating neural predictions require detailed and specific assumptions about the encoding of posterior beliefs in neural responses, with no consensus or decisive data about the nature of this encoding. Here, we present a new method and prove conditions for its validity, that overcomes these challenges for a wide class of probabilistic encodings – including the two major classes of neural sampling and distributed distributional codes. Our method tests whether the relationships between the model posteriors for different stimuli match the relationships between the corresponding neural responses – akin to representational similarity analysis (RSA), a widely used method for nonprobabilistic models. Finally, we present a new model comparison diagnostic for our method, based not on the agreement of the model with the data directly, but on the alignment of the model and data when injecting noise in our neural prediction generation method. We illustrate our method using simulated V1 data and compare two Bayesian models that are practically indistinguishable using behavior alone. Our results show a powerful new way to rigorously test Bayesian models on neural data.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/lengyel24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/lengyel24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Semi-Ensemble: A Simple Approach Over-parameterize Model Interpolation</title>
        <description>We develop a unified framework for interpolating two models with various degrees of over-parameterization, having model merging and model ensemble as special cases. Instead of directly interpolating models in their original parameter space, the proposed Semi-Ensemble interpolates the over-parameterized versions of the models in a higher-dimensional joint parameter space. Here, the over-parameterizations recover each endpoint model when projected to some low-dimensional subspace spanned by a fraction of bases. By carefully constructing the joint parameter space, the interpolated model can achieve a smooth tradeoff between the total number of parameters and the model accuracy, outperforming existing baselines. Intriguingly, we show that Semi-ensembles can sometimes achieve a better performance than vanilla ensembles, even with a slightly smaller number of parameters.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/lee24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/lee24a.html</guid>
        
        
      </item>
    
      <item>
        <title>On the Direct Alignment of Latent Spaces</title>
        <description>With the wide adaption of deep learning and pre-trained models rises the question of how to effectively reuse existing latent spaces for new applications.One important question is how the geometry of the latent space changes in-between different training runs of the same architecture and different architectures trained for the same task. Previous works proposed that the latent spaces for similar tasks are approximately isometric. However, in this work we show that method restricted to this assumption perform worse than when just using a linear transformation to align the latent spaces. We propose directly computing a transformation between the latent codes of different architectures which is more efficient than previous approaches and flexible wrt. to the type of transformation used. Our experiments show that aligning the latent space with a linear transformation performs best while not needing more prior knowledge.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/lahner24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/lahner24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Soft Matching Distance: A metric on neural representations that captures single-neuron tuning</title>
        <description>Common measures of neural representational (dis)similarity are designed to be insensitive to rotations and reflections of the neural activation space. Motivated by the premise that the tuning of individual units may be important, there has been recent interest in developing stricter notions of representational (dis)similarity that require neurons to be individually matched across networks. When two networks have the same size (i.e. same number of neurons), a distance metric can be formulated by optimizing over neuron index permutations to maximize tuning curve alignment. However, it is not clear how to generalize this metric to measure distances between networks with different sizes. Here, we leverage a connection to optimal transport theory to derive a natural generalization based on “soft” permutations. The resulting metric is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions. Further, our proposed metric avoids counter-intuitive outcomes suffered by alternative approaches, and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/khosla24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/khosla24a.html</guid>
        
        
      </item>
    
      <item>
        <title>MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation</title>
        <description>Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a unified framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre-training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/khan24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/khan24a.html</guid>
        
        
      </item>
    
      <item>
        <title>On Transferring Expert Knowledge from Tabular Data to Images</title>
        <description>Transferring knowledge across modalities has garnered significant attention in the field of machine learning as it enables the utilization of expert knowledge from diverse domains. In particular, the representation of expert knowledge in tabular form, commonly found in fields such as medicine, can greatly enhance the comprehensiveness and accuracy of image-based learning. However, the transfer of knowledge from tabular to image data presents unique challenges due to the distinct characteristics of these data types, making it challenging to determine &quot;how to reuse&quot; and &quot;which subset to reuse&quot;. To address this, we propose a novel method called CHannel tAbulaR alignment with optiMal tranSport (CHARMS) that automatically and effectively transfers relevant tabular knowledge. Specifically, by maximizing the mutual information between a group of channels and tabular features, our method modifies the visual embedding and captures the semantics of tabular knowledge. The alignment between channels and attributes helps select the subset of tabular data which contains knowledge to images. Experimental results demonstrate that {\sc Charms} effectively reuses tabular knowledge to improve the performance and interpretability of visual classifiers.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/jiang24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/jiang24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Linearly Structured World Representations in Maze-Solving Transformers</title>
        <description>The emergence of seemingly similar representations across tasks and neural architectures suggests that convergent properties may underlie sophisticated behavior. One form of representation that seems particularly fundamental to reasoning in many artificial (and perhaps natural) networks is the formation of world models, which decompose observed task structures into re-usable perceptual primitives and task-relevant relations. In this work, we show that auto-regressive transformers tasked with solving mazes learn to linearly represent the structure of mazes, and that the formation of these representations coincides with a sharp increase in generalization performance. Furthermore, we find preliminary evidence for Adjacency Heads which may play a role in computing valid paths through mazes.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/ivanitskiy24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/ivanitskiy24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Randomly Weighted Neuromodulation in Neural Networks Facilitates Learning of Manifolds Common Across Tasks</title>
        <description>Geometric Sensitive Hashing functions, a family of Local Sensitive Hashing functions, are neural network models that learn class-specific manifold geometry in supervised learning. However, given a set of supervised learning tasks, understanding the manifold geometries that can represent each task and the kinds of relationships between the tasks based on them has received little attention. We explore a formalization of this question by considering a generative process where each task is associated with a high-dimensional manifold, which can be done in brain-like models with neuromodulatory systems. Following this formulation, we define Task-specific Geometric Sensitive Hashing and show that a randomly weighted neural network with a neuromodulation system can realize this function.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/hong24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/hong24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Duality of Bures and Shape Distances with Implications for Comparing Neural Representations</title>
        <description>A multitude of (dis)similarity measures between neural networks representations have been proposed, resulting in a fragmented research landscape. Most (dis)similarity measures fall into one of two categories. First, measures such as linear regression, canonical correlations analysis (CCA), and shape distances, all learn explicit mappings between neural units to quantify similarity while accounting for expected invariances. Second, measures such as representational similarity analysis (RSA), centered kernel alignment (CKA), and normalized Bures similarity (NBS) all quantify similarity in summary statistics that are already invariant to such symmetries (e.g. by comparing stimulus-by-stimulus kernel matrices). Here, we take steps towards unifying these two broad categories of methods by observing that the cosine of the Riemannian shape distance (from category 1) is equal to NBS (from category 2). We explore how this connection leads to new interpretations of shape distances and NBS, and draw contrasts of these measures with CKA, a popular similarity measure in the deep learning literature.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/harvey24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/harvey24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Visual Expertise Explains Image Inversion Effects</title>
        <description>We present an anatomically-inspired neurocomputational model, including a foveated retina and the log-polar mapping from the visual field to the primary visual cortex, that recreates image inversion effects long seen in psychophysical studies. We show that visual expertise, the ability to discriminate between subordinate-level categories, changes the performance of the model on inverted images. We first explore face discrimination, which, in humans, relies on configural information. The log-polar transform disrupts configural information in an inverted image and leaves featural information relatively unaffected. We suggest this is responsible for the degradation of performance with inverted faces. We then recreate the effect with other subordinate-level category discriminators and show that the inversion effect arises as a result of visual expertise, where configural information becomes relevant as more identities are learned at the subordinate-level. Our model matches the classic result: faces suffer more from inversion than mono-oriented objects, which are more disrupted than non-mono-oriented objects when objects are only familiar at a basic-level, and simultaneously shows that expert-level discrimination of other subordinate-level categories respond similarly to inversion as face experts.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/gahl24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/gahl24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Preface of UniReps: the First Workshop on Unifying Representations in Neural Models</title>
        <description>Discover why, when and how distinct learning processes yield similar representations, and the degree to which these can be unified.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/fumero24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/fumero24a.html</guid>
        
        
      </item>
    
      <item>
        <title>Multimodal decoding of human brain activity into images and text</title>
        <description>Every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation.  We utilized the Natural Scenes Dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. We employed the Generative Image-to-text Transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. The method involves training regularized linear regression models between brain activity and extracted features. Additionally, we incorporated depth maps from the ControlNet model to further guide the reconstruction process. We propose a multimodal based approach that leverage similarities between neural and deep learning presentation and by learning alignment between these spaces, we produce textual description and image reconstruction from brain activity. We evaluate our methods using quantitative metrics for both generated captions and images. Our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships.  In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. Our approach provides a flexible platform for future research, with potential applications based on a combination of high-level semantic information coming from text and low-level image shape information coming from depth maps and initial guess images.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/ferrante24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/ferrante24a.html</guid>
        
        
      </item>
    
      <item>
        <title>ReWaRD: Retinal Waves for Pre-Training Artificial Neural Networks Mimicking Real Prenatal Development</title>
        <description>Computational models trained on a large amount of natural images are the state-of-the-art to study human vision – usually adult vision. Computational models of infant vision and its further development are gaining more and more attention in the community. In this work we aim at the very beginning of our visual experience – pre- and post-natal retinal waves which suggest to be a pre-training mechanism for the human visual system at a very early stage of development. We see this approach as an instance of biologically plausible data driven inductive bias through pre-training. We built a computational model that mimics this development mechanism by pre-training different artificial convolutional neural networks with simulated retinal wave images. The resulting features of this biologically plausible pre-training closely match the V1 features of the human visual system. We show that the performance gain by pre-training with retinal waves is similar to a state-of-the art pre-training pipeline. Our framework contains the retinal wave generator, as well as a training strategy, which can be a first step in a curriculum learning based training diet for various models of development. We release code, data and trained networks to build the basis for future work on visual development and based on a curriculum learning approach including prenatal development to support studies of innate vs. learned properties of the human visual system. An additional benefit of our pre-trained networks for neuroscience or computer vision applications is the absence of biases inherited from datasets like ImageNet.</description>
        <pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v243/cappell24a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v243/cappell24a.html</guid>
        
        
      </item>
    
  </channel>
</rss>
