2024-10-10

Title: CaLMFlow: Volterra Flow Matching using Causal Language Models

Authors: Sizhuang He, Daniel Levine, Ivan Vrkic, Marco Francesco Bressana, David Zhang, Syed Asad Rizvi, Yangtian Zhang, Emanuele Zappala, David van Dijk
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2410.05292
Pdf URL: https://arxiv.org/pdf/2410.05292
Copy Paste: [[2410.05292]] CaLMFlow: Volterra Flow Matching using Causal Language Models(https://arxiv.org/abs/2410.05292)
Keywords: generative
Abstract: We introduce CaLMFlow (Causal Language Models for Flow Matching), a novel framework that casts flow matching as a Volterra integral equation (VIE), leveraging the power of large language models (LLMs) for continuous data generation. CaLMFlow enables the direct application of LLMs to learn complex flows by formulating flow matching as a sequence modeling task, bridging discrete language modeling and continuous generative modeling. Our method implements tokenization across space and time, thereby solving a VIE over these domains. This approach enables efficient handling of high-dimensional data and outperforms ODE solver-dependent methods like conditional flow matching (CFM). We demonstrate CaLMFlow's effectiveness on synthetic and real-world data, including single-cell perturbation response prediction, showcasing its ability to incorporate textual context and generalize to unseen conditions. Our results highlight LLM-driven flow matching as a promising paradigm in generative modeling, offering improved scalability, flexibility, and context-awareness.

Title: ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Authors: Dong Han, Salaheldin Mohamed, Yong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05309
Pdf URL: https://arxiv.org/pdf/2410.05309
Copy Paste: [[2410.05309]] ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning(https://arxiv.org/abs/2410.05309)
Keywords: diffusion, generative
Abstract: With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, the generated contents cannot be fully controlled. There is a potential risk that T2I model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model while maintaining the high quality of generated images by fine-tuning the pre-trained diffusion model via reinforcement learning by optimizing the well-designed content-safe reward function. The proposed method leverages a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents that adhere to the pret-rained model and keep the corresponding semantic meaning on the safe side. In this way, the T2I model is robust to unsafe adversarial prompts since unsafe visual representations are mitigated from latent space. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high-fidelity of benign images as well as images generated by unsafe prompts. We compare with five existing state-of-the-art (SOTA) methods and achieve competitive performance on sexual content removal and image quality retention. In terms of robustness, our method outperforms counterparts under the SOTA black-box attacking model. Furthermore, our constructed method can be a benchmark for anti-NSFW generation with semantically-relevant safe alignment.

Title: PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms

Authors: Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, Suman Banerjee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05315
Pdf URL: https://arxiv.org/pdf/2410.05315
Copy Paste: [[2410.05315]] PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms(https://arxiv.org/abs/2410.05315)
Keywords: generative
Abstract: Deploying large language models (LLMs) locally on mobile devices is advantageous in scenarios where transmitting data to remote cloud servers is either undesirable due to privacy concerns or impractical due to network connection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated the local deployment of LLMs. However, local deployment also presents challenges, particularly in balancing quality (generative performance), latency, and throughput within the hardware constraints of mobile devices. In this paper, we introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate LLMs on mobile devices. We provide a comprehensive benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities. Unlike traditional benchmarks that assess full-scale models on high-end GPU clusters, we focus on evaluating resource efficiency (memory and power consumption) and harmful output for compressed models on mobile devices. Our key observations include i) differences in energy efficiency and throughput across mobile platforms; ii) the impact of quantization on memory usage, GPU execution time, and power consumption; and iii) accuracy and performance degradation of quantized models compared to their non-quantized counterparts; and iv) the frequency of hallucinations and toxic content generated by compressed LLMs on mobile devices.

Title: Accelerating Diffusion Transformers with Token-wise Feature Caching

Authors: Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.05317
Pdf URL: https://arxiv.org/pdf/2410.05317
Copy Paste: [[2410.05317]] Accelerating Diffusion Transformers with Token-wise Feature Caching(https://arxiv.org/abs/2410.05317)
Keywords: diffusion
Abstract: Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and PixArt-$\alpha$ with almost no drop in generation quality.

Title: Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models

Authors: Muhammad Haaris Khan, Hadrien Reynaud, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05322
Pdf URL: https://arxiv.org/pdf/2410.05322
Copy Paste: [[2410.05322]] Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models(https://arxiv.org/abs/2410.05322)
Keywords: diffusion
Abstract: Although powerful for image generation, consistent and controllable video is a longstanding problem for diffusion models. Video models require extensive training and computational resources, leading to high costs and large environmental impacts. Moreover, video models currently offer limited control of the output motion. This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail. These techniques can be applied to existing image models without training any video parameters (zero-shot) by altering the input noise in a latent diffusion model. Two complementary methods are presented. Noise crystallization ensures consistency but is limited to large movements due to reduced latent embedding sizes. Liquid noise trades consistency for greater flexibility without resolution limitations. The core concepts also allow other applications such as relighting, seamless upscaling, and improved video style transfer. Furthermore, an exploration of the VAE embedding used for latent diffusion models is performed, resulting in interesting theoretical insights such as a method for human-interpretable latent spaces.

Title: From Incomplete Coarse-Grained to Complete Fine-Grained: A Two-Stage Framework for Spatiotemporal Data Reconstruction

Authors: Ziyu Sun, Haoyang Su, En Wang, Funing Yang, Yongjian Yang, Wenbin Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05323
Pdf URL: https://arxiv.org/pdf/2410.05323
Copy Paste: [[2410.05323]] From Incomplete Coarse-Grained to Complete Fine-Grained: A Two-Stage Framework for Spatiotemporal Data Reconstruction(https://arxiv.org/abs/2410.05323)
Keywords: diffusion
Abstract: With the rapid development of various sensing devices, spatiotemporal data is becoming increasingly important nowadays. However, due to sensing costs and privacy concerns, the collected data is often incomplete and coarse-grained, limiting its application to specific tasks. To address this, we propose a new task called spatiotemporal data reconstruction, which aims to infer complete and fine-grained data from sparse and coarse-grained observations. To achieve this, we introduce a two-stage data inference framework, DiffRecon, grounded in the Denoising Diffusion Probabilistic Model (DDPM). In the first stage, we present Diffusion-C, a diffusion model augmented by ST-PointFormer, a powerful encoder designed to leverage the spatial correlations between sparse data points. Following this, the second stage introduces Diffusion-F, which incorporates the proposed T-PatternNet to capture the temporal pattern within sequential data. Together, these two stages form an end-to-end framework capable of inferring complete, fine-grained data from incomplete and coarse-grained observations. We conducted experiments on multiple real-world datasets to demonstrate the superiority of our method.

Title: Generating CAD Code with Vision-Language Models for 3D Designs

Authors: Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Zaidi, Megan Langwasser, Wei Xu, Matthew Gombolay
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05340
Pdf URL: https://arxiv.org/pdf/2410.05340
Copy Paste: [[2410.05340]] Generating CAD Code with Vision-Language Models for 3D Designs(https://arxiv.org/abs/2410.05340)
Keywords: generative
Abstract: Generative AI has transformed the fields of Design and Manufacturing by providing efficient and automated methods for generating and modifying 3D objects. One approach involves using Large Language Models (LLMs) to generate Computer- Aided Design (CAD) scripting code, which can then be executed to render a 3D object; however, the resulting 3D object may not meet the specified requirements. Testing the correctness of CAD generated code is challenging due to the complexity and structure of 3D objects (e.g., shapes, surfaces, and dimensions) that are not feasible in code. In this paper, we introduce CADCodeVerify, a novel approach to iteratively verify and improve 3D objects generated from CAD code. Our approach works by producing ameliorative feedback by prompting a Vision-Language Model (VLM) to generate and answer a set of validation questions to verify the generated object and prompt the VLM to correct deviations. To evaluate CADCodeVerify, we introduce, CADPrompt, the first benchmark for CAD code generation, consisting of 200 natural language prompts paired with expert-annotated scripting code for 3D objects to benchmark progress. Our findings show that CADCodeVerify improves VLM performance by providing visual feedback, enhancing the structure of the 3D objects, and increasing the success rate of the compiled program. When applied to GPT-4, CADCodeVerify achieved a 7.30% reduction in Point Cloud distance and a 5.0% improvement in success rate compared to prior work

Title: AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models

Authors: Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Jitao Sang, Dit-Yan Yeung
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05346
Pdf URL: https://arxiv.org/pdf/2410.05346
Copy Paste: [[2410.05346]] AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models(https://arxiv.org/abs/2410.05346)
Keywords: self-supervised
Abstract: Due to their multimodal capabilities, Vision-Language Models (VLMs) have found numerous impactful applications in real-world scenarios. However, recent studies have revealed that VLMs are vulnerable to image-based adversarial attacks, particularly targeted adversarial images that manipulate the model to generate harmful content specified by the adversary. Current attack methods rely on predefined target labels to create targeted adversarial attacks, which limits their scalability and applicability for large-scale robustness evaluations. In this paper, we propose AnyAttack, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision, allowing any image to serve as a target for the attack. To address the limitation of existing methods that require label supervision, we introduce a contrastive loss that trains a generator on a large-scale unlabeled image dataset, LAION-400M dataset, for generating targeted adversarial noise. This large-scale pre-training endows our method with powerful transferability across a wide range of VLMs. Extensive experiments on five mainstream open-source VLMs (CLIP, BLIP, BLIP2, InstructBLIP, and MiniGPT-4) across three multimodal tasks (image-text retrieval, multimodal classification, and image captioning) demonstrate the effectiveness of our attack. Additionally, we successfully transfer AnyAttack to multiple commercial VLMs, including Google's Gemini, Claude's Sonnet, and Microsoft's Copilot. These results reveal an unprecedented risk to VLMs, highlighting the need for effective countermeasures.

Title: LLMs Are In-Context Reinforcement Learners

Authors: Giovanni Monea, Antoine Bosselut, Kianté Brantley, Yoav Artzi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05362
Pdf URL: https://arxiv.org/pdf/2410.05362
Copy Paste: [[2410.05362]] LLMs Are In-Context Reinforcement Learners(https://arxiv.org/abs/2410.05362)
Keywords: in-context
Abstract: Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.

Title: Diffusion Model Predictive Control

Authors: Guangyao Zhou, Sivaramakrishnan Swaminathan, Rajkumar Vasudeva Raju, J. Swaroop Guntupalli, Wolfgang Lehrach, Joseph Ortiz, Antoine Dedieu, Miguel Lázaro-Gredilla, Kevin Murphy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05364
Pdf URL: https://arxiv.org/pdf/2410.05364
Copy Paste: [[2410.05364]] Diffusion Model Predictive Control(https://arxiv.org/abs/2410.05364)
Keywords: diffusion
Abstract: We propose Diffusion Model Predictive Control (D-MPC), a novel MPC approach that learns a multi-step action proposal and a multi-step dynamics model, both using diffusion models, and combines them for use in online MPC. On the popular D4RL benchmark, we show performance that is significantly better than existing model-based offline planning methods using MPC and competitive with state-of-the-art (SOTA) model-based and model-free reinforcement learning methods. We additionally illustrate D-MPC's ability to optimize novel reward functions at run time and adapt to novel dynamics, and highlight its advantages compared to existing diffusion-based planning baselines.

Title: STOP! Camera Spoofing via the in-Vehicle IP Network

Authors: Dror Peri, Avishai Wool
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2410.05417
Pdf URL: https://arxiv.org/pdf/2410.05417
Copy Paste: [[2410.05417]] STOP! Camera Spoofing via the in-Vehicle IP Network(https://arxiv.org/abs/2410.05417)
Keywords: anomaly
Abstract: Autonomous driving and advanced driver assistance systems (ADAS) rely on cameras to control the driving. In many prior approaches an attacker aiming to stop the vehicle had to send messages on the specialized and better-defended CAN bus. We suggest an easier alternative: manipulate the IP-based network communication between the camera and the ADAS logic, inject fake images of stop signs or red lights into the video stream, and let the ADAS stop the car safely. We created an attack tool that successfully exploits the GigE Vision protocol. Then we analyze two classes of passive anomaly detectors to identify such attacks: protocol-based detectors and video-based detectors. We implemented multiple detectors of both classes and evaluated them on data collected from our test vehicle and also on data from the public BDD corpus. Our results show that such detectors are effective against naive adversaries, but sophisticated adversaries can evade detection. Finally, we propose a novel class of active defense mechanisms that randomly adjust camera parameters during the video transmission, and verify that the received images obey the requested adjustments. Within this class we focus on a specific implementation, the width-varying defense, which randomly modifies the width of every frame. Beyond its function as an anomaly detector, this defense is also a protective measure against certain attacks: by distorting injected image patches it prevents their recognition by the ADAS logic. We demonstrate the effectiveness of the width-varying defense through theoretical analysis and by an extensive evaluation of several types of attack in a wide range of realistic road driving conditions. The best the attack was able to achieve against this defense was injecting a stop sign for a duration of 0.2 seconds, with a success probability of 0.2%, whereas stopping a vehicle requires about 2.5 seconds.

Title: Diffusion Imitation from Observation

Authors: Bo-Ruei Huang, Chun-Kai Yang, Chun-Mao Lai, Dai-Jie Wu, Shao-Hua Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05429
Pdf URL: https://arxiv.org/pdf/2410.05429
Copy Paste: [[2410.05429]] Diffusion Imitation from Observation(https://arxiv.org/abs/2410.05429)
Keywords: diffusion, generative
Abstract: Learning from observation (LfO) aims to imitate experts by learning from state-only demonstrations without requiring action labels. Existing adversarial imitation learning approaches learn a generator agent policy to produce state transitions that are indistinguishable to a discriminator that learns to classify agent and expert state transitions. Despite its simplicity in formulation, these methods are often sensitive to hyperparameters and brittle to train. Motivated by the recent success of diffusion models in generative modeling, we propose to integrate a diffusion model into the adversarial imitation learning from observation framework. Specifically, we employ a diffusion model to capture expert and agent transitions by generating the next state, given the current state. Then, we reformulate the learning objective to train the diffusion model as a binary classifier and use it to provide "realness" rewards for policy learning. Our proposed framework, Diffusion Imitation from Observation (DIFO), demonstrates superior performance in various continuous control domains, including navigation, locomotion, manipulation, and games. Project page: this https URL

Title: Continuous Ensemble Weather Forecasting with Diffusion models

Authors: Martin Andrae, Tomas Landelius, Joel Oskarsson, Fredrik Lindsten
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2410.05431
Pdf URL: https://arxiv.org/pdf/2410.05431
Copy Paste: [[2410.05431]] Continuous Ensemble Weather Forecasting with Diffusion models(https://arxiv.org/abs/2410.05431)
Keywords: diffusion
Abstract: Weather forecasting has seen a shift in methods from numerical simulations to data-driven systems. While initial research in the area focused on deterministic forecasting, recent works have used diffusion models to produce skillful ensemble forecasts. These models are trained on a single forecasting step and rolled out autoregressively. However, they are computationally expensive and accumulate errors for high temporal resolution due to the many rollout steps. We address these limitations with Continuous Ensemble Forecasting, a novel and flexible method for sampling ensemble forecasts in diffusion models. The method can generate temporally consistent ensemble trajectories completely in parallel, with no autoregressive steps. Continuous Ensemble Forecasting can also be combined with autoregressive rollouts to yield forecasts at an arbitrary fine temporal resolution without sacrificing accuracy. We demonstrate that the method achieves competitive results for global weather forecasting with good probabilistic properties.

Title: Can LLMs Understand Time Series Anomalies?

Authors: Zihao Zhou, Rose Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05440
Pdf URL: https://arxiv.org/pdf/2410.05440
Copy Paste: [[2410.05440]] Can LLMs Understand Time Series Anomalies?(https://arxiv.org/abs/2410.05440)
Keywords: anomaly
Abstract: Large Language Models (LLMs) have gained popularity in time series forecasting, but their potential for anomaly detection remains largely unexplored. Our study investigates whether LLMs can understand and detect anomalies in time series data, focusing on zero-shot and few-shot scenarios. Inspired by conjectures about LLMs' behavior from time series forecasting research, we formulate key hypotheses about LLMs' capabilities in time series anomaly detection. We design and conduct principled experiments to test each of these hypotheses. Our investigation reveals several surprising findings about LLMs for time series: 1. LLMs understand time series better as *images* rather than as text 2. LLMs did not demonstrate enhanced performance when prompted to engage in *explicit reasoning* about time series analysis 3. Contrary to common beliefs, LLM's understanding of time series *do not* stem from their repetition biases or arithmetic abilities 4. LLMs' behaviors and performance in time series analysis *vary significantly* across different model architectures This study provides the first comprehensive analysis of contemporary LLM capabilities in time series anomaly detection. Our results suggest that while LLMs can understand time series anomalies, many common conjectures based on their reasoning capabilities do not hold. These insights pave the way for more effective LLM-based approaches in time series analysis, bridging the gap between forecasting and anomaly detection applications.

Title: Task Diversity Shortens the ICL Plateau

Authors: Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, Ernest K. Ryu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2410.05448
Pdf URL: https://arxiv.org/pdf/2410.05448
Copy Paste: [[2410.05448]] Task Diversity Shortens the ICL Plateau(https://arxiv.org/abs/2410.05448)
Keywords: in-context
Abstract: In-context learning (ICL) describes a language model's ability to generate outputs based on a set of input demonstrations and a subsequent query. To understand this remarkable capability, researchers have studied simplified, stylized models. These studies have consistently observed long loss plateaus, during which models exhibit minimal improvement, followed by a sudden, rapid surge of learning. In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data.

Title: Image Watermarks are Removable Using Controllable Regeneration from Clean Noise

Authors: Yepeng Liu, Yiren Song, Hai Ci, Yu Zhang, Haofan Wang, Mike Zheng Shou, Yuheng Bu
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.05470
Pdf URL: https://arxiv.org/pdf/2410.05470
Copy Paste: [[2410.05470]] Image Watermarks are Removable Using Controllable Regeneration from Clean Noise(https://arxiv.org/abs/2410.05470)
Keywords: diffusion, generative
Abstract: Image watermark techniques provide an effective way to assert ownership, deter misuse, and trace content sources, which has become increasingly essential in the era of large generative models. A critical attribute of watermark techniques is their robustness against various manipulations. In this paper, we introduce a watermark removal approach capable of effectively nullifying the state of the art watermarking techniques. Our primary insight involves regenerating the watermarked image starting from a clean Gaussian noise via a controllable diffusion model, utilizing the extracted semantic and spatial features from the watermarked image. The semantic control adapter and the spatial control network are specifically trained to control the denoising process towards ensuring image quality and enhancing consistency between the cleaned image and the original watermarked image. To achieve a smooth trade-off between watermark removal performance and image consistency, we further propose an adjustable and controllable regeneration scheme. This scheme adds varying numbers of noise steps to the latent representation of the watermarked image, followed by a controlled denoising process starting from this noisy latent representation. As the number of noise steps increases, the latent representation progressively approaches clean Gaussian noise, facilitating the desired trade-off. We apply our watermark removal methods across various watermarking techniques, and the results demonstrate that our methods offer superior visual consistency/quality and enhanced watermark removal performance compared to existing regeneration approaches.

Title: fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models

Authors: Weijia Xu, Nebojsa Jojic, Nicolas Le Roux
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05481
Pdf URL: https://arxiv.org/pdf/2410.05481
Copy Paste: [[2410.05481]] fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models(https://arxiv.org/abs/2410.05481)
Keywords: foundation model
Abstract: Humans have the ability to learn new tasks by inferring high-level concepts from existing solution, then manipulating these concepts in lieu of the raw data. Can we automate this process by deriving latent semantic structures in a document collection using foundation models? We introduce fPLSA, a foundation-model-based Probabilistic Latent Semantic Analysis (PLSA) method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fPLSA tags help reconstruct the original texts better than existing tagging methods. Moreover, when used for hierarchical sampling, fPLSA produces more diverse outputs with a higher likelihood of hitting the correct answer than direct sampling and hierarchical sampling with existing tagging methods.

Title: Transformers learn variable-order Markov chains in-context

Authors: Ruida Zhou, Chao Tian, Suhas Diggavi
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2410.05493
Pdf URL: https://arxiv.org/pdf/2410.05493
Copy Paste: [[2410.05493]] Transformers learn variable-order Markov chains in-context(https://arxiv.org/abs/2410.05493)
Keywords: in-context
Abstract: Large language models have demonstrated impressive in-context learning (ICL) capability. However, it is still unclear how the underlying transformers accomplish it, especially in more complex scenarios. Toward this goal, several recent works studied how transformers learn fixed-order Markov chains (FOMC) in context, yet natural languages are more suitably modeled by variable-order Markov chains (VOMC), i.e., context trees (CTs). In this work, we study the ICL of VOMC by viewing language modeling as a form of data compression and focus on small alphabets and low-order VOMCs. This perspective allows us to leverage mature compression algorithms, such as context-tree weighting (CTW) and prediction by partial matching (PPM) algorithms as baselines, the former of which is Bayesian optimal for a class of CTW priors. We empirically observe a few phenomena: 1) Transformers can indeed learn to compress VOMC in-context, while PPM suffers significantly; 2) The performance of transformers is not very sensitive to the number of layers, and even a two-layer transformer can learn in-context quite well; and 3) Transformers trained and tested on non-CTW priors can significantly outperform the CTW algorithm. To explain these phenomena, we analyze the attention map of the transformers and extract two mechanisms, on which we provide two transformer constructions: 1) A construction with $D+2$ layers that can mimic the CTW algorithm accurately for CTs of maximum order $D$, 2) A 2-layer transformer that utilizes the feed-forward network for probability blending. One distinction from the FOMC setting is that a counting mechanism appears to play an important role. We implement these synthetic transformer layers and show that such hybrid transformers can match the ICL performance of transformers, and more interestingly, some of them can perform even better despite the much-reduced parameter sets.

Title: Toward General Object-level Mapping from Sparse Views with 3D Diffusion Priors

Authors: Ziwei Liao, Binbin Xu, Steven L. Waslander
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2410.05514
Pdf URL: https://arxiv.org/pdf/2410.05514
Copy Paste: [[2410.05514]] Toward General Object-level Mapping from Sparse Views with 3D Diffusion Priors(https://arxiv.org/abs/2410.05514)
Keywords: diffusion, generative
Abstract: Object-level mapping builds a 3D map of objects in a scene with detailed shapes and poses from multi-view sensor observations. Conventional methods struggle to build complete shapes and estimate accurate poses due to partial occlusions and sensor noise. They require dense observations to cover all objects, which is challenging to achieve in robotics trajectories. Recent work introduces generative shape priors for object-level mapping from sparse views, but is limited to single-category objects. In this work, we propose a General Object-level Mapping system, GOM, which leverages a 3D diffusion model as shape prior with multi-category support and outputs Neural Radiance Fields (NeRFs) for both texture and geometry for all objects in a scene. GOM includes an effective formulation to guide a pre-trained diffusion model with extra nonlinear constraints from sensor measurements without finetuning. We also develop a probabilistic optimization formulation to fuse multi-view sensor observations and diffusion priors for joint 3D object pose and shape estimation. Our GOM system demonstrates superior multi-category mapping performance from sparse views, and achieves more accurate mapping results compared to state-of-the-art methods on the real-world benchmarks. We will release our code: this https URL.

Title: Generative Portrait Shadow Removal

Authors: Jae Shin Yoon, Zhixin Shu, Mengwei Ren, Xuaner Zhang, Yannick Hold-Geoffroy, Krishna Kumar Singh, He Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05525
Pdf URL: https://arxiv.org/pdf/2410.05525
Copy Paste: [[2410.05525]] Generative Portrait Shadow Removal(https://arxiv.org/abs/2410.05525)
Keywords: diffusion, generative
Abstract: We introduce a high-fidelity portrait shadow removal model that can effectively enhance the image of a portrait by predicting its appearance under disturbing shadows and highlights. Portrait shadow removal is a highly ill-posed problem where multiple plausible solutions can be found based on a single image. While existing works have solved this problem by predicting the appearance residuals that can propagate local shadow distribution, such methods are often incomplete and lead to unnatural predictions, especially for portraits with hard shadows. We overcome the limitations of existing local propagation methods by formulating the removal problem as a generation task where a diffusion model learns to globally rebuild the human appearance from scratch as a condition of an input portrait image. For robust and natural shadow removal, we propose to train the diffusion model with a compositional repurposing framework: a pre-trained text-guided image generation model is first fine-tuned to harmonize the lighting and color of the foreground with a background scene by using a background harmonization dataset; and then the model is further fine-tuned to generate a shadow-free portrait image via a shadow-paired dataset. To overcome the limitation of losing fine details in the latent diffusion model, we propose a guided-upsampling network to restore the original high-frequency details (wrinkles and dots) from the input image. To enable our compositional training framework, we construct a high-fidelity and large-scale dataset using a lightstage capturing system and synthetic graphics simulation. Our generative framework effectively removes shadows caused by both self and external occlusions while maintaining original lighting distribution and high-frequency details. Our method also demonstrates robustness to diverse subjects captured in real environments.

Title: Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection

Authors: Jiuzheng Yang, Song Tang, Yangkuiyi Zhang, Shuaifeng Li, Mao Ye, Jianwei Zhang, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05557
Pdf URL: https://arxiv.org/pdf/2410.05557
Copy Paste: [[2410.05557]] Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection(https://arxiv.org/abs/2410.05557)
Keywords: self-supervised
Abstract: Source-Free domain adaptive Object Detection (SFOD) aims to transfer a detector (pre-trained on source domain) to new unlabelled target domains. Current SFOD methods typically follow the Mean Teacher framework, where weak-to-strong augmentation provides diverse and sharp contrast for self-supervised learning. However, this augmentation strategy suffers from an inherent problem called crucial semantics loss: Due to random, strong disturbance, strong augmentation is prone to losing typical visual components, hindering cross-domain feature extraction. To address this thus-far ignored limitation, this paper introduces a novel Weak-to-Strong Contrastive Learning (WSCoL) approach. The core idea is to distill semantics lossless knowledge in the weak features (from the weak/teacher branch) to guide the representation learning upon the strong features (from the strong/student branch). To achieve this, we project the original features into a shared space using a mapping network, thereby reducing the bias between the weak and strong features. Meanwhile, a weak features-guided contrastive learning is performed in a weak-to-strong manner alternatively. Specifically, we first conduct an adaptation-aware prototype-guided clustering on the weak features to generate pseudo labels for corresponding strong features matched through proposals. Sequentially, we identify positive-negative samples based on the pseudo labels and perform cross-category contrastive learning on the strong features where an uncertainty estimator encourages adaptive background contrast. Extensive experiments demonstrate that WSCoL yields new state-of-the-art performance, offering a built-in mechanism mitigating crucial semantics loss for traditional Mean Teacher framework. The code and data will be released soon.

Title: TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Authors: Gihyun Kwon, Jong Chul Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05591
Pdf URL: https://arxiv.org/pdf/2410.05591
Copy Paste: [[2410.05591]] TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation(https://arxiv.org/abs/2410.05591)
Keywords: diffusion
Abstract: Despite significant advancements in customizing text-to-image and video generation models, generating images and videos that effectively integrate multiple personalized concepts remains a challenging task. To address this, we present TweedieMix, a novel method for composing customized diffusion models during the inference phase. By analyzing the properties of reverse diffusion sampling, our approach divides the sampling process into two stages. During the initial steps, we apply a multiple object-aware sampling technique to ensure the inclusion of the desired target objects. In the later steps, we blend the appearances of the custom concepts in the de-noised image space using Tweedie's formula. Our results demonstrate that TweedieMix can generate multiple personalized concepts with higher fidelity than existing methods. Moreover, our framework can be effortlessly extended to image-to-video diffusion models, enabling the generation of videos that feature multiple personalized concepts. Results and source code are in our anonymous project page.

Title: When Graph Neural Networks Meet Dynamic Mode Decomposition

Authors: Dai Shi, Lequan Lin, Andi Han, Zhiyong Wang, Yi Guo, Junbin Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05593
Pdf URL: https://arxiv.org/pdf/2410.05593
Copy Paste: [[2410.05593]] When Graph Neural Networks Meet Dynamic Mode Decomposition(https://arxiv.org/abs/2410.05593)
Keywords: diffusion
Abstract: Graph Neural Networks (GNNs) have emerged as fundamental tools for a wide range of prediction tasks on graph-structured data. Recent studies have drawn analogies between GNN feature propagation and diffusion processes, which can be interpreted as dynamical systems. In this paper, we delve deeper into this perspective by connecting the dynamics in GNNs to modern Koopman theory and its numerical method, Dynamic Mode Decomposition (DMD). We illustrate how DMD can estimate a low-rank, finite-dimensional linear operator based on multiple states of the system, effectively approximating potential nonlinear interactions between nodes in the graph. This approach allows us to capture complex dynamics within the graph accurately and efficiently. We theoretically establish a connection between the DMD-estimated operator and the original dynamic operator between system states. Building upon this foundation, we introduce a family of DMD-GNN models that effectively leverage the low-rank eigenfunctions provided by the DMD algorithm. We further discuss the potential of enhancing our approach by incorporating domain-specific constraints such as symmetry into the DMD computation, allowing the corresponding GNN models to respect known physical properties of the underlying system. Our work paves the path for applying advanced dynamical system analysis tools via GNNs. We validate our approach through extensive experiments on various learning tasks, including directed graphs, large-scale graphs, long-range interactions, and spatial-temporal graphs. We also empirically verify that our proposed models can serve as powerful encoders for link prediction tasks. The results demonstrate that our DMD-enhanced GNNs achieve state-of-the-art performance, highlighting the effectiveness of integrating DMD into GNN frameworks.

Title: Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning

Authors: Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05600
Pdf URL: https://arxiv.org/pdf/2410.05600
Copy Paste: [[2410.05600]] Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning(https://arxiv.org/abs/2410.05600)
Keywords: in-context
Abstract: The widespread presence of hate speech on the internet, including formats such as text-based tweets and vision-language memes, poses a significant challenge to digital platform safety. Recent research has developed detection models tailored to specific modalities; however, there is a notable gap in transferring detection capabilities across different formats. This study conducts extensive experiments using few-shot in-context learning with large language models to explore the transferability of hate speech detection between modalities. Our findings demonstrate that text-based hate speech examples can significantly enhance the classification accuracy of vision-language hate speech. Moreover, text-based demonstrations outperform vision-language demonstrations in few-shot learning settings. These results highlight the effectiveness of cross-modality knowledge transfer and offer valuable insights for improving hate speech detection systems.

Title: ReFIR: Grounding Large Restoration Models with Retrieval Augmentation

Authors: Hang Guo, Tao Dai, Zhihao Ouyang, Taolin Zhang, Yaohua Zha, Bin Chen, Shu-tao Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05601
Pdf URL: https://arxiv.org/pdf/2410.05601
Copy Paste: [[2410.05601]] ReFIR: Grounding Large Restoration Models with Retrieval Augmentation(https://arxiv.org/abs/2410.05601)
Keywords: diffusion
Abstract: Recent advances in diffusion-based Large Restoration Models (LRMs) have significantly improved photo-realistic image restoration by leveraging the internal knowledge embedded within model weights. However, existing LRMs often suffer from the hallucination dilemma, i.e., producing incorrect contents or textures when dealing with severe degradations, due to their heavy reliance on limited internal knowledge. In this paper, we propose an orthogonal solution called the Retrieval-augmented Framework for Image Restoration (ReFIR), which incorporates retrieved images as external knowledge to extend the knowledge boundary of existing LRMs in generating details faithful to the original scene. Specifically, we first introduce the nearest neighbor lookup to retrieve content-relevant high-quality images as reference, after which we propose the cross-image injection to modify existing LRMs to utilize high-quality textures from retrieved images. Thanks to the additional external knowledge, our ReFIR can well handle the hallucination challenge and facilitate faithfully results. Extensive experiments demonstrate that ReFIR can achieve not only high-fidelity but also realistic restoration results. Importantly, our ReFIR requires no training and is adaptable to various LRMs.

Title: Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

Authors: Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2410.05603
Pdf URL: https://arxiv.org/pdf/2410.05603
Copy Paste: [[2410.05603]] Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition(https://arxiv.org/abs/2410.05603)
Keywords: in-context
Abstract: Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term "task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.

Title: Leveraging free energy in pretraining model selection for improved fine-tuning

Authors: Michael Munn, Susan Wei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05612
Pdf URL: https://arxiv.org/pdf/2410.05612
Copy Paste: [[2410.05612]] Leveraging free energy in pretraining model selection for improved fine-tuning(https://arxiv.org/abs/2410.05612)
Keywords: foundation model
Abstract: Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and Vision Transformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrain-then-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that lend themselves to good downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint's adaptability by measuring the concentration of nearby favorable parameters for the downstream task. We demonstrate that this free energy criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task. Furthermore, we provide empirical evidence that the free energy criterion reliably correlates with improved fine-tuning performance, offering a principled approach to predicting model adaptability.

Title: Vector-ICL: In-context Learning with Continuous Vector Representations

Authors: Yufan Zhuang, Chandan Singh, Liyuan Liu, Jingbo Shang, Jianfeng Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05629
Pdf URL: https://arxiv.org/pdf/2410.05629
Copy Paste: [[2410.05629]] Vector-ICL: In-context Learning with Continuous Vector Representations(https://arxiv.org/abs/2410.05629)
Keywords: in-context
Abstract: Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities on textual data. We explore whether these capabilities can be extended to continuous vectors from diverse domains, obtained from black-box pretrained encoders. By aligning input data with an LLM's embedding space through lightweight projectors, we observe that LLMs can effectively process and learn from these projected vectors, which we term Vector-ICL. In particular, we find that pretraining projectors with general language modeling objectives enables Vector-ICL, while task-specific finetuning further enhances performance. In our experiments across various tasks and modalities, including text reconstruction, numerical function regression, text classification, summarization, molecule captioning, time-series classification, graph classification, and fMRI decoding, Vector-ICL often surpasses both few-shot ICL and domain-specific model or tuning. We further conduct analyses and case studies, indicating the potential of LLMs to process vector representations beyond traditional token-based paradigms.

Title: Score-Based Variational Inference for Inverse Problems

Authors: Zhipeng Xue, Penghao Cai, Xiaojun Yuan, Xiqi Gao
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2410.05646
Pdf URL: https://arxiv.org/pdf/2410.05646
Copy Paste: [[2410.05646]] Score-Based Variational Inference for Inverse Problems(https://arxiv.org/abs/2410.05646)
Keywords: diffusion
Abstract: Existing diffusion-based methods for inverse problems sample from the posterior using score functions and accept the generated random samples as solutions. In applications that posterior mean is preferred, we have to generate multiple samples from the posterior which is time-consuming. In this work, by analyzing the probability density evolution of the conditional reverse diffusion process, we prove that the posterior mean can be achieved by tracking the mean of each reverse diffusion step. Based on that, we establish a framework termed reverse mean propagation (RMP) that targets the posterior mean directly. We show that RMP can be implemented by solving a variational inference problem, which can be further decomposed as minimizing a reverse KL divergence at each reverse step. We further develop an algorithm that optimizes the reverse KL divergence with natural gradient descent using score functions and propagates the mean at each reverse step. Experiments demonstrate the validity of the theory of our framework and show that our algorithm outperforms state-of-the-art algorithms on reconstruction performance with lower computational complexity in various inverse problems.

Title: ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler

Authors: Serin Yang, Taesung Kwon, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05651
Pdf URL: https://arxiv.org/pdf/2410.05651
Copy Paste: [[2410.05651]] ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler(https://arxiv.org/abs/2410.05651)
Keywords: diffusion
Abstract: Recent progress in large-scale text-to-video (T2V) and image-to-video (I2V) diffusion models has greatly enhanced video generation, especially in terms of keyframe interpolation. However, current image-to-video diffusion models, while powerful in generating videos from a single conditioning frame, need adaptation for two-frame (start & end) conditioned generation, which is essential for effective bounded interpolation. Unfortunately, existing approaches that fuse temporally forward and backward paths in parallel often suffer from off-manifold issues, leading to artifacts or requiring multiple iterative re-noising steps. In this work, we introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames. Additionally, we incorporate advanced guidance techniques, CFG++ and DDS, to further enhance the interpolation process. By integrating these, our method achieves state-of-the-art performance, efficiently generating high-quality, smooth videos between keyframes. On a single 3090 GPU, our method can interpolate 25 frames at 1024 x 576 resolution in just 195 seconds, establishing it as a leading solution for keyframe interpolation.

Title: Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

Authors: Saemi Moon, Minjong Lee, Sangdon Park, Dongwoo Kim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05664
Pdf URL: https://arxiv.org/pdf/2410.05664
Copy Paste: [[2410.05664]] Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning(https://arxiv.org/abs/2410.05664)
Keywords: diffusion
Abstract: As text-to-image diffusion models become advanced enough for commercial applications, there is also increasing concern about their potential for malicious and harmful use. Model unlearning has been proposed to mitigate the concerns by removing undesired and potentially harmful information from the pre-trained model. So far, the success of unlearning is mainly measured by whether the unlearned model can generate a target concept while maintaining image quality. However, unlearning is typically tested under limited scenarios, and the side effects of unlearning have barely been studied in the current literature. In this work, we thoroughly analyze unlearning under various scenarios with five key aspects. Our investigation reveals that every method has side effects or limitations, especially in more complex and realistic situations. By releasing our comprehensive evaluation framework with the source codes and artifacts, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.

Title: T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Authors: Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, William Yang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05677
Pdf URL: https://arxiv.org/pdf/2410.05677
Copy Paste: [[2410.05677]] T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design(https://arxiv.org/abs/2410.05677)
Keywords: diffusion
Abstract: In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.

Title: Extreme Value Modelling of Feature Residuals for Anomaly Detection in Dynamic Graphs

Authors: Sevvandi Kandanaarachchi, Conrad Sanderson, Rob J. Hyndman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05687
Pdf URL: https://arxiv.org/pdf/2410.05687
Copy Paste: [[2410.05687]] Extreme Value Modelling of Feature Residuals for Anomaly Detection in Dynamic Graphs(https://arxiv.org/abs/2410.05687)
Keywords: anomaly
Abstract: Detecting anomalies in a temporal sequence of graphs can be applied is areas such as the detection of accidents in transport networks and cyber attacks in computer networks. Existing methods for detecting abnormal graphs can suffer from multiple limitations, such as high false positive rates as well as difficulties with handling variable-sized graphs and non-trivial temporal dynamics. To address this, we propose a technique where temporal dependencies are explicitly modelled via time series analysis of a large set of pertinent graph features, followed by using residuals to remove the dependencies. Extreme Value Theory is then used to robustly model and classify any remaining extremes, aiming to produce low false positives rates. Comparative evaluations on a multitude of graph instances show that the proposed approach obtains considerably better accuracy than TensorSplat and Laplacian Anomaly Detection.

Title: DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

Authors: June Suk Choi, Kyungmin Lee, Jongheon Jeong, Saining Xie, Jinwoo Shin, Kimin Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05694
Pdf URL: https://arxiv.org/pdf/2410.05694
Copy Paste: [[2410.05694]] DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing(https://arxiv.org/abs/2410.05694)
Keywords: diffusion
Abstract: Recent advances in diffusion models have introduced a new era of text-guided image manipulation, enabling users to create realistic edited images with simple textual prompts. However, there is significant concern about the potential misuse of these methods, especially in creating misleading or harmful content. Although recent defense strategies, which introduce imperceptible adversarial noise to induce model failure, have shown promise, they remain ineffective against more sophisticated manipulations, such as editing with a mask. In this work, we propose DiffusionGuard, a robust and effective defense method against unauthorized edits by diffusion-based image editing models, even in challenging setups. Through a detailed analysis of these models, we introduce a novel objective that generates adversarial noise targeting the early stage of the diffusion process. This approach significantly improves the efficiency and effectiveness of adversarial noises. We also introduce a mask-augmentation technique to enhance robustness against various masks during test time. Finally, we introduce a comprehensive benchmark designed to evaluate the effectiveness and robustness of methods in protecting against privacy threats in realistic scenarios. Through extensive experiments, we show that our method achieves stronger protection and improved mask robustness with lower computational costs compared to the strongest baseline. Additionally, our method exhibits superior transferability and better resilience to noise removal techniques compared to all baseline methods. Our source code is publicly available at this https URL.

Title: Diffusing to the Top: Boost Graph Neural Networks with Minimal Hyperparameter Tuning

Authors: Lequan Lin, Dai Shi, Andi Han, Zhiyong Wang, Junbin Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05697
Pdf URL: https://arxiv.org/pdf/2410.05697
Copy Paste: [[2410.05697]] Diffusing to the Top: Boost Graph Neural Networks with Minimal Hyperparameter Tuning(https://arxiv.org/abs/2410.05697)
Keywords: diffusion
Abstract: Graph Neural Networks (GNNs) are proficient in graph representation learning and achieve promising performance on versatile tasks such as node classification and link prediction. Usually, a comprehensive hyperparameter tuning is essential for fully unlocking GNN's top performance, especially for complicated tasks such as node classification on large graphs and long-range graphs. This is usually associated with high computational and time costs and careful design of appropriate search spaces. This work introduces a graph-conditioned latent diffusion framework (GNN-Diff) to generate high-performing GNNs based on the model checkpoints of sub-optimal hyperparameters selected by a light-tuning coarse search. We validate our method through 166 experiments across four graph tasks: node classification on small, large, and long-range graphs, as well as link prediction. Our experiments involve 10 classic and state-of-the-art target models and 20 publicly available datasets. The results consistently demonstrate that GNN-Diff: (1) boosts the performance of GNNs with efficient hyperparameter tuning; and (2) presents high stability and generalizability on unseen data across multiple generation runs. The code is available at this https URL.

Title: PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM

Authors: Stefan Stefanache, Lluís Pastor Pérez, Julen Costa Watanabe, Ernesto Sanchez Tejedor, Thomas Hofmann, Enis Simsar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05710
Pdf URL: https://arxiv.org/pdf/2410.05710
Copy Paste: [[2410.05710]] PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM(https://arxiv.org/abs/2410.05710)
Keywords: diffusion, generative
Abstract: Evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI. Specifically, it is imperative to assess their capacity to execute diverse editing tasks while preserving the image content and realism. While recent developments in generative models have opened up previously unheard-of possibilities for image editing, conducting a thorough evaluation of these models remains a challenging and open task. The absence of a standardized evaluation benchmark, primarily due to the inherent need for a post-edit reference image for evaluation, further complicates this issue. Currently, evaluations often rely on established models such as CLIP or require human intervention for a comprehensive understanding of the performance of these image editing models. Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement, contributing to the advancement and refinement of existing methodologies in the field.

Title: Diffusion Auto-regressive Transformer for Effective Self-supervised Time Series Forecasting

Authors: Daoyu Wang, Mingyue Cheng, Zhiding Liu, Qi Liu, Enhong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05711
Pdf URL: https://arxiv.org/pdf/2410.05711
Copy Paste: [[2410.05711]] Diffusion Auto-regressive Transformer for Effective Self-supervised Time Series Forecasting(https://arxiv.org/abs/2410.05711)
Keywords: diffusion, self-supervised, generative
Abstract: Self-supervised learning has become a popular and effective approach for enhancing time series forecasting, enabling models to learn universal representations from unlabeled data. However, effectively capturing both the global sequence dependence and local detail features within time series data remains challenging. To address this, we propose a novel generative self-supervised method called TimeDART, denoting Diffusion Auto-regressive Transformer for Time series forecasting. In TimeDART, we treat time series patches as basic modeling units. Specifically, we employ an self-attention based Transformer encoder to model the dependencies of inter-patches. Additionally, we introduce diffusion and denoising mechanisms to capture the detail locality features of intra-patch. Notably, we design a cross-attention-based denoising decoder that allows for adjustable optimization difficulty in the self-supervised task, facilitating more effective self-supervised pre-training. Furthermore, the entire model is optimized in an auto-regressive manner to obtain transferable representations. Extensive experiments demonstrate that TimeDART achieves state-of-the-art fine-tuning performance compared to the most advanced competitive methods in forecasting tasks. Our code is publicly available at this https URL.

Title: CUBE360: Learning Cubic Field Representation for Monocular 360 Depth Estimation for Virtual Reality

Authors: Wenjie Chang, Hao Ai, Tianzhu Zhang, Lin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05735
Pdf URL: https://arxiv.org/pdf/2410.05735
Copy Paste: [[2410.05735]] CUBE360: Learning Cubic Field Representation for Monocular 360 Depth Estimation for Virtual Reality(https://arxiv.org/abs/2410.05735)
Keywords: self-supervised
Abstract: Panoramic images provide comprehensive scene information and are suitable for VR applications. Obtaining corresponding depth maps is essential for achieving immersive and interactive experiences. However, panoramic depth estimation presents significant challenges due to the severe distortion caused by equirectangular projection (ERP) and the limited availability of panoramic RGB-D datasets. Inspired by the recent success of neural rendering, we propose a novel method, named $\mathbf{CUBE360}$, that learns a cubic field composed of multiple MPIs from a single panoramic image for $\mathbf{continuous}$ depth estimation at any view direction. Our CUBE360 employs cubemap projection to transform an ERP image into six faces and extract the MPIs for each, thereby reducing the memory consumption required for MPI processing of high-resolution data. Additionally, this approach avoids the computational complexity of handling the uneven pixel distribution inherent to equirectangular projectio. An attention-based blending module is then employed to learn correlations among the MPIs of cubic faces, constructing a cubic field representation with color and density information at various depth levels. Furthermore, a novel sampling strategy is introduced for rendering novel views from the cubic field at both cubic and planar scales. The entire pipeline is trained using photometric loss calculated from rendered views within a self-supervised learning approach, enabling training on 360 videos without depth annotations. Experiments on both synthetic and real-world datasets demonstrate the superior performance of CUBE360 compared to prior SSL methods. We also highlight its effectiveness in downstream applications, such as VR roaming and visual effects, underscoring CUBE360's potential to enhance immersive experiences.

Title: Training-free Diffusion Model Alignment with Sampling Demons

Authors: Po-Hung Yeh, Kuang-Huei Lee, Jun-Cheng Chen
Subjects: cs.CV, cs.AI, cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2410.05760
Pdf URL: https://arxiv.org/pdf/2410.05760
Copy Paste: [[2410.05760]] Training-free Diffusion Model Alignment with Sampling Demons(https://arxiv.org/abs/2410.05760)
Keywords: diffusion
Abstract: Aligning diffusion models with user preferences has been a key challenge. Existing methods for aligning diffusion models either require retraining or are limited to differentiable reward functions. To address these limitations, we propose a stochastic optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through stochastic optimization. We provide comprehensive theoretical and empirical evidence to support and validate our approach, including experiments that use non-differentiable sources of rewards such as Visual-Language Model (VLM) APIs and human judgements. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models. Our method can be easily integrated with existing diffusion models without further training. Our experiments show that the proposed approach significantly improves the average aesthetics scores for text-to-image generation.

Title: ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

Authors: Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05774
Pdf URL: https://arxiv.org/pdf/2410.05774
Copy Paste: [[2410.05774]] ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition(https://arxiv.org/abs/2410.05774)
Keywords: foundation model
Abstract: Our world is full of varied actions and moves across specialized domains that we, as humans, strive to identify and understand. Within any single domain, actions can often appear quite similar, making it challenging for deep models to distinguish them accurately. To evaluate the effectiveness of multimodal foundation models in helping us recognize such actions, we present ActionAtlas v1.0, a multiple-choice video question answering benchmark featuring short videos across various sports. Each video in the dataset is paired with a question and four or five choices. The question pinpoints specific individuals, asking which choice "best" describes their action within a certain temporal context. Overall, the dataset includes 934 videos showcasing 580 unique actions across 56 sports, with a total of 1896 actions within choices. Unlike most existing video question answering benchmarks that only cover simplistic actions, often identifiable from a single frame, ActionAtlas focuses on intricate movements and rigorously tests the model's capability to discern subtle differences between moves that look similar within each domain. We evaluate open and proprietary foundation models on this benchmark, finding that the best model, GPT-4o, achieves a maximum accuracy of 45.52%. Meanwhile, Non-expert crowd workers, provided with action description for each choice, achieve 61.64% accuracy, where random chance is approximately 21%. Our findings with state-of-the-art models indicate that having a high frame sampling rate is important for accurately recognizing actions in ActionAtlas, a feature that some leading proprietary video models, such as Gemini, do not include in their default configuration.

Title: SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

Authors: Qi Tang, Yao Zhao, Meiqin Liu, Chao Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05799
Pdf URL: https://arxiv.org/pdf/2410.05799
Copy Paste: [[2410.05799]] SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution(https://arxiv.org/abs/2410.05799)
Keywords: diffusion
Abstract: Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos, yet it grapples with maintaining detail consistency across frames due to stochastic fluctuations. The traditional approach of pixel-level alignment is ineffective for diffusion-processed frames because of iterative disruptions. To overcome this, we introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls. This framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames. The Instance-Centric Alignment Module (InCAM) utilizes video-clip-wise tokens to dynamically relate pixels within and across frames, enhancing coherency. Additionally, the Channel-wise Texture Aggregation Memory (CaTeGory) infuses extrinsic knowledge, capitalizing on long-standing semantic textures. Our method also innovates the blurring diffusion process with the ResShift mechanism, finely balancing between sharpness and diffusion effects. Comprehensive experiments confirm our framework's advantage over state-of-the-art diffusion-based VSR techniques. The code is available: this https URL.

Title: CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

Authors: Mingyi Guo, Yuyang Liu, Zongying Lin, Peixi Peng, Yonghong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05804
Pdf URL: https://arxiv.org/pdf/2410.05804
Copy Paste: [[2410.05804]] CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection(https://arxiv.org/abs/2410.05804)
Keywords: foundation model
Abstract: Incremental object detection (IOD) is challenged by background shift, where background categories in sequential data may include previously learned or future classes. Inspired by the vision-language foundation models such as CLIP, these models capture shared attributes from extensive image-text paired data during pre-training. We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection. Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes. Specifically, we utilize large language models to generate candidate textual attributes and select the most relevant ones based on current training data, recording their significance in an attribute assignment matrix. For subsequent tasks, we freeze the retained attributes and continue selecting from the remaining candidates while updating the attribute assignment matrix accordingly. Furthermore, we employ OWL-ViT as our baseline, preserving the original parameters of the pre-trained foundation model. Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of IOD. Extensive two-phase and multi-phase experiments on the COCO dataset demonstrate the state-of-the-art performance of our proposed method.

Title: PostCast: Generalizable Postprocessing for Precipitation Nowcasting via Unsupervised Blurriness Modeling

Authors: Junchao Gong, Siwei Tu, Weidong Yang, Ben Fei, Kun Chen, Wenlong Zhang, Xiaokang Yang, Wanli Ouyang, Lei Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05805
Pdf URL: https://arxiv.org/pdf/2410.05805
Copy Paste: [[2410.05805]] PostCast: Generalizable Postprocessing for Precipitation Nowcasting via Unsupervised Blurriness Modeling(https://arxiv.org/abs/2410.05805)
Keywords: diffusion, generative
Abstract: Precipitation nowcasting plays a pivotal role in socioeconomic sectors, especially in severe convective weather warnings. Although notable progress has been achieved by approaches mining the spatiotemporal correlations with deep learning, these methods still suffer severe blurriness as the lead time increases, which hampers accurate predictions for extreme precipitation. To alleviate blurriness, researchers explore generative methods conditioned on blurry predictions. However, the pairs of blurry predictions and corresponding ground truth need to be generated in advance, making the training pipeline cumbersome and limiting the generality of generative models within blur modes that appear in training data. By rethinking the blurriness in precipitation nowcasting as a blur kernel acting on predictions, we propose an unsupervised postprocessing method to eliminate the blurriness without the requirement of training with the pairs of blurry predictions and corresponding ground truth. Specifically, we utilize blurry predictions to guide the generation process of a pre-trained unconditional denoising diffusion probabilistic model (DDPM) to obtain high-fidelity predictions with eliminated blurriness. A zero-shot blur kernel estimation mechanism and an auto-scale denoise guidance strategy are introduced to adapt the unconditional DDPM to any blurriness modes varying from datasets and lead times in precipitation nowcasting. Extensive experiments are conducted on 7 precipitation radar datasets, demonstrating the generality and superiority of our method.

Title: CAP: Detecting Unauthorized Data Usage in Generative Models via Prompt Generation

Authors: Daniela Gallo, Angelica Liguori, Ettore Ritacco, Luca Caviglione, Fabrizio Durante, Giuseppe Manco
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05819
Pdf URL: https://arxiv.org/pdf/2410.05819
Copy Paste: [[2410.05819]] CAP: Detecting Unauthorized Data Usage in Generative Models via Prompt Generation(https://arxiv.org/abs/2410.05819)
Keywords: generative
Abstract: To achieve accurate and unbiased predictions, Machine Learning (ML) models rely on large, heterogeneous, and high-quality datasets. However, this could raise ethical and legal concerns regarding copyright and authorization aspects, especially when information is gathered from the Internet. With the rise of generative models, being able to track data has become of particular importance, especially since they may (un)intentionally replicate copyrighted contents. Therefore, this work proposes Copyright Audit via Prompts generation (CAP), a framework for automatically testing whether an ML model has been trained with unauthorized data. Specifically, we devise an approach to generate suitable keys inducing the model to reveal copyrighted contents. To prove its effectiveness, we conducted an extensive evaluation campaign on measurements collected in four IoT scenarios. The obtained results showcase the effectiveness of CAP, when used against both realistic and synthetic datasets.

Title: A noise-corrected Langevin algorithm and sampling by half-denoising

Authors: Aapo Hyvärinen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2410.05837
Pdf URL: https://arxiv.org/pdf/2410.05837
Copy Paste: [[2410.05837]] A noise-corrected Langevin algorithm and sampling by half-denoising(https://arxiv.org/abs/2410.05837)
Keywords: diffusion
Abstract: The Langevin algorithm is a classic method for sampling from a given pdf in a real space. In its basic version, it only requires knowledge of the gradient of the log-density, also called the score function. However, in deep learning, it is often easier to learn the so-called "noisy score function", i.e. the gradient of the log-density of noisy data, more precisely when Gaussian noise is added to the data. Such an estimate is biased and complicates the use of the Langevin method. Here, we propose a noise-corrected version of the Langevin algorithm, where the bias due to noisy data is removed, at least regarding first-order terms. Unlike diffusion models, our algorithm needs to know the noisy score function for one single noise level only. We further propose a simple special case which has an interesting intuitive interpretation of iteratively adding noise the data and then attempting to remove half of that noise.

Title: Unobserved Object Detection using Generative Models

Authors: Subhransu S. Bhattacharjee, Dylan Campbell, Rahul Shome
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2410.05869
Pdf URL: https://arxiv.org/pdf/2410.05869
Copy Paste: [[2410.05869]] Unobserved Object Detection using Generative Models(https://arxiv.org/abs/2410.05869)
Keywords: diffusion, generative
Abstract: Can we detect an object that is not visible in an image? This study introduces the novel task of 2D and 3D unobserved object detection for predicting the location of objects that are occluded or lie outside the image frame. We adapt several state-of-the-art pre-trained generative models to solve this task, including 2D and 3D diffusion models and vision--language models, and show that they can be used to infer the presence of objects that are not directly observed. To benchmark this task, we propose a suite of metrics that captures different aspects of performance. Our empirical evaluations on indoor scenes from the RealEstate10k dataset with COCO object categories demonstrate results that motivate the use of generative models for the unobserved object detection task. The current work presents a promising step towards compelling applications like visual search and probabilistic planning that can leverage object detection beyond what can be directly observed.

Title: MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos

Authors: Yiling Zhang, Erkut Akdag, Egor Bondarev, Peter H. N. De With
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05900
Pdf URL: https://arxiv.org/pdf/2410.05900
Copy Paste: [[2410.05900]] MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos(https://arxiv.org/abs/2410.05900)
Keywords: anomaly
Abstract: Detection of anomaly events is relevant for public safety and requires a combination of fine-grained motion information and contextual events at variable time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL outperforms state-of-the-art methods on the UCF-Crime dataset, achieving an anomaly detection performance 89.78% AUC. Moreover, it performs complementary to SotA with 95.32% AUC on the ShanghaiTech and 84.57% AP on the XD-Violence dataset. Furthermore, we generate an extended dataset of the UCF-Crime for development and evaluation on a wider range of anomalies, namely Video Anomaly Detection Dataset (VADD), involving 2,591 videos in 18 classes with extensive coverage of realistic anomalies.

Title: MedUniSeg: 2D and 3D Medical Image Segmentation via a Prompt-driven Universal Model

Authors: Yiwen Ye, Ziyang Chen, Jianpeng Zhang, Yutong Xie, Yong Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.05905
Pdf URL: https://arxiv.org/pdf/2410.05905
Copy Paste: [[2410.05905]] MedUniSeg: 2D and 3D Medical Image Segmentation via a Prompt-driven Universal Model(https://arxiv.org/abs/2410.05905)
Keywords: self-supervised
Abstract: Universal segmentation models offer significant potential in addressing a wide range of tasks by effectively leveraging discrete annotations. As the scope of tasks and modalities expands, it becomes increasingly important to generate and strategically position task- and modal-specific priors within the universal model. However, existing universal models often overlook the correlations between different priors, and the optimal placement and frequency of these priors remain underexplored. In this paper, we introduce MedUniSeg, a prompt-driven universal segmentation model designed for 2D and 3D multi-task segmentation across diverse modalities and domains. MedUniSeg employs multiple modal-specific prompts alongside a universal task prompt to accurately characterize the modalities and tasks. To generate the related priors, we propose the modal map (MMap) and the fusion and selection (FUSE) modules, which transform modal and task prompts into corresponding priors. These modal and task priors are systematically introduced at the start and end of the encoding process. We evaluate MedUniSeg on a comprehensive multi-modal upstream dataset consisting of 17 sub-datasets. The results demonstrate that MedUniSeg achieves superior multi-task segmentation performance, attaining a 1.2% improvement in the mean Dice score across the 17 upstream tasks compared to nnUNet baselines, while using less than 1/10 of the parameters. For tasks that underperform during the initial multi-task joint training, we freeze MedUniSeg and introduce new modules to re-learn these tasks. This approach yields an enhanced version, MedUniSeg*, which consistently outperforms MedUniSeg across all tasks. Moreover, MedUniSeg surpasses advanced self-supervised and supervised pre-trained models on six downstream tasks, establishing itself as a high-quality, highly generalizable pre-trained segmentation model.

Title: TIMBA: Time series Imputation with Bi-directional Mamba Blocks and Diffusion models

Authors: Javier Solís-García, Belén Vega-Márquez, Juan A. Nepomuceno, Isabel A. Nepomuceno-Chamorro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05916
Pdf URL: https://arxiv.org/pdf/2410.05916
Copy Paste: [[2410.05916]] TIMBA: Time series Imputation with Bi-directional Mamba Blocks and Diffusion models(https://arxiv.org/abs/2410.05916)
Keywords: diffusion
Abstract: The problem of imputing multivariate time series spans a wide range of fields, from clinical healthcare to multi-sensor systems. Initially, Recurrent Neural Networks (RNNs) were employed for this task; however, their error accumulation issues led to the adoption of Transformers, leveraging attention mechanisms to mitigate these problems. Concurrently, the promising results of diffusion models in capturing original distributions have positioned them at the forefront of current research, often in conjunction with Transformers. In this paper, we propose replacing time-oriented Transformers with State-Space Models (SSM), which are better suited for temporal data modeling. Specifically, we utilize the latest SSM variant, S6, which incorporates attention-like mechanisms. By embedding S6 within Mamba blocks, we develop a model that integrates SSM, Graph Neural Networks, and node-oriented Transformers to achieve enhanced spatiotemporal representations. Implementing these architectural modifications, previously unexplored in this field, we present Time series Imputation with Bi-directional mamba blocks and diffusion models (TIMBA). TIMBA achieves superior performance in almost all benchmark scenarios and performs comparably in others across a diverse range of missing value situations and three real-world datasets. We also evaluate how the performance of our model varies with different amounts of missing values and analyse its performance on downstream tasks. In addition, we provide the original code to replicate the results.

Title: Fortify Your Foundations: Practical Privacy and Security for Foundation Model Deployments In The Cloud

Authors: Marcin Chrapek, Anjo Vahldiek-Oberwagner, Marcin Spoczynski, Scott Constable, Mona Vij, Torsten Hoefler
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05930
Pdf URL: https://arxiv.org/pdf/2410.05930
Copy Paste: [[2410.05930]] Fortify Your Foundations: Practical Privacy and Security for Foundation Model Deployments In The Cloud(https://arxiv.org/abs/2410.05930)
Keywords: foundation model
Abstract: Foundation Models (FMs) display exceptional performance in tasks such as natural language processing and are being applied across a growing range of disciplines. Although typically trained on large public datasets, FMs are often fine-tuned or integrated into Retrieval-Augmented Generation (RAG) systems, which rely on private data. This access, along with their size and costly training, heightens the risk of intellectual property theft. Moreover, multimodal FMs may expose sensitive information. In this work, we examine the FM threat model and discuss the practicality and comprehensiveness of various approaches for securing against them, such as ML-based methods and trusted execution environments (TEEs). We demonstrate that TEEs offer an effective balance between strong security properties, usability, and performance. Specifically, we present a solution achieving less than 10\% overhead versus bare metal for the full Llama2 7B and 13B inference pipelines running inside \intel\ SGX and \intel\ TDX. We also share our configuration files and insights from our implementation. To our knowledge, our work is the first to show the practicality of TEEs for securing FMs.

Title: Pyramidal Flow Matching for Efficient Video Generative Modeling

Authors: Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05954
Pdf URL: https://arxiv.org/pdf/2410.05954
Copy Paste: [[2410.05954]] Pyramidal Flow Matching for Efficient Video Generative Modeling(https://arxiv.org/abs/2410.05954)
Keywords: diffusion, generative
Abstract: Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models will be open-sourced at this https URL.

Title: FLOPS: Forward Learning with OPtimal Sampling

Authors: Tao Ren, Zishi Zhang, Jinyang Jiang, Guanghao Li, Zeliang Zhang, Mingqian Feng, Yijie Peng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05966
Pdf URL: https://arxiv.org/pdf/2410.05966
Copy Paste: [[2410.05966]] FLOPS: Forward Learning with OPtimal Sampling(https://arxiv.org/abs/2410.05966)
Keywords: foundation model
Abstract: Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformers on various datasets and further deploy the allocator to two black-box applications: prompt tuning and multimodal alignment for foundation models. All findings demonstrate that our proposed allocator significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.

Title: ConML: A Universal Meta-Learning Framework with Task-Level Contrastive Learning

Authors: Shiguang Wu, Yaqing Wang, Yatao Bian, Quanming Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05975
Pdf URL: https://arxiv.org/pdf/2410.05975
Copy Paste: [[2410.05975]] ConML: A Universal Meta-Learning Framework with Task-Level Contrastive Learning(https://arxiv.org/abs/2410.05975)
Keywords: in-context
Abstract: Meta-learning enables learning systems to adapt quickly to new tasks, similar to humans. To emulate this human-like rapid learning and enhance alignment and discrimination abilities, we propose ConML, a universal meta-learning framework that can be applied to various meta-learning algorithms without relying on specific model architectures nor target models. The core of ConML is task-level contrastive learning, which extends contrastive learning from the representation space in unsupervised learning to the model space in meta-learning. By leveraging task identity as an additional supervision signal during meta-training, we contrast the outputs of the meta-learner in the model space, minimizing inner-task distance (between models trained on different subsets of the same task) and maximizing inter-task distance (between models from different tasks). We demonstrate that ConML integrates seamlessly with optimization-based, metric-based, and amortization-based meta-learning algorithms, as well as in-context learning, resulting in performance improvements across diverse few-shot learning tasks.

Title: Generalizing to any diverse distribution: uniformity, gentle finetuning and rebalancing

Authors: Andreas Loukas, Karolis Martinkus, Ed Wagstaff, Kyunghyun Cho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.05980
Pdf URL: https://arxiv.org/pdf/2410.05980
Copy Paste: [[2410.05980]] Generalizing to any diverse distribution: uniformity, gentle finetuning and rebalancing(https://arxiv.org/abs/2410.05980)
Keywords: foundation model
Abstract: As training datasets grow larger, we aspire to develop models that generalize well to any diverse test distribution, even if the latter deviates significantly from the training data. Various approaches like domain adaptation, domain generalization, and robust optimization attempt to address the out-of-distribution challenge by posing assumptions about the relation between training and test distribution. Differently, we adopt a more conservative perspective by accounting for the worst-case error across all sufficiently diverse test distributions within a known domain. Our first finding is that training on a uniform distribution over this domain is optimal. We also interrogate practical remedies when uniform samples are unavailable by considering methods for mitigating non-uniformity through finetuning and rebalancing. Our theory provides a mathematical grounding for previous observations on the role of entropy and rebalancing for o.o.d. generalization and foundation model training. We also provide new empirical evidence across tasks involving o.o.d. shifts which illustrate the broad applicability of our perspective.

Title: Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

Authors: Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukas König, David Kappel, Anand Subramoney
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2410.05985
Pdf URL: https://arxiv.org/pdf/2410.05985
Copy Paste: [[2410.05985]] Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates(https://arxiv.org/abs/2410.05985)
Keywords: diffusion
Abstract: The increasing size of deep learning models has created the need for more efficient alternatives to the standard error backpropagation algorithm, that make better use of asynchronous, parallel and distributed computing. One major shortcoming of backpropagation is the interlocking between the forward phase of the algorithm, which computes a global loss, and the backward phase where the loss is backpropagated through all layers to compute the gradients, which are used to update the network parameters. To address this problem, we propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads. Furthermore, since we observe that the forward pass is often much faster than the backward pass, we use separate threads for the forward and backward pass calculations, which allows us to use a higher ratio of forward to backward threads than the usual 1:1 ratio, reducing the overall staleness of the parameters. Thus, our approach performs asynchronous stochastic gradient descent using separate threads for the loss (forward) and gradient (backward) computations and performs layer-wise partial updates to parameters in a distributed way. We show that this approach yields close to state-of-the-art results while running up to 2.97x faster than Hogwild! scaled on multiple devices (Locally-Partitioned-Asynchronous-Parallel SGD). We theoretically prove the convergence of the algorithm using a novel theoretical framework based on stochastic differential equations and the drift diffusion process, by modeling the asynchronous parameter updates as a stochastic process.

Title: Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision

Authors: Moritz Feuerpfeil, Marco Cipriano, Gerard de Melo
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2410.05991
Pdf URL: https://arxiv.org/pdf/2410.05991
Copy Paste: [[2410.05991]] Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision(https://arxiv.org/abs/2410.05991)
Keywords: generative
Abstract: Scalable Vector Graphics (SVG) is a popular format on the web and in the design industry. However, despite the great strides made in generative modeling, SVG has remained underexplored due to the discrete and complex nature of such data. We introduce GRIMOIRE, a text-guided SVG generative model that is comprised of two modules: A Visual Shape Quantizer (VSQ) learns to map raster images onto a discrete codebook by reconstructing them as vector shapes, and an Auto-Regressive Transformer (ART) models the joint probability distribution over shape tokens, positions and textual descriptions, allowing us to generate vector graphics from natural language. Unlike existing models that require direct supervision from SVG data, GRIMOIRE learns shape image patches using only raster image supervision which opens up vector generative modeling to significantly more data. We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on the MNIST and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and vector-supervised approach in flexibility.

Title: Sparse Repellency for Shielded Generation in Text-to-image Diffusion Models

Authors: Michael Kirchhof, James Thornton, Pierre Ablin, Louis Béthune, Eugene Ndiaye, Marco Cuturi
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2410.06025
Pdf URL: https://arxiv.org/pdf/2410.06025
Copy Paste: [[2410.06025]] Sparse Repellency for Shielded Generation in Text-to-image Diffusion Models(https://arxiv.org/abs/2410.06025)
Keywords: diffusion
Abstract: The increased adoption of diffusion models in text-to-image generation has triggered concerns on their reliability. Such models are now closely scrutinized under the lens of various metrics, notably calibration, fairness, or compute efficiency. We focus in this work on two issues that arise when deploying these models: a lack of diversity when prompting images, and a tendency to recreate images from the training set. To solve both problems, we propose a method that coaxes the sampled trajectories of pretrained diffusion models to land on images that fall outside of a reference set. We achieve this by adding repellency terms to the diffusion SDE throughout the generation trajectory, which are triggered whenever the path is expected to land too closely to an image in the shielded reference set. Our method is sparse in the sense that these repellency terms are zero and inactive most of the time, and even more so towards the end of the generation trajectory. Our method, named SPELL for sparse repellency, can be used either with a static reference set that contains protected images, or dynamically, by updating the set at each timestep with the expected images concurrently generated within a batch. We show that adding SPELL to popular diffusion models improves their diversity while impacting their FID only marginally, and performs comparatively better than other recent training-free diversity methods. We also demonstrate how SPELL can ensure a shielded generation away from a very large set of protected images by considering all 1.2M images from ImageNet as the protected set.

Title: Block Induced Signature Generative Adversarial Network (BISGAN): Signature Spoofing Using GANs and Their Evaluation

Authors: Haadia Amjad, Kilian Goeller, Steffen Seitz, Carsten Knoll, Naseer Bajwa, Muhammad Imran Malik, Ronald Tetzlaff
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06041
Pdf URL: https://arxiv.org/pdf/2410.06041
Copy Paste: [[2410.06041]] Block Induced Signature Generative Adversarial Network (BISGAN): Signature Spoofing Using GANs and Their Evaluation(https://arxiv.org/abs/2410.06041)
Keywords: generative
Abstract: Deep learning is actively being used in biometrics to develop efficient identification and verification systems. Handwritten signatures are a common subset of biometric data for authentication purposes. Generative adversarial networks (GANs) learn from original and forged signatures to generate forged signatures. While most GAN techniques create a strong signature verifier, which is the discriminator, there is a need to focus more on the quality of forgeries generated by the generator model. This work focuses on creating a generator that produces forged samples that achieve a benchmark in spoofing signature verification systems. We use CycleGANs infused with Inception model-like blocks with attention heads as the generator and a variation of the SigCNN model as the base Discriminator. We train our model with a new technique that results in 80% to 100% success in signature spoofing. Additionally, we create a custom evaluation technique to act as a goodness measure of the generated forgeries. Our work advocates generator-focused GAN architectures for spoofing data quality that aid in a better understanding of biometric data generation and evaluation.

Title: HyperDet: Generalizable Detection of Synthesized Images by Generating and Merging A Mixture of Hyper LoRAs

Authors: Huangsen Cao, Yongwei Wang, Yinfeng Liu, Sixian Zheng, Kangtao Lv, Zhimeng Zhang, Bo Zhang, Xin Ding, Fei Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06044
Pdf URL: https://arxiv.org/pdf/2410.06044
Copy Paste: [[2410.06044]] HyperDet: Generalizable Detection of Synthesized Images by Generating and Merging A Mixture of Hyper LoRAs(https://arxiv.org/abs/2410.06044)
Keywords: generative
Abstract: The emergence of diverse generative vision models has recently enabled the synthesis of visually realistic images, underscoring the critical need for effectively detecting these generated images from real photos. Despite advances in this field, existing detection approaches often struggle to accurately identify synthesized images generated by different generative models. In this work, we introduce a novel and generalizable detection framework termed HyperDet, which innovatively captures and integrates shared knowledge from a collection of functionally distinct and lightweight expert detectors. HyperDet leverages a large pretrained vision model to extract general detection features while simultaneously capturing and enhancing task-specific features. To achieve this, HyperDet first groups SRM filters into five distinct groups to efficiently capture varying levels of pixel artifacts based on their different functionality and complexity. Then, HyperDet utilizes a hypernetwork to generate LoRA model weights with distinct embedding parameters. Finally, we merge the LoRA networks to form an efficient model ensemble. Also, we propose a novel objective function that balances the pixel and semantic artifacts effectively. Extensive experiments on the UnivFD and Fake2M datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance. Moreover, our work paves a new way to establish generalizable domain-specific fake image detectors based on pretrained large vision models.

Title: AP-LDM: Attentive and Progressive Latent Diffusion Model for Training-Free High-Resolution Image Generation

Authors: Boyuan Cao, Jiaxin Ye, Yujie Wei, Hongming Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06055
Pdf URL: https://arxiv.org/pdf/2410.06055
Copy Paste: [[2410.06055]] AP-LDM: Attentive and Progressive Latent Diffusion Model for Training-Free High-Resolution Image Generation(https://arxiv.org/abs/2410.06055)
Keywords: diffusion
Abstract: Latent diffusion models (LDMs), such as Stable Diffusion, often experience significant structural distortions when directly generating high-resolution (HR) images that exceed their original training resolutions. A straightforward and cost-effective solution is to adapt pre-trained LDMs for HR image generation; however, existing methods often suffer from poor image quality and long inference time. In this paper, we propose an Attentive and Progressive LDM (AP-LDM), a novel, training-free framework aimed at enhancing HR image quality while accelerating the generation process. AP-LDM decomposes the denoising process of LDMs into two stages: (i) attentive training-resolution denoising, and (ii) progressive high-resolution denoising. The first stage generates a latent representation of a higher-quality training-resolution image through the proposed attentive guidance, which utilizes a novel parameter-free self-attention mechanism to enhance the structural consistency. The second stage progressively performs upsampling in pixel space, alleviating the severe artifacts caused by latent space upsampling. Leveraging the effective initialization from the first stage enables denoising at higher resolutions with significantly fewer steps, enhancing overall efficiency. Extensive experimental results demonstrate that AP-LDM significantly outperforms state-of-the-art methods, delivering up to a 5x speedup in HR image generation, thereby highlighting its substantial advantages for real-world applications. Code is available at this https URL.

Title: Diversity-Rewarded CFG Distillation

Authors: Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie, Olivier Bachem, Sarah Perrin, Alexandre Ramé
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.06084
Pdf URL: https://arxiv.org/pdf/2410.06084
Copy Paste: [[2410.06084]] Diversity-Rewarded CFG Distillation(https://arxiv.org/abs/2410.06084)
Keywords: generative
Abstract: Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM (Agostinelli et al., 2023) text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations at this https URL.

Title: Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA

Authors: Wenyu Huang, Guancheng Zhou, Hongru Wang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06121
Pdf URL: https://arxiv.org/pdf/2410.06121
Copy Paste: [[2410.06121]] Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA(https://arxiv.org/abs/2410.06121)
Keywords: generative
Abstract: Retrieval-Augmented Generation (RAG) is widely used to inject external non-parametric knowledge into large language models (LLMs). Recent works suggest that Knowledge Graphs (KGs) contain valuable external knowledge for LLMs. Retrieving information from KGs differs from extracting it from document sets. Most existing approaches seek to directly retrieve relevant subgraphs, thereby eliminating the need for extensive SPARQL annotations, traditionally required by semantic parsing methods. In this paper, we model the subgraph retrieval task as a conditional generation task handled by small language models. Specifically, we define a subgraph identifier as a sequence of relations, each represented as a special token stored in the language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, achieves competitive retrieval performance compared to state-of-the-art models relying on 7B parameters, demonstrating that small language models are capable of performing the subgraph retrieval task. Furthermore, our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks. Our model and data will be made available online: this https URL.

Title: Zero-Shot Learning of Causal Models

Authors: Divyat Mahajan, Jannes Gladrow, Agrin Hilmkil, Cheng Zhang, Meyer Scetbon
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2410.06128
Pdf URL: https://arxiv.org/pdf/2410.06128
Copy Paste: [[2410.06128]] Zero-Shot Learning of Causal Models(https://arxiv.org/abs/2410.06128)
Keywords: generative
Abstract: With the increasing acquisition of datasets over time, we now have access to precise and varied descriptions of the world, capturing all sorts of phenomena. These datasets can be seen as empirical observations of unknown causal generative processes, which can commonly be described by Structural Causal Models (SCMs). Recovering these causal generative processes from observations poses formidable challenges, and often require to learn a specific generative model for each dataset. In this work, we propose to learn a \emph{single} model capable of inferring in a zero-shot manner the causal generative processes of datasets. Rather than learning a specific SCM for each dataset, we enable the Fixed-Point Approach (FiP) proposed in~\cite{scetbon2024fip}, to infer the generative SCMs conditionally on their empirical representations. More specifically, we propose to amortize the learning of a conditional version of FiP to infer generative SCMs from observations and causal structures on synthetically generated datasets. We show that our model is capable of predicting in zero-shot the true generative SCMs, and as a by-product, of (i) generating new dataset samples, and (ii) inferring intervened ones. Our experiments demonstrate that our amortized procedure achieves performances on par with SoTA methods trained specifically for each dataset on both in and out-of-distribution problems. To the best of our knowledge, this is the first time that SCMs are inferred in a zero-shot manner from observations, paving the way for a paradigmatic shift towards the assimilation of causal knowledge across datasets.

Title: Towards Unsupervised Eye-Region Segmentation for Eye Tracking

Authors: Jiangfan Deng, Zhuang Jia, Zhaoxue Wang, Xiang Long, Daniel K. Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06131
Pdf URL: https://arxiv.org/pdf/2410.06131
Copy Paste: [[2410.06131]] Towards Unsupervised Eye-Region Segmentation for Eye Tracking(https://arxiv.org/abs/2410.06131)
Keywords: foundation model
Abstract: Finding the eye and parsing out the parts (e.g. pupil and iris) is a key prerequisite for image-based eye tracking, which has become an indispensable module in today's head-mounted VR/AR devices. However, a typical route for training a segmenter requires tedious handlabeling. In this work, we explore an unsupervised way. First, we utilize priors of human eye and extract signals from the image to establish rough clues indicating the eye-region structure. Upon these sparse and noisy clues, a segmentation network is trained to gradually identify the precise area for each part. To achieve accurate parsing of the eye-region, we first leverage the pretrained foundation model Segment Anything (SAM) in an automatic way to refine the eye indications. Then, the learning process is designed in an end-to-end manner following progressive and prior-aware principle. Experiments show that our unsupervised approach can easily achieve 90% (the pupil and iris) and 85% (the whole eye-region) of the performances under supervised learning.

Title: Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Authors: Sha Guo, Zhuo Chen, Yang Zhao, Ning Zhang, Xiaotong Li, Lingyu Duan
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2410.06149
Pdf URL: https://arxiv.org/pdf/2410.06149
Copy Paste: [[2410.06149]] Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach(https://arxiv.org/abs/2410.06149)
Keywords: diffusion
Abstract: Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.

Title: AgentSquare: Automatic LLM Agent Search in Modular Design Space

Authors: Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06153
Pdf URL: https://arxiv.org/pdf/2410.06153
Copy Paste: [[2410.06153]] AgentSquare: Automatic LLM Agent Search in Modular Design Space(https://arxiv.org/abs/2410.06153)
Keywords: in-context
Abstract: Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at this https URL.

Title: GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Authors: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06154
Pdf URL: https://arxiv.org/pdf/2410.06154
Copy Paste: [[2410.06154]] GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models(https://arxiv.org/abs/2410.06154)
Keywords: in-context
Abstract: In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to a purity measure obtained through a fitness function. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of text prompts preferred by the downstream VLM. Furthermore, we also explicitly steer the LLM generation process in each optimization step by specifically adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step. This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models -- showing that the discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models.

Title: Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images

Authors: Shiyu Miao, Delong Chen, Fan Liu, Chuanyi Zhang, Yanhui Gu, Shengjie Guo, Jun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06194
Pdf URL: https://arxiv.org/pdf/2410.06194
Copy Paste: [[2410.06194]] Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images(https://arxiv.org/abs/2410.06194)
Keywords: foundation model
Abstract: The Direct Segment Anything Model (DirectSAM) excels in class-agnostic contour extraction. In this paper, we explore its use by applying it to optical remote sensing imagery, where semantic contour extraction-such as identifying buildings, road networks, and coastlines-holds significant practical value. Those applications are currently handled via training specialized small models separately on small datasets in each domain. We introduce a foundation model derived from DirectSAM, termed DirectSAM-RS, which not only inherits the strong segmentation capability acquired from natural images, but also benefits from a large-scale dataset we created for remote sensing semantic contour extraction. This dataset comprises over 34k image-text-contour triplets, making it at least 30 times larger than individual dataset. DirectSAM-RS integrates a prompter module: a text encoder and cross-attention layers attached to the DirectSAM architecture, which allows flexible conditioning on target class labels or referring expressions. We evaluate the DirectSAM-RS in both zero-shot and fine-tuning setting, and demonstrate that it achieves state-of-the-art performance across several downstream benchmarks.

Title: RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

Authors: Tianyuan Zhang, Zhengfei Kuang, Haian Jin, Zexiang Xu, Sai Bi, Hao Tan, He Zhang, Yiwei Hu, Milos Hasan, William T. Freeman, Kai Zhang, Fujun Luan
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06231
Pdf URL: https://arxiv.org/pdf/2410.06231
Copy Paste: [[2410.06231]] RelitLRM: Generative Relightable Radiance for Large Reconstruction Models(https://arxiv.org/abs/2410.06231)
Keywords: diffusion, generative
Abstract: We propose RelitLRM, a Large Reconstruction Model (LRM) for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations from sparse (4-8) posed images captured under unknown static lighting. Unlike prior inverse rendering methods requiring dense captures and slow optimization, often causing artifacts like incorrect highlights or shadow baking, RelitLRM adopts a feed-forward transformer-based model with a novel combination of a geometry reconstructor and a relightable appearance generator based on diffusion. The model is trained end-to-end on synthetic multi-view renderings of objects under varying known illuminations. This architecture design enables to effectively decompose geometry and appearance, resolve the ambiguity between material and lighting, and capture the multi-modal distribution of shadows and specularity in the relit appearance. We show our sparse-view feed-forward RelitLRM offers competitive relighting results to state-of-the-art dense-view optimization-based baselines while being significantly faster. Our project page is available at: this https URL.

Title: EVOLvE: Evaluating and Optimizing LLMs For Exploration

Authors: Allen Nie, Yi Su, Bo Chang, Jonathan N. Lee, Ed H. Chi, Quoc V. Le, Minmin Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2410.06238
Pdf URL: https://arxiv.org/pdf/2410.06238
Copy Paste: [[2410.06238]] EVOLvE: Evaluating and Optimizing LLMs For Exploration(https://arxiv.org/abs/2410.06238)
Keywords: in-context
Abstract: Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.

Title: Unsupervised Model Diagnosis

Authors: Yinong Oliver Wang, Eileen Li, Jinqi Luo, Zhaoning Wang, Fernando De la Torre
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06243
Pdf URL: https://arxiv.org/pdf/2410.06243
Copy Paste: [[2410.06243]] Unsupervised Model Diagnosis(https://arxiv.org/abs/2410.06243)
Keywords: generative
Abstract: Ensuring model explainability and robustness is essential for reliable deployment of deep vision systems. Current methods for evaluating robustness rely on collecting and annotating extensive test sets. While this is common practice, the process is labor-intensive and expensive with no guarantee of sufficient coverage across attributes of interest. Recently, model diagnosis frameworks have emerged leveraging user inputs (e.g., text) to assess the vulnerability of the model. However, such dependence on human can introduce bias and limitation given the domain knowledge of particular users. This paper proposes Unsupervised Model Diagnosis (UMO), that leverages generative models to produce semantic counterfactual explanations without any user guidance. Given a differentiable computer vision model (i.e., the target model), UMO optimizes for the most counterfactual directions in a generative latent space. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources, such as dictionaries or language models. We validate the framework on multiple vision tasks (e.g., classification, segmentation, keypoint detection). Extensive experiments show that our unsupervised discovery of semantic directions can correctly highlight spurious correlations and visualize the failure mode of target models without any human intervention.

Title: Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Authors: Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Yuyin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06244
Pdf URL: https://arxiv.org/pdf/2410.06244
Copy Paste: [[2410.06244]] Story-Adapter: A Training-free Iterative Framework for Long Story Visualization(https://arxiv.org/abs/2410.06244)
Keywords: diffusion, generative
Abstract: Story visualization, the task of generating coherent images based on a narrative, has seen significant advancements with the emergence of text-to-image models, particularly diffusion models. However, maintaining semantic consistency, generating high-quality fine-grained interactions, and ensuring computational feasibility remain challenging, especially in long story visualization (i.e., up to 100 frames). In this work, we propose a training-free and computationally efficient framework, termed Story-Adapter, to enhance the generative capability of long stories. Specifically, we propose an iterative paradigm to refine each generated image, leveraging both the text prompt and all generated images from the previous iteration. Central to our framework is a training-free global reference cross-attention module, which aggregates all generated images from the previous iteration to preserve semantic consistency across the entire story, while minimizing computational costs with global embeddings. This iterative process progressively optimizes image generation by repeatedly incorporating text constraints, resulting in more precise and fine-grained interactions. Extensive experiments validate the superiority of Story-Adapter in improving both semantic consistency and generative capability for fine-grained interactions, particularly in long story scenarios. The project page and associated code can be accessed via this https URL .

Title: SymDiff: Equivariant Diffusion via Stochastic Symmetrisation

Authors: Leo Zhang, Kianoosh Ashouritaklimi, Yee Whye Teh, Rob Cornish
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2410.06262
Pdf URL: https://arxiv.org/pdf/2410.06262
Copy Paste: [[2410.06262]] SymDiff: Equivariant Diffusion via Stochastic Symmetrisation(https://arxiv.org/abs/2410.06262)
Keywords: diffusion, generative
Abstract: We propose SymDiff, a novel method for constructing equivariant diffusion models using the recently introduced framework of stochastic symmetrisation. SymDiff resembles a learned data augmentation that is deployed at sampling time, and is lightweight, computationally efficient, and easy to implement on top of arbitrary off-the-shelf models. Notably, in contrast to previous work, SymDiff typically does not require any neural network components that are intrinsically equivariant, avoiding the need for complex parameterizations and the use of higher-order geometric features. Instead, our method can leverage highly scalable modern architectures as drop-in replacements for these more constrained alternatives. We show that this additional flexibility yields significant empirical benefit on $\mathrm{E}(3)$-equivariant molecular generation. To the best of our knowledge, this is the first application of symmetrisation to generative modelling, suggesting its potential in this domain more generally.

Title: Think While You Generate: Discrete Diffusion with Planned Denoising

Authors: Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stärk, Yilun Xu, Tommi Jaakkola, Rafael Gómez-Bombarelli
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2410.06264
Pdf URL: https://arxiv.org/pdf/2410.06264
Copy Paste: [[2410.06264]] Think While You Generate: Discrete Diffusion with Planned Denoising(https://arxiv.org/abs/2410.06264)
Keywords: diffusion, generative
Abstract: Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based generation on ImageNet $256 \times 256$. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at this https URL.

Title: The Mystery of Compositional Generalization in Graph-based Generative Commonsense Reasoning

Authors: Xiyan Fu, Anette Frank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06272
Pdf URL: https://arxiv.org/pdf/2410.06272
Copy Paste: [[2410.06272]] The Mystery of Compositional Generalization in Graph-based Generative Commonsense Reasoning(https://arxiv.org/abs/2410.06272)
Keywords: generative, in-context
Abstract: While LLMs have emerged as performant architectures for reasoning tasks, their compositional generalization capabilities have been questioned. In this work, we introduce a Compositional Generalization Challenge for Graph-based Commonsense Reasoning (CGGC) that goes beyond previous evaluations that are based on sequences or tree structures - and instead involves a reasoning graph: It requires models to generate a natural sentence based on given concepts and a corresponding reasoning graph, where the presented graph involves a previously unseen combination of relation types. To master this challenge, models need to learn how to reason over relation tupels within the graph, and how to compose them when conceptualizing a verbalization. We evaluate seven well-known LLMs using in-context learning and find that performant LLMs still struggle in compositional generalization. We investigate potential causes of this gap by analyzing the structures of reasoning graphs, and find that different structures present varying levels of difficulty for compositional generalization. Arranging the order of demonstrations according to the structures' difficulty shows that organizing samples in an easy-to-hard schema enhances the compositional generalization ability of LLMs.

Title: Non-Halting Queries: Exploiting Fixed Points in LLMs

Authors: Ghaith Hammouri, Kemal Derya, Berk Sunar
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2410.06287
Pdf URL: https://arxiv.org/pdf/2410.06287
Copy Paste: [[2410.06287]] Non-Halting Queries: Exploiting Fixed Points in LLMs(https://arxiv.org/abs/2410.06287)
Keywords: anomaly
Abstract: We introduce a new vulnerability that exploits fixed points in autoregressive models and use it to craft queries that never halt, i.e. an LLM output that does not terminate. More precisely, for what we call non-halting queries, the LLM never samples the end-of-string token (). We rigorously analyze the conditions under which the non-halting anomaly presents itself. In particular, at temperature zero, we prove that if a repeating (cyclic) sequence of tokens is observed at the output beyond the context size, then the LLM does not halt. We demonstrate the non-halting anomaly in a number of experiments performed in base (unaligned) models where repeating tokens immediately lead to a non-halting cyclic behavior as predicted by the analysis. Further, we develop a simple recipe that takes the same fixed points observed in the base model and creates a prompt structure to target aligned models. We study the recipe behavior in bypassing alignment in a number of LLMs including GPT-4o, llama-3-8b-instruct, and gemma-2-9b-it where all models are forced into a non-halting state. Further, we demonstrate the recipe's success in sending most major models released over the past year into a non-halting state with the same simple prompt even at higher temperatures. Further, we study direct inversion based techniques to craft new short prompts to induce the non-halting state. Our experiments with the gradient search based inversion technique ARCA show that non-halting is prevalent across models and may be easily induced with a few input tokens. While its impact on the reliability of hosted systems can be mitigated by configuring a hard maximum token limit in the sampler, the non-halting anomaly still manages to break alignment. This underlines the need for further studies and stronger forms of alignment against non-halting anomalies.

Title: Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?

Authors: Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Félix do Carmo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06338
Pdf URL: https://arxiv.org/pdf/2410.06338
Copy Paste: [[2410.06338]] Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?(https://arxiv.org/abs/2410.06338)
Keywords: in-context
Abstract: This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.

Title: MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks

Authors: Mirelle Bueno, Roberto Lotufo, Rodrigo Nogueira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06396
Pdf URL: https://arxiv.org/pdf/2410.06396
Copy Paste: [[2410.06396]] MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks(https://arxiv.org/abs/2410.06396)
Keywords: in-context
Abstract: Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists have 80 items. In this paper, we introduce MLissard, a multilingual benchmark designed to evaluate models' abilities to process and generate texts of varied lengths and offers a mechanism for controlling sequence complexity. Our evaluation of open-source and proprietary models show a consistent decline in performance across all models and languages as the complexity of the sequence increases. Surprisingly, the use of in-context examples in languages other than English helps increase extrapolation performance significantly. The datasets and code are available at this https URL

Title: Provable Accuracy Bounds for Hybrid Dynamical Optimization and Sampling

Authors: Matthew X. Burns, Qingyuan Hou, Michael C. Huang
Subjects: cs.LG, cs.DS, math.ST
Abstract URL: https://arxiv.org/abs/2410.06397
Pdf URL: https://arxiv.org/pdf/2410.06397
Copy Paste: [[2410.06397]] Provable Accuracy Bounds for Hybrid Dynamical Optimization and Sampling(https://arxiv.org/abs/2410.06397)
Keywords: diffusion
Abstract: Analog dynamical accelerators (DXs) are a growing sub-field in computer architecture research, offering order-of-magnitude gains in power efficiency and latency over traditional digital methods in several machine learning, optimization, and sampling tasks. However, limited-capacity accelerators require hybrid analog/digital algorithms to solve real-world problems, commonly using large-neighborhood local search (LNLS) frameworks. Unlike fully digital algorithms, hybrid LNLS has no non-asymptotic convergence guarantees and no principled hyperparameter selection schemes, particularly limiting cross-device training and inference. In this work, we provide non-asymptotic convergence guarantees for hybrid LNLS by reducing to block Langevin Diffusion (BLD) algorithms. Adapting tools from classical sampling theory, we prove exponential KL-divergence convergence for randomized and cyclic block selection strategies using ideal DXs. With finite device variation, we provide explicit bounds on the 2-Wasserstein bias in terms of step duration, noise strength, and function parameters. Our BLD model provides a key link between established theory and novel computing platforms, and our theoretical results provide a closed-form expression linking device variation, algorithm hyperparameters, and performance.

Title: ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments

Authors: Sourjyadip Ray, Kushal Gupta, Soumi Kundu, Payal Arvind Kasat, Somak Aditya, Pawan Goyal
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.06420
Pdf URL: https://arxiv.org/pdf/2410.06420
Copy Paste: [[2410.06420]] ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments(https://arxiv.org/abs/2410.06420)
Keywords: in-context
Abstract: The global shortage of healthcare workers has demanded the development of smart healthcare assistants, which can help monitor and alert healthcare workers when necessary. We examine the healthcare knowledge of existing Large Vision Language Models (LVLMs) via the Visual Question Answering (VQA) task in hospital settings through expert annotated open-ended questions. We introduce the Emergency Room Visual Question Answering (ERVQA) dataset, consisting of triplets covering diverse emergency room scenarios, a seminal benchmark for LVLMs. By developing a detailed error taxonomy and analyzing answer trends, we reveal the nuanced nature of the task. We benchmark state-of-the-art open-source and closed LVLMs using traditional and adapted VQA metrics: Entailment Score and CLIPScore Confidence. Analyzing errors across models, we infer trends based on properties like decoder type, model size, and in-context examples. Our findings suggest the ERVQA dataset presents a highly complex task, highlighting the need for specialized, domain-specific solutions.

Title: MaD-Scientist: AI-based Scientist solving Convection-Diffusion-Reaction Equations Using Massive PINN-Based Prior Data

Authors: Mingu Kang, Dongseok Lee, Woojin Cho, Jaehyeon Park, Kookjin Lee, Anthony Gruber, Youngjoon Hong, Noseong Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06442
Pdf URL: https://arxiv.org/pdf/2410.06442
Copy Paste: [[2410.06442]] MaD-Scientist: AI-based Scientist solving Convection-Diffusion-Reaction Equations Using Massive PINN-Based Prior Data(https://arxiv.org/abs/2410.06442)
Keywords: diffusion, foundation model, in-context
Abstract: Large language models (LLMs), like ChatGPT, have shown that even trained with noisy prior data, they can generalize effectively to new tasks through in-context learning (ICL) and pre-training techniques. Motivated by this, we explore whether a similar approach can be applied to scientific foundation models (SFMs). Our methodology is structured as follows: (i) we collect low-cost physics-informed neural network (PINN)-based approximated prior data in the form of solutions to partial differential equations (PDEs) constructed through an arbitrary linear combination of mathematical dictionaries; (ii) we utilize Transformer architectures with self and cross-attention mechanisms to predict PDE solutions without knowledge of the governing equations in a zero-shot setting; (iii) we provide experimental evidence on the one-dimensional convection-diffusion-reaction equation, which demonstrate that pre-training remains robust even with approximated prior data, with only marginal impacts on test accuracy. Notably, this finding opens the path to pre-training SFMs with realistic, low-cost data instead of (or in conjunction with) numerical high-cost data. These results support the conjecture that SFMs can improve in a manner similar to LLMs, where fully cleaning the vast set of sentences crawled from the Internet is nearly impossible.

Title: HFH-Font: Few-shot Chinese Font Synthesis with Higher Quality, Faster Speed, and Higher Resolution

Authors: Hua Li, Zhouhui Lian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06488
Pdf URL: https://arxiv.org/pdf/2410.06488
Copy Paste: [[2410.06488]] HFH-Font: Few-shot Chinese Font Synthesis with Higher Quality, Faster Speed, and Higher Resolution(https://arxiv.org/abs/2410.06488)
Keywords: diffusion, generative
Abstract: The challenge of automatically synthesizing high-quality vector fonts, particularly for writing systems (e.g., Chinese) consisting of huge amounts of complex glyphs, remains unsolved. Existing font synthesis techniques fall into two categories: 1) methods that directly generate vector glyphs, and 2) methods that initially synthesize glyph images and then vectorize them. However, the first category often fails to construct complete and correct shapes for complex glyphs, while the latter struggles to efficiently synthesize high-resolution (i.e., 1024 $\times$ 1024 or higher) glyph images while preserving local details. In this paper, we introduce HFH-Font, a few-shot font synthesis method capable of efficiently generating high-resolution glyph images that can be converted into high-quality vector glyphs. More specifically, our method employs a diffusion model-based generative framework with component-aware conditioning to learn different levels of style information adaptable to varying input reference sizes. We also design a distillation module based on Score Distillation Sampling for 1-step fast inference, and a style-guided super-resolution module to refine and upscale low-resolution synthesis results. Extensive experiments, including a user study with professional font designers, have been conducted to demonstrate that our method significantly outperforms existing font synthesis approaches. Experimental results show that our method produces high-fidelity, high-resolution raster images which can be vectorized into high-quality vector fonts. Using our method, for the first time, large-scale Chinese vector fonts of a quality comparable to those manually created by professional font designers can be automatically generated.

Title: Chemistry-Inspired Diffusion with Non-Differentiable Guidance

Authors: Yuchen Shen, Chenhao Zhang, Sijie Fu, Chenghui Zhou, Newell Washburn, Barnabás Póczos
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06502
Pdf URL: https://arxiv.org/pdf/2410.06502
Copy Paste: [[2410.06502]] Chemistry-Inspired Diffusion with Non-Differentiable Guidance(https://arxiv.org/abs/2410.06502)
Keywords: diffusion
Abstract: Recent advances in diffusion models have shown remarkable potential in the conditional generation of novel molecules. These models can be guided in two ways: (i) explicitly, through additional features representing the condition, or (ii) implicitly, using a property predictor. However, training property predictors or conditional diffusion models requires an abundance of labeled data and is inherently challenging in real-world applications. We propose a novel approach that attenuates the limitations of acquiring large labeled datasets by leveraging domain knowledge from quantum chemistry as a non-differentiable oracle to guide an unconditional diffusion model. Instead of relying on neural networks, the oracle provides accurate guidance in the form of estimated gradients, allowing the diffusion process to sample from a conditional distribution specified by quantum chemistry. We show that this results in more precise conditional generation of novel and stable molecular structures. Our experiments demonstrate that our method: (1) significantly reduces atomic forces, enhancing the validity of generated molecules when used for stability optimization; (2) is compatible with both explicit and implicit guidance in diffusion models, enabling joint optimization of molecular properties and stability; and (3) generalizes effectively to molecular optimization tasks beyond stability optimization.

Title: DiffGAD: A Diffusion-based Unsupervised Graph Anomaly Detector

Authors: Jinghan Li, Yuan Gao, Jinda Lu, Junfeng Fang, Congcong Wen, Hui Lin, Xiang Wang
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2410.06549
Pdf URL: https://arxiv.org/pdf/2410.06549
Copy Paste: [[2410.06549]] DiffGAD: A Diffusion-based Unsupervised Graph Anomaly Detector(https://arxiv.org/abs/2410.06549)
Keywords: diffusion, anomaly
Abstract: Graph Anomaly Detection (GAD) is crucial for identifying abnormal entities within networks, garnering significant attention across various fields. Traditional unsupervised methods, which decode encoded latent representations of unlabeled data with a reconstruction focus, often fail to capture critical discriminative content, leading to suboptimal anomaly detection. To address these challenges, we present a Diffusion-based Graph Anomaly Detector (DiffGAD). At the heart of DiffGAD is a novel latent space learning paradigm, meticulously designed to enhance its proficiency by guiding it with discriminative content. This innovative approach leverages diffusion sampling to infuse the latent space with discriminative content and introduces a content-preservation mechanism that retains valuable information across different scales, significantly improving its adeptness at identifying anomalies with limited time and space complexity. Our comprehensive evaluation of DiffGAD, conducted on six real-world and large-scale datasets with various metrics, demonstrated its exceptional performance.

Title: InstantIR: Blind Image Restoration with Instant Generative Reference

Authors: Jen-Yuan Huang, Haofan Wang, Qixun Wang, Xu Bai, Hao Ai, Peng Xing, Jen-Tse Huang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06551
Pdf URL: https://arxiv.org/pdf/2410.06551
Copy Paste: [[2410.06551]] InstantIR: Blind Image Restoration with Instant Generative Reference(https://arxiv.org/abs/2410.06551)
Keywords: diffusion, generative
Abstract: Handling test-time unknown degradation is the major challenge in Blind Image Restoration (BIR), necessitating high model generalization. An effective strategy is to incorporate prior knowledge, either from human input or generative model. In this paper, we introduce Instant-reference Image Restoration (InstantIR), a novel diffusion-based BIR method which dynamically adjusts generation condition during inference. We first extract a compact representation of the input via a pre-trained vision encoder. At each generation step, this representation is used to decode current diffusion latent and instantiate it in the generative prior. The degraded image is then encoded with this reference, providing robust generation condition. We observe the variance of generative references fluctuate with degradation intensity, which we further leverage as an indicator for developing a sampling algorithm adaptive to input quality. Extensive experiments demonstrate InstantIR achieves state-of-the-art performance and offering outstanding visual quality. Through modulating generative references with textual description, InstantIR can restore extreme degradation and additionally feature creative restoration.

Title: On The Relationship between Visual Anomaly-free and Anomalous Representations

Authors: Riya Sadrani, Hrishikesh Sharma, Ayush Bachan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06576
Pdf URL: https://arxiv.org/pdf/2410.06576
Copy Paste: [[2410.06576]] On The Relationship between Visual Anomaly-free and Anomalous Representations(https://arxiv.org/abs/2410.06576)
Keywords: anomaly
Abstract: Anomaly Detection is an important problem within computer vision, having variety of real-life applications. Yet, the current set of solutions to this problem entail known, systematic shortcomings. Specifically, contemporary surface Anomaly Detection task assumes the presence of multiple specific anomaly classes e.g. cracks, rusting etc., unlike one-class classification model of past. However, building a deep learning model in such setup remains a challenge because anomalies arise rarely, and hence anomaly samples are quite scarce. Transfer learning has been a preferred paradigm in such situations. But the typical source domains with large dataset sizes e.g. ImageNet, JFT-300M, LAION-2B do not correlate well with the domain of surfaces and materials, an important premise of transfer learning. In this paper, we make an important hypothesis and show, by exhaustive experimentation, that the space of anomaly-free visual patterns of the normal samples correlates well with each of the various spaces of anomalous patterns of the class-specific anomaly samples. The first results of using this hypothesis in transfer learning have indeed been quite encouraging. We expect that finding such a simple closeby domain that readily entails large number of samples, and which also oftentimes shows interclass separability though with narrow margins, will be a useful discovery. Especially, it is expected to improve domain adaptation for anomaly detection, and few-shot learning for anomaly detection, making in-the-wild anomaly detection realistically possible in future.

Title: DDRN:a Data Distribution Reconstruction Network for Occluded Person Re-Identification

Authors: Zhaoyong Wang, Yujie Liu, Mingyue Li, Wenxin Zhang, Zongmin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06600
Pdf URL: https://arxiv.org/pdf/2410.06600
Copy Paste: [[2410.06600]] DDRN:a Data Distribution Reconstruction Network for Occluded Person Re-Identification(https://arxiv.org/abs/2410.06600)
Keywords: generative
Abstract: In occluded person re-identification(ReID), severe occlusions lead to a significant amount of irrelevant information that hinders the accurate identification of individuals. These irrelevant cues primarily stem from background interference and occluding interference, adversely affecting the final retrieval results. Traditional discriminative models, which rely on the specific content and positions of the images, often misclassify in cases of occlusion. To address these limitations, we propose the Data Distribution Reconstruction Network (DDRN), a generative model that leverages data distribution to filter out irrelevant details, enhancing overall feature perception ability and reducing irrelevant feature interference. Additionally, severe occlusions lead to the complexity of the feature space. To effectively handle this, we design a multi-center approach through the proposed Hierarchical SubcenterArcface (HS-Arcface) loss function, which can better approximate complex feature spaces. On the Occluded-Duke dataset, we achieved a mAP of 62.4\% (+1.1\%) and a rank-1 accuracy of 71.3\% (+0.6\%), surpassing the latest state-of-the-art methods(FRT) significantly.

Title: $\beta$-calibration of Language Model Confidence Scores for Generative QA

Authors: Putra Manggala, Atalanti Mastakouri, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, Aaditya Ramdas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06615
Pdf URL: https://arxiv.org/pdf/2410.06615
Copy Paste: [[2410.06615]] $\beta$-calibration of Language Model Confidence Scores for Generative QA(https://arxiv.org/abs/2410.06615)
Keywords: generative
Abstract: To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is on average indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce $\beta$-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving $\beta$-calibration.

Title: ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Authors: Yi Ding, Bolian Li, Ruqi Zhang
Subjects: cs.CV, cs.CL, cs.PL
Abstract URL: https://arxiv.org/abs/2410.06625
Pdf URL: https://arxiv.org/pdf/2410.06625
Copy Paste: [[2410.06625]] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time(https://arxiv.org/abs/2410.06625)
Keywords: generative
Abstract: Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at this https URL.

Title: Tree of Problems: Improving structured problem solving with compositionality

Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06634
Pdf URL: https://arxiv.org/pdf/2410.06634
Copy Paste: [[2410.06634]] Tree of Problems: Improving structured problem solving with compositionality(https://arxiv.org/abs/2410.06634)
Keywords: in-context
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across multiple tasks through in-context learning. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when combined with self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) and Graph of Thoughts (GoT) emerged as alternatives, dividing the complex problem into paths of subproblems. In this paper, we propose Tree of Problems (ToP), a simpler version of ToT, which we hypothesise can work better for complex tasks that can be divided into identical subtasks. Our empirical results show that our approach outperforms ToT and GoT, and in addition performs better than CoT on complex reasoning tasks. All code for this paper is publicly available here: this https URL.

Title: Task-oriented Time Series Imputation Evaluation via Generalized Representers

Authors: Zhixian Wang, Linxiao Yang, Liang Sun, Qingsong Wen, Yi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06652
Pdf URL: https://arxiv.org/pdf/2410.06652
Copy Paste: [[2410.06652]] Task-oriented Time Series Imputation Evaluation via Generalized Representers(https://arxiv.org/abs/2410.06652)
Keywords: anomaly
Abstract: Time series analysis is widely used in many fields such as power energy, economics, and transportation, including different tasks such as forecasting, anomaly detection, classification, etc. Missing values are widely observed in these tasks, and often leading to unpredictable negative effects on existing methods, hindering their further application. In response to this situation, existing time series imputation methods mainly focus on restoring sequences based on their data characteristics, while ignoring the performance of the restored sequences in downstream tasks. Considering different requirements of downstream tasks (e.g., forecasting), this paper proposes an efficient downstream task-oriented time series imputation evaluation approach. By combining time series imputation with neural network models used for downstream tasks, the gain of different imputation strategies on downstream tasks is estimated without retraining, and the most favorable imputation value for downstream tasks is given by combining different imputation strategies according to the estimated gain.

Title: Decouple-Then-Merge: Towards Better Training for Diffusion Models

Authors: Qianli Ma, Xuefei Ning, Dongrui Liu, Li Niu, Linfeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06664
Pdf URL: https://arxiv.org/pdf/2410.06664
Copy Paste: [[2410.06664]] Decouple-Then-Merge: Towards Better Training for Diffusion Models(https://arxiv.org/abs/2410.06664)
Keywords: diffusion
Abstract: Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. Typically, the model parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation. To solve this issue, this work proposes a Decouple-then-Merge (DeMe) framework, which begins with a pretrained model and finetunes separate models tailored to specific timesteps. We introduce several improved techniques during the finetuning stage to promote effective knowledge sharing while minimizing training interference across timesteps. Finally, after finetuning, these separate models can be merged into a single model in the parameter space, ensuring efficient and practical inference. Experimental results show significant generation quality improvements upon 6 benchmarks including Stable Diffusion on COCO30K, ImageNet1K, PartiPrompts, and DDPM on LSUN Church, LSUN Bedroom, and CIFAR10.

Title: Revisiting Multi-Permutation Equivariance through the Lens of Irreducible Representations

Authors: Yonatan Sverdlov, Ido Springer, Nadav Dym
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06665
Pdf URL: https://arxiv.org/pdf/2410.06665
Copy Paste: [[2410.06665]] Revisiting Multi-Permutation Equivariance through the Lens of Irreducible Representations(https://arxiv.org/abs/2410.06665)
Keywords: anomaly
Abstract: This paper explores the characterization of equivariant linear layers for representations of permutations and related groups. Unlike traditional approaches, which address these problems using parameter-sharing, we consider an alternative methodology based on irreducible representations and Schur's lemma. Using this methodology, we obtain an alternative derivation for existing models like DeepSets, 2-IGN graph equivariant networks, and Deep Weight Space (DWS) networks. The derivation for DWS networks is significantly simpler than that of previous results. Next, we extend our approach to unaligned symmetric sets, where equivariance to the wreath product of groups is required. Previous works have addressed this problem in a rather restrictive setting, in which almost all wreath equivariant layers are Siamese. In contrast, we give a full characterization of layers in this case and show that there is a vast number of additional non-Siamese layers in some settings. We also show empirically that these additional non-Siamese layers can improve performance in tasks like graph anomaly detection, weight space alignment, and learning Wasserstein distances. Our code is available at \href{this https URL}{GitHub}.

Title: Guaranteed Generation from Large Language Models

Authors: Minbeom Kim, Thibaut Thonet, Jos Rozen, Hwaran Lee, Kyomin Jung, Marc Dymetman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06716
Pdf URL: https://arxiv.org/pdf/2410.06716
Copy Paste: [[2410.06716]] Guaranteed Generation from Large Language Models(https://arxiv.org/abs/2410.06716)
Keywords: generative
Abstract: As large language models (LLMs) are increasingly used across various applications, there is a growing need to control text generation to satisfy specific constraints or requirements. This raises a crucial question: Is it possible to guarantee strict constraint satisfaction in generated outputs while preserving the distribution of the original model as much as possible? We first define the ideal distribution - the one closest to the original model, which also always satisfies the expressed constraint - as the ultimate goal of guaranteed generation. We then state a fundamental limitation, namely that it is impossible to reach that goal through autoregressive training alone. This motivates the necessity of combining training-time and inference-time methods to enforce such guarantees. Based on this insight, we propose GUARD, a simple yet effective approach that combines an autoregressive proposal distribution with rejection sampling. Through GUARD's theoretical properties, we show how controlling the KL divergence between a specific proposal and the target ideal distribution simultaneously optimizes inference speed and distributional closeness. To validate these theoretical concepts, we conduct extensive experiments on two text generation settings with hard-to-satisfy constraints: a lexical constraint scenario and a sentiment reversal scenario. These experiments show that GUARD achieves perfect constraint satisfaction while almost preserving the ideal distribution with highly improved inference efficiency. GUARD provides a principled approach to enforcing strict guarantees for LLMs without compromising their generative capabilities.

Title: Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques

Authors: Benyuan Meng, Qianqian Xu, Zitai Wang, Zhiyong Yang, Xiaochun Cao, Qingming Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06719
Pdf URL: https://arxiv.org/pdf/2410.06719
Copy Paste: [[2410.06719]] Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques(https://arxiv.org/abs/2410.06719)
Keywords: diffusion, generative
Abstract: Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at this https URL.

Title: MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Authors: Zhenhui Ye, Tianyun Zhong, Yi Ren, Ziyue Jiang, Jiawei Huang, Rongjie Huang, Jinglin Liu, Jinzheng He, Chen Zhang, Zehan Wang, Xize Chen, Xiang Yin, Zhou Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06734
Pdf URL: https://arxiv.org/pdf/2410.06734
Copy Paste: [[2410.06734]] MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes(https://arxiv.org/abs/2410.06734)
Keywords: in-context
Abstract: Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. Personalized TFG is a variant that emphasizes the perceptual identity similarity of the synthesized result (from the perspective of appearance and talking style). While previous works typically solve this problem by learning an individual neural radiance field (NeRF) for each identity to implicitly store its static and dynamic information, we find it inefficient and non-generalized due to the per-identity-per-training framework and the limited training data. To this end, we propose MimicTalk, the first attempt that exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. To be specific, (1) we first come up with a person-agnostic 3D TFG model as the base model and propose to adapt it into a specific identity; (2) we propose a static-dynamic-hybrid adaptation pipeline to help the model learn the personalized static appearance and facial dynamic features; (3) To generate the facial motion of the personalized talking style, we propose an in-context stylized audio-to-motion model that mimics the implicit talking style provided in the reference video without information loss by an explicit style representation. The adaptation process to an unseen identity can be performed in 15 minutes, which is 47 times faster than previous person-dependent methods. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness. Source code and video samples are available at this https URL .

Title: Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?

Authors: Fumiya Uchiyama, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06735
Pdf URL: https://arxiv.org/pdf/2410.06735
Copy Paste: [[2410.06735]] Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?(https://arxiv.org/abs/2410.06735)
Keywords: in-context
Abstract: Recent large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks. Prior research indicates that LLMs pre-trained with programming language data exhibit high mathematical and reasoning abilities; however, this causal relationship has not been rigorously tested. Our research aims to verify which programming languages and features during pre-training affect logical inference performance. Specifically, we pre-trained decoder-based language models from scratch using datasets from ten programming languages (e.g., Python, C, Java) and three natural language datasets (Wikipedia, Fineweb, C4) under identical conditions. Thereafter, we evaluated the trained models in a few-shot in-context learning setting on logical reasoning tasks: FLD and bAbi, which do not require commonsense or world knowledge. The results demonstrate that nearly all models trained with programming languages consistently outperform those trained with natural languages, indicating that programming languages contain factors that elicit logic inference performance. In addition, we found that models trained with programming languages exhibit a better ability to follow instructions compared to those trained with natural languages. Further analysis reveals that the depth of Abstract Syntax Trees representing parsed results of programs also affects logical reasoning performance. These findings will offer insights into the essential elements of pre-training for acquiring the foundational abilities of LLMs.

Title: DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation

Authors: Zhiqi Li, Yiming Chen, Peidong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06756
Pdf URL: https://arxiv.org/pdf/2410.06756
Copy Paste: [[2410.06756]] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation(https://arxiv.org/abs/2410.06756)
Keywords: generative
Abstract: Recent advancements in 2D/3D generative techniques have facilitated the generation of dynamic 3D objects from monocular videos. Previous methods mainly rely on the implicit neural radiance fields (NeRF) or explicit Gaussian Splatting as the underlying representation, and struggle to achieve satisfactory spatial-temporal consistency and surface appearance. Drawing inspiration from modern 3D animation pipelines, we introduce DreamMesh4D, a novel framework combining mesh representation with geometric skinning technique to generate high-quality 4D object from a monocular video. Instead of utilizing classical texture map for appearance, we bind Gaussian splats to triangle face of mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh obtained through an image-to-3D generation procedure. Sparse points are then uniformly sampled across the mesh surface, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the surface Gaussians are deformed via a novel geometric skinning algorithm, which is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the deformation network are learned via reference view photometric loss, score distillation loss as well as other regularizers in a two-stage manner. Extensive experiments demonstrate superior performance of our method. Furthermore, our method is compatible with modern graphic pipelines, showcasing its potential in the 3D gaming and film industry.

Title: Mind Your Questions Towards Backdoor Attacks on Text-to-Visualization Models

Authors: Shuaimin Li, Yuanfeng Song, Xuanang Chen, Anni Peng, Zhuoyue Wan, Chen Jason Zhang, Raymond Chi-Wing Wong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2410.06782
Pdf URL: https://arxiv.org/pdf/2410.06782
Copy Paste: [[2410.06782]] Mind Your Questions Towards Backdoor Attacks on Text-to-Visualization Models(https://arxiv.org/abs/2410.06782)
Keywords: in-context
Abstract: Text-to-visualization (text-to-vis) models have become valuable tools in the era of big data, enabling users to generate data visualizations and make informed decisions through natural language queries (NLQs). Despite their widespread application, the security vulnerabilities of these models have been largely overlooked. To address this gap, we propose VisPoison, a novel framework designed to identify these vulnerabilities of current text-to-vis models systematically. VisPoison introduces two types of triggers that activate three distinct backdoor attacks, potentially leading to data exposure, misleading visualizations, or denial-of-service (DoS) incidents. The framework features both proactive and passive attack mechanisms: proactive attacks leverage rare-word triggers to access confidential data, while passive attacks, triggered unintentionally by users, exploit a first-word trigger method, causing errors or DoS events in visualizations. Through extensive experiments on both trainable and in-context learning (ICL)-based text-to-vis models, \textit{VisPoison} achieves attack success rates of over 90\%, highlighting the security problem of current text-to-vis models. Additionally, we explore two types of defense mechanisms against these attacks, but the results show that existing countermeasures are insufficient, underscoring the pressing need for more robust security solutions in text-to-vis systems.

Title: Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

Authors: Anton Firc, Kamil Malinka, Petr Hanáček
Subjects: cs.CR, cs.AI, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2410.06796
Pdf URL: https://arxiv.org/pdf/2410.06796
Copy Paste: [[2410.06796]] Diffuse or Confuse: A Diffusion Deepfake Speech Dataset(https://arxiv.org/abs/2410.06796)
Keywords: diffusion
Abstract: Advancements in artificial intelligence and machine learning have significantly improved synthetic speech generation. This paper explores diffusion models, a novel method for creating realistic synthetic speech. We create a diffusion dataset using available tools and pretrained models. Additionally, this study assesses the quality of diffusion-generated deepfakes versus non-diffusion ones and their potential threat to current deepfake detection systems. Findings indicate that the detection of diffusion-based deepfakes is generally comparable to non-diffusion deepfakes, with some variability based on detector architecture. Re-vocoding with diffusion vocoders shows minimal impact, and the overall speech quality is comparable to non-diffusion methods.

Title: Seg2Act: Global Context-aware Action Generation for Document Logical Structuring

Authors: Zichao Li, Shaojie He, Meng Liao, Xuanang Chen, Yaojie Lu, Hongyu Lin, Yanxiong Lu, Xianpei Han, Le Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06802
Pdf URL: https://arxiv.org/pdf/2410.06802
Copy Paste: [[2410.06802]] Seg2Act: Global Context-aware Action Generation for Document Logical Structuring(https://arxiv.org/abs/2410.06802)
Keywords: generative
Abstract: Document logical structuring aims to extract the underlying hierarchical structure of documents, which is crucial for document intelligence. Traditional approaches often fall short in handling the complexity and the variability of lengthy documents. To address these issues, we introduce Seg2Act, an end-to-end, generation-based method for document logical structuring, revisiting logical structure extraction as an action generation task. Specifically, given the text segments of a document, Seg2Act iteratively generates the action sequence via a global context-aware generative model, and simultaneously updates its global context and current logical structure based on the generated actions. Experiments on ChCatExt and HierDoc datasets demonstrate the superior performance of Seg2Act in both supervised and transfer learning settings.

Title: Boosting Few-Shot Detection with Large Language Models and Layout-to-Image Synthesis

Authors: Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06841
Pdf URL: https://arxiv.org/pdf/2410.06841
Copy Paste: [[2410.06841]] Boosting Few-Shot Detection with Large Language Models and Layout-to-Image Synthesis(https://arxiv.org/abs/2410.06841)
Keywords: diffusion, generative
Abstract: Recent advancements in diffusion models have enabled a wide range of works exploiting their ability to generate high-volume, high-quality data for use in various downstream tasks. One subclass of such models, dubbed Layout-to-Image Synthesis (LIS), learns to generate images conditioned on a spatial layout (bounding boxes, masks, poses, etc.) and has shown a promising ability to generate realistic images, albeit with limited layout-adherence. Moreover, the question of how to effectively transfer those models for scalable augmentation of few-shot detection data remains unanswered. Thus, we propose a collaborative framework employing a Large Language Model (LLM) and an LIS model for enhancing few-shot detection beyond state-of-the-art generative augmentation approaches. We leverage LLM's reasoning ability to extrapolate the spatial prior of the annotation space by generating new bounding boxes given only a few example annotations. Additionally, we introduce our novel layout-aware CLIP score for sample ranking, enabling tight coupling between generated layouts and images. Significant improvements on COCO few-shot benchmarks are observed. With our approach, a YOLOX-S baseline is boosted by more than 140%, 50%, 35% in mAP on the COCO 5-,10-, and 30-shot settings, respectively.

Title: Generative Model for Less-Resourced Language with 1 billion parameters

Authors: Domen Vreš, Martin Božič, Aljaž Potočnik, Tomaž Martinčič, Marko Robnik-Šikonja
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06898
Pdf URL: https://arxiv.org/pdf/2410.06898
Copy Paste: [[2410.06898]] Generative Model for Less-Resourced Language with 1 billion parameters(https://arxiv.org/abs/2410.06898)
Keywords: generative, in-context
Abstract: Large language models (LLMs) are a basic infrastructure for modern natural language processing. Many commercial and open-source LLMs exist for English, e.g., ChatGPT, Llama, Falcon, and Mistral. As these models are trained on mostly English texts, their fluency and knowledge of low-resource languages and societies are superficial. We present the development of large generative language models for a less-resourced language. GaMS 1B - Generative Model for Slovene with 1 billion parameters was created by continuing pretraining of the existing English OPT model. We developed a new tokenizer adapted to Slovene, Croatian, and English languages and used embedding initialization methods FOCUS and WECHSEL to transfer the embeddings from the English OPT model. We evaluate our models on several classification datasets from the Slovene suite of benchmarks and generative sentence simplification task SENTA. We only used a few-shot in-context learning of our models, which are not yet instruction-tuned. For classification tasks, in this mode, the generative models lag behind the existing Slovene BERT-type models fine-tuned for specific tasks. On a sentence simplification task, the GaMS models achieve comparable or better performance than the GPT-3.5-Turbo model.

Title: Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Authors: Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, Saining Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06940
Pdf URL: https://arxiv.org/pdf/2410.06940
Copy Paste: [[2410.06940]] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think(https://arxiv.org/abs/2410.06940)
Keywords: diffusion, self-supervised, generative
Abstract: Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.

Title: CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages

Authors: Pretam Ray, Jivnesh Sandhan, Amrith Krishna, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06944
Pdf URL: https://arxiv.org/pdf/2410.06944
Copy Paste: [[2410.06944]] CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages(https://arxiv.org/abs/2410.06944)
Keywords: self-supervised
Abstract: Neural dependency parsing has achieved remarkable performance for low resource morphologically rich languages. It has also been well-studied that morphologically rich languages exhibit relatively free word order. This prompts a fundamental investigation: Is there a way to enhance dependency parsing performance, making the model robust to word order variations utilizing the relatively free word order nature of morphologically rich languages? In this work, we examine the robustness of graph-based parsing architectures on 7 relatively free word order languages. We focus on scrutinizing essential modifications such as data augmentation and the removal of position encoding required to adapt these architectures accordingly. To this end, we propose a contrastive self-supervised learning method to make the model robust to word order variations. Furthermore, our proposed modification demonstrates a substantial average gain of 3.03/2.95 points in 7 relatively free word order languages, as measured by the UAS/LAS Score metric when compared to the best performing baseline.

Title: Self-Boosting Large Language Models with Synthetic Preference Data

Authors: Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06961
Pdf URL: https://arxiv.org/pdf/2410.06961
Copy Paste: [[2410.06961]] Self-Boosting Large Language Models with Synthetic Preference Data(https://arxiv.org/abs/2410.06961)
Keywords: generative
Abstract: Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

Title: Bridge the Points: Graph-based Few-shot Segment Anything Semantically

Authors: Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06964
Pdf URL: https://arxiv.org/pdf/2410.06964
Copy Paste: [[2410.06964]] Bridge the Points: Graph-based Few-shot Segment Anything Semantically(https://arxiv.org/abs/2410.06964)
Keywords: foundation model
Abstract: The recent advancements in large-scale pre-training techniques have significantly enhanced the capabilities of vision foundation models, notably the Segment Anything Model (SAM), which can generate precise masks based on point and box prompts. Recent studies extend SAM to Few-shot Semantic Segmentation (FSS), focusing on prompt generation for SAM-based automatic semantic segmentation. However, these methods struggle with selecting suitable prompts, require specific hyperparameter settings for different scenarios, and experience prolonged one-shot inference times due to the overuse of SAM, resulting in low efficiency and limited automation ability. To address these issues, we propose a simple yet effective approach based on graph analysis. In particular, a Positive-Negative Alignment module dynamically selects the point prompts for generating masks, especially uncovering the potential of the background context as the negative reference. Another subsequent Point-Mask Clustering module aligns the granularity of masks and selected points as a directed graph, based on mask coverage over points. These points are then aggregated by decomposing the weakly connected components of the directed graph in an efficient manner, constructing distinct natural clusters. Finally, the positive and overshooting gating, benefiting from graph-based granularity alignment, aggregate high-confident masks and filter out the false-positive masks for final prediction, reducing the usage of additional hyperparameters and redundant mask generation. Extensive experimental analysis across standard FSS, One-shot Part Segmentation, and Cross Domain FSS datasets validate the effectiveness and efficiency of the proposed approach, surpassing state-of-the-art generalist models with a mIoU of 58.7% on COCO-20i and 35.2% on LVIS-92i. The code is available in this https URL.

Title: Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

Authors: Runze Chen, Haiyong Luo, Fang Zhao, Jingze Yu, Yupeng Jia, Juan Wang, Xuepeng Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.06982
Pdf URL: https://arxiv.org/pdf/2410.06982
Copy Paste: [[2410.06982]] Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation(https://arxiv.org/abs/2410.06982)
Keywords: self-supervised
Abstract: Monocular depth estimation, enabled by self-supervised learning, is a key technique for 3D perception in computer vision. However, it faces significant challenges in real-world scenarios, which encompass adverse weather variations, motion blur, as well as scenes with poor lighting conditions at night. Our research reveals that we can divide monocular depth estimation into three sub-problems: depth structure consistency, local texture disambiguation, and semantic-structural correlation. Our approach tackles the non-robustness of existing self-supervised monocular depth estimation models to interference textures by adopting a structure-centered perspective and utilizing the scene structure characteristics demonstrated by semantics and illumination. We devise a novel approach to reduce over-reliance on local textures, enhancing robustness against missing or interfering patterns. Additionally, we incorporate a semantic expert model as the teacher and construct inter-model feature dependencies via learnable isomorphic graphs to enable aggregation of semantic structural knowledge. Our approach achieves state-of-the-art out-of-distribution monocular depth estimation performance across a range of public adverse scenario datasets. It demonstrates notable scalability and compatibility, without necessitating extensive model engineering. This showcases the potential for customizing models for diverse industrial applications.

Title: Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control

Authors: Shimon Vainer, Konstantin Kutsy, Dante De Nigris, Ciara Rowles, Slava Elizarov, Simon Donné
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2410.06985
Pdf URL: https://arxiv.org/pdf/2410.06985
Copy Paste: [[2410.06985]] Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control(https://arxiv.org/abs/2410.06985)
Keywords: diffusion
Abstract: Multi-view consistency remains a challenge for image diffusion models. Even within the Text-to-Texture problem, where perfect geometric correspondences are known a priori, many methods fail to yield aligned predictions across views, necessitating non-trivial fusion methods to incorporate the results onto the original mesh. We explore this issue for a Collaborative Control workflow specifically in PBR Text-to-Texture. Collaborative Control directly models PBR image probability distributions, including normal bump maps; to our knowledge, the only diffusion model to directly output full PBR stacks. We discuss the design decisions involved in making this model multi-view consistent, and demonstrate the effectiveness of our approach in ablation studies, as well as practical applications.

Title: Diffusion Density Estimators

Authors: Akhil Premkumar
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2410.06986
Pdf URL: https://arxiv.org/pdf/2410.06986
Copy Paste: [[2410.06986]] Diffusion Density Estimators(https://arxiv.org/abs/2410.06986)
Keywords: diffusion, generative
Abstract: We investigate the use of diffusion models as neural density estimators. The current approach to this problem involves converting the generative process to a smooth flow, known as the Probability Flow ODE. The log density at a given sample can be obtained by solving the ODE with a black-box solver. We introduce a new, highly parallelizable method that computes log densities without the need to solve a flow. Our approach is based on estimating a path integral by Monte Carlo, in a manner identical to the simulation-free training of diffusion models. We also study how different training parameters affect the accuracy of the density calculation, and offer insights into how these models can be made more scalable and efficient.

Title: Efficient Distribution Matching of Representations via Noise-Injected Deep InfoMax

Authors: Ivan Butakov, Alexander Sememenko, Alexander Tolmachev, Andrey Gladkov, Marina Munkhoeva, Alexey Frolov
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2410.06993
Pdf URL: https://arxiv.org/pdf/2410.06993
Copy Paste: [[2410.06993]] Efficient Distribution Matching of Representations via Noise-Injected Deep InfoMax(https://arxiv.org/abs/2410.06993)
Keywords: self-supervised, generative
Abstract: Deep InfoMax (DIM) is a well-established method for self-supervised representation learning (SSRL) based on maximization of the mutual information between the input and the output of a deep neural network encoder. Despite the DIM and contrastive SSRL in general being well-explored, the task of learning representations conforming to a specific distribution (i.e., distribution matching, DM) is still under-addressed. Motivated by the importance of DM to several downstream tasks (including generative modeling, disentanglement, outliers detection and other), we enhance DIM to enable automatic matching of learned representations to a selected prior distribution. To achieve this, we propose injecting an independent noise into the normalized outputs of the encoder, while keeping the same InfoMax training objective. We show that such modification allows for learning uniformly and normally distributed representations, as well as representations of other absolutely continuous distributions. Our approach is tested on various downstream tasks. The results indicate a moderate trade-off between the performance on the downstream tasks and quality of DM.

Title: A Gentle Introduction and Tutorial on Deep Generative Models in Transportation Research

Authors: Seongjin Choi, Zhixiong Jin, Seungwoo Ham, Jiwon Kim, Lijun Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.07066
Pdf URL: https://arxiv.org/pdf/2410.07066
Copy Paste: [[2410.07066]] A Gentle Introduction and Tutorial on Deep Generative Models in Transportation Research(https://arxiv.org/abs/2410.07066)
Keywords: generative
Abstract: Deep Generative Models (DGMs) have rapidly advanced in recent years, becoming essential tools in various fields due to their ability to learn complex data distributions and generate synthetic data. Their importance in transportation research is increasingly recognized, particularly for applications like traffic data generation, prediction, and feature extraction. This paper offers a comprehensive introduction and tutorial on DGMs, with a focus on their applications in transportation. It begins with an overview of generative models, followed by detailed explanations of fundamental models, a systematic review of the literature, and practical tutorial code to aid implementation. The paper also discusses current challenges and opportunities, highlighting how these models can be effectively utilized and further developed in transportation research. This paper serves as a valuable reference, guiding researchers and practitioners from foundational knowledge to advanced applications of DGMs in transportation research.

Title: Retrieval-Augmented Decision Transformer: External Memory for In-context RL

Authors: Thomas Schmied, Fabian Paischer, Vihang Patil, Markus Hofmarcher, Razvan Pascanu, Sepp Hochreiter
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2410.07071
Pdf URL: https://arxiv.org/pdf/2410.07071
Copy Paste: [[2410.07071]] Retrieval-Augmented Decision Transformer: External Memory for In-context RL(https://arxiv.org/abs/2410.07071)
Keywords: in-context
Abstract: In-context learning (ICL) is the ability of a model to learn a new task by observing a few exemplars in its context. While prevalent in NLP, this capability has recently also been observed in Reinforcement Learning (RL) settings. Prior in-context RL methods, however, require entire episodes in the agent's context. Given that complex environments typically lead to long episodes with sparse rewards, these methods are constrained to simple environments with short episodes. To address these challenges, we introduce Retrieval-Augmented Decision Transformer (RA-DT). RA-DT employs an external memory mechanism to store past experiences from which it retrieves only sub-trajectories relevant for the current situation. The retrieval component in RA-DT does not require training and can be entirely domain-agnostic. We evaluate the capabilities of RA-DT on grid-world environments, robotics simulations, and procedurally-generated video games. On grid-worlds, RA-DT outperforms baselines, while using only a fraction of their context length. Furthermore, we illuminate the limitations of current in-context RL methods on complex environments and discuss future directions. To facilitate future research, we release datasets for four of the considered environments.

Title: Let's Ask GNN: Empowering Large Language Model for Graph In-Context Learning

Authors: Zhengyu Hu, Yichuan Li, Zhengyu Chen, Jingang Wang, Han Liu, Kyumin Lee, Kaize Ding
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2410.07074
Pdf URL: https://arxiv.org/pdf/2410.07074
Copy Paste: [[2410.07074]] Let's Ask GNN: Empowering Large Language Model for Graph In-Context Learning(https://arxiv.org/abs/2410.07074)
Keywords: in-context
Abstract: Textual Attributed Graphs (TAGs) are crucial for modeling complex real-world systems, yet leveraging large language models (LLMs) for TAGs presents unique challenges due to the gap between sequential text processing and graph-structured data. We introduce AskGNN, a novel approach that bridges this gap by leveraging In-Context Learning (ICL) to integrate graph data and task-specific information into LLMs. AskGNN employs a Graph Neural Network (GNN)-powered structure-enhanced retriever to select labeled nodes across graphs, incorporating complex graph structures and their supervision signals. Our learning-to-retrieve algorithm optimizes the retriever to select example nodes that maximize LLM performance on graph. Experiments across three tasks and seven LLMs demonstrate AskGNN's superior effectiveness in graph task performance, opening new avenues for applying LLMs to graph-structured data without extensive fine-tuning.

Title: FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset

Authors: Donglin Di, He Feng, Wenzhang Sun, Yongjia Ma, Hao Li, Wei Chen, Xiaofei Gou, Tonghua Su, Xun Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.07151
Pdf URL: https://arxiv.org/pdf/2410.07151
Copy Paste: [[2410.07151]] FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset(https://arxiv.org/abs/2410.07151)
Keywords: generative
Abstract: Generating talking face videos from various conditions has recently become a highly popular research area within generative tasks. However, building a high-quality face video generation model requires a well-performing pre-trained backbone, a key obstacle that universal models fail to adequately address. Most existing works rely on universal video or image generation models and optimize control mechanisms, but they neglect the evident upper bound in video quality due to the limited capabilities of the backbones, which is a result of the lack of high-quality human face video datasets. In this work, we investigate the unsatisfactory results from related studies, gather and trim existing public talking face video datasets, and additionally collect and annotate a large-scale dataset, resulting in a comprehensive, high-quality multiracial face collection named \textbf{FaceVid-1K}. Using this dataset, we craft several effective pre-trained backbone models for face video generation. Specifically, we conduct experiments with several well-established video generation models, including text-to-video, image-to-video, and unconditional video generation, under various settings. We obtain the corresponding performance benchmarks and compared them with those trained on public datasets to demonstrate the superiority of our dataset. These experiments also allow us to investigate empirical strategies for crafting domain-specific video generation tasks with cost-effective settings. We will make our curated dataset, along with the pre-trained talking face video generation models, publicly available as a resource contribution to hopefully advance the research field.

Title: Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis

Authors: Bohan Zeng, Ling Yang, Siyu Li, Jiaming Liu, Zixiang Zhang, Juanxi Tian, Kaixin Zhu, Yongzhen Guo, Fu-Yun Wang, Minkai Xu, Stefano Ermon, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.07155
Pdf URL: https://arxiv.org/pdf/2410.07155
Copy Paste: [[2410.07155]] Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis(https://arxiv.org/abs/2410.07155)
Keywords: diffusion
Abstract: Recent advances in diffusion models have demonstrated exceptional capabilities in image and video generation, further improving the effectiveness of 4D synthesis. Existing 4D generation methods can generate high-quality 4D objects or scenes based on user-friendly conditions, benefiting the gaming and video industries. However, these methods struggle to synthesize significant object deformation of complex 4D transitions and interactions within scenes. To address this challenge, we propose Trans4D, a novel text-to-4D synthesis framework that enables realistic complex scene transitions. Specifically, we first use multi-modal large language models (MLLMs) to produce a physic-aware scene description for 4D scene initialization and effective transition timing planning. Then we propose a geometry-aware 4D transition network to realize a complex scene-level 4D transition based on the plan, which involves expressive geometrical object deformation. Extensive experiments demonstrate that Trans4D consistently outperforms existing state-of-the-art methods in generating 4D scenes with accurate and high-quality transitions, validating its effectiveness. Code: this https URL

Title: AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Authors: Yukang Cao, Liang Pan, Kai Han, Kwan-Yee K. Wong, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.07164
Pdf URL: https://arxiv.org/pdf/2410.07164
Copy Paste: [[2410.07164]] AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation(https://arxiv.org/abs/2410.07164)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have led to significant improvements in the generation and animation of 4D full-body human-object interactions (HOI). Nevertheless, existing methods primarily focus on SMPL-based motion generation, which is limited by the scarcity of realistic large-scale interaction data. This constraint affects their ability to create everyday HOI scenes. This paper addresses this challenge using a zero-shot approach with a pre-trained diffusion model. Despite this potential, achieving our goals is difficult due to the diffusion model's lack of understanding of ''where'' and ''how'' objects interact with the human body. To tackle these issues, we introduce AvatarGO, a novel framework designed to generate animatable 4D HOI scenes directly from textual inputs. Specifically, 1) for the ''where'' challenge, we propose LLM-guided contact retargeting, which employs Lang-SAM to identify the contact body part from text prompts, ensuring precise representation of human-object spatial relations. 2) For the ''how'' challenge, we introduce correspondence-aware motion optimization that constructs motion fields for both human and object models using the linear blend skinning function from SMPL-X. Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling penetration issues. Extensive experiments with existing methods validate AvatarGO's superior generation and animation capabilities on a variety of human-object pairs and diverse poses. As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.

Title: Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Authors: Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2410.07167
Pdf URL: https://arxiv.org/pdf/2410.07167
Copy Paste: [[2410.07167]] Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate(https://arxiv.org/abs/2410.07167)
Keywords: in-context
Abstract: We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) \textbf{Effective} to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) \textbf{Robust} toward different training/evaluation data. 3) \textbf{Generalize} across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: this https URL.

Title: Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Authors: Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen, Alan W Black, Gopala K. Anumanchipalli
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.07168
Pdf URL: https://arxiv.org/pdf/2410.07168
Copy Paste: [[2410.07168]] Sylber: Syllabic Embedding Representation of Speech from Raw Audio(https://arxiv.org/abs/2410.07168)
Keywords: self-supervised, generative
Abstract: Syllables are compositional units of spoken language that play a crucial role in human speech perception and production. However, current neural speech representations lack structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding. We also train token-to-speech generative models with our syllabic units and show that fully intelligible speech can be reconstructed from these tokens. Lastly, we observe that categorical perception, a linguistic phenomenon of speech perception, emerges naturally in our model, making the embedding space more categorical and sparse than previous self-supervised learning approaches. Together, we present a novel self-supervised approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

Title: One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Authors: Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2410.07170
Pdf URL: https://arxiv.org/pdf/2410.07170
Copy Paste: [[2410.07170]] One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation(https://arxiv.org/abs/2410.07170)
Keywords: foundation model
Abstract: Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across model weights. Recent works focus on weight-driven initialization or learning of adaptive ranks during training. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to sub-optimal performance. We propose to enhance LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and re-distribute ranks among all weight matrices to explain the maximal amount of variance and continue the standard LoRA fine-tuning procedure. This results in our new method Explained Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and attains the highest average score across a multitude of tasks per domain.

Title: IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Authors: Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2410.07171
Pdf URL: https://arxiv.org/pdf/2410.07171
Copy Paste: [[2410.07171]] IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation(https://arxiv.org/abs/2410.07171)
Keywords: diffusion
Abstract: Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: this https URL

Title: MM-Ego: Towards Building Egocentric Multimodal LLMs

Authors: Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07177
Pdf URL: https://arxiv.org/pdf/2410.07177
Copy Paste: [[2410.07177]] MM-Ego: Towards Building Egocentric Multimodal LLMs(https://arxiv.org/abs/2410.07177)
Keywords: foundation model
Abstract: This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.