TL;DR: In this article , the authors present a survey on state-of-the-art techniques to deal with training DL models to overcome three challenges including small, imbalanced datasets, and lack of generalization.
Abstract: Abstract Data scarcity is a major challenge when training deep learning (DL) models. DL demands a large amount of data to achieve exceptional performance. Unfortunately, many applications have small or inadequate data to train DL frameworks. Usually, manual labeling is needed to provide labeled data, which typically involves human annotators with a vast background of knowledge. This annotation process is costly, time-consuming, and error-prone. Usually, every DL framework is fed by a significant amount of labeled data to automatically learn representations. Ultimately, a larger amount of data would generate a better DL model and its performance is also application dependent. This issue is the main barrier for many applications dismissing the use of DL. Having sufficient data is the first step toward any successful and trustworthy DL application. This paper presents a holistic survey on state-of-the-art techniques to deal with training DL models to overcome three challenges including small, imbalanced datasets, and lack of generalization. This survey starts by listing the learning techniques. Next, the types of DL architectures are introduced. After that, state-of-the-art solutions to address the issue of lack of training data are listed, such as Transfer Learning (TL), Self-Supervised Learning (SSL), Generative Adversarial Networks (GANs), Model Architecture (MA), Physics-Informed Neural Network (PINN), and Deep Synthetic Minority Oversampling Technique (DeepSMOTE). Then, these solutions were followed by some related tips about data acquisition needed prior to training purposes, as well as recommendations for ensuring the trustworthiness of the training dataset. The survey ends with a list of applications that suffer from data scarcity, several alternatives are proposed in order to generate more data in each application including Electromagnetic Imaging (EMI), Civil Structural Health Monitoring, Medical imaging, Meteorology, Wireless Communications, Fluid Mechanics, Microelectromechanical system, and Cybersecurity. To the best of the authors’ knowledge, this is the first review that offers a comprehensive overview on strategies to tackle data scarcity in DL.
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, С. В. Захаров, Carl Vondrick
1 Oct 2023
TL;DR: Zero-1-to-3 framework for changing the camera viewpoint of an object from a single image, leveraging geometric priors learned from large-scale diffusion models.
Abstract: We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman A. Khan, Fahad Shahbaz Khan
1 Jun 2023
TL;DR: MaPLe improves alignment between vision and language representations by learning multi-modal prompts across both vision and language branches, promoting strong coupling and discouraging learning independent uni-modal solutions.
Abstract: Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions. Further, we learn separate prompts across different early stages to progressively model the stage-wise feature relationships to allow rich context learning. We evaluate the effectiveness of our approach on three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes and 2.72% on overall harmonic-mean, averaged over 11 diverse image recognition datasets. Our code and pre-trained models are available at https://github.com/muzairkhattak/multimodal-prompt-learning.
Niklas Muennighoff, Thomas J. Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel
1 Jan 2023
TL;DR: Multitask finetuning of large multilingual language models generalizes well to unseen tasks and languages, improving performance on both English and non-English tasks.
Abstract: Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models.We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus.Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-ofthe-art zero-shot results.We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset.We find training on these machine-translated prompts leads to better performance on humanwritten prompts in the respective languages.Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen.We conjecture that the models are learning higher-level capabilities that are both task-and languageagnostic.In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts.Our code, datasets and models are freely available at https://github.com/bigscience-workshop/xmtf.
Feng Li, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yue Zhao, Hang Zhang, Peizhao Zhang, P. Vajda, Diana Marculescu
1 Jun 2023
TL;DR: Open-vocabulary semantic segmentation with Mask-adapted CLIP achieves state-of-the-art performance by finetuning CLIP on a collection of masked image regions and their corresponding text descriptions.
Abstract: Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the “blank” areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
TL;DR: In this paper , a health indicator (HI) derived from spectral correlation, Wasserstein distance, and linear rectification was proposed to reflect the changes in the probability distribution of all cyclic power-spectra over time.
Abstract: The prognosis of bearings is vital for condition-based maintenance of rotating machinery. This article proposes a systematic prognostic scheme for rolling element bearings. The proposed scheme infers the degradation progression by developing a novel health indicator (HI). This novel HI, derived from the spectral correlation, Wasserstein distance, and linear rectification, can reflect the changes in the probability distribution of all cyclic power-spectra over time. In other words, any form of variation in modulation characteristics can be revealed through the proposed novel indicator, even for the weak information buried by the internal or external noise. Furthermore, the developed HI can eliminate random fluctuations that often impair the remaining useful life (RUL) prediction accuracy. Then, a 3 ${\boldsymbol{\sigma }}$ criterion-based technique is introduced to divide health stages. After that, the gated recurrent unit network is employed to predict the RUL of the bearing system, integrated with the Bayesian optimization algorithm to tune the optimal hyperparameters adaptively. This renders the establishment of an intelligent prognosis model with high prediction accuracy and generalization ability. Finally, experimental validations are conducted using the run-to-failure datasets of bearings. The obtained results demonstrate that the proposed HI has better monotonicity, and the proposed prognostic scheme can predict the RUL with high accuracy.
TL;DR: The authors adopted a set of learnable adaption prompts and prepend them to the word tokens at higher transformer layers, which adaptively injects the new instructional cues into LLaMA, while effectively preserving its pre-trained knowledge.
Abstract: We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.
TL;DR: In this article , a multimode generalization and differentiation circuit for the Pavlov associative memory is proposed based on memristors, which is mainly composed of voltage control modules, synaptic neuron modules, and inhibition modules.
Abstract: Most of the classical conditioning laws implemented by existing circuits are involved in learning and forgetting between only three neurons, and the problems between multiple neurons are not considered. In this article, a multimode generalization and differentiation circuit for the Pavlov associative memory is proposed based on memristors. The designed circuit is mainly composed of voltage control modules, synaptic neuron modules, and inhibition modules. The secondary differentiation is accomplished through the process of associative learning and forgetting among multiple neurons. The process of multiple generalization and differentiation is realized based on the nonvolatility and thresholding properties of memristors. The extinction inhibition and differentiation inhibition in forgetting is considered through the inhibition modules. The Pavlov associative memory neural network with multimodal generalization and differentiation may provide a reference for the further development of brain-like intelligence.
Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He
14 Sep 2023
TL;DR: TALLRec is an effective and efficient tuning framework to align large language models with recommendation tasks. It significantly enhances the recommendation capabilities of LLMs and exhibits robust cross-domain generalization.
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains, thereby prompting researchers to explore their potential for use in recommendation systems. Initial attempts have leveraged the exceptional capabilities of LLMs, such as rich knowledge and strong generalization through In-context Learning, which involves phrasing the recommendation task as prompts. Nevertheless, the performance of LLMs in recommendation tasks remains suboptimal due to a substantial disparity between the training tasks for LLMs and recommendation tasks, as well as inadequate recommendation data during pre-training. To bridge the gap, we consider building a Large Recommendation Language Model by tunning LLMs with recommendation data. To this end, we propose an efficient and effective Tuning framework for Aligning LLMs with Recommendations, namely TALLRec. We have demonstrated that the proposed TALLRec framework can significantly enhance the recommendation capabilities of LLMs in the movie and book domains, even with a limited dataset of fewer than 100 samples. Additionally, the proposed framework is highly efficient and can be executed on a single RTX 3090 with LLaMA-7B. Furthermore, the fine-tuned LLM exhibits robust cross-domain generalization. Our code and data are available at https://github.com/SAI990323/TALLRec.
TL;DR: Zhang et al. as mentioned in this paper proposed an attention-based deep meta-transfer learning (ADMTL) method for few-shot fine-grained fault diagnosis (FSFGFD) problem with the aim of identifying novel finegrained faults under different working conditions using only few samples from each class.
Abstract: Deep learning-based fault diagnosis methods have made tremendous progress in recent years; however, most of these methods are coarse grained and data demanding that cannot find the root causes of mechanical system failures at a finer granularity with limited fault data. Therefore, in this study, we first investigate the few-shot fine-grained fault diagnosis (FSFGFD) problem, with the aim of identifying novel fine-grained faults under different working conditions using only few samples from each class. To address the difficulties of fine-grained fault feature extraction and poor model generalization to unseen few-shot faults in FSFGFD tasks, a novel attention-based deep meta-transfer learning (ADMTL) method is proposed. First, the failure modes under different working conditions are considered as fine-grained faults, and their raw signals are transformed into time–frequency images. Based on this, an attention mechanism is introduced to guide the feature extractor of the ADMTL on what information to learn. The ADMTL then follows a three-stage learning process of pre-training, meta-transfer, and meta-adaptation to achieve fast adaptation to new fine-grained faults using a priori knowledge gained from known faults. Furthermore, a parameter modulation strategy is employed to adaptively update the pre-trained network during the meta-transfer process. The comprehensive experimental results of three case studies demonstrate the superiority of our method over state-of-the-art methods. The proposed method achieves excellent performance with an average accuracy of 99.08%, 95.86%, and 77.74% for FSFGFD tasks when performing meta-transfer within the same machine and between different machines, respectively.
TL;DR: ZoeD-M12-NK as discussed by the authors combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale, and uses a lightweight head with a novel bin adjustment design called metric bins module for each domain.
Abstract: This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .
TL;DR: Tsing et al. as mentioned in this paper proposed a federated learning with adaptive local aggregation (FedALA) module, which can adaptively aggregate the downloaded global model and local model towards the local objective on each client.
Abstract: A key challenge in federated learning (FL) is the statistical heterogeneity that impairs the generalization of the global model on each client. To address this, we propose a method Federated learning with Adaptive Local Aggregation (FedALA) by capturing the desired information in the global model for client models in personalized FL. The key component of FedALA is an Adaptive Local Aggregation (ALA) module, which can adaptively aggregate the downloaded global model and local model towards the local objective on each client to initialize the local model before training in each iteration. To evaluate the effectiveness of FedALA, we conduct extensive experiments with five benchmark datasets in computer vision and natural language processing domains. FedALA outperforms eleven state-of-the-art baselines by up to 3.27% in test accuracy. Furthermore, we also apply ALA module to other federated learning methods and achieve up to 24.19% improvement in test accuracy. Code is available at https://github.com/TsingZ0/FedALA.
TL;DR: DeepOPF as mentioned in this paper employs a penalty approach with a zero-order gradient estimation technique in the training process toward guaranteeing the inequality constraints, which can reduce the number of variables to be predicted by the DNN.
Abstract: To cope with increasing uncertainty from renewable generation and flexible load, grid operators need to solve alternative current optimal power flow (AC-OPF) problems more frequently for efficient and reliable operation. In this article, we develop a deep neural network (DNN) approach, called DeepOPF, for solving AC-OPF problems in a fraction of the time used by conventional iterative solvers. A key difficulty for applying machine learning techniques for solving AC-OPF problems lies in ensuring that the obtained solutions respect the equality and inequality physical and operational constraints. Generalized a prediction-and-reconstruction procedure in our previous studies, DeepOPF first trains a DNN model to predict a set of independent operating variables and then directly compute the remaining ones by solving the power flow equations. Such an approach not only preserves the power-flow balance equality constraints but also reduces the number of variables to be predicted by the DNN, cutting down the number of neurons and training data needed. DeepOPF then employs a penalty approach with a zero-order gradient estimation technique in the training process toward guaranteeing the inequality constraints. We also drive a condition for tuning the DNN size according to the desired approximation accuracy, which measures its generalization capability. It provides theoretical justification for using DNN to solve AC-OPF problems. Simulation results for IEEE 30/118/300-bus and a synthetic 2000-bus test cases demonstrate the effectiveness of the penalty approach. They also show that DeepOPF speeds up the computing time by up to two orders of magnitude as compared to a state-of-the-art iterative solver, at the expense of $< $ 0.2% cost difference.
TL;DR: Wang et al. as discussed by the authors proposed a slide-patch and whole-face attention model with SE blocks, which jointly perceived the discriminative locality characteristics and informative global features of the face for effective facial expression recognition.
Abstract: Learning discriminative features is of vital importance for automatic facial expression recognition (FER) in the wild. In this article, we propose a novel Slide-Patch and Whole-Face Attention model with SE blocks (SPWFA-SE), which jointly perceives the discriminative locality characteristics and informative global features of the face for effective FER. Specifically, the well-designed slide patches are proposed to extract local features. Different from the existing methods, our slide patches not only can maintain the information at the edge area of patches, but also do not need to detect facial landmarks. Moreover, to make the model adaptively focus on the distinguishable regions, an attention module is proposed in the patch level to learn the weight of each patch. Furthermore, squeeze-and-excitation blocks are explored in the channel level to learn the weight of each channel. As such, the proposed multi-level feature extraction and attention mechanisms can enhance the representative ability of the learned features. Extensive experiments on five challenging datasets demonstrate that our method can achieve state-of-the-art performance. Cross database experiments on another three databases show the superior generalization performance of our model. Furthermore, complexity analysis results show that our model contains fewer parameters with fast training advantages than other competing models.
TL;DR: Prompt-aligned Gradient (ProGrad) prevents prompt tuning from forgetting general knowledge learned from VLMs. ProGrad updates the prompt whose gradient is aligned to the general knowledge.
Abstract: Thanks to the large pre-trained vision-language models (VLMs) like CLIP [37], we can craft a zero-shot classifier by discrete prompt design, e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity between the image and the prompt sentence "a photo of a [CLASS]". Furthermore, prompting shows great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the soft prompts with few samples. However, we find a common failure that improper fine-tuning or learning with extremely few-shot samples may even under-perform the zero-shot prediction. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompting. In this paper, we present Prompt-aligned Gradient, dubbed ProGrad to prevent prompt tuning from forgetting the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge, which is represented as the optimization direction offered by the pre-defined prompt predictions. Extensive experiments under the few-shot learning, domain generalization, base-to-new generalization and cross-dataset transfer settings demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods.
TL;DR: In this paper , the authors present an overview of existing DTL- and DDA-based video surveillance systems to shed light on their benefits, discuss their challenges, and highlight their future perspectives.
TL;DR: In this paper , the authors introduce the concept of $ (a, b) $-fuzzy soft set, shortened as $(a and b)$-FSSs, which is a generalization of the orthopair fuzzy soft set.
Abstract: Many models of uncertain knowledge have been designed that combine expanded views of fuzziness (expressions of partial memberships) with parameterization (multiple subsethood indexed by a parameter set). The standard orthopair fuzzy soft set is a very general example of this successful blend initiated by fuzzy soft sets. It is a mapping from a set of parameters to the family of all orthopair fuzzy sets (which allow for a very general view of acceptable membership and non-membership evaluations). To expand the scope of application of fuzzy soft set theory, the restriction of orthopair fuzzy sets that membership and non-membership must be calibrated with the same power should be removed. To this purpose we introduce the concept of $ (a, b) $-fuzzy soft set, shortened as $ (a, b) $-FSS. They enable us to address situations that impose evaluations with different importances for membership and non-membership degrees, a problem that cannot be modeled by the existing generalizations of intuitionistic fuzzy soft sets. We establish the fundamental set of arithmetic operations for $ (a, b) $-FSSs and explore their main characteristics. Then we define aggregation operators for $ (a, b) $-FSSs and discuss their main properties and the relationships between them. Finally, with the help of suitably defined scores and accuracies we design a multi-criteria decision-making strategy that operates in this novel framework. We also analyze a decision-making problem to endorse the validity of $ (a, b) $-FSSs for decision-making purposes.
TL;DR: Encoder-based domain tuning accelerates text-to-image personalization by underfitting on a large set of concepts from a given domain.
Abstract: Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts. However, current personalization approaches struggle with lengthy training times, high storage requirements or loss of identity. To overcome these limitations, we propose an encoder-based domain-tuning approach. Our key insight is that by underfitting on a large set of concepts from a given domain, we can improve generalization and create a model that is more amenable to quickly adding novel concepts from the same domain. Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e.g. a specific face, and learns to map it into a word-embedding representing the concept. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively injest additional concepts. Together, these components are used to guide the learning of unseen concepts, allowing us to personalize a model using only a single image and as few as 5 training steps --- accelerating personalization from dozens of minutes to seconds , while preserving quality. Code and trained encoders will be available at our project page.
TL;DR: In this paper , a survey article summarizes the research direction of dataset quality in machine learning, including the definition of related concepts, analysis of quality issues and risks, and a review of data quality dimensions and metrics throughout the dataset lifecycle.
Abstract: With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, including the definition of related concepts, analysis of quality issues and risks, and a review of dataset quality dimensions and metrics throughout the dataset lifecycle and a review of dataset quality metrics analyzed from a dataset lifecycle perspective and summarized in literatures. Furthermore, this article introduces a comprehensive quality evaluation process, which includes a framework for dataset quality evaluation with dimensions and metrics, computation methods for quality metrics, and assessment models. These studies provide valuable guidance for evaluating dataset quality in the field of machine learning, which can help improve the accuracy, efficiency, and generalization ability of machine learning models, and promote the development and application of artificial intelligence technology.
TL;DR: Explainable Artificial Intelligence (XAI) is an emerging research field bringing transparency to highly complex and opaque machine learning (ML) models as mentioned in this paper , however, these tools are seldomly used beyond visualization purposes.
TL;DR: The existing paradigm for fake image detection fails to generalize across generative models. A new approach is proposed that performs real-vs-fake classification without learning, using a feature space not explicitly trained to distinguish real from fake images.
Abstract: With generative models proliferating at a rapid rate, there is a growing need for general purpose fake image detectors. In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images. Upon analysis, we find that the resulting classifier is asymmetrically tuned to detect patterns that make an image fake. The real class becomes a ‘sink’ class holding anything that is not fake, including generated images from models not accessible during training. Building upon this discovery, we propose to perform real-vs-fake classification without learning; i.e., using a feature space not explicitly trained to distinguish real from fake images. We use nearest neighbor and linear probing as instantiations of this idea. When given access to the feature space of a large pretrained vision-language model, the very simple baseline of nearest neighbor classification has surprisingly good generalization ability in detecting fake images from a wide variety of generative models; e.g., it improves upon the SoTA [50] by +15.07 mAP and +25.90% acc when tested on unseen diffusion and autoregressive models. Our code, models, and data can be found at https://github.com/Yuheng-Li/UniversalFakeDetect
TL;DR: In this paper , the authors proposed a unified framework that is applicable to general design problems in wireless networks, which includes graph modeling, neural architecture design, and theory-guided performance enhancement.
Abstract: Deep learning-based approaches have been developed to solve challenging problems in wireless communications, leading to promising results. Early attempts adopted neural network architectures inherited from applications such as computer vision. They often yield poor performance in large scale networks (i.e., poor scalability) and unseen network settings (i.e., poor generalization). To resolve these issues, graph neural networks (GNNs) have been recently adopted, as they can effectively exploit the domain knowledge, i.e., the graph topology in wireless communications problems. GNN-based methods can achieve near-optimal performance in large-scale networks and generalize well under different system settings, but the theoretical underpinnings and design guidelines remain elusive, which may hinder their practical implementations. This paper endeavors to fill both the theoretical and practical gaps. For theoretical guarantees, we prove that GNNs achieve near-optimal performance in wireless networks with much fewer training samples than traditional neural architectures. Specifically, to solve an optimization problem on an $n$ -node graph (where the nodes may represent users, base stations, or antennas), GNNs’ generalization error and required number of training samples are $\mathcal {O}(n)$ and $\mathcal {O}(n^{2})$ times lower than the unstructured multi-layer perceptrons. For design guidelines, we propose a unified framework that is applicable to general design problems in wireless networks, which includes graph modeling, neural architecture design, and theory-guided performance enhancement. Extensive simulations, which cover a variety of important problems and network settings, verify our theory and the effectiveness of the proposed design framework.
Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman A. Khan, Fahad Shahbaz Khan
1 Jun 2023
TL;DR: Fine-tuned CLIP models are efficient video learners, but lack generalization ability. ViFi-CLIP baseline is sufficient to bridge the domain gap from images to videos.
Abstract: Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a ‘bridge and prompt’ approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code and pre-trained models are available at https://github.com/muzairkhattak/ViFi-CLIP.
TL;DR: Gao et al. as discussed by the authors developed a model transfer paradigm to train deep networks on synthetic X-ray data and corresponding labels generated using simulation techniques from CT scans, which can even outperform real-data-trained models due to the effectiveness of training on a larger dataset.
Abstract: Artificial intelligence (AI) now enables automated interpretation of medical images. However, AI’s potential use for interventional image analysis remains largely untapped. This is because the post hoc analysis of data collected during live procedures has fundamental and practical limitations, including ethical considerations, expense, scalability, data integrity and a lack of ground truth. Here we demonstrate that creating realistic simulated images from human models is a viable alternative and complement to large-scale in situ data collection. We show that training AI image analysis models on realistically synthesized data, combined with contemporary domain generalization techniques, results in machine learning models that on real data perform comparably to models trained on a precisely matched real data training set. We find that our model transfer paradigm for X-ray image analysis, which we refer to as SyntheX, can even outperform real-data-trained models due to the effectiveness of training on a larger dataset. SyntheX provides an opportunity to markedly accelerate the conception, design and evaluation of X-ray-based intelligent systems. In addition, SyntheX provides the opportunity to test novel instrumentation, design complementary surgical approaches, and envision novel techniques that improve outcomes, save time or mitigate human error, free from the ethical and practical considerations of live human data collection. Simulated data is an alternative to real data for medical applications where interventional data are needed to train AI-based systems. Gao and colleagues develop a model transfer paradigm to train deep networks on synthetic X-ray data and corresponding labels generated using simulation techniques from CT scans. The approach establishes synthetic data as a viable resource for developing machine learning models that apply to real clinical data.
TL;DR: In this article , a deep learning-based extended dynamic mode decomposition algorithm is presented to learn a finite-dimensional approximation of the Koopman operator for path tracking control of autonomous vehicles.
Abstract: Autonomous driving technologies have received notable attention in the past decades. In autonomous driving systems, identifying a precise dynamical model for motion control is nontrivial due to the strong nonlinearity and uncertainty in vehicle dynamics. Recent efforts have resorted to machine learning techniques for building vehicle dynamical models, but the generalization ability and interpretability of existing methods still need to be improved. In this paper, we propose a data-driven vehicle modeling approach based on deep neural networks with an interpretable Koopman operator. The main advantage of using the Koopman operator is to represent the nonlinear dynamics in a linear lifted feature space. In the proposed approach, a deep learning-based extended dynamic mode decomposition algorithm is presented to learn a finite-dimensional approximation of the Koopman operator. Furthermore, a data-driven model predictive controller with the learned Koopman model is designed for path tracking control of autonomous vehicles. Simulation results in a high-fidelity CarSim environment show that our approach exhibit a high modeling precision at a wide operating range and outperforms previously developed methods in terms of modeling performance. Path tracking tests of the autonomous vehicle are also performed in the CarSim environment and the results show the effectiveness of the proposed approach.
TL;DR: In this paper , a human preference score (HPS) is derived for each human preference classifier with the collected dataset and derived a Human preference score based on the classifier is used to adapt Stable Diffusion to better align with human aesthetic preferences.
Abstract: Recent years have witnessed a rapid growth of deep generative models, with text-to-image models gaining significant attention from the public. However, existing models often generate images that do not align well with human aesthetic preferences, such as awkward combinations of limbs and facial expressions. To address this issue, we collect a dataset of human choices on generated images from the Stable Foundation Discord channel. Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices. Thus, we train a human preference classifier with the collected dataset and derive a Human Preference Score (HPS) based on the classifier. Using the HPS, we propose a simple yet effective method to adapt Stable Diffusion to better align with human aesthetic preferences. Our experiments show that the HPS outperforms CLIP in predicting human choices and has good generalization capability towards images generated from other models. By tuning Stable Diffusion with the guidance of the HPS, the adapted model is able to generate images that are more preferred by human users.
TL;DR: Cranial et al. as discussed by the authors introduce LLM+P, a framework that incorporates the strengths of classical planners into LLMs, by first converting the language description into a file written in the planning domain definition language (PDDL), then leveraging classical planners to quickly find a solution, and then translating the found solution back into natural language.
Abstract: Large language models (LLMs) have demonstrated remarkable zero-shot generalization abilities: state-of-the-art chatbots can provide plausible answers to many common questions that arise in daily life. However, so far, LLMs cannot reliably solve long-horizon planning problems. By contrast, classical planners, once a problem is given in a formatted way, can use efficient search algorithms to quickly identify correct, or even optimal, plans. In an effort to get the best of both worlds, this paper introduces LLM+P, the first framework that incorporates the strengths of classical planners into LLMs. LLM+P takes in a natural language description of a planning problem, then returns a correct (or optimal) plan for solving that problem in natural language. LLM+P does so by first converting the language description into a file written in the planning domain definition language (PDDL), then leveraging classical planners to quickly find a solution, and then translating the found solution back into natural language. Along with LLM+P, we define a diverse set of different benchmark problems taken from common planning scenarios. Via a comprehensive set of experiments on these benchmark problems, we find that LLM+P is able to provide optimal solutions for most problems, while LLMs fail to provide even feasible plans for most problems.\footnote{The code and results are publicly available at https://github.com/Cranial-XIX/llm-pddl.git.
Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Fan Huang, Gengfeng Zheng
1 Jun 2023
TL;DR: Implicit Identity Leakage hinders the generalization ability of deepfake detection models. The phenomenon is caused by the learned identity representation on images. A method named ID-unaware Deepfake Detection Model is proposed to reduce the influence of this phenomenon and achieve improved generalization performance.
Abstract: In this paper, we analyse the generalization ability of binary classifiers for the task of deepfake detection. We find that the stumbling block to their generalization is caused by the unexpected learned identity representation on images. Termed as the Implicit Identity Leakage, this phenomenon has been qualitatively and quantitatively verified among various DNNs. Furthermore, based on such understanding, we propose a simple yet effective method named the ID-unaware Deepfake Detection Model to reduce the influence of this phenomenon. Extensive experimental results demonstrate that our method outperforms the state-of-the-art in both in-dataset and cross-dataset evaluation. The code is available at https://github.com/megvii-research/CADDM.
TL;DR: In this article , an adversarial mutual information-guided single domain generalization network for machinery fault diagnosis is proposed, where a domain generation module is designed to generate fake target domains that have significant distribution discrepancies with the source domain, and an iterative min-max game of mutual information is implemented to learn generalized features for resisting the unknown domain shift.
Abstract: Domain generalization-based fault diagnosis has recently emerged to address domain shift problems. Most existing methods learn domain-invariant representations from multiple source domains. However, valuable fault samples from polytropic working conditions are difficult to be collected, and it is quite common that available data are from a single working condition. Therefore, this article proposes an adversarial mutual information-guided single domain generalization network for machinery fault diagnosis. To enhance the model generalization ability, a domain generation module is designed to generate fake target domains that have significant distribution discrepancies with the source domain. Then, an iterative min–max game of mutual information between the domain generation module and task diagnosis module is implemented to learn generalized features for resisting the unknown domain shift. Extensive diagnosis experiments conducted on two mechanical rigs validated the effectiveness of the proposed method.
TL;DR: Wang et al. as discussed by the authors proposed a residual attention network (RAN) that employs the attention mechanism and residual learning to improve classification efficiency and accuracy, with the help of mutual information correlation analysis, the hourly segmented measured data is transformed into the matrix form as the input of the deep learning model for training.