Research

Artificial General Intelligence (AGI) & Multimodal Question Answering (QA)

SugaFormer: Super-class guided Transformer for Zero-Shot Attribute Classification (AAAI '25)

LLaMo: Large Language Model-based Molecular Graph Assistant (NeurIPS'24)

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning (AAAI '25)

DialogGSR: Generative Subgraph Retrieval for Knowledge Graph–Grounded Dialog Generation (EMNLP'24)

A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. This hypothetical ability can also be referred to as Artificial General Intelligence (AGI). There have been many milestones towards such a goal, including GPT-4, ChatGPT, CLIP, and Flamingo. They have shown marvelous performance on diverse tasks compared to task-specific weak AI models, even without specific training. Nowadays, these AGI/foundation models have acquired multi-modality (images, videos, knowledge graphs, etc.), achieving a deeper understanding of the world. Our overarching goal is to develop a general-purpose learning system capable of learning and performing unseen tasks using every modality it can utilize.

Our related publications

[AAAI’ 25] Super-class guided Transformer for Zero-Shot Attribute Classification
[AAAI’ 25] VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
[NeurIPS’ 24] LLaMo: Large Language Model-based Molecular Graph Assistant
[EMNLP’ 24] Generative Subgraph Retrieval for Knowledge Graph–Grounded Dialog Generation
[ECCV' 24] Understanding Multi-compositional Learning in Vision and Language Models via Category Theory
[CVPR’ 24] vid-TLDR: Training Free Token merging for Light-weight Video Transformer
[CVPR’ 24] Prompt Learning via Meta-Regularization
[CVPR’ 24] Retrieval-Augmented Open-Vocabulary Object Detection[EMNLP '23] Large Language Models are Temporal and Causal Reasoners for Video Question Answering[NeurIPS '23] NuTrea: Neural Tree Search for Context-guided Multi-hop KGQA[ICCV '23] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models [ICCV '23] Distribution-Aware Prompt Tuning for Vision-Language Models[ICCV '23] Read-only Prompt Optimization for Vision-Language Few-shot Learning[CVPR '23] MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models[AAAI '23] Relation-aware Language-Graph Transformer for Question Answering[MedAGI '23] Concept Bottleneck with Visual Concept Filtering for Explainable Medical Image Classification[CVPR '22] Video-Text Representation Learning via Differentiable Weak Temporal Alignment

AI for Science

DGT: Deformable Graph Transformer (TPAMI)

InvBO: Inversion-based Latent Bayesian Optimization
(NeurIPS '24)

NF-BO: Latent Bayesian Optimization via Autoregressive Normalizing Flows (ICLR '25, Oral presentation)

LLaMo: Large Language Model-based Molecular Graph Assistant (NeurIPS'24)

In modern data analysis, highly structured data frequently occur and they can be viewed as data on non-Euclidean spaces (e.g., graphs, Riemannian manifolds, data manifolds, and functional spaces). We focus on AI for Science, applying graph neural networks (GNNs) to areas such as molecule language models, molecule optimization for drug discovery, and weather forecasting. Our goal is to leverage these advanced techniques to model complex scientific data and drive innovation in scientific research.

Our related publications

[TPAMI] Deformable Graph Transformer
[ICLR’ 25] Latent Bayesian Optimization via Autoregressive Normalizing Flows (Oral presentation, top 1.8%).
[NeurIPS’ 24] LLaMo: Large Language Model-based Molecular Graph Assistant
[NeurIPS’ 24] Inversion-based Latent Bayesian Optimization
[EMNLP’ 24] Generative Subgraph Retrieval for Knowledge Graph–Grounded Dialog Generation
[NeurIPS '23] Advancing Bayesian Optimization via Learning Smooth Latent Spaces[AAAI '22] Deformable Graph Convolutional Networks[NN '22] Graph Transformer Networks: Learning Meta-path Graphs to Improve GNNs[NeurIPS '21] Metropolis-Hastings Data Augmentation for Graph Neural Networks[NeurIPS '21] Neighborhood Overlap-aware Graph Neural Networks for Link Prediction[NeurIPS '20] Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs[NeurIPS '19] Graph Transformer Networks[ICML '15] Manifold-valued Dirichlet Processes[CVPR '16] Latent Variable Graphical Model Selection using Harmonic Analysis: Applications to the HCP[Quarterly of Applied Math ] Localizing differentially evolving covariance structures via scan statistics[ICCV '15] Interpolation on the manifold of k component Gaussian Mixture Models

Deep Generative Models

NF-BO: Latent Bayesian Optimization via Autoregressive Normalizing Flows (ICLR '25, Oral presentation)

CAF: Constant Acceleration Flow (NeurIPS '24)

InvBO: Inversion-based Latent Bayesian Optimization
(NeurIPS '24)

DAVI: Diffusion Prior-Based Amortized VariationalInference for Noisy Inverse Problems
(ECCV '24, Oral presentation)

Generative models represent a cornerstone in artificial intelligence, serving as powerful engines for innovation in both drug discovery, image generation and video generation. In drug discovery, these models leverage machine learning like Bayesian optimization to design novel molecules, accelerating the identification of potential therapeutic compounds. Concurrently, in image generation, techniques like diffusion models produce realistic images, enabling creative expression and practical applications across diverse fields. With their ability to generate new data samples and push the boundaries of what's possible, generative models continue to reshape industries and drive progress in science and technology.

Our related publications

[ICLR’ 25] Latent Bayesian Optimization via Autoregressive Normalizing Flows (Oral presentation, top 1.8%).
[NeurIPS’ 24] Constant Acceleration Flow
[NeurIPS’ 24] Inversion-based Latent Bayesian Optimization
[ECCV’ 24] Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems (Oral presentation, top 2.3%)
[ICML’ 24] Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis
[ICLR '24] Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations[NeurIPS '23] Advancing Bayesian Optimization via Learning Smooth Latent Spaces[NeurIPS '22] Invertible Monotone Operators for Normalizing Flows [ECCV '22] k-SALSA: k-anonymous synthetic averaging of retinal images via local style alignment

Deep Understanding of Visual World

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality (CVPR'25)

SpeaQ: Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection (CVPR'24)

MCTF: Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision (CVPR'24)

TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers (NeurIPS '22)

High-level computer vision enables a deeper understanding of the visual world. Object recognition systems detect objects in images and videos. They offer basic information on whether certain objects are in the scene and how many instances are in the scene. But the information may not be sufficient for building personalized and automated systems for smart city: smart home, smart offices, and hospitals. Without a deep understanding of the interaction between humans and objects, it is hard to understand the context of the scene and what kind of services are needed. "Scene Understanding" is one topic to study such interaction and generate metadata such as scene graphs. It allows "Visual Question Answering (VQA)". Security cameras are pervasive in modern cities and computer vision helps anomaly detection: flood, wildfire, dangerous wild animals, and estimate traffic and even temperature. We study algorithms that offer a more accurate and deeper understanding of the visual world and help people to live safer and smarter.

Our related publications

[CVPR’ 25] EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
[CVPR’ 24] Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision
[CVPR’ 24] Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
[NeurIPS '22] TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers[CVPR '22] Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection[CVPR '21] HOTR: End-to-End Human-Object Interaction Detection with Transformers[ECCV '20] UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection[CVPR '18] Tensorize, Factorize and Regularize: Robust Visual Relationship Learning[ECCV '16] Abundant Inverse Regression using Sufficient Reduction and its Applications[ECCV '18] Efficient Relative Attribute Learning using Graph Neural Networks

Implicit Neural Representation and 3D Computer Vision

DDMI: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations (ICLR'24)

UP-NeRF: Unconstrained Pose-Prior-Free Neural Radiance Fields (NeurIPS '23)

Semantic-Aware Implicit Template Learning via Part Deformation Consistency (ICCV '23)

3D data (e.g., point cloud, voxel, polygonal mesh) are crucial to diverse fields like robotics, autonomous driving, AI Drones, medical data analysis, and scene reconstruction. We are interested in the field of 3D Computer Vision and 3D Deep Learning based on 3D data, which has more complex geometry than 2D data. Shape classification, indoor/outdoor scene semantic segmentation, and shape correspondence/registration are representative tasks for point cloud data. In addition, Implicit Neural Representation (INR) is in our interest, which is an emerging paradigm that offers a novel approach to representing complex geometric shapes and scenes.

Our related publications

[ICLR '24] Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations[NeurIPS '23] Unconstrained Pose Prior-Free Neural Radiance Field[ICCV '23] Semantic-Aware Implicit Template Learning via Part Deformation Consistency[ICML '23] Robust Camera Pose Refinement for Multi-Resolution Hash Encoding[CVPR '23] Self-positioning Point-based Transformer for Point Cloud Understanding [NeurIPS '22] SageMix: Saliency-Guided Mixup for Point Clouds[ICCV '21] Point Cloud Augmentation with Weighted Local Transformations

Safe AI, Adversarial Examples, and Uncertainty

Machine learning models (or deep neural networks) have been used in a variety of applications including autonomous robots, vehicles, and drones. When deploying AI systems to the physical world, the reliability of algorithms is crucial for safety. Guaranteeing such safety includes specification, robustness, and assurance. Given a concrete purpose of the system (specification), the AI system should be robust to perturbations and attacks (adversarial examples). Further, the uncertainty of predictions by models helps monitor and control the AI system's activity. In this line of thought, we study uncertainty of models (e.g., Bayesian Neural Networks) and adversarial examples from both attacker and defender perspectives. This topic may fall in the intersection of AI and security.

Our related publications

[IEEE ACCESS'21] Search-and-Attack: Temporally SparseAdversarial Perturbations on Videos[ECCV '20] Robust Neural Networks inspired by Strong Stability Preserving Runge-Kutta methods[UAI '19] Sampling-free Uncertainty Estimation in Gated Recurrent Units with Applications to Normative Modeling in Neuroimaging[arxiv '18] Sampling-free Uncertainty Estimation in Gated Recurrent Units with Exponential Families

Medical Imaging

Riemannian MLGM (CVPR)

Medical imaging or brain imaging inherently has many structured measurements such as diffusion tensor image (DTI), high angular resolution diffusion images (HARDI), ensemble average propagators (EAPs), etc. Common goals in medical imaging are to identify important regions related to a certain disease, detect diseases at the early stage, and model the disease progression. To provide predictions and findings that are rigorously tested by statistics, more powerful pipelines are needed. We study a more powerful representation of medical images and models (mixed effects models for structured data, filtering, dimensionality reduction etc.). We also research few-shot detection, domain-adaptation, and contrastive learning to deal with limited samples and labels in the medical domain.

Our related publications

[CVPR '17] Riemannian Nonlinear Mixed Effects Models: Analyzing Longitudinal Deformations in Neuroimaging [CVPRW '17] Riemannian Variance Filtering: An Independent Filtering Scheme for Statistical Tests on Manifold-valued Data [ECCV '15] Canonical Correlation Analysis on Riemannian Manifolds and its Applications [CVPR '14] MGLM on Riemannian Manifolds with Applications to Statistical Analysis of Diffusion Weighted Images

Google Sites

Report abuse