Brandon Jinxiu Liu (刘锦绣)
“Stay hungry, Stay foolish” -- Steven Jobs
Hi there! I am a junior undergraduate student from South China University of Technology, advised by Prof. Qi Liu (IEEE Senior Member) . Now I am working as a research intern
at Stanford Vision and Learning Lab,
focusing on 4D Dynamic Generation, advised by Prof. Jiajun Wu .
I am also working at Westlake University & OPPO Research Institude, focusing on Multi-modal LLM enhanced Diffusion Model based Image/Video Generation, advised by Prof. Guo-Jun Qi (IEEE Fellow) .
Sincerely looking for PhD positions for fall 2025 admission!
Email: jinxiuliu0628@foxmail.com  /  branodnjinxiuliu@cs.stanford.edu
Tel/Wechat: +86-13951891694
Email  / 
CV
 / 
Google Scholar
|
Sidelights!!! Click the portrait 👆 and enjoy the animation magic from my project Prompt image to Life.
|
|
Stanford Vision and Learning Lab, Stanford University , Research Intern 03/24 – present
4D Scene Generation advised by Prof. Jiajun Wu and Hong-Xing "Koven" Yu.
|
|
Westlake University & OPPO Research Institude , Research Intern 09/23 – present
Text driven Video Generation advised by Prof. Guo-Jun Qi (IEEE Fellow) .
|
|
School of Future Technology, SCUT , Research Intern 12/22 – present
Text driven Image Generation advised by Prof. Qi Liu (IEEE Senior Member) .
|
Education Experience
South China University of Technology (SCUT), Guangzhou, China 09/21 – 06/25(expected)
B.Eng (Majoring in Artificial Intelligence)
Main courses: Deep Learning and Computer Vision(4.0/4.0), Course Design of Deep Learning and Computer Vision (4.0/4.0, Best project),
C++ Programming Foundations (4.0/4.0), Python Programming (4.0/4.0), Data Structure (4.0/4.0),
Advanced Language Programming Training (4.0/4.0), Artificial Intelligence and 3D Vision(4.0/4.0), Calculus (4.0/4.0)......
|
News
One paper is accepted by AAAI 2024
One paper is accepted by VDU@CVPR 2024 as Oral Presentation
One paper is accepted by IJCAI 2024
|
|
R3CD: Scene Graph to Image Generation with Relation-aware Compositional
Contrastive Control Diffusion
Jinxiu Liu,
Qi Liu,
In this paper, we introduce R3CD, a new image generation framework from scene graphs with large-scale diffusion models and contrastive control mechanisms.
R3CD can handle complex or ambiguous relations in scene graphs and produce realistic and diverse images that match the scene graph specifications.
R3CD consists of two main components: (1) SGFormer, a transformer-based encoder that captures both local and global information from scene graphs; (2) Relation-aware Diffusion contrastive control, a contrastive learning module that aligns the relation features and the image features across different levels of abstraction.
AAAI 2024, 4 positive reviews
Paper
|
|
Maple: Multi-modal Pre-training for Contextual Instance-aware Visual Generation
Jinxiu Liu,
Jinjin Cao,
Zilyu Ye,
Zhiyang Chen,
Ziwei Xuan,
Zemin Huang,
Mingyuan Zhou,
Xiaoqian Shen,
Qi Liu,
Mohamed Elhoseiny ,
Guo-Jun Qi
Generative multi-modal models have recently drawn increasing attention, driven by their critical impacts on granularity-diverse conversations.
Despite the high quality in images generated with short captions, most of them deteriorate drastically when long contexts with multiple instances are provided.
The resulting instances in the generated images are hard to be consistent with those in former images.
This attributes to the models' weakness in capturing the features of these instances sparsely located in the long input sequence.
To address this issue, we propose Maple,
a large-scale open-domain generative multi-modal model.
Maple is able to take long context with interleaved images and text as guidance to generate images and keep the instances in the generated image consistent with the given inputs.
This is realized by a training schedule focused on instance-level consistency, and a long-context decoupling mechanism to balance contextual information of different importance.
Specifically, (1) we introduce a new image generation approach that utilizes instances as extra inputs to emphasize their features in the multi-modal context, enhancing instance-level consistency.
To train this model, we provide an extensive open-domain dataset with recurring instances across image sequences. We also propose a multi-stage training strategy,
evolving from instance-grounded single-round generation to multi-modal context-aware multi-turn generation.
(2) We decouple the multi-modal context into two inputs, the highly related current caption and the broader multi-modal context with low information density,
allowing for differential prioritization through a tailored attention mechanism within the UNet of diffusion model. This mechanism refines the attention to focus on instance-level features relevant
to current instructions when perceiving multimodal context.
Remarkably, the proposed Maple exhibits superior performance in generating high-quality images which are more coherent to the inputs and outperform current state-of-the-art multi-modal visual generators in contextual visual consistency.
NeurIPS 2024, under review
|
|
Prompt image to Life: Training-free Text-driven Image-to-video Generation
Jinxiu Liu,
Yuan Yao,
Bingwen Zhu,
Fanyi Wang,
Weijian Luo,
Jingwen Su,
Yanhao Zhang,
Yuxiao Wang,
Liyuan Ma,
Qi Liu,
Jiebo Luo,
Guo-Jun Qi
Image-to-video (I2V) generation is a challenging task that requires transforming a static image into a dynamic video according to a text prompt.
For a long time, it has been a challenging task that demands both subject consistency and text semantic alignment.
Moreover, existing I2V generators require expensive training on large video datasets.
To address this issue, we propose PiLife, a novel training-free I2V framework that leverages a pre-trained text-to-image diffusion model. PiLife can generate videos that are coherent with a given image and aligned with the semantics of a given text, which mainly consists of three components: (i) A motion-aware diffusion inversion module that embeds motion semantics into the inverted images as the initial frames; (ii) A motion-aware noise initialization module that employs a motion text attention map to modulate the diffusion process and adjust the motion intensity of different regions with spatial noise; (iii) A probabilistic cross-frame attention module that leverages a geometric distribution to randomly sample a frame and compute attention with it, thereby enhancing the motion diversity. Experiments show that PiLife significantly outperforms the training-free baselines, and is comparable or even superior to some training-based I2V methods.
ECCV 2024, under review
Paper /
project page
|
|
OpenStory: A Large-Scale Open-Domain Dataset for Subject-Driven Visual Storytelling
Zilyu Ye *,
Jinxiu Liu *,
Jinjin Cao
Zhiyang Chen
Ziwei Xuan
Mingyuan Zhou
Qi Liu,
Guo-Jun Qi   (* contribute equally)
In this paper, we present OpenStory, a large-scale dataset tailored for training subject-focused story visualization models to
generate coherent and contextually relevant visual narratives. Addressing the challenges of maintaining subject continuityacross frames
and capturing compelling narratives, We propose an innovative pipeline that automates the extraction ofkeyframes from open-domain videos.
It ingeniously employsvision-language models to generate descriptive captions,which are then refined by a large language model to
ensurenarrative flow and coherence. Furthermore, advanced sub-ject masking techniques are applied to isolate and segmentthe primary subjects.
Derived from diverse video sources,including YouTube and existing datasets, OpenStory offersa comprehensive open-domain resource, surpassing priordatasets confined to
specific scenarios. With automatedcaptioning instead of manual annotation, high-resolutionimagery optimized for subject count per frame, and exten-sive frame sequences
ensuring consistent subjects for tem-poral modeling, OpenStory establishes itself as an invalu-able benchmark. It facilitates advancements in subject-focused story visualization,
enabling the training of modelscapable of comprehending and generating intricate multi-modal narratives from extensive visual and textual inputs.
CVPR 2024@VDU, Oral Presentation
|
|
🐖PiGIE: Proximal Policy Optimization Guided Diffusion for Fine-Grained Image Editing
Tiancheng Li*,
Jinxiu Liu*,
William Luo,
Huajun Chen,
Qi Liu,
(* contribute equally)
Instruction-based image editing is a challenging task since it requires to manipulation of the visual content of images according to complex human language instructions.
When editing an image with tiny objects and complex positional relationships, existing image editing methods cannot locate the accurate region to execute the editing.
To address this issue, we introduce Proximal Policy Optimization Guided Image Editing(PiGIE), a diffusion model that can accurately edit tiny objects in images with complex scenes.
The PiGIE is able to incorporate proper noise masks to edit images based on the guidance of the target object’s attention maps.
Different from the traditional image editing approaches based on supervised learning, PiGIE leverages the cosine similarity between UNet’s attention map
and human feedback as a reward function and employs Proximal Policy Optimization (PPO) to fine-tune the diffusion model, such that PiGIE can locate the editing
regions precisely based on human instructions. On multiple image editing benchmarks, PiGIE exhibits remarkable improvements in both
mage quality and generalization capability. In particular, PiGIE sets a new baseline for editing fine-grained images with multiple tiny objects,
shedding light on future studies on text-guided image editing for tiny objects.
ACM MM 2024, under review
|
|
PoseAnimate: Zero-shot High Fidelity Pose Controllable Character Animation
Bingwen Zhu,
Fanyi Wang,
Peng Liu,
Jingwen Su,
Jinxiu Liu,
Yanhao Zhang,
Zuxuan Wu,
Yu-Gang Jiang,
Guo-Jun Qi,
In this paper, we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains
three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings,
to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module
(DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided
Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character
and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive
experiment results demonstrate that our approach
outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity.
IJCAI 2024
Paper /
project page
|
|
Deep Neural Network Compression by Spatial-wise Low-rank
Decomposition
Xiaoye Zhu*,
Jinxiu Liu*,
Ye Liu,
Michael NG,
Zihan Ji,
(* contribute equally)
In this paper, we introduces a new method for compressing convolutional neural networks (CNNs) based on spatial-wise low-rank decomposition (SLR). The method preserves the higher-order structure of the filter weights and exploits their local low-rankness in different spatial resolutions, which can be implemented as a 1x1 convolution layer and achieves significant reductions in model size and computation cost with minimal accuracy loss. The paper shows the superior performance of the method over state-of-the-art low-rank compression methods and network pruning methods on several popular CNNs and datasets.
Applied Intelligence, under review, positive reviews
|
|
MiniHuggingGPT:A mini multi-modal application like HuggingGPT and MiniGPT-4
Course Design of Deep Learning and Computer Vision  mentored by
Prof. Mingkui Tan and
Prof. Huiping Zhuang,
- Took LLM as an API that calls other large-scale models based on natural language instructions.
- Developed a text-based dialogue system based on ChatGLM that can call for three large-scale models for image captioning, image generation, and text conversation using natural language commands by instruction finetuning.
- Provided a webui interface based on gradio for easy interaction with the system and showcased various examples of its capabilities.
Awarded as the Best Course Design, 1/39
project report
|
|
Introduction to few-shot learning
Honored to be invited by Xinjie Shen, Chairman of AIA(Artificial Intelligence Association) in SCUT.
In this talk, I explore the use of few-shot learning techniques for RE based on my research experience. I introduce the basic concepts and principles of few-shot learning, such as the meta-learning framework, the episodic training strategy, and the evaluation metrics. I also discuss some recent advances and applications of few-shot learning for RE, such as the use of pre-trained language models, graph neural networks, contrastive learning, and data augmentation. I demonstrate how few-shot learning can improve the performance and robustness of RE models on different datasets and scenarios. I also share some of the challenges and future directions of few-shot learning for RE.
Slides
|
I have been fortunate to work as a research intern with these wonderful people who generously provided me with guidance and mentorship.
© Brandon Jinxiu Liu
|