Brandon Jinxiu Liu (刘锦绣)

“Stay hungry, Stay foolish” -- Steven Jobs

Hi there! I am a junior undergraduate student from South China University of Technology, advised by Prof. Qi Liu (IEEE Senior Member) . Now I am working as a research intern at Stanford Vision and Learning Lab, focusing on 4D Dynamic Generation, advised by Prof. Jiajun Wu . I am also working at Westlake University & OPPO Research Institude, focusing on Multi-modal LLM enhanced Diffusion Model based Image/Video Generation, advised by Prof. Guo-Jun Qi (IEEE Fellow) .

Sincerely looking for PhD positions for fall 2025 admission!

Email: jinxiuliu0628@foxmail.com  /  branodnjinxiuliu@cs.stanford.edu
Tel/Wechat: +86-13951891694

Email  /  CV  /  Google Scholar

profile photo
Sidelights!!! Click the portrait 👆 and enjoy the animation magic from my project Prompt image to Life.
Research Experience
layoutgpt_gif Stanford Vision and Learning Lab, Stanford University , Research Intern   03/24 – present
4D Scene Generation   advised by Prof. Jiajun Wu and Hong-Xing "Koven" Yu.

layoutgpt_gif Westlake University & OPPO Research Institude , Research Intern   09/23 – present
Text driven Video Generation   advised by Prof. Guo-Jun Qi (IEEE Fellow) .

layoutgpt_gif School of Future Technology, SCUT , Research Intern   12/22 – present
Text driven Image Generation   advised by Prof. Qi Liu (IEEE Senior Member) .

Education Experience

South China University of Technology (SCUT), Guangzhou, China   09/21 – 06/25(expected)
B.Eng   (Majoring in Artificial Intelligence)

Main courses: Deep Learning and Computer Vision(4.0/4.0),  Course Design of Deep Learning and Computer Vision (4.0/4.0,  Best project),  C++ Programming Foundations (4.0/4.0),  Python Programming (4.0/4.0),  Data Structure (4.0/4.0),  Advanced Language Programming Training (4.0/4.0),  Artificial Intelligence and 3D Vision(4.0/4.0),  Calculus (4.0/4.0)......

News

One paper is accepted by AAAI 2024

One paper is accepted by VDU@CVPR 2024 as Oral Presentation

One paper is accepted by IJCAI 2024

Publication
layoutgpt_gif R3CD: Scene Graph to Image Generation with Relation-aware Compositional Contrastive Control Diffusion

Jinxiu Liu,   Qi Liu,  

In this paper, we introduce R3CD, a new image generation framework from scene graphs with large-scale diffusion models and contrastive control mechanisms. R3CD can handle complex or ambiguous relations in scene graphs and produce realistic and diverse images that match the scene graph specifications. R3CD consists of two main components: (1) SGFormer, a transformer-based encoder that captures both local and global information from scene graphs; (2) Relation-aware Diffusion contrastive control, a contrastive learning module that aligns the relation features and the image features across different levels of abstraction.

AAAI 2024, 4 positive reviews
Paper

layoutgpt_gif Maple: Multi-modal Pre-training for Contextual Instance-aware Visual Generation

Jinxiu Liu,   Jinjin Cao,   Zilyu Ye,   Zhiyang Chen,   Ziwei Xuan,   Zemin Huang,   Mingyuan Zhou,   Xiaoqian Shen,   Qi Liu,   Mohamed Elhoseiny ,   Guo-Jun Qi  

Generative multi-modal models have recently drawn increasing attention, driven by their critical impacts on granularity-diverse conversations. Despite the high quality in images generated with short captions, most of them deteriorate drastically when long contexts with multiple instances are provided. The resulting instances in the generated images are hard to be consistent with those in former images. This attributes to the models' weakness in capturing the features of these instances sparsely located in the long input sequence. To address this issue, we propose Maple, a large-scale open-domain generative multi-modal model. Maple is able to take long context with interleaved images and text as guidance to generate images and keep the instances in the generated image consistent with the given inputs. This is realized by a training schedule focused on instance-level consistency, and a long-context decoupling mechanism to balance contextual information of different importance. Specifically, (1) we introduce a new image generation approach that utilizes instances as extra inputs to emphasize their features in the multi-modal context, enhancing instance-level consistency. To train this model, we provide an extensive open-domain dataset with recurring instances across image sequences. We also propose a multi-stage training strategy, evolving from instance-grounded single-round generation to multi-modal context-aware multi-turn generation. (2) We decouple the multi-modal context into two inputs, the highly related current caption and the broader multi-modal context with low information density, allowing for differential prioritization through a tailored attention mechanism within the UNet of diffusion model. This mechanism refines the attention to focus on instance-level features relevant to current instructions when perceiving multimodal context. Remarkably, the proposed Maple exhibits superior performance in generating high-quality images which are more coherent to the inputs and outperform current state-of-the-art multi-modal visual generators in contextual visual consistency.

NeurIPS 2024, under review

layoutgpt_gif layoutgpt_gif Prompt image to Life: Training-free Text-driven Image-to-video Generation

Jinxiu Liu,   Yuan Yao,   Bingwen Zhu,   Fanyi Wang,   Weijian Luo,   Jingwen Su,   Yanhao Zhang,   Yuxiao Wang,   Liyuan Ma,   Qi Liu,   Jiebo Luo,   Guo-Jun Qi  

Image-to-video (I2V) generation is a challenging task that requires transforming a static image into a dynamic video according to a text prompt. For a long time, it has been a challenging task that demands both subject consistency and text semantic alignment. Moreover, existing I2V generators require expensive training on large video datasets. To address this issue, we propose PiLife, a novel training-free I2V framework that leverages a pre-trained text-to-image diffusion model. PiLife can generate videos that are coherent with a given image and aligned with the semantics of a given text, which mainly consists of three components: (i) A motion-aware diffusion inversion module that embeds motion semantics into the inverted images as the initial frames; (ii) A motion-aware noise initialization module that employs a motion text attention map to modulate the diffusion process and adjust the motion intensity of different regions with spatial noise; (iii) A probabilistic cross-frame attention module that leverages a geometric distribution to randomly sample a frame and compute attention with it, thereby enhancing the motion diversity. Experiments show that PiLife significantly outperforms the training-free baselines, and is comparable or even superior to some training-based I2V methods.

ECCV 2024, under review
Paper / project page

layoutgpt_gif OpenStory: A Large-Scale Open-Domain Dataset for Subject-Driven Visual Storytelling

Zilyu Ye *,   Jinxiu Liu *,   Jinjin Cao   Zhiyang Chen   Ziwei Xuan   Mingyuan Zhou   Qi Liu,   Guo-Jun Qi     (* contribute equally)

In this paper, we present OpenStory, a large-scale dataset tailored for training subject-focused story visualization models to generate coherent and contextually relevant visual narratives. Addressing the challenges of maintaining subject continuityacross frames and capturing compelling narratives, We propose an innovative pipeline that automates the extraction ofkeyframes from open-domain videos. It ingeniously employsvision-language models to generate descriptive captions,which are then refined by a large language model to ensurenarrative flow and coherence. Furthermore, advanced sub-ject masking techniques are applied to isolate and segmentthe primary subjects. Derived from diverse video sources,including YouTube and existing datasets, OpenStory offersa comprehensive open-domain resource, surpassing priordatasets confined to specific scenarios. With automatedcaptioning instead of manual annotation, high-resolutionimagery optimized for subject count per frame, and exten-sive frame sequences ensuring consistent subjects for tem-poral modeling, OpenStory establishes itself as an invalu-able benchmark. It facilitates advancements in subject-focused story visualization, enabling the training of modelscapable of comprehending and generating intricate multi-modal narratives from extensive visual and textual inputs.

CVPR 2024@VDU, Oral Presentation

layoutgpt_gif 🐖PiGIE: Proximal Policy Optimization Guided Diffusion for Fine-Grained Image Editing

Tiancheng Li*,   Jinxiu Liu*,   William Luo,   Huajun Chen,   Qi Liu,   (* contribute equally)

Instruction-based image editing is a challenging task since it requires to manipulation of the visual content of images according to complex human language instructions. When editing an image with tiny objects and complex positional relationships, existing image editing methods cannot locate the accurate region to execute the editing. To address this issue, we introduce Proximal Policy Optimization Guided Image Editing(PiGIE), a diffusion model that can accurately edit tiny objects in images with complex scenes. The PiGIE is able to incorporate proper noise masks to edit images based on the guidance of the target object’s attention maps. Different from the traditional image editing approaches based on supervised learning, PiGIE leverages the cosine similarity between UNet’s attention map and human feedback as a reward function and employs Proximal Policy Optimization (PPO) to fine-tune the diffusion model, such that PiGIE can locate the editing regions precisely based on human instructions. On multiple image editing benchmarks, PiGIE exhibits remarkable improvements in both mage quality and generalization capability. In particular, PiGIE sets a new baseline for editing fine-grained images with multiple tiny objects, shedding light on future studies on text-guided image editing for tiny objects.

ACM MM 2024, under review

layoutgpt_gif layoutgpt_gif PoseAnimate: Zero-shot High Fidelity Pose Controllable Character Animation

Bingwen Zhu,   Fanyi Wang,   Peng Liu,   Jingwen Su,   Jinxiu Liu,   Yanhao Zhang,   Zuxuan Wu,   Yu-Gang Jiang,   Guo-Jun Qi,  

In this paper, we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings, to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module (DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity.

IJCAI 2024
Paper / project page

layoutgpt_gif Deep Neural Network Compression by Spatial-wise Low-rank Decomposition

Xiaoye Zhu*,   Jinxiu Liu*,   Ye Liu,   Michael NG,   Zihan Ji,   (* contribute equally)

In this paper, we introduces a new method for compressing convolutional neural networks (CNNs) based on spatial-wise low-rank decomposition (SLR). The method preserves the higher-order structure of the filter weights and exploits their local low-rankness in different spatial resolutions, which can be implemented as a 1x1 convolution layer and achieves significant reductions in model size and computation cost with minimal accuracy loss. The paper shows the superior performance of the method over state-of-the-art low-rank compression methods and network pruning methods on several popular CNNs and datasets.

Applied Intelligence, under review, positive reviews

Projects
layoutgpt_gif MiniHuggingGPT:A mini multi-modal application like HuggingGPT and MiniGPT-4

Course Design of Deep Learning and Computer Vision  mentored by Prof. Mingkui Tan and Prof. Huiping Zhuang,  

- Took LLM as an API that calls other large-scale models based on natural language instructions.
- Developed a text-based dialogue system based on ChatGLM that can call for three large-scale models for image captioning, image generation, and text conversation using natural language commands by instruction finetuning.
- Provided a webui interface based on gradio for easy interaction with the system and showcased various examples of its capabilities.

Awarded as the Best Course Design, 1/39

project report

Talk
layoutgpt_gif Introduction to few-shot learning

Honored to be invited by Xinjie Shen, Chairman of AIA(Artificial Intelligence Association) in SCUT.

In this talk, I explore the use of few-shot learning techniques for RE based on my research experience. I introduce the basic concepts and principles of few-shot learning, such as the meta-learning framework, the episodic training strategy, and the evaluation metrics. I also discuss some recent advances and applications of few-shot learning for RE, such as the use of pre-trained language models, graph neural networks, contrastive learning, and data augmentation. I demonstrate how few-shot learning can improve the performance and robustness of RE models on different datasets and scenarios. I also share some of the challenges and future directions of few-shot learning for RE.

Slides

Acknowledgement
I have been fortunate to work as a research intern with these wonderful people who generously provided me with guidance and mentorship.

@ South China University of Technology

Prof. Qi Liu
Prof. Ye Liu
Prof. Ziqian Zeng

@ Westlake University

Prof. Guo-Jun Qi
Dr. Liyuan Ma

@ OPPO Research

Prof. Guo-Jun Qi
Dr. Yanhao Zhang
Dr. Fanyi Wang
Dr. Jingwen Su

@ University of Rochester

Prof. Jiebo Luo
Yuan Yao

@ Peking University

Dr. Weijian Luo

@ Fudan University

Bingwen Zhu


Flag Counter

© Brandon Jinxiu Liu

Template from Jon Barron.