Jinxiu Liu

Brandon Jinxiu Liu (刘锦绣)

“Stay hungry, Stay foolish” -- Steven Jobs

Hi there! I am an senior undergraduate student from South China University of Technology. Now I am working closely with Prof.Ming-Hsuan Yang and Dr. Yinxiao Li from Google DeepMind. I spent a wonderful summer at Stanford Vision and Learning Lab, working with Kyle Sargent and Hong-Xing "Koven" Yu , under the supervision of Prof. Jiajun Wu . I have also worked at Westlake University & OPPO Research Institude, focusing on Multi-modal LLM enhanced Diffusion Model based Image/Video Generation, advised by Prof. Guo-Jun Qi (IEEE Fellow) .

My long-term research goal is to build an AI system capable of controllably creating immersive, interactive and physically grounded 2D/3D/4D virtual worlds with the power of foundation model (LLM/MLLM, Image/Video/3D Diffusion), especially drawing inspiration from human nature and the real needs of designers and artists. I envision my upcoming PhD journey as an entrepreneurial venture—similar to founding a startup—driven by my "North Star" and focused on producing impactful, practical research.

Sincerely looking for PhD positions for fall 2025 admission!

Email: jinxiuliu0628@gmail.com / branodnjinxiuliu@cs.stanford.edu
Tel/Wechat: +86-13951891694

Email / CV / Google Scholar

Sidelights!!! Click the portrait 👆 and enjoy the animation magic from my project Prompt image to Life.

Research Experience

	Stanford Vision and Learning Lab, Stanford University , Research Intern 03/24 – present 4D Scene Generation advised by Prof. Jiajun Wu and Hong-Xing "Koven" Yu.
	Westlake University & OPPO Research Institude , Research Intern 09/23 – present Text driven Video Generation advised by Prof. Guo-Jun Qi (IEEE Fellow) .
	School of Future Technology, SCUT , Research Intern 12/22 – present Text driven Image Generation advised by Prof. Qi Liu (IEEE Senior Member) .

Education Experience

South China University of Technology (SCUT), Guangzhou, China 09/21 – 06/25(expected)
B.Eng (Majoring in Artificial Intelligence)

Main courses: Deep Learning and Computer Vision(4.0/4.0), Course Design of Deep Learning and Computer Vision (4.0/4.0, Best project), C++ Programming Foundations (4.0/4.0), Python Programming (4.0/4.0), Data Structure (4.0/4.0), Advanced Language Programming Training (4.0/4.0), Artificial Intelligence and 3D Vision(4.0/4.0), Calculus (4.0/4.0)......

News

One paper is accepted by AAAI 2024

One paper is accepted by VDU@CVPR 2024 as Oral Presentation

One paper is accepted by IJCAI 2024

One paper is featured as Hugging Face Daily Papers and reposted by AK .

Publications

	R3CD: Scene Graph to Image Generation with Relation-aware Compositional Contrastive Control Diffusion Jinxiu Liu, Qi Liu, In this paper, we introduce R3CD, a new image generation framework from scene graphs with large-scale diffusion models and contrastive control mechanisms. R3CD can handle complex or ambiguous relations in complex multi-object scene graphs and produce realistic and diverse images that match the scene graph specifications. R3CD consists of two main components: (1) SGFormer, a transformer-based encoder that captures both local and global information from scene graphs; (2) Relation-aware Diffusion contrastive control, a contrastive learning module that aligns the relation features and the image features across different levels of abstraction. AAAI 2024, 4 positive reviews Paper
	Maple: Multi-modal Pre-training for Contextual Instance-aware Visual Generation Jinxiu Liu, Jinjin Cao, Zilyu Ye, Zhiyang Chen, Ziwei Xuan, Zemin Huang, Mingyuan Zhou, Xiaoqian Shen, Qi Liu, Mohamed Elhoseiny , Guo-Jun Qi Despite the high quality in images generated with short captions, most of them deteriorate drastically when long contexts with multiple instances are provided. The resulting instances in the generated images are hard to be consistent with those in former images. To address this issue, we propose Maple, a large-scale open-domain generative multi-modal model. Maple is able to take long context with interleaved images and text as guidance to generate images and keep the instances in the generated image consistent with the given inputs. In this paper, we propose a training schedule focused on instance-level consistency, and a long-context decoupling mechanism to balance contextual information of different importance. Tech Report, MLLM Pretraining for Interleaved Image-text Generation based on Openstory++
	Prompt image to Life: Training-free Text-driven Image-to-video Generation Jinxiu Liu, Yuan Yao, Bingwen Zhu, Fanyi Wang, Weijian Luo, Jingwen Su, Yanhao Zhang, Yuxiao Wang, Liyuan Ma, Qi Liu, Jiebo Luo, Guo-Jun Qi We propose PiLife, a novel training-free I2V framework that only leverages a pre-trained text-to-image diffusion model. PiLife can generate videos that are coherent with a given image and aligned with the semantics of a given text, which mainly consists of three components: (i) A motion-aware diffusion inversion module that embeds motion semantics into the inverted images as the initial frames; (ii) A motion-aware noise initialization module that employs a motion text attention map to modulate the diffusion process and adjust the motion intensity of different regions with spatial noise; (iii) A probabilistic cross-frame attention module that leverages a geometric distribution to randomly sample a frame and compute attention with it, thereby enhancing the motion diversity. Experiments show that PiLife significantly outperforms the training-free baselines, and is comparable or even superior to some training-based I2V methods. Tech Report, Best Project in "MetaVerse and VR Course Project" Paper / project page
	Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling Zilyu Ye , Jinxiu Liu ‡** , Ruotian Peng * Jinjin Cao Zhiyang Chen Ziwei Xuan Mingyuan Zhou Xiaoqian Shen Mohamed Elhoseiny Qi Liu, Guo-Jun Qi (* contribute equally, ‡ Project Lead) Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. Tech Report, 🏆 Hugging Face Daily Papers Paper / Project Page /
	OpenStory: A Large-Scale Open-Domain Dataset for Subject-Driven Visual Storytelling Zilyu Ye , Jinxiu Liu ‡** , Jinjin Cao Zhiyang Chen Ziwei Xuan Mingyuan Zhou Qi Liu, Guo-Jun Qi (* contribute equally, ‡ Project Lead) In this paper, we present OpenStory, a large-scale dataset tailored for training subject-focused story visualization models to generate coherent and contextually relevant visual narratives. Addressing the challenges of maintaining subject continuityacross frames and capturing compelling narratives, We propose an innovative pipeline that automates the extraction ofkeyframes from open-domain videos. It ingeniously employsvision-language models to generate descriptive captions,which are then refined by a large language model to ensurenarrative flow and coherence. Furthermore, advanced sub-ject masking techniques are applied to isolate and segmentthe primary subjects. Derived from diverse video sources,including YouTube and existing datasets, OpenStory offersa comprehensive open-domain resource, surpassing priordatasets confined to specific scenarios. With automatedcaptioning instead of manual annotation, high-resolutionimagery optimized for subject count per frame, and exten-sive frame sequences ensuring consistent subjects for tem-poral modeling, OpenStory establishes itself as an invalu-able benchmark. It facilitates advancements in subject-focused story visualization, enabling the training of modelscapable of comprehending and generating intricate multi-modal narratives from extensive visual and textual inputs. CVPR 2024@VDU, Oral Presentation Paper /
	🐖PiGIE: Proximal Policy Optimization Guided Diffusion for Fine-Grained Image Editing Tiancheng Li, Jinxiu Liu ‡, William Luo, Huajun Chen, Qi Liu, (* contribute equally， ‡ Project Lead)** Instruction-based image editing is a challenging task since it requires to manipulation of the visual content of images according to complex human language instructions. When editing an image with tiny objects and complex positional relationships, existing image editing methods cannot locate the accurate region to execute the editing. To address this issue, we introduce Proximal Policy Optimization Guided Image Editing(PiGIE), a diffusion model that can accurately edit tiny objects in images with complex scenes. The PiGIE is able to incorporate proper noise masks to edit images based on the guidance of the target object’s attention maps. Different from the traditional image editing approaches based on supervised learning, PiGIE leverages the cosine similarity between UNet’s attention map and simulated human feedback as a reward function and employs Proximal Policy Optimization (PPO) to fine-tune the diffusion model, such that PiGIE can locate the editing regions precisely based on human instructions. On multiple image editing benchmarks, PiGIE exhibits remarkable improvements in both mage quality and generalization capability. In particular, PiGIE sets a new baseline for editing fine-grained images with multiple tiny objects, shedding light on future studies on text-guided image editing for tiny objects. AI4CC@CVPR 2024 Extended Paper / Original Paper
	PoseAnimate: Zero-shot High Fidelity Pose Controllable Character Animation Bingwen Zhu, Fanyi Wang, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Yu-Gang Jiang, Guo-Jun Qi, In this paper, we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings, to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module (DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. IJCAI 2024 Paper / project page

Deep Neural Network Compression by Spatial-wise Low-rank Decomposition

Xiaoye Zhu*, Jinxiu Liu*, Ye Liu, Michael NG, Zihan Ji, (* contribute equally)

In this paper, we introduces a new method for compressing convolutional neural networks (CNNs) based on spatial-wise low-rank decomposition (SLR). The method preserves the higher-order structure of the filter weights and exploits their local low-rankness in different spatial resolutions, which can be implemented as a 1x1 convolution layer and achieves significant reductions in model size and computation cost with minimal accuracy loss. The paper shows the superior performance of the method over state-of-the-art low-rank compression methods and network pruning methods on several popular CNNs and datasets.

Best Project (1/96) in "Optimization Method Course Project"

Projects

MiniHuggingGPT：A mini multi-modal application like HuggingGPT and MiniGPT-4

Course Design of Deep Learning and Computer Vision mentored by Prof. Mingkui Tan and Prof. Huiping Zhuang,

- Took LLM as an API that calls other large-scale models based on natural language instructions.
- Developed a text-based dialogue system based on ChatGLM that can call for three large-scale models for image captioning, image generation, and text conversation using natural language commands by instruction finetuning.
- Provided a webui interface based on gradio for easy interaction with the system and showcased various examples of its capabilities.

Awarded as the Best Course Design, 1/39

project report

Talk

Introduction to few-shot learning

Honored to be invited by Xinjie Shen, Chairman of AIA(Artificial Intelligence Association) in SCUT.

In this talk, I explore the use of few-shot learning techniques for RE based on my research experience. I introduce the basic concepts and principles of few-shot learning, such as the meta-learning framework, the episodic training strategy, and the evaluation metrics. I also discuss some recent advances and applications of few-shot learning for RE, such as the use of pre-trained language models, graph neural networks, contrastive learning, and data augmentation. I demonstrate how few-shot learning can improve the performance and robustness of RE models on different datasets and scenarios. I also share some of the challenges and future directions of few-shot learning for RE.

Slides

Awards

Template from Jon Barron.