Jinxiu Liu

Brandon Jinxiu Liu (刘锦绣)

“Stay hungry, Stay foolish” -- Steven Jobs

Hi there! I am an senior undergraduate student from South China University of Technology. Now I am working closely with Prof.Ming-Hsuan Yang and Dr. Yinxiao Li from Google DeepMind. I am working as a research intern in Bytedance, with Dr. Xuefeng Xiao and Dr. Yuxi Ren from Seed Team . I was a visiting student in Stanford AI Lab. I have also worked at Westlake University & OPPO Research Institude, focusing on Unified Multi-modal Understanding and Generation Model, advised by Prof. Guo-Jun Qi (IEEE Fellow) .

My long-term research goal is to build an AI system capable of controllably creating immersive, interactive and physically grounded 2D/3D/4D virtual worlds with the power of foundation model (LLM/MLLM, Image/Video/3D Diffusion), especially drawing inspiration from human nature and the real needs of designers and artists. I envision my upcoming PhD journey as an entrepreneurial venture—similar to founding a startup—driven by my "North Star" and focused on producing impactful, practical research.

Email: jinxiuliu0628@gmail.com / branodnjinxiuliu@cs.stanford.edu
Tel/Wechat: +86-13951891694

Email / CV / Google Scholar

Sidelights!!! Click the portrait 👆 and enjoy the animation magic from my project Prompt image to Life.

Research Statement

Recently I am considering how to develop an efficient and interactive world simulator with versatile applications in gaming, robotics, architecture, and physical simulations. This research is structured around three sub-goals, culminating in a dual-engine-driven simulator designed to optimize computational efficiency and interaction capabilities, even in resource-constrained and large-scale scenarios.
Sub-Goal 1:Efficient Large Foundation Model A key challenge in generative models lies in their high computational demands. My research aims to address this by creating efficient foundation models that deliver fast inference, use fewer parameters, and maintain high-quality outputs. Techniques such as distillation, pruning, and quantization will be employed to optimize autoregressive video generation models. These advancements are particularly crucial for real-time applications in AR and XR, where low latency and scalability across resolutions are essential for delivering immersive experiences across diverse devices.
Sub-Goal 2: World Simulator based Generative Engine This sub-goal focuses on building a world simulator capable of multi-modal, controllable visual generation while simulating complex world dynamics. The simulator will enable adaptive, interactive, and procedurally generated environments tailored to applications in gaming, robotics, and scientific research (AI4SCI). By integrating symbolic control and spatial reasoning, the system will dynamically adapt to user inputs and real-world constraints, enhancing interactivity, realism, and user engagement.
Sub-Goal 3: Inverse Rendering and Physics Engine Based Generative Engine Bridging the gap between generative diversity and structural precision is critical for creating high-quality, editable 3D and 4D assets. My work will leverage inverse rendering and integrate physics engines, such as Blender, to produce assets with exceptional realism and precision. Furthermore, combining large language models (LLMs), vision-language models (VLMs), and agent-based symbolic frameworks will enable a flexible pipeline for asset generation, significantly reducing manual labor and enhancing creative workflows in industries like gaming and filmmaking.
The Final Goal and Vision: Interactive Dual-Engine-Driven World Simulator and AIGC System By synthesizing these three sub-goals, my research aims to create a dual-engine-driven simulator that merges generative diversity with structural precision. This simulator will support dynamic, interactive, and highly realistic virtual environments, offering scalable solutions for applications in gaming, virtual testing, robotics, and AI-driven simulations. Ultimately, this work aspires to push the boundaries of interactive world simulation, driving both academic advancements and transformative real-world innovations.

Life Illustration

Research Experience

	Stanford AI Lab, Stanford University , Research Intern 06/24 – 08/24 Generative Spatial Intelligence.
	Westlake University & OPPO Research Institude , Research Intern 09/23 – present Text driven Video Generation advised by Prof. Guo-Jun Qi (IEEE Fellow) .
	School of Future Technology, SCUT , Research Intern 12/22 – present Text driven Image Generation advised by Prof. Qi Liu (IEEE Senior Member) .

Education Experience

South China University of Technology (SCUT), Guangzhou, China 09/21 – 06/25(expected)
B.Eng (Majoring in Artificial Intelligence)

Main courses: Deep Learning and Computer Vision(4.0/4.0), Course Design of Deep Learning and Computer Vision (4.0/4.0, Best project), C++ Programming Foundations (4.0/4.0), Python Programming (4.0/4.0), Data Structure (4.0/4.0), Advanced Language Programming Training (4.0/4.0), Artificial Intelligence and 3D Vision(4.0/4.0), Calculus (4.0/4.0)......

News

One paper is accepted by AAAI 2024

One paper is accepted by VDU@CVPR 2024 as Oral Presentation

One paper is accepted by IJCAI 2024

One paper is featured as Hugging Face Daily Papers and reposted by AK .

One paper is accepted by CVPR 2025, See you in Nashville!

Publications

	DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes Jinxiu Liu, Shaoheng Lin, Yinxiao Li, Ming-Hsuan Yang, The increasing demand for immersive AR/VR applications and spatial intelligence has heightened the need to generate high-quality scene-level and 360° panoramic video. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose the DynamicScaler, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with constant VRAM consumption regardless of the output video resolution. CVPR 2025 Paper / Code / project page / Huggingface Daily Papers
	R3CD: Scene Graph to Image Generation with Relation-aware Compositional Contrastive Control Diffusion Jinxiu Liu, Qi Liu, In this paper, we introduce R3CD, a new image generation framework from scene graphs with large-scale diffusion models and contrastive control mechanisms. R3CD can handle complex or ambiguous relations in complex multi-object scene graphs and produce realistic and diverse images that match the scene graph specifications. R3CD consists of two main components: (1) SGFormer, a transformer-based encoder that captures both local and global information from scene graphs; (2) Relation-aware Diffusion contrastive control, a contrastive learning module that aligns the relation features and the image features across different levels of abstraction. AAAI 2024, 4 positive reviews Paper
	Maple: Multi-modal Pre-training for Contextual Instance-aware Visual Generation Jinxiu Liu, Jinjin Cao, Zilyu Ye, Zhiyang Chen, Ziwei Xuan, Zemin Huang, Mingyuan Zhou, Xiaoqian Shen, Qi Liu, Mohamed Elhoseiny , Guo-Jun Qi Despite the high quality in images generated with short captions, most of them deteriorate drastically when long contexts with multiple instances are provided. The resulting instances in the generated images are hard to be consistent with those in former images. To address this issue, we propose Maple, a large-scale open-domain generative multi-modal model. Maple is able to take long context with interleaved images and text as guidance to generate images and keep the instances in the generated image consistent with the given inputs. In this paper, we propose a training schedule focused on instance-level consistency, and a long-context decoupling mechanism to balance contextual information of different importance. Tech Report, MLLM Pretraining for Interleaved Image-text Generation based on Openstory++
	Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling Zilyu Ye , Jinxiu Liu ‡** , Ruotian Peng * Jinjin Cao Zhiyang Chen Ziwei Xuan Mingyuan Zhou Xiaoqian Shen Mohamed Elhoseiny Qi Liu, Guo-Jun Qi (* contribute equally, ‡ Project Lead) Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. Tech Report, 🏆 Hugging Face Daily Papers Paper / Project Page (Code & Dataset) /
	OpenStory: A Large-Scale Open-Domain Dataset for Subject-Driven Visual Storytelling Zilyu Ye , Jinxiu Liu ‡** , Jinjin Cao Zhiyang Chen Ziwei Xuan Mingyuan Zhou Qi Liu, Guo-Jun Qi (* contribute equally, ‡ Project Lead) In this paper, we present OpenStory, a large-scale dataset tailored for training subject-focused story visualization models to generate coherent and contextually relevant visual narratives. Addressing the challenges of maintaining subject continuityacross frames and capturing compelling narratives, We propose an innovative pipeline that automates the extraction ofkeyframes from open-domain videos. It ingeniously employsvision-language models to generate descriptive captions,which are then refined by a large language model to ensurenarrative flow and coherence. Furthermore, advanced sub-ject masking techniques are applied to isolate and segmentthe primary subjects. Derived from diverse video sources,including YouTube and existing datasets, OpenStory offersa comprehensive open-domain resource, surpassing priordatasets confined to specific scenarios. With automatedcaptioning instead of manual annotation, high-resolutionimagery optimized for subject count per frame, and exten-sive frame sequences ensuring consistent subjects for tem-poral modeling, OpenStory establishes itself as an invalu-able benchmark. It facilitates advancements in subject-focused story visualization, enabling the training of modelscapable of comprehending and generating intricate multi-modal narratives from extensive visual and textual inputs. CVPR 2024@VDU, Oral Presentation Paper /
	🐖PiGIE: Proximal Policy Optimization Guided Diffusion for Fine-Grained Image Editing Tiancheng Li, Jinxiu Liu ‡, William Luo, Huajun Chen, Qi Liu, (* contribute equally， ‡ Project Lead)** Instruction-based image editing is a challenging task since it requires to manipulation of the visual content of images according to complex human language instructions. When editing an image with tiny objects and complex positional relationships, existing image editing methods cannot locate the accurate region to execute the editing. To address this issue, we introduce Proximal Policy Optimization Guided Image Editing(PiGIE), a diffusion model that can accurately edit tiny objects in images with complex scenes. The PiGIE is able to incorporate proper noise masks to edit images based on the guidance of the target object’s attention maps. Different from the traditional image editing approaches based on supervised learning, PiGIE leverages the cosine similarity between UNet’s attention map and simulated human feedback as a reward function and employs Proximal Policy Optimization (PPO) to fine-tune the diffusion model, such that PiGIE can locate the editing regions precisely based on human instructions. On multiple image editing benchmarks, PiGIE exhibits remarkable improvements in both mage quality and generalization capability. In particular, PiGIE sets a new baseline for editing fine-grained images with multiple tiny objects, shedding light on future studies on text-guided image editing for tiny objects. AI4CC@CVPR 2024 Extended Paper / Original Paper
	PoseAnimate: Zero-shot High Fidelity Pose Controllable Character Animation Bingwen Zhu, Fanyi Wang, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Yu-Gang Jiang, Guo-Jun Qi, In this paper, we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings, to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module (DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. IJCAI 2024 Paper / project page

Deep Neural Network Compression by Spatial-wise Low-rank Decomposition

Xiaoye Zhu*, Jinxiu Liu*, Ye Liu, Michael NG, Zihan Ji, (* contribute equally)

In this paper, we introduces a new method for compressing convolutional neural networks (CNNs) based on spatial-wise low-rank decomposition (SLR). The method preserves the higher-order structure of the filter weights and exploits their local low-rankness in different spatial resolutions, which can be implemented as a 1x1 convolution layer and achieves significant reductions in model size and computation cost with minimal accuracy loss. The paper shows the superior performance of the method over state-of-the-art low-rank compression methods and network pruning methods on several popular CNNs and datasets.

Best Project (1/96) in "Optimization Method Course Project"

Projects

MiniHuggingGPT：A mini multi-modal application like HuggingGPT and MiniGPT-4

Course Design of Deep Learning and Computer Vision mentored by Prof. Mingkui Tan and Prof. Huiping Zhuang,

- Took LLM as an API that calls other large-scale models based on natural language instructions.
- Developed a text-based dialogue system based on ChatGLM that can call for three large-scale models for image captioning, image generation, and text conversation using natural language commands by instruction finetuning.
- Provided a webui interface based on gradio for easy interaction with the system and showcased various examples of its capabilities.

Awarded as the Best Course Design, 1/39

project report

Talk

Introduction to few-shot learning

Honored to be invited by Xinjie Shen, Chairman of AIA(Artificial Intelligence Association) in SCUT.

In this talk, I explore the use of few-shot learning techniques for RE based on my research experience. I introduce the basic concepts and principles of few-shot learning, such as the meta-learning framework, the episodic training strategy, and the evaluation metrics. I also discuss some recent advances and applications of few-shot learning for RE, such as the use of pre-trained language models, graph neural networks, contrastive learning, and data augmentation. I demonstrate how few-shot learning can improve the performance and robustness of RE models on different datasets and scenarios. I also share some of the challenges and future directions of few-shot learning for RE.

Slides

Awards

Template from Jon Barron.