UQ INFS4205/7205 - A3: Personalised Multimodal Agent System

Overview

In this assignment, you will design, implement, and evaluate a Personalised Multimodal Agent System built on your own knowledge base.

The goal is not simply to create a chatbot that runs, but to investigate how system design choices affect retrieval quality, reasoning ability, and user interaction.

The system must integrate:

a personalised knowledge base
at least two modalities
a retrieval/indexing pipeline
an agent framework for tool orchestration
a quantitative evaluation of system design choices

This assignment emphasises original system thinking. High marks will be awarded for clear design hypotheses, meaningful technical decisions, strong comparisons between alternative system variants, and evidence-based analysis.

Learning Objectives

Construct a personalised multimodal knowledge base.
Design and justify retrieval and indexing strategies.
Build an agent workflow that uses tools and state.
Compare alternative system designs through evaluation.
Analyse trade-offs between retrieval quality and efficiency.
Communicate system design and findings clearly.

Task Description

You will build a personalised multimodal agent system that answers questions over a knowledge base derived from your own curated data.

Your system must go beyond a basic chatbot. It should be framed around a clear technical question, design hypothesis, or innovation point.

Examples include:

Is text-only indexing sufficient for multimodal QA?
Can image-only embeddings support retrieval?
Does hybrid multimodal retrieval outperform single-space retrieval?
Does agentic routing improve complex queries?
Does memory help multi-turn personalised interactions?
What is gained by separating retrieval, planning, and answering?

Treat the assignment as a mini systems research project.

Minimum Technical Requirements

Personalised Knowledge Base

Construct a knowledge base using curated, genuinely personalised content.

Personal study materials
Course notes
Research papers & figures
Travel memories
Recipe collections
Shopping records
Project documents
Hobby collections

At Least Two Modalities

Integrate multiple types of data into the pipeline.

Text + Image
Text + Audio transcript
Text + Chart
Image + Metadata
Doc text + Figures

Retrieval Component

A structured approach to fetch relevant context.

Vector databases
Multimodal embeddings
Separate indices
OCR / caption indexing
Hybrid retrieval
Ranking or fusion

Agent Framework

Orchestrate steps logically and effectively.

Query routing
Retrieval planning
Memory & state
Tool selection
Task decomposition
Verification stages

Quantitative Evaluation

Rigorous comparisons using defined metrics.

Plain LLM baselines
Agent-based tracking
Retrieval ablations

Evaluation Requirements

Your evaluation must include a benchmark suite covering different query types. Rather than showing a few ad hoc examples, you should design a small but structured test set.

Required Query Families

You must evaluate at least four query families:

Factual Retrieval
direct retrieval of stored knowledge (e.g., How long does the miso soup recipe take?)
Cross-Modal Retrieval
queries requiring information from different modalities (e.g., Find a recipe whose image shows a soup with tofu.)
Analytical / Multi-Hop Synthesis
questions that require combining multiple pieces of evidence (e.g., have eggs, tomatoes, and onions. What can I cook, and which option is the fastest?)
Conversational Follow-Up / Personalised Context
multi-turn or memory-sensitive queries (e.g., I’m allergic to peanuts and I want something quick tonight.)

For each family, you should provide at least one test case and analyse where each system variant succeeds or fails.

Required Metrics

Your evaluation must include at least:

One quality metric
- retrieval-oriented metric such as Recall@k, Top-k retrieval accuracy, or MRR;
- answer-quality measure
  such as task success rate, keyword match, groundedness, human judgement, or LLM-as-judge scoring;
And one efficiency or systems metric
such as latency, number of tool calls, or token usage.

Required Comparisons

You must compare:

plain LLM/VLM vs final agent system;
at least one ablation on your final design (e.g., with and without indexing).

Example ablations
text-only index vs image-only index
caption-only vs caption+image embeddings
no memory vs memory
no router vs router
fixed pipeline vs tool-based agent

Originality & Use of Teaching Demo

The teaching demo illustrates LangGraph usage but is not a project template. Your submission must include original:

problem framing knowledge base design multimodal representation retrieval strategy agent workflow evaluation methodology

⚠️
                    NO CREDIT RULE

                    Projects
                        that simply copy the teaching demo may receive zero marks.
                

Deliverables

You must submit the following:

Source code as a zip and report as pdf, submitted separately, naming is [StudentID_Name.xxx] (e.g., Sxxxxxxx_NAME.zip).
Report Maximum 4 pages (allow appendix). Your report should be written like a short systems paper and include the sections below.

                        source_code.zip

Source Code Repository

📁 source code
📄 installation instructions
⚙️ dependencies
▶️ run instructions

                        report.pdf

Report (Maximum 4 pages)

📝 problem statement
📚 knowledge base description
🔎 retrieval design
🤖 agent workflow
🔬 experiments & ablation studies
📊 results & failure analysis

Marking Criteria (20 Marks)

Click on each category below to expand the detailed rubric describing expectations based on mark distributions.

1. Problem Framing & Innovation

4 Marks

Assesses the originality and clarity of the design question. Higher marks present a clear technical question with a strong, meaningful innovation point.

4.0 – 3.5 Marks

A clear and compelling design hypothesis is articulated. The project demonstrates substantial originality and a strong independent technical contribution. The innovation is meaningful, well-motivated, and clearly distinct from the teaching demo.

3.4 – 2.5 Marks

A reasonable design question is presented, with some original thinking. The system extends beyond the demo in meaningful ways, though the innovation may be narrower or less fully justified.

2.4 – 1.5 Marks

The project has limited originality. The framing is weak, vague, or mostly implementation-driven. Some independent effort is visible, but the technical contribution is modest.

1.4 – 0.0 Marks

The work is largely derivative. The framing is minimal or unclear. The system appears close to the demo or relies on superficial modifications only.

2. Knowledge Base & Retrieval Design

4 Marks

Assesses the quality of multimodal data design and retrieval/indexing decisions. Higher marks use a well-constructed personalised knowledge base with meaningful multimodal integration.

4.0 – 3.5 Marks

The knowledge base is well curated, coherent, and genuinely personalised. At least two modalities are integrated meaningfully. Retrieval/indexing choices are well justified and compared through sound experiments.

3.4 – 2.5 Marks

The knowledge base is appropriate and multimodal, with mostly sensible retrieval design. Some comparison or justification is provided, though the design space explored may be limited.

2.4 – 1.5 Marks

The knowledge base is basic or weakly personalised. Multimodal use is present but shallow. Retrieval design is underdeveloped or insufficiently justified.

1.4 – 0.0 Marks

The data setup is minimal, poorly explained, or not meaningfully multimodal. Retrieval design is simplistic or largely inherited from template code.

3. Agent Framework & Tool Orchestration

4 Marks

Assesses the design and implementation of the agent workflow. Higher marks feature a well-structured workflow correctly utilising tools, routing, or state.

4.0 – 3.5 Marks

The agent workflow is well designed and clearly useful. Tool usage, routing, memory, or state handling are thoughtful and task-appropriate. The framework demonstrates clear added value over a simple pipeline.

3.4 – 2.5 Marks

The workflow is functional and mostly appropriate. Some non-trivial orchestration is present, though the design may be less sophisticated or less well analysed.

2.4 – 1.5 Marks

A basic framework is implemented, but orchestration is limited. The agent adds only modest value beyond a linear retrieval-answer pipeline.

1.4 – 0.0 Marks

The workflow is superficial, generic, or minimally adapted. It resembles the teaching demo closely.

4. Quantitative Evaluation & Ablation

4 Marks

Assesses the rigour of evaluation. Higher marks rigorously evaluate variants against multiple baselines and include impactful discussions on trade-offs.

4.0 – 3.5 Marks

Evaluation is rigorous and insightful. Multiple baselines and ablations are compared. Metrics are appropriate and clearly reported. The analysis explains why some designs perform better than others.

3.4 – 2.5 Marks

Evaluation is solid and includes the required comparisons. Metrics are mostly appropriate. Some useful analysis is provided.

2.4 – 1.5 Marks

Evaluation is limited, incomplete, or weakly structured. Comparisons exist but are shallow or insufficiently analysed.

1.4 – 0.0 Marks

Very limited evaluation. Mostly anecdotal examples or screenshots. Little or no meaningful comparison across variants.

5. Report, Code & Reproducibility

4 Marks

Assesses clarity, professionalism, and reproducibility. Higher marks exhibit high structural quality with failure analyses and thorough documentation allowing code reproduction.

4.0 – 3.5 Marks

The report is clear, well structured, and professionally presented. Code and documentation are reproducible. Results, diagrams, and failure analysis are strong. Evidence of independent development is provided.

3.4 – 2.5 Marks

The report and code are generally clear and usable. Most required materials are present. Some aspects of clarity or reproducibility could be improved.

2.4 – 1.5 Marks

The report or repository is incomplete, difficult to follow, or weakly documented. Reproducibility is limited.

1.4 – 0.0 Marks

Poorly documented submission. Major missing details. Limited evidence of understanding or independent implementation.

Academic Integrity Note

Students may consult public documentation, tutorials, or framework examples for learning purposes. However, submitted work must reflect their own design, implementation, and analysis.

Using the provided teaching demo as a starting scaffold is acceptable only if the final submission shows substantial redesign, independent implementation, and meaningful evaluation.

The following may be treated as evidence of non-original work:

⊗ code similarity
⊗ identical workflow structure
⊗ copied report language
⊗ copied prompts
⊗ copied tool definitions
⊗ missing originality explanations

FAQ

(Updated: 13 March 2026)

Can we use the provided teaching demo?

Yes, but only as a learning scaffold. It is not a submission template.

What counts as enough originality?

Your submission should clearly differ in task framing, knowledge base, retrieval design, workflow logic, and evaluation. If your system still looks essentially like the demo with minor edits, it will not receive credit.

Do we need quantitative evaluation?

Yes. Showing screenshots or example conversations alone is not sufficient.

Do we need to compare multiple system designs?

Yes. Comparison is a required part of the assignment.

Team: Danny Wang^*, Yadan Luo^†, Zhuoxiao Chen, Yan Jiang, Xiangyu Sun, Xuwei Xu, Fengyi Zhang, Zhizhen Zhang.

* Project Credit, † Coordinator.

Personalised Multimodal Agent System.

Framework

Knowledge Base

Retrieval

Inference

Overview

Learning Objectives

Task Description

Minimum Technical Requirements

Personalised Knowledge Base

At Least Two Modalities

Retrieval Component

Agent Framework

Quantitative Evaluation

Evaluation Requirements

Required Query Families

Required Metrics

Required Comparisons

Originality & Use of Teaching Demo

Deliverables

Source Code Repository

Report (Maximum 4 pages)

Marking Criteria (20 Marks)

1. Problem Framing & Innovation

2. Knowledge Base & Retrieval Design

3. Agent Framework & Tool Orchestration

4. Quantitative Evaluation & Ablation

5. Report, Code & Reproducibility

Academic Integrity Note

FAQ

Can we use the provided teaching demo?

What counts as enough originality?

Do we need quantitative evaluation?

Do we need to compare multiple system designs?