Overview
In this assignment, you will design, implement, and evaluate a Personalised Multimodal Agent System built on your own
knowledge base.
The goal is not simply to create a chatbot that runs, but to investigate how system design choices
affect retrieval quality, reasoning ability, and user interaction.
The system must
integrate:
- a personalised knowledge base
- at least two modalities
- a retrieval/indexing pipeline
- an agent framework for tool orchestration
- a quantitative evaluation of system design choices
This assignment emphasises original system thinking. High marks will be awarded for clear design
hypotheses, meaningful technical decisions, strong comparisons between alternative system variants, and
evidence-based analysis.
Learning Objectives
- Construct a personalised multimodal knowledge base.
- Design and justify retrieval and indexing strategies.
- Build an agent workflow that uses tools and state.
- Compare alternative system designs through evaluation.
- Analyse trade-offs between retrieval quality and efficiency.
- Communicate system design and findings clearly.
Task Description
You will build a personalised multimodal agent system that answers questions over a knowledge base
derived from your own curated data.
Your system must go beyond a basic chatbot. It should be framed around a clear technical question,
design hypothesis, or innovation point.
Examples include:
- Is text-only indexing sufficient for multimodal QA?
- Can image-only embeddings support retrieval?
- Does hybrid multimodal retrieval outperform single-space retrieval?
- Does agentic routing improve complex queries?
- Does memory help multi-turn personalised interactions?
- What is gained by separating retrieval, planning, and answering?
Treat the assignment as a mini systems research project.
Minimum Technical Requirements
01
Personalised Knowledge Base
Construct a knowledge base using curated, genuinely personalised content.
- Personal study materials
- Course notes
- Research papers & figures
- Travel memories
- Recipe collections
- Shopping records
- Project documents
- Hobby collections
02
At Least Two Modalities
Integrate multiple types of data into the pipeline.
- Text + Image
- Text + Audio transcript
- Text + Chart
- Image + Metadata
- Doc text + Figures
03
Retrieval Component
A structured approach to fetch relevant context.
- Vector databases
- Multimodal embeddings
- Separate indices
- OCR / caption indexing
- Hybrid retrieval
- Ranking or fusion
04
Agent Framework
Orchestrate steps logically and effectively.
- Query routing
- Retrieval planning
- Memory & state
- Tool selection
- Task decomposition
- Verification stages
05
Quantitative Evaluation
Rigorous comparisons using defined metrics.
- Plain LLM baselines
- Agent-based tracking
- Retrieval ablations
Evaluation Requirements
Your evaluation must include a benchmark suite covering different query types. Rather than showing a few
ad hoc examples, you should design a small but structured test set.
Required Query Families
You must evaluate at least four
query families:
-
Factual Retrieval
direct
retrieval of stored knowledge (e.g., How long does the miso soup recipe take?)
-
Cross-Modal Retrieval
queries
requiring information from different modalities (e.g., Find a recipe whose image shows a
soup with tofu.)
-
Analytical / Multi-Hop Synthesis
questions
that require combining multiple pieces of evidence (e.g., have eggs, tomatoes, and onions.
What can I cook, and which option is the fastest?)
-
Conversational Follow-Up / Personalised Context
multi-turn
or memory-sensitive queries (e.g., I’m allergic to peanuts and I want something quick
tonight.)
For each family, you should provide at least one test case and analyse where each system variant
succeeds or fails.
Required Metrics
Your evaluation must include at
least:
-
One quality metric
- retrieval-oriented metric
such
as Recall@k, Top-k retrieval accuracy, or MRR;
-
answer-quality measure
such
as task success rate, keyword match, groundedness, human judgement, or LLM-as-judge
scoring;
-
And one efficiency or systems metric
such
as latency, number of tool calls, or token usage.
Required Comparisons
You must compare:
- plain LLM/VLM vs final agent system;
- at least one ablation on your final design (e.g., with and without indexing).
Example ablations
text-only index vs image-only index
caption-only vs caption+image embeddings
no memory vs memory
no router vs router
fixed pipeline vs tool-based agent
Originality & Use of Teaching Demo
The teaching demo
illustrates LangGraph usage but is not a project template. Your submission must include
original:
problem
framing
knowledge
base design
multimodal
representation
retrieval
strategy
agent
workflow
evaluation
methodology
⚠️
NO CREDIT RULE
Projects
that simply copy the teaching demo may receive zero marks.
Deliverables
You must submit the following:
- Source code as a zip and report as pdf, submitted separately, naming
is [StudentID_Name.xxx]
(e.g., Sxxxxxxx_NAME.zip).
- Report Maximum 4 pages (allow appendix). Your report should be written
like a short systems paper and include the sections below.
Source Code Repository
- 📁 source code
- 📄
installation instructions
- ⚙️
dependencies
- ▶️ run
instructions
Report (Maximum 4 pages)
- 📝 problem
statement
- 📚 knowledge
base description
- 🔎 retrieval
design
- 🤖 agent
workflow
- 🔬 experiments
& ablation studies
- 📊 results &
failure analysis
Marking Criteria (20 Marks)
Click on each category below to expand the detailed rubric describing expectations based on mark
distributions.
Assesses the originality and clarity of the design question. Higher marks present a clear
technical question with a strong, meaningful innovation point.
A clear and compelling design hypothesis is articulated. The project
demonstrates substantial originality and a strong independent technical contribution.
The innovation is meaningful, well-motivated, and clearly distinct from the teaching
demo.
A reasonable design question is presented, with some original
thinking. The system extends beyond the demo in meaningful ways, though the innovation
may be narrower or less fully justified.
The project has limited originality. The framing is weak, vague, or
mostly implementation-driven. Some independent effort is visible, but the technical
contribution is modest.
The work is largely derivative. The framing is minimal or unclear.
The system appears close to the demo or relies on superficial modifications only.
Assesses the quality of multimodal data design and retrieval/indexing decisions. Higher marks
use a well-constructed personalised knowledge base with meaningful multimodal integration.
The knowledge base is well curated, coherent, and genuinely
personalised. At least two modalities are integrated meaningfully. Retrieval/indexing
choices are well justified and compared through sound experiments.
The knowledge base is appropriate and multimodal, with mostly
sensible retrieval design. Some comparison or justification is provided, though the
design space explored may be limited.
The knowledge base is basic or weakly personalised. Multimodal use is
present but shallow. Retrieval design is underdeveloped or insufficiently justified.
The data setup is minimal, poorly explained, or not meaningfully
multimodal. Retrieval design is simplistic or largely inherited from template code.
Assesses the design and implementation of the agent workflow. Higher marks feature a
well-structured workflow correctly utilising tools, routing, or state.
The agent workflow is well designed and clearly useful. Tool usage,
routing, memory, or state handling are thoughtful and task-appropriate. The framework
demonstrates clear added value over a simple pipeline.
The workflow is functional and mostly appropriate. Some non-trivial
orchestration is present, though the design may be less sophisticated or less well
analysed.
A basic framework is implemented, but orchestration is limited. The
agent adds only modest value beyond a linear retrieval-answer pipeline.
The workflow is superficial, generic, or minimally adapted. It
resembles the teaching demo closely.
Assesses the rigour of evaluation. Higher marks rigorously evaluate variants against multiple
baselines and include impactful discussions on trade-offs.
Evaluation is rigorous and insightful. Multiple baselines and
ablations are compared. Metrics are appropriate and clearly reported. The analysis
explains why some designs perform better than others.
Evaluation is solid and includes the required comparisons. Metrics
are mostly appropriate. Some useful analysis is provided.
Evaluation is limited, incomplete, or weakly structured. Comparisons
exist but are shallow or insufficiently analysed.
Very limited evaluation. Mostly anecdotal examples or screenshots.
Little or no meaningful comparison across variants.
Assesses clarity, professionalism, and reproducibility. Higher marks exhibit high structural
quality with failure analyses and thorough documentation allowing code reproduction.
The report is clear, well structured, and professionally presented.
Code and documentation are reproducible. Results, diagrams, and failure analysis are
strong. Evidence of independent development is provided.
The report and code are generally clear and usable. Most required
materials are present. Some aspects of clarity or reproducibility could be improved.
The report or repository is incomplete, difficult to follow, or
weakly documented. Reproducibility is limited.
Poorly documented submission. Major missing details. Limited evidence
of understanding or independent implementation.
Academic Integrity Note
Students may consult public documentation, tutorials, or framework examples for learning purposes.
However, submitted work must reflect their own design,
implementation, and analysis.
Using the provided teaching demo as a starting scaffold is acceptable only if the final submission shows
substantial redesign, independent implementation, and meaningful evaluation.
The following may be treated as evidence of non-original work:
- ⊗ code similarity
- ⊗ identical workflow structure
- ⊗ copied report language
- ⊗ copied prompts
- ⊗ copied tool definitions
- ⊗ missing originality explanations
FAQ
(Updated: 13 March
2026)
Can we use the provided teaching demo?
Yes, but only as a learning
scaffold. It is not a submission template.
What counts as enough originality?
Your submission should clearly
differ in task framing, knowledge base, retrieval design, workflow logic, and evaluation. If your
system still looks essentially like the demo with minor edits, it will not receive credit.
Do we need quantitative evaluation?
Yes. Showing screenshots or
example conversations alone is not sufficient.
Do we need to compare multiple system designs?
Yes. Comparison is a required
part of the assignment.
Team: Danny Wang*, Yadan Luo†,
Zhuoxiao Chen, Yan Jiang, Xiangyu Sun, Xuwei Xu, Fengyi Zhang, Zhizhen Zhang.
* Project Credit, † Coordinator.
This content is created based on
publicly available sources. All original copyrights remain with their respective
owners.
© 2026 INFS4205/7205. The University of Queensland.