UQ INFS4205/7205 - A3:

Personalised Multimodal Agent System.

Build an intelligent agent backed by your own multimodal knowledge base. Utilise LangGraph and Large Language Models (LLMs) to produce reliable, grounded, and domain-specific answers.

Framework

LangGraph Agent

Knowledge Base

Text + Media

Retrieval

Vector DB + Metadata

Inference

LLM (Ollama / API)

LangGraph v0.1
Agent_Trace.log [x]
MACHINE
INTELLIGENCE

Overview

In this assignment, you will design, implement, and evaluate a Personalised Multimodal Agent System built on your own knowledge base.

The goal is not simply to create a chatbot that runs, but to investigate how system design choices affect retrieval quality, reasoning ability, and user interaction.

The system must integrate:

This assignment emphasises original system thinking. High marks will be awarded for clear design hypotheses, meaningful technical decisions, strong comparisons between alternative system variants, and evidence-based analysis.

Learning Objectives

Task Description

You will build a personalised multimodal agent system that answers questions over a knowledge base derived from your own curated data.

Your system must go beyond a basic chatbot. It should be framed around a clear technical question, design hypothesis, or innovation point.

Examples include:

Treat the assignment as a mini systems research project.

Minimum Technical Requirements

01

Personalised Knowledge Base

Construct a knowledge base using curated, genuinely personalised content.

  • Personal study materials
  • Course notes
  • Research papers & figures
  • Travel memories
  • Recipe collections
  • Shopping records
  • Project documents
  • Hobby collections
02

At Least Two Modalities

Integrate multiple types of data into the pipeline.

  • Text + Image
  • Text + Audio transcript
  • Text + Chart
  • Image + Metadata
  • Doc text + Figures
03

Retrieval Component

A structured approach to fetch relevant context.

  • Vector databases
  • Multimodal embeddings
  • Separate indices
  • OCR / caption indexing
  • Hybrid retrieval
  • Ranking or fusion
04

Agent Framework

Orchestrate steps logically and effectively.

  • Query routing
  • Retrieval planning
  • Memory & state
  • Tool selection
  • Task decomposition
  • Verification stages
05

Quantitative Evaluation

Rigorous comparisons using defined metrics.

  • Plain LLM baselines
  • Agent-based tracking
  • Retrieval ablations

Evaluation Requirements

Your evaluation must include a benchmark suite covering different query types. Rather than showing a few ad hoc examples, you should design a small but structured test set.

Required Query Families

You must evaluate at least four query families:

  • Factual Retrieval
    direct retrieval of stored knowledge (e.g., How long does the miso soup recipe take?)
  • Cross-Modal Retrieval
    queries requiring information from different modalities (e.g., Find a recipe whose image shows a soup with tofu.)
  • Analytical / Multi-Hop Synthesis
    questions that require combining multiple pieces of evidence (e.g., have eggs, tomatoes, and onions. What can I cook, and which option is the fastest?)
  • Conversational Follow-Up / Personalised Context
    multi-turn or memory-sensitive queries (e.g., I’m allergic to peanuts and I want something quick tonight.)
For each family, you should provide at least one test case and analyse where each system variant succeeds or fails.

Required Metrics

Your evaluation must include at least:

  • One quality metric
    • retrieval-oriented metric such as Recall@k, Top-k retrieval accuracy, or MRR;
    • answer-quality measure
      such as task success rate, keyword match, groundedness, human judgement, or LLM-as-judge scoring;
  • And one efficiency or systems metric
    such as latency, number of tool calls, or token usage.

Required Comparisons

You must compare:

  • plain LLM/VLM vs final agent system;
  • at least one ablation on your final design (e.g., with and without indexing).
Example ablations
  • text-only index vs image-only index
  • caption-only vs caption+image embeddings
  • no memory vs memory
  • no router vs router
  • fixed pipeline vs tool-based agent

Originality & Use of Teaching Demo

The teaching demo illustrates LangGraph usage but is not a project template. Your submission must include original:

problem framing knowledge base design multimodal representation retrieval strategy agent workflow evaluation methodology
⚠️
NO CREDIT RULE
Projects that simply copy the teaching demo may receive zero marks.

Deliverables

You must submit the following:

source_code.zip

Source Code Repository

  • 📁 source code
  • 📄 installation instructions
  • ⚙️ dependencies
  • ▶️ run instructions
report.pdf

Report (Maximum 4 pages)

  • 📝 problem statement
  • 📚 knowledge base description
  • 🔎 retrieval design
  • 🤖 agent workflow
  • 🔬 experiments & ablation studies
  • 📊 results & failure analysis

Marking Criteria (20 Marks)

Click on each category below to expand the detailed rubric describing expectations based on mark distributions.

1. Problem Framing & Innovation

4 Marks

Assesses the originality and clarity of the design question. Higher marks present a clear technical question with a strong, meaningful innovation point.

4.0 – 3.5 Marks

A clear and compelling design hypothesis is articulated. The project demonstrates substantial originality and a strong independent technical contribution. The innovation is meaningful, well-motivated, and clearly distinct from the teaching demo.

3.4 – 2.5 Marks

A reasonable design question is presented, with some original thinking. The system extends beyond the demo in meaningful ways, though the innovation may be narrower or less fully justified.

2.4 – 1.5 Marks

The project has limited originality. The framing is weak, vague, or mostly implementation-driven. Some independent effort is visible, but the technical contribution is modest.

1.4 – 0.0 Marks

The work is largely derivative. The framing is minimal or unclear. The system appears close to the demo or relies on superficial modifications only.

2. Knowledge Base & Retrieval Design

4 Marks

Assesses the quality of multimodal data design and retrieval/indexing decisions. Higher marks use a well-constructed personalised knowledge base with meaningful multimodal integration.

4.0 – 3.5 Marks

The knowledge base is well curated, coherent, and genuinely personalised. At least two modalities are integrated meaningfully. Retrieval/indexing choices are well justified and compared through sound experiments.

3.4 – 2.5 Marks

The knowledge base is appropriate and multimodal, with mostly sensible retrieval design. Some comparison or justification is provided, though the design space explored may be limited.

2.4 – 1.5 Marks

The knowledge base is basic or weakly personalised. Multimodal use is present but shallow. Retrieval design is underdeveloped or insufficiently justified.

1.4 – 0.0 Marks

The data setup is minimal, poorly explained, or not meaningfully multimodal. Retrieval design is simplistic or largely inherited from template code.

3. Agent Framework & Tool Orchestration

4 Marks

Assesses the design and implementation of the agent workflow. Higher marks feature a well-structured workflow correctly utilising tools, routing, or state.

4.0 – 3.5 Marks

The agent workflow is well designed and clearly useful. Tool usage, routing, memory, or state handling are thoughtful and task-appropriate. The framework demonstrates clear added value over a simple pipeline.

3.4 – 2.5 Marks

The workflow is functional and mostly appropriate. Some non-trivial orchestration is present, though the design may be less sophisticated or less well analysed.

2.4 – 1.5 Marks

A basic framework is implemented, but orchestration is limited. The agent adds only modest value beyond a linear retrieval-answer pipeline.

1.4 – 0.0 Marks

The workflow is superficial, generic, or minimally adapted. It resembles the teaching demo closely.

4. Quantitative Evaluation & Ablation

4 Marks

Assesses the rigour of evaluation. Higher marks rigorously evaluate variants against multiple baselines and include impactful discussions on trade-offs.

4.0 – 3.5 Marks

Evaluation is rigorous and insightful. Multiple baselines and ablations are compared. Metrics are appropriate and clearly reported. The analysis explains why some designs perform better than others.

3.4 – 2.5 Marks

Evaluation is solid and includes the required comparisons. Metrics are mostly appropriate. Some useful analysis is provided.

2.4 – 1.5 Marks

Evaluation is limited, incomplete, or weakly structured. Comparisons exist but are shallow or insufficiently analysed.

1.4 – 0.0 Marks

Very limited evaluation. Mostly anecdotal examples or screenshots. Little or no meaningful comparison across variants.

5. Report, Code & Reproducibility

4 Marks

Assesses clarity, professionalism, and reproducibility. Higher marks exhibit high structural quality with failure analyses and thorough documentation allowing code reproduction.

4.0 – 3.5 Marks

The report is clear, well structured, and professionally presented. Code and documentation are reproducible. Results, diagrams, and failure analysis are strong. Evidence of independent development is provided.

3.4 – 2.5 Marks

The report and code are generally clear and usable. Most required materials are present. Some aspects of clarity or reproducibility could be improved.

2.4 – 1.5 Marks

The report or repository is incomplete, difficult to follow, or weakly documented. Reproducibility is limited.

1.4 – 0.0 Marks

Poorly documented submission. Major missing details. Limited evidence of understanding or independent implementation.

Academic Integrity Note

Students may consult public documentation, tutorials, or framework examples for learning purposes. However, submitted work must reflect their own design, implementation, and analysis.

Using the provided teaching demo as a starting scaffold is acceptable only if the final submission shows substantial redesign, independent implementation, and meaningful evaluation.

The following may be treated as evidence of non-original work:

  • ⊗ code similarity
  • ⊗ identical workflow structure
  • ⊗ copied report language
  • ⊗ copied prompts
  • ⊗ copied tool definitions
  • ⊗ missing originality explanations

FAQ

(Updated: 13 March 2026)

Can we use the provided teaching demo?

Yes, but only as a learning scaffold. It is not a submission template.

What counts as enough originality?

Your submission should clearly differ in task framing, knowledge base, retrieval design, workflow logic, and evaluation. If your system still looks essentially like the demo with minor edits, it will not receive credit.

Do we need quantitative evaluation?

Yes. Showing screenshots or example conversations alone is not sufficient.

Do we need to compare multiple system designs?

Yes. Comparison is a required part of the assignment.

Team: Danny Wang*, Yadan Luo, Zhuoxiao Chen, Yan Jiang, Xiangyu Sun, Xuwei Xu, Fengyi Zhang, Zhizhen Zhang.

* Project Credit, † Coordinator.

This content is created based on publicly available sources. All original copyrights remain with their respective owners.
© 2026 INFS4205/7205. The University of Queensland.