Designing multimodal systems

Executive summary

This paper is written for teams that already know how to ship text-only models and are now being asked to “make it multimodal”.

1. Designing outcomes for your multimodal model

Why multimodal isn’t just “more data” and how to optimize for tasks, not benchmarks.

2. From “one big model” to decomposed systems

Multimodal perception is harder than it looks. Decompose the task, don’t rely on one world model, and beware the “just add vision” reflex.

3. Data pipelines as engineered systems

Make the end-to-end flow explicit, from single scores to slice-aware evaluation

4. Technical challenges: latent spaces, evolution, and compute

Designing latent spaces for multimodal work, and designing for evolution, not a one-off launch.

5. Evaluation and benchmarking: beyond leaderboards

Use metrics as diagnostics, not just scores.

Designing multimodal systems

Adding vision or audio to a text model does not, by itself, give you a useful multimodal system. You need perception, alignment, and decision-making over messy, synchronized streams that don’t behave like tokens. Invisible’s latest paper sketches a practical approach to multimodal system design for researchers used to working with text-only models.

Iframe is blocked. Accept cookies to load it.

Preferences

Manage consent preferences by category

Essentials

Always active

Necessary for the site to function. Always On.

Provider name

Name

Tracker Name

Description

Tracker description

Type

Tracker type

Retention:

Tracker retention

No trackers detected for this category.

No providers detected for this category.

Analytics

Measures usage and improves your experience.

Provider name

Name

Tracker Name

Description

Tracker description

Type

Tracker type

Retention:

Tracker retention

No trackers detected for this category.

No providers detected for this category.

Marketing

Used for targeted advertising.

Provider name

Name

Tracker Name

Description

Tracker description

Type

Tracker type

Retention:

Tracker retention

No trackers detected for this category.

No providers detected for this category.

Personalization

Remembers your preferences and provides enhanced features.

Provider name

Name

Tracker Name

Description

Tracker description

Type

Tracker type

Retention:

Tracker retention

No trackers detected for this category.

No providers detected for this category.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. Visit our Privacy centre for more information.

Going beyond chatbots: Designing multimodal systems

A practical pathway into multimodal

Executive summary

This paper is written for teams that already know how to ship text-only models and are now being asked to “make it multimodal”.