January 18, 2025

Segment Anything

Segment Anything is a foundation model that literally segments various objects. You can obtain segmented regions simply by providing an image without the need for any learning process. Additionally, by providing several prompts, users can more easily achieve the desired results. Prompts can include points, bounding boxes, and text.

Let’s take a look at the following example. From left to right, prompts is being added (in this case, points colored by light-blue). By placing prompts on the bears, it enables us to segment the area. The more prompts you add, the closer you get to the desired segmented regions.

But how is the model realizing this functions? Well, Since it is foundation model, somehow, applies prompts to image embedding for segmentating.

To obtain Image Embedding, Transformer is used in this Segment Anything. This process is quite computationally intensive. However, when observing the demo site of Segment Anything, it appears that the results are obtained in near realtime. Why is that?

Oops, Demo page is here: https://segment-anything.com/demo

And Github repo is here : https://github.com/facebookresearch/segment-anything

And paper : https://arxiv.org/abs/2304.02643

Why is Segment Antying so fast?

Let’s see model architechture above figure from paper. There are mostly consists of 2 parts. Image Encoder and Prompt Encoder. The Image Encoder is a module used to obtain Image Embedding, and it is computationally intensive. On the other hand, the Prompt Encoder is a module used to obtain embeddings for prompts from Image Embedding, and it is computationally lightweight. Note that Prompt Encoding needs to be computed interactively after receiving instructions from the user, while Image Encoding can be computed before receiving user instructions. The authors says that “Given an image embedding, the prompt encoder and mask decoder predict a mask from a prompt in ∼50ms in a web browser”. How smart they are!

In other words, they separated interactive computations from non-interactive computations and demonstrated that interactive computations do not require significant computational load.