Multimodal Embeddings API

Multimobile Embeddings API는 사용자가 제공한 입력을 기반으로 벡터를 생성합니다. 여기에는 이미지, 텍스트, 동영상 데이터의 조합이 포함될 수 있습니다. 그런 다음 임베딩 벡터를 이미지 분류 또는 동영상 콘텐츠 검토와 같은 후속 태스크에 사용할 수 있습니다.

지원되는 모델:

  • multimodalembedding@001

문법

  • PROJECT_ID = PROJECT_ID
  • REGION = us-central1

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  http://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    ...
  ],
}'

Python

from vertexai.vision_models import MultiModalEmbeddingModel

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding")
model.get_embeddings(...)

매개변수 목록

요청 본문

매개변수

image

선택사항. Image

임베딩을 생성하려는 텍스트입니다.

text

선택사항. String

임베딩을 생성하려는 이미지입니다.

video

선택사항. Video

임베딩을 생성하려는 동영상 세그먼트입니다.

dimension

선택사항. Int

이 매개변수는 128, 256, 512, 1408 값 중 하나를 허용합니다. 응답에는 해당 차원의 임베딩이 포함됩니다. 텍스트 및 이미지 입력에만 적용됩니다.

이미지

매개변수

bytesBase64Encoded

선택사항. String

base64 문자열로 인코딩된 이미지 바이트입니다. bytesBase64Encoded 또는 gcsUri 중 하나를 설정해야 합니다.

gcsUri

선택사항. String

임베딩을 수행할 이미지의 Cloud Storage 위치입니다. bytesBase64Encoded 또는 gcsUri 중 하나를 설정해야 합니다.

mimeType

선택사항. String

이미지 콘텐츠의 MIME 유형입니다. image/jpegimage/png가 지원됩니다.

VideoSegmentConfig

매개변수

startOffsetSec

선택사항. Int

동영상 세그먼트의 시작 오프셋(초)입니다. 시작 오프셋을 지정하지 않으면 max(0, endOffsetSec - 120)로 시작 오프셋이 계산됩니다.

endOffsetSec

선택사항. Int

동영상 세그먼트의 종료 오프셋(초)입니다. 종료 오프셋을 지정하지 않으면 min(video length, startOffSec + 120)로 종료 오프셋이 계산됩니다. startOffSecendOffSec가 모두 지정된 경우 endOffsetSec will be adjusted to min(startOffsetSec+120, endOffsetSec).

intervalSec

선택사항. Int

임베딩이 생성되는 동영상의 간격입니다. interval_sec의 최솟값은 4입니다. 간격이 4보다 작으면 InvalidArgumentError가 반환됩니다. 간격의 최댓값에는 제한이 없습니다. 하지만 간격이 min(video length, 120s)보다 크면 생성된 임베딩의 품질에 영향을 줍니다. 기본값은 16입니다.

동영상

매개변수

bytesBase64Encoded

선택사항. String

base64 문자열로 인코딩된 동영상 바이트입니다. bytesBase64Encoded 또는 gcsUri 중 하나를 설정해야 합니다.

gcsUri

선택사항. String

임베딩을 수행할 동영상의 Cloud Storage 위치입니다. bytesBase64Encoded 또는 gcsUri 중 하나를 설정해야 합니다.

videoSegmentConfig

선택사항. VideoSegmentConfig

동영상 세그먼트 구성입니다.

예시

  • PROJECT_ID = PROJECT_ID
  • REGION = us-central1
  • MODEL_ID = multimodalembedding@001

기본 사용 사례

멀티모달 임베딩 모델은 제공된 입력에 따라 벡터를 생성하며, 여기에는 이미지, 텍스트, 동영상 데이터의 조합이 포함될 수 있습니다.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  http://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    {
      "image": {
        "gcsUri": "gs://your-public-uri-test/flower.jpg"
      },
      "text": "white shoes",
      "video": {
        "gcsUri": "gs://your-public-uri-test/Okabashi.mp4"
      },
    }
  ],
}'

Python

# @title Client for multimodal embedding
import base64
import time
import typing
from dataclasses import dataclass

from absl import app
from absl import flags
# Need to do pip install google-cloud-aiplatform for the following two imports.
# Also run: gcloud auth application-default login.
from google.cloud import aiplatform
from google.protobuf import struct_pb2

PROJECT_ID = {PROJECT_ID}
IMAGE_URI = "gs://your-public-uri-test/flower.jpg" # @param {type:"string"}
TEXT = "white shoes" # @param {type:"string"}
VIDEO_URI = "gs://your-public-uri-test/Okabashi.mp4" # @param {type:"string"}
VIDEO_START_OFFSET_SEC=0
VIDEO_END_OFFSET_SEC=120
VIDEO_EMBEDDING_INTERVAL_SEC=16

# Inspired from http://stackoverflow.com/questions/34269772/type-hints-in-namedtuple.
class EmbeddingResponse(typing.NamedTuple):
    @dataclass
    class VideoEmbedding:
        start_offset_sec: int
        end_offset_sec: int
        embedding: typing.Sequence[float]

    text_embedding: typing.Sequence[float]
    image_embedding: typing.Sequence[float]
    video_embeddings: typing.Sequence[VideoEmbedding]

class EmbeddingPredictionClient:
    """Wrapper around Prediction Service Client."""

    def __init__(self, project: str,
                 location: str = "us-central1",
                 api_regional_endpoint: str = "us-central1-aiplatform.googleapis.com"):
        client_options = {"api_endpoint": api_regional_endpoint}
        # Initialize client that will be used to create and send requests.
        # This client only needs to be created once, and can be reused for multiple requests.
        self.client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
        self.location = location
        self.project = project

    def get_embedding(self, text: str = None, image_uri: str = None, video_uri: str = None,
                      start_offset_sec: int = 0, end_offset_sec: int = 120, interval_sec: int = 16):
        if not text and not image_uri and not video_uri:
            raise ValueError('At least one of text or image_uri or video_uri must be specified.')

        instance = struct_pb2.Struct()
        if text:
            instance.fields['text'].string_value = text

        if image_uri:
            image_struct = instance.fields['image'].struct_value
            image_struct.fields['gcsUri'].string_value = image_uri

        if video_uri:
            video_struct = instance.fields['video'].struct_value
            video_struct.fields['gcsUri'].string_value = video_uri
            video_config_struct = video_struct.fields['videoSegmentConfig'].struct_value
            video_config_struct.fields['startOffsetSec'].number_value = start_offset_sec
            video_config_struct.fields['endOffsetSec'].number_value = end_offset_sec
            video_config_struct.fields['intervalSec'].number_value = interval_sec

        instances = [instance]
        endpoint = (f"projects/{self.project}/locations/{self.location}"
                    "/publishers/google/models/multimodalembedding@001")
        response = self.client.predict(endpoint=endpoint, instances=instances)

        text_embedding = None
        if text:
            text_emb_value = response.predictions[0]['textEmbedding']
            text_embedding = [v for v in text_emb_value]

        image_embedding = None
        if image_uri:
            image_emb_value = response.predictions[0]['imageEmbedding']
            image_embedding = [v for v in image_emb_value]

        video_embeddings = None
        if video_uri:
            video_emb_values = response.predictions[0]['videoEmbeddings']
            video_embeddings = [
                EmbeddingResponse.VideoEmbedding(start_offset_sec=v['startOffsetSec'], end_offset_sec=v['endOffsetSec'],
                                                 embedding=[x for x in v['embedding']])
                for v in
                video_emb_values]

        return EmbeddingResponse(
            text_embedding=text_embedding,
            image_embedding=image_embedding,
            video_embeddings=video_embeddings)

# client can be reused.
client = EmbeddingPredictionClient(project=PROJECT_ID)
start = time.time()
response = client.get_embedding(text=TEXT, image_uri=IMAGE_URI, video_uri=VIDEO_URI,
                                    start_offset_sec=VIDEO_START_OFFSET_SEC,
                                    end_offset_sec=VIDEO_END_OFFSET_SEC,
                                    interval_sec=VIDEO_EMBEDDING_INTERVAL_SEC)
end = time.time()

print(response)
print('Time taken: ', end - start)

고급 사용 사례

사용자는 텍스트 및 이미지 임베딩의 차원을 지정할 수 있습니다. 동영상 임베딩의 경우 사용자는 동영상 세그먼트와 임베딩 밀도를 지정할 수 있습니다.

curl - 이미지

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  http://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    {
      "image": {
        "gcsUri": "gs://your-public-uri-test/flower.jpg"
      },
      "text": "white shoes",
    }
  ],
  "parameters": {
    "dimension": 128
  }
}'

curl - 동영상

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  http://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    {
      "video": {
        "gcsUri": "gs://your-public-uri-test/Okabashi.mp4",
        "videoSegmentConfig": {
          "startOffsetSec": 10,
          "endOffsetSec": 60,
          "intervalSec": 10
        }
      },
    }
  ],
}'

Python

# @title Client for multimodal embedding
import base64
import time
import typing
from dataclasses import dataclass

from absl import app
from absl import flags
# Need to do pip install google-cloud-aiplatform for the following two imports.
# Also run: gcloud auth application-default login.
from google.cloud import aiplatform
from google.protobuf import struct_pb2

PROJECT_ID = {PROJECT_ID}
IMAGE_URI = "gs://your-public-uri-test/flower.jpg"
TEXT = "white shoes"
VIDEO_URI = "gs://your-public-uri-test/brahms.mp4"
VIDEO_START_OFFSET_SEC=10
VIDEO_END_OFFSET_SEC=60
VIDEO_EMBEDDING_INTERVAL_SEC=10
DIMENSION= 128

# Inspired from http://stackoverflow.com/questions/34269772/type-hints-in-namedtuple.
class EmbeddingResponse(typing.NamedTuple):
    @dataclass
    class VideoEmbedding:
        start_offset_sec: int
        end_offset_sec: int
        embedding: typing.Sequence[float]

    text_embedding: typing.Sequence[float]
    image_embedding: typing.Sequence[float]
    video_embeddings: typing.Sequence[VideoEmbedding]

class EmbeddingPredictionClient:
    """Wrapper around Prediction Service Client."""

    def __init__(self, project: str,
                 location: str = "us-central1",
                 api_regional_endpoint: str = "us-central1-aiplatform.googleapis.com"):
        client_options = {"api_endpoint": api_regional_endpoint}
        # Initialize client that will be used to create and send requests.
        # This client only needs to be created once, and can be reused for multiple requests.
        self.client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
        self.location = location
        self.project = project

    def get_embedding(self, text: str = None, image_uri: str = None, video_uri: str = None,
                      start_offset_sec: int = 0, end_offset_sec: int = 120, interval_sec: int = 16, dimension=1408):
        if not text and not image_uri and not video_uri:
            raise ValueError('At least one of text or image_uri or video_uri must be specified.')

        instance = struct_pb2.Struct()
        if text:
            instance.fields['text'].string_value = text

        if image_uri:
            image_struct = instance.fields['image'].struct_value
            image_struct.fields['gcsUri'].string_value = image_uri

        if video_uri:
            video_struct = instance.fields['video'].struct_value
            video_struct.fields['gcsUri'].string_value = video_uri
            video_config_struct = video_struct.fields['videoSegmentConfig'].struct_value
            video_config_struct.fields['startOffsetSec'].number_value = start_offset_sec
            video_config_struct.fields['endOffsetSec'].number_value = end_offset_sec
            video_config_struct.fields['intervalSec'].number_value = interval_sec

        parameters = struct_pb2.Struct()
        parameters.fields['dimension'].number_value = dimension

        instances = [instance]
        endpoint = (f"projects/{self.project}/locations/{self.location}"
                    "/publishers/google/models/multimodalembedding@001")
        response = self.client.predict(endpoint=endpoint, instances=instances, parameters=parameters)

        text_embedding = None
        if text:
            text_emb_value = response.predictions[0]['textEmbedding']
            text_embedding = [v for v in text_emb_value]

        image_embedding = None
        if image_uri:
            image_emb_value = response.predictions[0]['imageEmbedding']
            image_embedding = [v for v in image_emb_value]

        video_embeddings = None
        if video_uri:
            video_emb_values = response.predictions[0]['videoEmbeddings']
            video_embeddings = [
                EmbeddingResponse.VideoEmbedding(start_offset_sec=v['startOffsetSec'], end_offset_sec=v['endOffsetSec'],
                                                 embedding=[x for x in v['embedding']])
                for v in
                video_emb_values]

        return EmbeddingResponse(
            text_embedding=text_embedding,
            image_embedding=image_embedding,
            video_embeddings=video_embeddings)

# client can be reused.
client = EmbeddingPredictionClient(project=PROJECT_ID)
start = time.time()
response = client.get_embedding(text=TEXT, image_uri=IMAGE_URI, video_uri=VIDEO_URI,
                                    start_offset_sec=VIDEO_START_OFFSET_SEC,
                                    end_offset_sec=VIDEO_END_OFFSET_SEC,
                                    interval_sec=VIDEO_EMBEDDING_INTERVAL_SEC,
                                    dimension=DIMENSION)
end = time.time()

print(response)
print('Time taken: ', end - start)

더 살펴보기

자세한 문서는 다음을 참조하세요.