To build a new index or update an existing index, provide vectors to Vector Search in the format and structure described in the following sections.
Input data storage and file organization
Prerequisite
Store your input data in a Cloud Storage bucket, in your Google Cloud project.
Input data files should be organized as follows:
- Each batch of input data files should be under a single Cloud Storage directory.
- Data files should be placed directly under
batch_root
and named with the following suffixes:.csv
,.json
, and.avro
. - There is a limit of 5000 objects (files) in the batch root directory.
- Each data file is interpreted as a set of records. The format of the record is determined by the suffix of the filename and those format requirements are described. See Data file formats.
- Each record should have an
id
, a feature vector, and your optional fields supported by Vertex AI Feature Store, like restricts and crowding. - A subdirectory named
delete
may be present. Each file directly underbatch_root
/delete
is taken as a text file ofid
records with oneid
in each line. - All other directories and files are ignored.
Input data processing
- All records from all data files, including those under
delete
, consist of a single batch of input. - The relative ordering of records within a data file is not important.
- A single ID should only appear once in a batch. If there is a duplicate with the same ID, it displays as one vector count.
- An ID cannot appear both in a regular data file and a delete data file.
- All IDs from a data file under delete causes it to be removed from the next index version.
- Records from regular data files is included in the next version, overwriting a value in an older index version.
Here is a JSON example:
{"id": "1", "embedding": [1,1,1]}
{"id": "2", "embedding": [2,2,2]}
The following is an example of a valid input data file organization:
batch_root/
feature_file_1.csv
feature_file_2.csv
delete/
delete_file.txt
The feature_file_1.csv
and feature_file_2.csv
files contain records in CSV
format. The delete_file.txt
file contains a list of record IDs to be deleted
from the next index version.
Data file formats
JSON
- Encode the JSON file using UTF-8.
- Each line of the JSON file will be interpreted as a separate JSON object.
- Each record must contain an
id
field to specify the ID of the vector. - Each record must contain an
embedding
field that is an array ofN
floating point numbers that represents the feature vector, whereN
is the dimension of the feature vector that was configured when the index was created. - An optional
restricts
field can be included that specifies an array ofTokenNamespace
objects in restricts. For each object:- Specify a
namespace
field that is theTokenNamespace.namespace
. - An optional
allow
field can be set to an array of strings which are the list ofTokenNamespace.string_tokens
. - An optional
deny
field can be set to an array of strings which are the list ofTokenNamespace.string_blacklist_tokens
. - The value of the field
crowding_tag
, if present, must be a string.
- Specify a
- An optional
numeric_restricts
field can be included that specifies an array ofNumericRestrictNamespace
. For each object:- Specify a
namespace
field that is theNumericRestrictNamespace.namespace
. - One of the value fields
value_int
,value_float
, andvalue_double
. - It must not have a field named op. This field is only for queries.
- Specify a
Avro
- Use a valid Avro file.
Make records that conform to the following schema:
{ "type": "record", "name": "FeatureVector", "fields": [ { "name": "id", "type": "string" }, { "name": "embedding", "type": { "type": "array", "items": "float" } }, { "name": "restricts", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "Restrict", "fields": [ { "name": "namespace", "type": "string" }, { "name": "allow", "type": [ "null", { "type": "array", "items": "string" } ] }, { "name": "deny", "type": [ "null", { "type": "array", "items": "string" } ] } ] } } ] }, { "name": "numeric_restricts", "type": [ "null", { "type": "array", "items": { "name": "NumericRestrict", "type": "record", "fields": [ { "name": "namespace", "type": "string" }, { "name": "value_int", "type": [ "null", "int" ], "default": null }, { "name": "value_float", "type": [ "null", "float" ], "default": null }, { "name": "value_double", "type": [ "null", "double" ], "default": null } ] } } ], "default": null }, { "name": "crowding_tag", "type": [ "null", "string" ] } ] }
CSV
- Encode the CSV file using UTF-8.
- Each line of the CSV must contain exactly one record.
- The first value in each line must be the vector ID, which must be a valid UTF-8 string.
- Following the ID, the next
N
values represent the feature vector, whereN
is the dimension of the feature vector that was configured when the index was created. - Feature vector values must be floating point literals as defined in the Java language spec.
- Additional values may be in the form
name=value
. - The name
crowding_tag
is interpreted as the crowding tag and may only appear once in the record. All other
name=value
pairs are interpreted as token namespace restricts. The same name may be repeated if there are multiple values in a namespace.For example,
color=red,color=blue
represents thisTokenNamespace
:{ "namespace": "color" "string_tokens": ["red", "blue"] }
If value starts with
!
, the rest of the string is interpreted as an excluded value.For example,
color=!red
represents thisTokenNamespace
:{ "namespace": "color" "string_blacklist_tokens": ["red"] }
#name=numericValue
pairs with number type suffix is interpreted as numeric namespace restricts. Number type suffix isi
for int,f
for float, andd
for double. The same name shouldn't be repeated as there should be a single value associated per namespace.For example,
#size=3i
represents thisNumericRestrictNamespace
:{ "namespace": "size" "value_int": 3 }
#ratio=0.1f
represents thisNumericRestrictNamespace
:{ "namespace": "ratio" "value_float": 0.1 }
#weight=0.3d
represents thisNumericRestriction
:{ "namespace": "weight" "value_double": 0.3 }
What's next
- Learn how to Create and manage your index