Alexander Godwin - Personal Website

OctoRouter

I recently built OctoRouter just to see how LLM routing works and to understand the techniques used to build such infrastructure. I decided to build a semantic routing strategy, faced with 3 options on how to approach this task, namely:

Get prompt classification from an LLM (requires network request)
Use a classification model (requires training a model)
Use the ONNX runtime with a sentence embedding model (calculations happen locally, might not be very accurate due to limited model size)

I chose the third approach; little did I know how much I was about to learn. I didn't expect to learn so much.

Let's just get into detailing what I've been up to.

Installations

You can get started with the ONNX runtime here: https://onnxruntime.ai/docs/install/

Download the all-MiniLM-L6-v2-onnx embedding model and its tokenizer from https://huggingface.co/onnx-models/all-MiniLM-L6-v2-onnx/tree/main

Also, install the tokenizer and onnxruntime_go packages:

go get github.com/sugarme/tokenizer
go get github.com/yalue/onnxruntime_go

Semantic calculator


package semantics

import (
	"fmt"
	"math"
	"os"
	"path/filepath"
	"sync"

	"github.com/sugarme/tokenizer"
	"github.com/sugarme/tokenizer/pretrained"
	ort "github.com/yalue/onnxruntime_go"
	"go.uber.org/zap"
)

const (
	ModelPath    = "assets/models/embedding.onnx"
	TokenPath    = "assets/models/tokenizer.json"
	MaxSeqLength = 128
	EmbeddingDim = 384
)

type SemanticCalculator struct {
	tokenizer        *tokenizer.Tokenizer
	session          *ort.AdvancedSession
	mu               sync.Mutex
	logger           *zap.Logger

	inputIds      []int64
	attentionMask []int64
	tokenTypeIds  []int64
	outputData    []float32
}

ModelPath is the path to the all-MiniLM-L6-v2-onnx embedding model.
TokenPath is the path to the tokenizer for the embedding model.
MaxSequenceLength is the length of a sentence that this model can process at once.
EmbeddingDim is the dimension to which this model is able to embed each token.

func NewSemanticCalculator() (*SemanticCalculator, error) {
	if !ort.IsInitialized() {
		ort.SetSharedLibraryPath("/opt/homebrew/lib/libonnxruntime.dylib")
		err := ort.InitializeEnvironment()

		if err != nil {
			return nil, err
		}
	}

	sc := &SemanticCalculator{
		inputIds:      make([]int64, MaxSeqLength),
		attentionMask: make([]int64, MaxSeqLength),
		tokenTypeIds:  make([]int64, MaxSeqLength),
		outputData:    make([]float32, MaxSeqLength*EmbeddingDim),
		logger:        &zap.Logger{},
	}

	err := sc.loadModel()

	if err != nil {
		fmt.Println("Failed to load the embedding model")
		return nil, err
	}

	return sc, nil
}

/opt/homebrew/lib/libonnxruntime.dylib is the path of the ONNX runtime on a Mac operating system.

The NewSemanticCalculator function returns a new SemanticCalculator struct, initializing the inputIds, attentionMask, and tokenTypeIds to an Integer slice having the MaxSequenceLength as its length.

inputIds - Contains tokens (numerical representation of individual words)
attentionMask - Contains 1s and 0s [1 represents tokens with values, 0 represents empty/padding values]
typeIds - Actually made for when we're processing multiple sentences at a time

The outputData slice is initialized to MaxSequenceLength * EmbeddingDim (each of the 128 tokens is embedded in 384 dimensions).

Loading the Model

func (sc *SemanticCalculator) loadModel() error {

	tokenizerPath := resolveFinalPath(TokenPath)

	if _, err := os.Stat(tokenizerPath); err == nil {
		tk, err := pretrained.FromFile(tokenizerPath)
		if err != nil {
			fmt.Printf("Error loading tokenizer: %v\n", err)
		}

		if tk != nil {
			sc.tokenizer = tk
		} else {
			sc.tokenizer = pretrained.BertBaseUncased()
		}
	}

	modelPath := resolveFinalPath(ModelPath)

	if _, err := os.Stat(modelPath); err != nil {
		return err
	}

	inputShape := ort.NewShape(1, MaxSeqLength)
	inputIdsTensor, _ := ort.NewTensor(inputShape, sc.inputIds)
	maskTensor, _ := ort.NewTensor(inputShape, sc.attentionMask)
	typeIdsTensor, _ := ort.NewTensor(inputShape, sc.tokenTypeIds)

	outputShape := ort.NewShape(1, MaxSeqLength, EmbeddingDim)
	outputDataTensor, _ := ort.NewTensor(outputShape, sc.outputData)

	session, err := ort.NewAdvancedSession(
		modelPath,
		[]string{"input_ids", "attention_mask", "token_type_ids"},
		[]string{"token_embeddings"},
		[]ort.Value{inputIdsTensor, maskTensor, typeIdsTensor},
		[]ort.Value{outputDataTensor},
		nil,
	)

	if err != nil {
		return err
	}

	sc.session = session

	return nil
}

The initial part of the loadModel function just loads tokenizer.json from its location; if it's unsuccessful, it falls back to pretrained.BertBaseUncased().

The remaining part of the function is interesting. We're creating input and output tensors to feed into the model.

inputShape is [1, 128]. It accepts a single sentence input and splits the input into 128 tokens.
inputIdsTensor creates a tensor from the inputIds.
maskTensor creates a tensor for the attentionMasks.
typeIdsTensor creates a tensor for the typeIds.
outputShape is [1, 128, 384].
outputDataTensor creates a tensor that returns an output array of length 128 * 384 and stores the array in outputData.

Finally, we create a new session with these arguments. One thing to look out for is "token_embeddings", which is the output name of the embedding model. From my experience, it's different with every model.

Just in case you decide to try out a different embedding model, an easy method to get the output name is by using Netron.

Calculating Embeddings

Here comes the fun & interesting part: calculating embeddings...

func (sc *SemanticCalculator) CalculateEmbeddings(text string) ([]float32, error) {
	sc.mu.Lock()
	defer sc.mu.Unlock()

	en, err := sc.tokenizer.EncodeSingle(text, true)

	if err != nil {
		return nil, err
	}

	ids := en.GetIds()
	mask := en.GetAttentionMask()
	typeIds := en.GetTypeIds()

	if len(ids) > 0 && ids[0] != 101 {
		newIds := []int{101}
		newIds = append(newIds, ids...)

		if newIds[len(newIds)-1] != 102 {
			newIds = append(newIds, 102)
		}
		ids = newIds
		// Resync mask
		mask = make([]int, len(ids))
		for i := range mask {
			mask[i] = 1
		}
		typeIds = make([]int, len(ids))
	}

	for i := range MaxSeqLength {
		if i < len(ids) {
			sc.inputIds[i] = int64(ids[i])
			sc.attentionMask[i] = int64(mask[i])
			sc.tokenTypeIds[i] = int64(typeIds[i])
		} else {
			sc.inputIds[i] = 0
			sc.attentionMask[i] = 0
			sc.tokenTypeIds[i] = 0
		}
	}

	err = sc.session.Run()
	if err != nil {
		return nil, err
	}

	embedding := make([]float32, EmbeddingDim)
	var validTokens float32

	for i := range MaxSeqLength {
		if i < len(ids) {
			validTokens++
			for d := range EmbeddingDim {
				embedding[d] += sc.outputData[i*EmbeddingDim+d]
			}
		}
	}

	if validTokens > 0 {
		for d := range EmbeddingDim {
			embedding[d] /= validTokens
		}
	}

	return embedding, nil

}

The sc.tokenizer.EncodeSingle(text, true) function encodes a single text (sentence). The extra argument is for the tokenizer to add [cls] and [sep] which signify the beginning and end of the sentence respectively.

We get the ids (inputIds), mask (attentionMask) and typeIds from methods on the output of the function in the previous paragraph.

In the if statement, we check if the ids start with 101 which is the token for [cls], and ends with 102 which is the token for [sep]. If it doesn't, we append those values then set all the masks to 1.

In the for loop, we loop for the length of MaxSequence (128). We check if the value at the counter is valid, put it into the input arrays we initialized in our SemanticCalculator struct. Once we exhaust the length of the inputs, the rest of the values in the input arrays are filled with 0.

That might be a lot to take in at once, so I'll use a real world example here.

user prompt is "hello world"

tokenizer.EncodeSingle("hello world", true) will return the following
- `ids`: "[101, 7592, 2088, 102]"
- `masks`: "[1, 1, 1, 1]"

So we first check if the 101 and 102 are present at both ends.
Remember our model requires an input length of 128.
We fill in the rest values as follows, before running the model.

sc.inputIds = "[101, 7592, 2088, 102, 0, ..., 0]"
sc.attentionMask = "[1, 1, 1, 1, 0, ..., 0]"

These are the inputs to the model, that we provided in the load model section.

Next, we call sc.session.Run(). This is the line that runs the model.

With the results in sc.outputData we can calculate the average of the entire sentence; this will give us the final embedding.

We loop through the array to filter the actual values from the invalid zero values, then we add the values at every index of the embedding dimensions across the length of valid ids together. Finally, we take the average of the resulting array. I think another example will be needed here, cause I might have botched the explanation. 😂

Every token is broken into a 384 length embedding

hello [7592] - "[455, 589, ... 1427]"
world [2088] - "[123, 345 ... 4566]"


[
    "[455, 589, ... 1427]",
    "[123, 345 ... 4566]"
]

when we perform this addition we have just one embedding as follows

"[455 + 123, 589 + 345, ... + ..., 1427 + 4566]"

which evaluates to a single array with the following values

"[578, 934, ..., 5993]"

finally we divide through by the number of valid tokens, in this case 2

"[587/2, 934/2, ..., 5993/2]" = "[293.5, 467, ..., 2996.5]"

The final embedding is "[293.5, 467, ..., 2996.5]"

That's the entirety of the magic; that's how to calculate sentence embedding. Interestingly simple, right?

Cosine Similarity

We use cosine similarity to calculate the distance between two sentences; the output is usually between 0 and 1. 0 meaning no relationship, 1 means the sentences are the same.

func (sc *SemanticCalculator) CosineSimilarity(a, b []float32) float64 {
	var dot, normA, normB float64
	for i := range a {
		dot += float64(a[i]) * float64(b[i])
		normA += float64(a[i]) * float64(a[i])
		normB += float64(b[i]) * float64(b[i])
	}
	if normA == 0 || normB == 0 {
		return 0
	}
	return dot / (math.Sqrt(normA) * math.Sqrt(normB))
}

That is the mathematical formula for calculating cosine similarity.

With that, this article comes to an end. Isn't it amazing that you might not need to make a network request to get embeddings for user prompts? This opens a whole world for devices and programs where latency is of the utmost importance.

The complete code for this project is on Github.