beyondwatts

Creating a local search engine with typesense

Nov 2024

Now that I have migrated my pocket bookmarks to readeck, I wanted to create a unified local search that will index these bookmarks and my internal documentation - stored in outline. Fortunately both readeck and outline have APIs, so it was ‘just’ a case of finding a suitable search application. Enter typesense.

Typesense is an open-source, typo-tolerant search engine optimized for instant (typically sub-50ms) search-as-you-type experiences and developer productivity.

Getting typesense up and runing

Ultimately, I deployed typesense in kubernetes but for a quick start, the documentation provides a docker compose file. With this, we can have typesense up and running easily:

services:
  typesense:
    image: typesense/typesense:27.1
    restart: on-failure
    ports:
      - "8108:8108"
    volumes:
      - ./typesense-data:/data
    command: '--data-dir /data --api-key=xyz --enable-cors'

Create the typesense-data directory and off we go:

mkdir "$(pwd)"/typesense-data
docker-compose up

Creating a typesense collection and adding data

Within typesense, data is stored in collections. Before we can start searching we need to create a collection and add some data to it:

Create a typesense collection

My collection is quite simple with just three fields (title, text and url) to start with. In the snippet below, I am using https and an internal subdomain, but we could also just connect to localhost with port 8108 and http.

const Typesense = require('typesense')

let client = new Typesense.Client({
  'nodes': [{
    'host': 'typesense.example.com', // For Typesense Cloud use xxx.a1.typesense.net
    'port': 443,      // For Typesense Cloud use 443
    'protocol': 'https'   // For Typesense Cloud use https
  }],
  'apiKey': '<API_KEY>',
  'connectionTimeoutSeconds': 2
})

let homeSchema = {
  'name': 'home',
  'fields': [
    {'name': 'title', 'type': 'string' },
    {'name': 'text', 'type': 'string'},
    {'name': 'url', 'type': 'string' },
  ]
}

client.collections().create(homeSchema)
.then(function (data) {
  console.log(data)
})

A this point, I have not created a default_sorting field as I haven’t yet decided how I want the results to be sorted.

Adding data to the home collection

The typesense documentation provides example code to load a json file. I experimented initially with this but I for the actual search, I wanted to load data from the outline and readeck APIs. Luckily, both these APIs are very similar with a list endpoint and a content endpoint. All that is needed is to loop through and post to typesense. Here’s the code for outline, it uses an API token for access - these are created in the outline UI. I’m also removing the markdown code from within the outline documents so they index better and display cleanly in the typesense search results:

const axios = require('axios');
const rateLimit = require('axios-rate-limit');
const axiosRateLimited = rateLimit(axios.create(), { maxRequests: 1, perMilliseconds: 1000});
const removeMarkdown = require('remove-markdown');

let bearerOutline = "outline_bearer_code"
let urlOutline = "https://outline.example.com"

async function getOutlineDocumentList(offset)
{
  let apiUrl = urlOutline + "/api/documents.list"
  try {
    let data = {
      'limit': apiListLimit,
      'offset': offset
    };
    let config = {
      headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer ' + bearerOutline
    }}
    let r = await axiosRateLimited.post(apiUrl, data, config)
    return r.data.data;
  } catch (error) {
    console.error('Error getting document list')
  }
}

async function getOutlineDocument(d)
{
  let apiUrl = urlOutline + "/api/documents.info"
  try {
    let data = {
      'id': d.id
    };
    let config = {
      headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer ' + bearerOutline
    }}
    let r = await axiosRateLimited.post(apiUrl, data, config)
    return r.data.data;
  } catch (error) {
    console.error('Error getting document')
  }
}

async function loadOutline()
{
  let offset = 0;
  while (true) {
    let docs = await getOutlineDocumentList(offset * apiListLimit);
    if (docs.length == 0) {
      console.log('Finished getting outline documents');
      return;
    }
    for (const d of docs)
    {
      console.log(`Getting ${d.id}`)
      let outlineDocument = await getOutlineDocument(d);
      let newSearchEntry = {};
      newSearchEntry.id = outlineDocument.id;
      newSearchEntry.title = outlineDocument.title;
      newSearchEntry.url = urlOutline + outlineDocument.url;
      newSearchEntry.text = removeMarkdown(outlineDocument.text);
      client.collections('home').documents().upsert(newSearchEntry);
    }
    offset++;
  }
}

loadOutline();

Amazingly, there is only one line for typesense:

      client.collections('home').documents().upsert(newSearchEntry);

Searching the home collection from the command line

Before we create a search UI application, we can check everything is working from the command line.

curl "https://typesense.example.com/collections/home/documents/search?q=typesense&query_by=title&x-typesense-api-key=<API_KEY>"

Creating a search UI application

The typesense documentation is pretty helpful. Again I’m following it here… The folks over at Algolia have built and open-sourced Instantsearch.js which is a collection of out-of-the-box UI components that you can use to build interactive search experiences quickly. Typesense have built an adapter that uses the same Instantsearch widgets, but send the queries to Typesense instead. There is a basic javascript example (without using any package managers) available at https://github.com/typesense/typesense-instantsearch-demo-no-npm-yarn . I cloned thie repo as the basis of a simple app. I haven’t yet even changed the title or page header…

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <meta name="theme-color" content="#000000">
    <link rel="manifest" href="./manifest.webmanifest">
    <link rel="shortcut icon" href="./favicon.png">
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/instantsearch.css@7/themes/algolia-min.css">
    <link rel="stylesheet" href="index.css">
    <title>Typesense InstantSearch.js Demo</title>
</head>

<body>
<header class="header">
    <h1 class="header-title">
        <a href="/">Instant Search Demo</a>
    </h1>
    <p class="header-subtitle">
        using
        <a href="https://github.com/algolia/instantsearch.js">
            Typesense + InstantSearch.js
        </a>
    </p>
</header>

<div class="container">
    <div class="search-panel">
        <div class="search-panel__results">
            <div id="searchbox"></div>
            <div id="hits"></div>
        </div>
    </div>

    <div id="pagination"></div>
</div>

<script src="https://cdn.jsdelivr.net/npm/instantsearch.js@4.44.0"></script>
<script src="https://cdn.jsdelivr.net/npm/typesense-instantsearch-adapter@2/dist/typesense-instantsearch-adapter.min.js"></script>

<script>
    // Adapted from https://github.com/typesense/typesense-instantsearch-demo-no-npm-yarn
    // Search API parameters are at https://www.algolia.com/doc/api-reference/search-api-parameters/
    function getQueryParam(param) {
      const urlParams = new URLSearchParams(window.location.search);
      return urlParams.get(param);
    }

    function getSearchParam() {
      let param = urlParams.get(param);
      let q = '';
      if (typeof(param) !== 'undefined') {
        q = param;
      }
      return q;
    }

    const typesenseInstantsearchAdapter = new TypesenseInstantSearchAdapter({
        server: {
            apiKey: 'API-KEY', // Be sure to use an API key that only allows searches, in production
            nodes: [
                {
                    host: 'typesense.example.com',
                    port: '443',
                    protocol: 'https',
                },
            ],
        },
        // The following parameters are directly passed to Typesense's search API endpoint.
        //  So you can pass any parameters supported by the search endpoint below.
        //  queryBy is required.
        //  filterBy is managed and overridden by InstantSearch.js. To set it, you want to use one of the filter widgets like refinementList or use the `configure` widget.
        additionalSearchParameters: {
            queryBy: 'title,text,url',
        },
    });
    const searchClient = typesenseInstantsearchAdapter.searchClient;

    const search = instantsearch({
        searchClient,
        indexName: 'home',
    });

    search.addWidgets([
        instantsearch.widgets.searchBox({
            container: '#searchbox',
        }),
        instantsearch.widgets.configure({
            query: getQueryParam('q'),
            distinct: 1,
            attributeForDistinct: 'url',
            hitsPerPage: 16,
        }),
        instantsearch.widgets.hits({
            container: '#hits',
            templates: {
                item(item) {
                    return `
                        <div>
                          <div class="hit-name">
                            ${item._highlightResult.title.value}
                          </div>
                          <div class="hit-text">
                            ${item._snippetResult.text.value}
                          </div>
                          <a href="${item._highlightResult.url.value}">${item._highlightResult.url.value}</a>
                        </div>
                      `;
                },
            },
        }),
        instantsearch.widgets.pagination({
            container: '#pagination',
        }),
    ]);

    window.onload = () => {
      const searchParam = getSearchParam();
      if (searchParam) {
        document.getElementById('searchbox').value = searchParam;
      }
    };

search.start();
</script>
</body>
</html>

There are a couple of points to call out - firstly, we update the config to point to our server:

        server: {
            apiKey: 'API-KEY', // Be sure to use an API key that only allows searches, in production
            nodes: [
                {
                    host: 'typesense.example.com',
                    port: '443',
                    protocol: 'https',
                },
            ],
        },
        additionalSearchParameters: {
            queryBy: 'title,text,url',
        },

And our collection:

    const search = instantsearch({
        searchClient,
        indexName: 'home',
    });

We use our field names for the results (hits) page:

        instantsearch.widgets.hits({
            container: '#hits',
            templates: {
                item(item) {
                    return `
                        <div>
                          <div class="hit-name">
                            ${item._highlightResult.title.value}
                          </div>
                          <div class="hit-text">
                            ${item._snippetResult.text.value}
                          </div>
                          <a href="${item._highlightResult.url.value}">${item._highlightResult.url.value}</a>
                        </div>
                      `;
                },
            },
        }),

There’s one last change in the code, for integrating with homepage - I wanted to be able to pass a search query using the standard ?q=search_term page url. To do this we need get the q parameter from the URL, if it exists:

    function getQueryParam(param) {
      const urlParams = new URLSearchParams(window.location.search);
      return urlParams.get(param);
    }

    function getSearchParam() {
      let param = urlParams.get(param);
      let q = '';
      if (typeof(param) !== 'undefined') {
        q = param;
      }
      return q;
    }
    
    window.onload = () => {
      const searchParam = getSearchParam();
      if (searchParam) {
        document.getElementById('searchbox').value = searchParam;
      }
    };

We can then use it to drive the query:

        instantsearch.widgets.configure({
            query: getQueryParam('q'),
            distinct: 1,
            attributeForDistinct: 'url',
            hitsPerPage: 16,
        }),

Integrating with homepage

I use homepage as my default browser start page and homepage has a search widget. We can easily enable our search engine with a little yaml:

  widgets.yaml: |
    - search:
        provider: custom
        focus: true
        url: http://search.example.com/?q=
        target: _blank
        suggestionUrl: http://search.example.com/search/?q= # Optional
        showSearchSuggestions: true # Optional

The homepage page search widget also supports suggestions… We can enable these with small express json server. The homepage documentation provides the format needed in response body for the URL provided with the suggestionUrl - the first entry of the array contains the search query, the second one is an array of the suggestions. In the example above, the search query was home.

[
  "home",
  [
    "home depot",
    "home depot near me",
    "home equity loan",
    "homeworkify",
    "homedepot.com",
    "homebase login",
    "home depot credit card",
    "home goods"
  ]
]

The code is essentially the typesense documentation javascript example with a little extra code to handle the ?q= url parameter and create the response json in the correct format.

const express = require('express');
const url = require('url');
const typesense = require('typesense')

const app = express();
const router = express.Router();
const port = 8080;

app.use('/', express.static('search'))

// set the server to listen on port 3000
app.listen(port, () => console.log(`Listening on port ${port}`));

app.get('/search', function (req, res) {
  var urlParts = url.parse(req.url, true);
  var parameters = urlParts.query;
  var q = '';
  if (typeof(parameters.q) == 'undefined')
  {
    res.json({text: 'search parameter is required'});
    return;
  } else {
    q = parameters.q;
  }
  
  let searchParameters = {
    'q'         : q,
    'query_by'  : 'title,text'
  }

  let client = new typesense.Client({
    'nodes': [{
      'host': 'typesense.example.com', // For Typesense Cloud use xxx.a1.typesense.net
      'port': 443,      // For Typesense Cloud use 443
      'protocol': 'https'   // For Typesense Cloud use https
    }],
    'apiKey': 'API-KEY',
    'connectionTimeoutSeconds': 2
  }); 

  let searchResult = [];
  let searchResultReturn = [];
  client.collections('home')
    .documents()
    .search(searchParameters)
    .then(function (searchResults) {
      for (h of searchResults.hits)
      {
        searchResult.push(
          h.document.title
        );
      }
      searchResultReturn.push(q);
      searchResultReturn.push(searchResult);
      res.json(searchResultReturn);
    })
});

Deploying typesense with kubernetes

Using the docker compose example above, it was straight forward to deploy a basic typesense kubernetes instance.

The only tricky part was getting the correct statement for the container command. This seems to be working for me:

          command: ['/opt/typesense-server', '--data-dir', '/data', '--api-key', 'API-KEY', '--enable-cors']

Full yaml:

---
apiVersion: v1
kind: Service
metadata:
  name: typesense
  namespace: search
spec:
  selector:
    app: typesense
  ports:
  - protocol: TCP
    port: 8108
    targetPort: 8108
  type: ClusterIP

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: typesense
  namespace: search
  labels:
    app: typesense
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: typesense
  template:
    metadata:
      labels:
        app: typesense
    spec:
      containers:
        - name: typesense
          image: typesense/typesense:27.1
          command: ['/opt/typesense-server', '--data-dir', '/data', '--api-key', 'xyz', '--enable-cors']
          ports:
            - containerPort: 8108
          volumeMounts:
            - name: typesense
              mountPath: /readeck
      volumes:
        - name: typesense
          persistentVolumeClaim:
            claimName: typesense

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: typesense
  namespace: search
  labels:
    app: typesense
spec:
  storageClassName: ceph-block
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
      
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: typesense
spec:
  entryPoints:
    - websecure
  routes:
  - match: Host(`typesense.example.com`)
    kind: Rule
    services:
    - name: typesense
      port: 8108

That’s it for now. I have a few improvements to make, such as: creating a CronJob to update the search data, tidying up the search UI results page, adding more content to the search index and investigating a high-availability deployment for typesense, but this is good for now!