Searching Manuals with ElasticSearch

Continuing from the last post on searching the Linux manual (“man”) pages, this week I’m going to be using ElasticSearch and see how well it works.

Why ElasticSearch? Well, as much as I have an issue with its licensing, it’s nearly synonymous with searching text documents.

Ironically, for being a well-deployed application, it goes on my list of “software whose quick start guide doesn’t work” – after a couple hours of debugging, the best I can tell is that the Docker image now defaults to trying to start in clustered mode instead of the single-node mode.

To remedy this, I used the Docker compose file for the multi-node quickstart, deleted the other nodes, and then set the remaining node to discovery.type=single-node

We can also use similar code to last time for loading the man pages into the database:

Python

from elasticsearch import Elasticsearch
import gzip
import os

# Configuration
manpath = "/usr/share/man/"

# Global vars
es = Elasticsearch(
  "https://127.0.0.1:9200",
  verify_certs=False,
  basic_auth=("elastic","changeme")
)

# Print connection info to check it's working
print(es.info())

# ... (same get_* functions as last post)

data = []

# Loop through all sections and get pages
for section in get_sections():
  # Loop through pages & add content
  for page in get_section_pages(section):
    content = get_page_contents(section, page)
    #data.append((section, page, content))
    es.index(
      index='man',
      document={
        'section': section,
        'page': page,
        'content': content
      }
    )

# Reindex
es.indices.refresh(index='man')

Searching is also fairly easy:

Python

from elasticsearch import Elasticsearch

# Global vars
es = Elasticsearch(
  "https://127.0.0.1:9200",
  verify_certs=False,
  basic_auth=("elastic","changeme")
)

# Print connection info to check it's working
#print(es.info())

results = es.search(index="man",q="password requirements")

for doc in results["hits"]["hits"]:
  print(doc["_score"], doc["_source"]["section"], doc["_source"]["page"])

Let’s see the results!

$ python3 search.py 
8.162808 man3 getpass.3
8.019694 man3 endspent.3
8.019694 man3 lckpwdf.3
8.019694 man3 ulckpwdf.3
8.019694 man3 fgetspent.3
8.019694 man3 setspent.3
8.019694 man3 sgetspent.3
8.019694 man3 sgetspent_r.3
8.019694 man3 getspent.3
8.019694 man3 getspnam.3

And it looks like it’s about the same as using the word match search in SQLite. A look at the documentation seems to confirm this:

q – Query in the Lucene query string syntax using query parameter search. Query parameter searches do not support the full Elasticsearch Query DSL but are handy for testing.
search()

Since it looks like the Python library is not the easiest way to query, let’s jump over to Kibana since it was included in the Docker compose file.

Here’s what the same query looks like in the Kibana console:

HTTP

POST /man/_search?pretty
  {
    "query": {
      "query_string": {
        "query": "password requirements"
      }
    },
    "_source": ["section","page"]
  }

Using the Standard analyzer appears to get the same results:

HTTP

POST /man/_search?pretty
  {
    "query": {
      "match": {
        "content":{
          "query": "password requirements",
          "analyzer": "standard"
        }
      }
    },
    "_source": ["section","page"]
  }

Using the English text analyzer is promising though!

HTTP

POST /man/_search?pretty
  {
    "query": {
      "match": {
        "content":{
          "query": "password requirements",
          "analyzer": "english"
        }
      }
    },
    "_source": ["section","page"]
  }

Since the results are a giant JSON document, let’s use a little jq to simplify the results:

$ jq '.hits.hits[]._source' en_analyzer.json
{
  "section": "man1",
  "page": "passwd.1"
}
{
  "section": "man5",
  "page": "shadow.5"
}
{
  "section": "man8",
  "page": "pam_pwquality.8"
}
{
  "section": "man8",
  "page": "pam_unix.8"
}
{
  "section": "man8",
  "page": "pam_extrausers.8"
}
{
  "section": "man1",
  "page": "apg.1"
}
{
  "section": "man1",
  "page": "systemd-ask-password.1"
}
{
  "section": "man1",
  "page": "systemd-tty-ask-password-agent.1"
}
{
  "section": "man5",
  "page": "pwquality.conf.5"
}
{
  "section": "man8",
  "page": "systemd-ask-password-console.service.8"
}
{
  "section": "man1",
  "page": "passwd.1"
}
{
  "section": "man5",
  "page": "shadow.5"
}
{
  "section": "man8",
  "page": "pam_pwquality.8"
}
{
  "section": "man8",
  "page": "pam_unix.8"
}
{
  "section": "man8",
  "page": "pam_extrausers.8"
}
{
  "section": "man1",
  "page": "apg.1"
}
{
  "section": "man1",
  "page": "systemd-ask-password.1"
}
{
  "section": "man1",
  "page": "systemd-tty-ask-password-agent.1"
}
{
  "section": "man5",
  "page": "pwquality.conf.5"
}
{
  "section": "man8",
  "page": "systemd-ask-password-console.service.8"
}

pwquality.conf is the 9th result, but it’s not as bad as it looks – pam_pwquality is the 3rd result! This is the PAM module rather than the actual configuration, but it will explain a lot of the options and the “see also” section will send the user to the correct place:

SEE ALSO
pwscore(1), pwquality.conf(5), pam_pwquality(8), pam.conf(5), PAM(8)

This still doesn’t beat SQLite’s 2nd place for the correct place, but there’s a bit more to look at in the results. For sake of simplicity, I’m going to put the page names for the top results side-by-side:

Rank	ElasticSearch	SQLite Full-Text Search
1	passwd.1	getpass.3
2	shadow.5	pwquality.conf.5
3	pam_pwquality.8	putspent.3
4	pam_unix.8	endspent.3
5	pam_extrausers.8	lckpwdf.3
6	apg.1	ulckpwdf.3
7	systemd-ask-password.1	fgetspent.3
8	systemd-tty-ask-password-agent.1	getspent.3
9	pwquality.conf.5	getspnam.3
10	systemd-ask-password-console.service.8	setspent.3

Top 10 results for both ElasticSearch English Analyzer vs SQLite FTS

If we read through the top results on the ElasticSearch side, we get passwd, the command for changing passwords, “shadow”, which is where passwords are stored on modern Linux distributions, the PAM module for password requirements (the module itself and then later the config), other PAM modules, a password generating command, and a few ways systemd can prompt the user to enter their password. Meanwhile, in the SQLite results, the password requirements config is second but all of the other top 10 results are internal Linux APIs. So, while what we wanted ranked lower on ElasticSearch, the results overall were much closer to what we’d want.

That’s going to be it for this time; keep an eye out since sometime in the future I intend to try this with a vector database.