Searching Linux “man” pages

Linux distributions come with a built-in documentation function through what are called “man” (manual) pages. However, reading the manual generally requires knowing the name of the program or function you’re working with. So, let’s see if we can do a little better.

The example I’m going to run with for this mini-series of posts is going to be password requirements. The reason I’m picking this example is that it’s built-in, involves a complex part of the Linux ecosystem, and isn’t readily findable in the manual using common terms. We’re also going to assume that you don’t have access to a search engine, or that it should be much more timely to be able to access the manual locally.

Specifically, password requirements on Linux (assuming the system is managing its own passwords) involves an area called “Pluggable Authentication Modules” (or PAM for short). PAM controls many aspects of logins, including things like two-factor authentication, but one of the functions is changing passwords. The module handling the policy aspect is called pwquality.conf (“password quality configuration”). To access the manual pages, you’d run the man pwquality.conf command.

Now, let’s take a look at why this isn’t easy to find (beyond needing to know the name to access the manual). Here are some of the word counts in the man pwquality.conf page, using common words associated with password complexity:

WordCount
“password”23
“requirement”1
“complex”0
“policy”0
“strong” and “strength”0
Word count for common terms related to password requirements

Now, why do these word counts matter? Because there’s one use of “requirement” but almost no other common terms are used. If you’re looking at organizational password policies, or coming from the Microsoft Active Directory world, you’re not going to find it easily in the manual; meanwhile if you search for the Active Directory equivalent, you’ll get a page called “Password must meet complexity requirements.”

Conversely, the manual page has the word “quality” 9 times.

So, to try to make this better, let’s first figure out how the manual pages are stored. If we run man man it will tell us:

Manual pages are normally stored in nroff(1) format under a directory such as /usr/share/man. In some installations, there may also be preformatted cat pages to improve performance. See man‐path(5) for details of where these files are stored.

(And, for a tough of irony, man man-path gives “No manual entry for man-path”)

Editorial note: while the output I was given said the page was man-page(5), the correct name is “manpath” (e.g. man manpath). The manpath returns all of the locations the system has been configured to search, similar to the PATH environment variable, and may include locations other than /usr/share/man.

However, there’s also one other important piece of information we’ll need in a minute – that the manual has sections:

1 Executable programs or shell commands
2 System calls (functions provided by the kernel)
3 Library calls (functions within program libraries)
4 Special files (usually found in /dev)
5 File formats and conventions, e.g. /etc/passwd
6 Games
7 Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7), man-pages(7)
8 System administration commands (usually only for root)
9 Kernel routines [Non standard]

Manual page for man

So, let’s take a look at the filesystem:

$ ls /usr/share/man
cs  da  de  es  fi  fr  fr.ISO8859-1  fr.UTF-8  hu  id  it  ja  ko  man1  man2  man3  man4  man5  man6  man7  man8  nl  pl  pt  pt_BR  ro  ru  sl  sr  sv  tr  uk  zh_CN  zh_TW

Ok, so we can see a lot of localization directories and then some directories matching man* – since there’s no en directory (and Linux development is primarily in English), we can assume that the man directories are in English by default (and we’ll find this to be true in a bit).

Now, if we look in the directories, we’ll find files named by their name, the section, and that they’re gzipped:

$ ls /usr/share/man/man1
'[.1.gz'                                   docker-service-scale.1.gz           gnome-extensions.1.gz

The first file name is a bit odd, but it’s for a bash thing and it’s just a string so we can handle it fine. However, as we start digging deeper we’ll notice there are a lot of section suffixes:

openssl-rsa.1ssl.gz
Git.3pm.gz
rwarray.3am.gz

The easiest way to deal with this will be simply including it in the page name. So, pulling all of this together, here’s what some Python code looks like:

Python
import gzip
import os
import sqlite3

# Configuration
manpath = "/usr/share/man/"

# Get list of (English) sections
def get_sections():
	# Array for returning
	sections = []

	# Loop through man sections
	for file in os.listdir(manpath):
		filename = os.fsdecode(file)

		# Use English files only
		if filename.startswith("man"):
			sections.append(filename)

	return sections

# Get pages in an individual section
def get_section_pages(section):
	# Array for returning
	pages = []

	# Path of man section
	path = os.path.join(manpath, section)

	for file in os.listdir(path):
		filename = os.fsdecode(file)

		# On Ubuntu all but a couple misc files are gzipped
		# Ignore files that aren't .gz
		if filename.endswith(".gz"):
			pagename = filename.removesuffix('.gz').removesuffix('.'+section)
			pages.append(pagename)
	
	return pages

def get_page_contents(section, page):
	# Get all the rest of the details to make the path
	#section = "man" + section.removeprefix("man") # normalize
	section_number = section.removeprefix("man")
	filename = page + "." + section_number + ".gz"
	# fallback for pages in some sections having additional prefixes
	if "." in page:
		filename = page + ".gz"
	path = os.path.join(manpath, section, filename)

	with open(path, 'rb') as file:
		content = gzip.decompress(file.read()).decode()
		return content

data = []

# Loop through all sections and get pages
for section in get_sections():
	# Loop through pages & add content
	for page in get_section_pages(section):
		content = get_page_contents(section, page)
		data.append((section, page, content))

Now, let’s load all of this into a format that’s easy to work with. I’m going to use a SQLite database for a couple reasons:

  1. It’s a well-supported format
  2. It allows easy text matching
  3. To take this a little further, it support Full-Text Searching

So, here’s some Python to load all of that into SQLite:

SQL
# Configuration vars
con = sqlite3.connect("man.db")
cur = con.cursor()

# Create data table
con.execute("CREATE TABLE manpages(section text, page text, content text)")
# Create full-text search table
# (has different indexing that's focused on search)
con.execute("CREATE VIRTUAL TABLE man_fts USING FTS5(section, page, content)")

# Add the data
cur.executemany("INSERT INTO manpages VALUES(?, ?, ?)", data)
cur.executemany("INSERT INTO man_fts VALUES(?, ?, ?)", data)
con.commit()

Now that that’s in SQL, let’s try the most simple sort of searching – matching specific words.

Let’s again assume that I’m searching based on a set of organizational standards and I’m not familiar with PAM. The first search I might try is “password” and “complexity”:

SQL
select * from manpages where content like '%password%' and content like '%complexity%'
"password" "complexity" word matching results

Uh oh, none of that is what we want. passwd is close since that’s the command for changing passwords, but we’ll find no mention of pwquality.conf in there. Stemming the word “complexity” to “complex” will give more results but still not what we’re looking for.

If we change the query to look for “requirements” instead of “complexity”, we’ll get the correct file, but it’s the 51st result. If we exclude section 3, it will become the 11th result.

However, we should be able to do better. Let’s look at the Full-Text Search (FTS) table we created previously. Also, the full-text search is based on this TIL (“Today I Learned”) post by Simon Willison.

SQL
select * from man_fts where man_fts match "password requirements" order by man_fts.rank
Query results for full-text search for "password requirements"

Second result, that’s a massive improvement!

For convenience, here’s how we can turn this into a Bash function (for example, to include in a .bashrc file):

Bash
$ Creates a command called "manfts" that searches our manual database
function manfts {
        # (Also, yes, this is vulnerable to a SQL injection and you shouldn't do this in production)
        sqlite3  ~/man.db "select page from man_fts where man_fts match \"$@\" order by man_fts.rank limit 5"
}
$ manfts "password requirements"
getpass.3
pwquality.conf.5
putspent.3
endspent.3
lckpwdf.3

For this week, this is as far as I got. However, we’ll pick this up to see how more powerful/feature-ful databases fare, starting with ElasticSearch.