Related Posts in WordPress with Vector Embedding
Mar21
How To2

Related Posts in WordPress with Vector Embedding

Since my 10 year old related posts plugin can’t even be downloaded anymore because of a security vulnerability, I figure it’s time to bring related posts into the ✨ AI era ✨ with vector embedding. Surprisingly, I didn’t find any WordPress plugins to do that, so – inspired by TomBot2000, who did this for a static site – I’m going to do it for WordPress.

What Is Vector Embedding?

By contrast with a lot of old-style related posts plugins, which compare similarity at the word level, vector embedding uses neural networks to plot a piece of text in semantic space, meaning you’re going to get much more meaningful recommendations.

An embedding model, like a GPT, is trained on a large corpus of text. But unlike GPTs, who then predict text, an embedding model just plots the input in some n-dimensional space of meaning. For example, “cat” and “feline” would occupy nearly the same point; “cat” and “dog” somewhat nearby, but “cat” and – say – “vestibule” would be very far apart. These are the sorts of distinctions that an old-style word-level plugin isn’t able to make.

The result of generating an embedding is a point in n-dimensional semantic space, some list of n numbers normalized between -1 and 1 that locate the meaning of your text. The nice part is that once you have the embeddings, you can calculate semantic distance between two pieces of text very easily as the distance between two points in that n-dimensional space1 – and this is how Twitter’s “See similar posts” button works so well and so quickly.

Generating Vector Embeddings For New Posts

To make this work for us, we’ll have to:

  1. Write a function to retrieve embeddings from the OpenAI API
  2. Hook it into WordPress to execute every time we publish or update a post
  3. Cache the resulting embedding
  4. Calculate distances to other posts and display the top three
  5. Go back and generate embeddings for existing posts (next section)

In principle we could do this all with PHP. However, OpenAI has a limit on input length in terms of tokens, so we’ll have to check the token length of our posts. There are compatible PHP tokenizers, but they require Composer, and frankly,

Also there’s a very nice OpenAI Python library that makes querying the API very easy (no futzing around with CURL). So what I’ll do is call a Python script from PHP. A little annoying, but not as annoying as Composer and CURL.

Get WordPress Ready

First of all, you’ll need an OpenAI platform account and a little credit. Vector embeddings are pretty cheap to generate – I was able to do my entire 103 posts of 223,000 words, twice because I screwed up the SQL the first time, and on the more expensive large model, for 7¢ total. If you have substantially more content, there’s also a small model you can use for another order of magnitude less money.

Second, you’ll need to generate a secret key to authorize your use of the API. You can do this in the sidebar of OpenAI’s platform page. Be sure to copy it down.

Third, we’ll add a setting to WordPress for the API key so we can avoid hardcoding the API key. Add the following to your theme’s functions.php:

//For entering the OpenAI key in the settings page
add_action('admin_init', function() {
	add_settings_section('openai', 'OpenAI API', function() {}, 'reading');
	
	$var = 'openaikey';
	register_setting('reading', $var);
	add_settings_field($var, 'OpenAI API Key', function() use ($var) {
		$default = get_option($var);
		echo "<input type=\"text\" value=\"{$default}\"
			class=\"regular-text ltr\" name=\"{$var}\" id=\"{$var}\" />";
	}, 'reading', 'openai');
});

This adds an ‘OpenAI’ section to the Reading Settings page in the WordPress admin, and adds a text field where we can enter our API key. We can get this later with get_option('openaikey').

OpenAI Settings

Get the Embeddings with Python

There are two Python packages that’ll be useful for us: openai – which lets us bypass all the CURL querying – and tiktoken – which lets us make sure we don’t run over the input token limit. So the first thing we’ll have to do is SSH into the webserver2 and install them:

pip install tiktoken
pip install openai

Next we’ll write a Python script to which we can pass the API key and the text content. We’ll call it embeddings.py and put it in the theme folder.

import os, json, argparse, sys
import tiktoken, openai

parser = argparse.ArgumentParser(description='Submit post content to OpenAI to generate embeddings.')
parser.add_argument('--key', type=str, help='The OpenAI private key.')
args = parser.parse_args()
os.environ['OPENAI_API_KEY'] = args.key

These argparse lines let us pass the key in as an argument by calling embeddings.py --key your_openai_key, which is what we’ll do from our PHP script. We’ll then set it as an environment variable that the openai library will access.

The post content, unfortunately, is too long to pass as a command line argument, so we’ll have PHP pass it in with stdin. We’ll also go ahead and strip out line breaks here.

content = sys.stdin.read().replace("\n", ' ') #Strip line breaks

Next we’ll initialize the OpenAI API and tokenize the content to make sure we’re under the 8,192 token input limit. You can set model='text-embedding-3-small' if you have a lot of content and need it to be very very cheap (62,500 pages/$ vs 9,615 pages/$), but as I said the large only cost me 7¢ to do twice on all my content.

model='text-embedding-3-large'
ttk = tiktoken.encoding_for_model(model)
tokens = ttk.encode(content)

def remove_stopwords(text):
	for word in ['a', 'an', 'the', 'as', 'be', 'are', 'is', 'were', 'that',
		 'then', 'there', 'these', 'to', 'have', 'has', 'by', 'for', 'with'
	]:
		text = text.replace(f' {word} ', ' ')
	return text

#Shorten if necessary
if len(tokens)>8191:
	content = remove_stopwords(content)
	tokens = ttk.encode(content)
if len(tokens)>8191:
	del tokens[8191:]
	content = ttk.decode(tokens)

This converts the content into a list of integer tokens so you can tell how close you are to the limit. In this example, I remove semantically unimportant words from any content that comes in over the token limit, and then if it’s still over the limit, just truncate it and reconstitute the text. My long papers are on the order of 11,000-12,000 tokens, so if your posts are all short you can skip everything except the first line, noting that the API will reject anything over the limit.

Finally, we query OpenAI and print the embeddings, which we’ll pull into our PHP file.

client = openai.OpenAI()
embedding = client.embeddings.create(input=[content], model=model).data[0].embedding
print(json.dumps(embedding))

This returns a 3,072-dimensional vector and outputs it as a JSON array. Computing 3,072-dimensional cosine-distances on the fly is computationally cheap enough that it doesn’t increase page load significantly for 100 posts, but if you’re worried about storage space or have lots of posts, you can reduce the number of dimensions by passing a dimensions parameter to embeddings.create.

Storing the Embedding

This function, which also goes in functions.php, hooks to save_post_post, which fires every time we save or publish a post. We’ll limit it to posts (not pages), and skip revisions and anything unpublished.

First we generate the content by concatenating the title and the content and stripping tags. Then we call the Python script with the API key as an argument using proc_open, and pipe the content into stdin. If all that succeeds, we take the embedding printed by the Python script, and store it in the post’s meta with the _embedding key.

add_action('save_post_post', function($post_id, $post=null, $update=false) {
	$post = get_post($post_id);
	if (wp_is_post_revision($post_id) || $post->post_status != 'publish') return;
	$content = get_the_title($post_id).' '.strip_tags(get_the_content($post_id));
	
	$descriptorspec = [['pipe','r'], ['pipe','w'], ['pipe','w']]; //stdin, stdout, stderr
	$py = proc_open(
		'python3 '.__DIR__.'/embeddings.py --key '.get_option('openaikey'),
		$descriptorspec, $pipes
	);
	if (is_resource($py)) {
		fwrite($pipes[0], $content);
		fclose($pipes[0]);
		$embedding = stream_get_contents($pipes[1]);
		fclose($pipes[1]);

		$error = stream_get_contents($pipes[2]);
		fclose($pipes[2]);
		if (!proc_close($py)) //Returns -1 on error, 0 on success
			update_post_meta($post_id, '_embedding', $embedding);
	}
});

Calculating Related Posts

Generating the embeddings themselves is the computationally hard work. Once we’ve located our post in semantic space, we just need to calculate the cosine distances between posts. To do that, we add one more function to functions.php:

function cos_sim($a, $b) {
	$n = 0; $d1 = 0; $d2 = 0;
	foreach (array_map(null, $a, $b) as $i) {
		$n += $i[0] * $i[1];
		$d1 += $i[0]**2;
		$d2 += $i[1]**2;
	}
	return $n/(sqrt($d1)*sqrt($d2));
}

With this, we can calculate the related posts on any single.php page:

<?php $embeddings = $wpdb->get_results(
	"SELECT post_id, meta_value
	FROM {$wpdb->prefix}postmeta
	WHERE meta_key='_embedding'",
	OBJECT_K
);
$own = json_decode($embeddings[get_the_ID()]->meta_value);
if ($own) {
	$similarities = [];
	foreach ($embeddings as $id => $e) {
		if ($id == get_the_ID()) continue;
		$embedding = json_decode($e->meta_value);
		$similarities[$id] = cos_sim($own, $embedding);
	}
	arsort($similarities); ?>

	<h3>Articles Similar To This One</h3>
	<ul><?php foreach (array_keys(array_slice($similarities,0,3, true)) as $item) { ; ?>
		<li>
			<?php get_post($item);
			//Style your post here ?>
		</li>
	<?php } ?></ul>
<?php } //if ($own) ?>

Essentially, we query all the embeddings from the post meta, pull out the current page’s, calculate the cosine similarity to each page other than the self, and generate related posts markup for the top three. Again, with 100 posts this isn’t intensive enough to noticeably affect page load time, but if you have a lot of posts, you can reduce the dimensionality or implement caching.

And voilà, a script to automatically generate embeddings and search for related posts every time you publish or update a page. See it in action by scrolling down!

Generating Previous Embeddings

With the above, you could in principle just go into every post and hit “Update” to generate embeddings. But that’s tedious, and shouldn’t we be able to do it all at once?

In fact, I started with this step and then worked on everything previous. In the interest of keeping things simple, since it’s a one-time self-contained script I’ll write the whole thing in Python. It’ll use the same two libraries as before (which you’ll need to install on your local machine now, if you’re running it there), plus cymysql, as well as much of the code. Be sure, of course, to modify the table prefix as appropriate if it’s anything other than wp_.

The connection information will be the same as in your WordPress install’s config.php file, and besides that the only thing you’ll need to add is your OpenAI API key.

import os, re, json
import cymysql, tiktoken, openai

conn=cymysql.connect(host='mysql.website.com', user='user', passwd='pw',db='wpdb')
sql=conn.cursor()
sql.execute('SELECT ID, post_content, post_title
	FROM wp_posts
	WHERE post_type="post" AND post_status="publish"'
)

os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_KEY'
client = openai.OpenAI()

def remove_stopwords(text):
	for word in ['a', 'an', 'the', 'as', 'be', 'are', 'is', 'were', 'that',
		'then', 'there', 'these', 'to', 'have', 'has', 'by', 'for', 'with'
	]:
		text = text.replace(f' {word} ', ' ')
	return text

model='text-embedding-3-large'
ttk = tiktoken.encoding_for_model(model) #Uses cl100k_base
for post in sql.fetchall():
	#Strip HTML tags and line breaks
	content = re.sub('<[^<]+?>', '', f'{post[2]} {post[1]}'.replace("\n", ' '))
	tokens = ttk.encode(content)

	#Shorten if necessary
	if len(tokens)>8191:
		content = remove_stopwords(content)
		tokens = ttk.encode(content)
	if len(tokens)>8191:
		del tokens[8191:]
		content = ttk.decode(tokens)

	embedding = client.embeddings.create(input=[content], model=model).data[0].embedding
	sql.execute(f'INSERT INTO wp_postmeta (post_id, meta_key, meta_value)
		VALUES (%s, %s, %s)',
		(post[0], '_embedding', json.dumps(embedding))
	)
	print('Embedded post:', post[2])
conn.commit()

Run this once to fill in the embeddings on your old posts, and there’s your AI-powered related posts system!

Footnotes

  1. Strictly speaking, ‘distance’ corresponds to Euclidean distance, which is somewhat more computationally intensive for high-dimensional and sparse vectors. Cosine similarity isn’t exactly the same, but it can be interpreted similarly as semantic closeness.
  2. Generally the credentials will be the same as your FTP client. The command will be ssh username@host.

Topics

Meta

SHARE

Facebook Twitter Reddit StumbleUpon

2 Comments

  • 1

    Colin

    Mar 22, 2024 at 8:01 | Reply

    I might’ve missed something, but is OpenAI needed here? Maybe for computing costs/time concerns? Could you just use word2vec and some other embedding package that wouldn’t require a subscription and payment?

    Is it to leverage their corpus? I’m more familiar with embedding within a corpus so we’d get similarities within your posts but the idea is to get similarities to general language given relatively few posts?

    • 1.1

      Cameron Harwick

      Mar 22, 2024 at 9:03

      Mainly I wanted to try it on as many dimensions as possible at first (300 for Word2vec vs 3000 for OAI’s), but they say you can get good results from 256 dimensions, so I’m sure it doesn’t matter too much.

Leave a Reply

More Content