Related Posts in WordPress with Vector Embedding

Since my 10 year old related posts plugin can’t even be downloaded anymore because of a security vulnerability, I figure it’s time to bring related posts into the ✨ AI era ✨ with vector embedding. Surprisingly, I didn’t find any WordPress plugins to do that, so – inspired by TomBot2000, who did this for a static site – I’m going to do it for WordPress.

What Is Vector Embedding?

By contrast with a lot of old-style related posts plugins, which compare similarity at the word level, vector embedding uses neural networks to plot a piece of text in semantic space, meaning you’re going to get much more meaningful recommendations.

An embedding model, like a GPT, is trained on a large corpus of text. But unlike GPTs, who then predict text, an embedding model just plots the input in some n-dimensional space of meaning. For example, “cat” and “feline” would occupy nearly the same point; “cat” and “dog” somewhat nearby, but “cat” and – say – “vestibule” would be very far apart. These are the sorts of distinctions that an old-style word-level plugin isn’t able to make.

The result of generating an embedding is a point in n-dimensional semantic space, some list of n numbers normalized between -1 and 1 that locate the meaning of your text. The nice part is that once you have the embeddings, you can calculate semantic distance between two pieces of text very easily as the distance between two points in that n-dimensional space¹ – and this is how Twitter’s “See similar posts” button works so well and so quickly.

Generating Vector Embeddings For New Posts

To make this work for us, we’ll have to:

Write a function to retrieve embeddings from the OpenAI API
Hook it into WordPress to execute every time we publish or update a post
Cache the resulting embedding
Calculate distances to other posts and display the top three
Go back and generate embeddings for existing posts (next section)

In principle we could do this all with PHP. However, OpenAI has a limit on input length in terms of tokens, so we’ll have to check the token length of our posts. There are compatible PHP tokenizers, but they require Composer, and frankly,

Also there’s a very nice OpenAI Python library that makes querying the API very easy (no futzing around with CURL). So what I’ll do is call a Python script from PHP. A little annoying, but not as annoying as Composer and CURL.

Get WordPress Ready

First of all, you’ll need an OpenAI platform account and a little credit. Vector embeddings are pretty cheap to generate – I was able to do my entire 103 posts of 223,000 words, twice because I screwed up the SQL the first time, and on the more expensive large model, for 7¢ total. If you have substantially more content, there’s also a small model you can use for another order of magnitude less money.

Second, you’ll need to generate a secret key to authorize your use of the API. You can do this in the sidebar of OpenAI’s platform page. Be sure to copy it down.

Third, we’ll add a setting to WordPress for the API key so we can avoid hardcoding the API key. Add the following to your theme’s functions.php:

//For entering the OpenAI key in the settings page
add_action('admin_init', function() {
	add_settings_section('openai', 'OpenAI API', function() {}, 'reading');
	
	$var = 'openaikey';
	register_setting('reading', $var);
	add_settings_field($var, 'OpenAI API Key', function() use ($var) {
		$default = get_option($var);
		echo "<input type=\"text\" value=\"{$default}\"
			class=\"regular-text ltr\" name=\"{$var}\" id=\"{$var}\" />";
	}, 'reading', 'openai');
});

This adds an ‘OpenAI’ section to the Reading Settings page in the WordPress admin, and adds a text field where we can enter our API key. We can get this later with get_option('openaikey').

Get the Embeddings with Python

There are two Python packages that’ll be useful for us: openai – which lets us bypass all the CURL querying – and tiktoken – which lets us make sure we don’t run over the input token limit. So the first thing we’ll have to do is SSH into the webserver² and install them:

pip install tiktoken
pip install openai

Next we’ll write a Python script to which we can pass the API key and the text content. We’ll call it embeddings.py and put it in the theme folder.

import os, json, argparse, sys
import tiktoken, openai

parser = argparse.ArgumentParser(description='Submit post content to OpenAI to generate embeddings.')
parser.add_argument('--key', type=str, help='The OpenAI private key.')
args = parser.parse_args()
os.environ['OPENAI_API_KEY'] = args.key

These argparse lines let us pass the key in as an argument by calling embeddings.py --key your_openai_key, which is what we’ll do from our PHP script. We’ll then set it as an environment variable that the openai library will access.

The post content, unfortunately, is too long to pass as a command line argument, so we’ll have PHP pass it in with stdin. We’ll also go ahead and strip out line breaks here.

content = sys.stdin.read().replace("\n", ' ') #Strip line breaks

Next we’ll initialize the OpenAI API and tokenize the content to make sure we’re under the 8,192 token input limit. You can set model='text-embedding-3-small' if you have a lot of content and need it to be very very cheap (62,500 pages/$ vs 9,615 pages/$), but as I said the large only cost me 7¢ to do twice on all my content.

model='text-embedding-3-large'
ttk = tiktoken.encoding_for_model(model)
tokens = ttk.encode(content)

def remove_stopwords(text):
	for word in ['a', 'an', 'the', 'as', 'be', 'are', 'is', 'were', 'that',
		 'then', 'there', 'these', 'to', 'have', 'has', 'by', 'for', 'with'
	]:
		text = text.replace(f' {word} ', ' ')
	return text

#Shorten if necessary
if len(tokens)>8191:
	content = remove_stopwords(content)
	tokens = ttk.encode(content)
if len(tokens)>8191:
	del tokens[8191:]
	content = ttk.decode(tokens)

This converts the content into a list of integer tokens so you can tell how close you are to the limit. In this example, I remove semantically unimportant words from any content that comes in over the token limit, and then if it’s still over the limit, just truncate it and reconstitute the text. My long papers are on the order of 11,000-12,000 tokens, so if your posts are all short you can skip everything except the first line, noting that the API will reject anything over the limit.

Finally, we query OpenAI and print the embeddings, which we’ll pull into our PHP file.

client = openai.OpenAI()
embedding = client.embeddings.create(input=[content], model=model).data[0].embedding
print(json.dumps(embedding))

This returns a 3,072-dimensional vector and outputs it as a JSON array. Computing 3,072-dimensional cosine-distances on the fly is computationally cheap enough that it doesn’t increase page load significantly for 100 posts, but if you’re worried about storage space or have lots of posts, you can reduce the number of dimensions by passing a dimensions parameter to embeddings.create.

Storing the Embedding

This function, which also goes in functions.php, hooks to save_post_post, which fires every time we save or publish a post. We’ll limit it to posts (not pages), and skip revisions and anything unpublished.

First we generate the content by concatenating the title and the content and stripping tags. Then we call the Python script with the API key as an argument using proc_open, and pipe the content into stdin. If all that succeeds, we take the embedding printed by the Python script, and store it in the post’s meta with the _embedding key.

add_action('save_post_post', function($post_id, $post=null, $update=false) {
	$post = get_post($post_id);
	if (wp_is_post_revision($post_id) || $post->post_status != 'publish') return;
	$content = get_the_title($post_id).' '.strip_tags(get_the_content($post_id));
	
	$descriptorspec = [['pipe','r'], ['pipe','w'], ['pipe','w']]; //stdin, stdout, stderr
	$py = proc_open(
		'python3 '.__DIR__.'/embeddings.py --key '.get_option('openaikey'),
		$descriptorspec, $pipes
	);
	if (is_resource($py)) {
		fwrite($pipes[0], $content);
		fclose($pipes[0]);
		$embedding = stream_get_contents($pipes[1]);
		fclose($pipes[1]);

		$error = stream_get_contents($pipes[2]);
		fclose($pipes[2]);
		if (!proc_close($py)) //Returns -1 on error, 0 on success
			update_post_meta($post_id, '_embedding', $embedding);
	}
});

Calculating Related Posts

Generating the embeddings themselves is the computationally hard work. Once we’ve located our post in semantic space, we just need to calculate the cosine distances between posts. To do that, we add one more function to functions.php:

function cos_sim($a, $b) {
	$n = 0; $d1 = 0; $d2 = 0;
	foreach (array_map(null, $a, $b) as $i) {
		$n += $i[0] * $i[1];
		$d1 += $i[0]**2;
		$d2 += $i[1]**2;
	}
	return $n/(sqrt($d1)*sqrt($d2));
}

With this, we can calculate the related posts on any single.php page:

<?php $embeddings = $wpdb->get_results(
	"SELECT post_id, meta_value
	FROM {$wpdb->prefix}postmeta
	WHERE meta_key='_embedding'",
	OBJECT_K
);
$own = json_decode($embeddings[get_the_ID()]->meta_value);
if ($own) {
	$similarities = [];
	foreach ($embeddings as $id => $e) {
		if ($id == get_the_ID()) continue;
		$embedding = json_decode($e->meta_value);
		$similarities[$id] = cos_sim($own, $embedding);
	}
	arsort($similarities); ?>

	<h3>Articles Similar To This One</h3>
	<ul><?php foreach (array_keys(array_slice($similarities,0,3, true)) as $item) { ?>
		<li>
			<?php get_post($item);
			//Style your post here ?>
		</li>
	<?php } ?></ul>
<?php } //if ($own) ?>

Essentially, we query all the embeddings from the post meta, pull out the current page’s, calculate the cosine similarity to each page other than the self, and generate related posts markup for the top three. Again, with 100 posts this isn’t intensive enough to noticeably affect page load time, but if you have a lot of posts, you can reduce the dimensionality or implement caching.

And voilà, a script to automatically generate embeddings and search for related posts every time you publish or update a page. See it in action by scrolling down!

Optional Regeneration On Update

The save_post_post hook above will generate an embedding every time a post is published or updated. Sometimes this is what we want, if there’s a major content update. But for small edits, since it costs us money, we’d like an option to edit posts without regenerating the embedding.

To do this, we’ll add a box to the bottom of the Post Settings sidebar in the editor, fire the save_post_post hook when it’s checked, and return otherwise.

We’ll add this box with the add_meta_box function, and attach some checkbox markup to it by referring to the embeddings_checkbox function.

<?php //Checkbox to avoid double generating embeddings unless we want to
add_action('add_meta_boxes', function() {
	add_meta_box(
		'gen_embeddings',
		'Embeddings',
		'embeddings_checkbox',
		'post', 'side', 'core'
	);
});

function embeddings_checkbox($post) {
	$value = get_post_meta($post->ID, '_embedding', true); ?>
	<label for="gen_embeddings">
		<input type="checkbox" id="gen_embeddings" name="gen_embeddings" <?php if (!$value) echo 'checked'; ?> />
		<?php echo $value ? 'Reg' : 'G'; ?>enerate embeddings
	</label>
<?php }

The second function checks whether the post has an existing embedding. If it doesn’t, it checks the box by default and prompts ‘Generate embeddings’. If it does, it unchecks the box, and gives you the option to ‘Regenerate embeddings’.

Then we only need to add one more line to the beginning of the save_post_post hook above, to abort the function if the checkbox isn’t checked.

if (!isset($_POST['gen_embeddings']) || !$_POST['gen_embeddings']) return;

Generating Previous Embeddings

With the above, you could in principle just go into every post and hit “Update” to generate embeddings. But that’s tedious, and shouldn’t we be able to do it all at once?

In fact, I started with this step and then worked on everything previous. In the interest of keeping things simple, since it’s a one-time self-contained script I’ll write the whole thing in Python. It’ll use the same two libraries as before (which you’ll need to install on your local machine now, if you’re running it there), plus cymysql, as well as much of the code. Be sure, of course, to modify the table prefix as appropriate if it’s anything other than wp_.

The connection information will be the same as in your WordPress install’s config.php file, and besides that the only thing you’ll need to add is your OpenAI API key.

import os, re, json
import cymysql, tiktoken, openai

conn=cymysql.connect(host='mysql.website.com', user='user', passwd='pw',db='wpdb')
sql=conn.cursor()
sql.execute('SELECT ID, post_content, post_title
	FROM wp_posts
	WHERE post_type="post" AND post_status="publish"'
)

os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_KEY'
client = openai.OpenAI()

def remove_stopwords(text):
	for word in ['a', 'an', 'the', 'as', 'be', 'are', 'is', 'were', 'that',
		'then', 'there', 'these', 'to', 'have', 'has', 'by', 'for', 'with'
	]:
		text = text.replace(f' {word} ', ' ')
	return text

model='text-embedding-3-large'
ttk = tiktoken.encoding_for_model(model) #Uses cl100k_base
for post in sql.fetchall():
	#Strip HTML tags and line breaks
	content = re.sub('<[^<]+?>', '', f'{post[2]} {post[1]}'.replace("\n", ' '))
	tokens = ttk.encode(content)

	#Shorten if necessary
	if len(tokens)>8191:
		content = remove_stopwords(content)
		tokens = ttk.encode(content)
	if len(tokens)>8191:
		del tokens[8191:]
		content = ttk.decode(tokens)

	embedding = client.embeddings.create(input=[content], model=model).data[0].embedding
	sql.execute(f'INSERT INTO wp_postmeta (post_id, meta_key, meta_value)
		VALUES (%s, %s, %s)',
		(post[0], '_embedding', json.dumps(embedding))
	)
	print('Embedded post:', post[2])
conn.commit()

Run this once to fill in the embeddings on your old posts, and there’s your AI-powered related posts system!

What Else Can We Do With Embeddings? An Interactive PCA

Principal component analysis is a way to summarize as much of the variation in many-dimensional data as you can, using fewer dimensions. For example, a genome is many-dimensional, but since much of the variation in different genes is correlated, a good chunk of the total variation can be captured in just a few axes.

Conveniently, the vector embeddings that we used for related posts are a 3,072-dimensional vector in semantic space, just the sort of thing we can use for a PCA. This will find the two-dimensional plane in that 3,072-dimensional space that captures the most possible variation among the points corresponding to posts.

PCA Visualization — *A visualization of how a PCA from three to two dimensions works. Imagine this, but from 3,072 dimensions to 2. Image from NLPCA.org.*

Below is an interactive PCA plot of all the posts on the website that you can filter by title. Color is category, size is length, and the location is the two semantic principal components. PCAs have no intrinsic meaning besides whatever two factors capture the most total variation, but since this is in semantic space, it looks like PC1 on the horizontal axis is something like Culture and Religion on the left to Macroeconomics and Crypto on the right, and PC2 on the vertical axis is something like Technical at the top to Popular at the bottom. Hover over each dot to see the post.

Principal Component 2

Principal Component 1

To generate the PCA, you’ll need Pandas and Scikit-Learn.

import cymysql, json, pandas as pd, os.path
from sklearn.decomposition import PCA

conn=cymysql.connect(host='...', user='...', passwd='...',db='...') #Copy from wp-config.php
sql=conn.cursor()
#This assumes one category per post. If you have more, just delete all but the first LEFT JOIN lines or it'll screw it up.
sql.execute("""SELECT ID, post_title, post_name, post_date, meta_value AS embedding, name AS category, LENGTH(post_content) AS length
            FROM wp_posts
            LEFT JOIN wp_postmeta ON post_id=ID AND meta_key="_embedding"
            LEFT JOIN wp_term_relationships ON object_id=ID
            LEFT JOIN wp_term_taxonomy ON wp_term_taxonomy.term_taxonomy_id=wp_term_relationships.term_taxonomy_id AND taxonomy='category'
            LEFT JOIN wp_terms ON wp_terms.term_id=wp_term_taxonomy.term_id
            WHERE post_type="post" AND post_status='publish' AND meta_key IS NOT NULL AND name IS NOT NULL""")

embeddings = []
posts = []
for post in sql.fetchall():
    post = list(post)
    embeddings.append(json.loads(post[4]))
    posts.append([post[0], post[1], post[2], post[3].strftime('%Y-%m-%d'), post[5], post[6]])

embeddings = pd.DataFrame(embeddings)
posts = pd.DataFrame(posts, columns=['id', 'Title', 'Slug', 'Date', 'Category', 'Length'])
pca = PCA(n_components=2).fit_transform(embeddings)
components = pd.concat([posts, pd.DataFrame(data=pca, columns=['PC1', 'PC2'])], axis=1)

with open(os.path.dirname(__file__)+'/pca.json', 'w') as f:
    json.dump(components.values.tolist(), f)

I won’t go into how to turn the JSON into an interactive chart like the above, but that’s all open to be seen in the web inspector. :)

Footnotes

Strictly speaking, ‘distance’ corresponds to Euclidean distance, which is somewhat more computationally intensive for high-dimensional and sparse vectors. Cosine similarity isn’t exactly the same, but it can be interpreted similarly as semantic closeness.
Generally the credentials will be the same as your FTP client. The command will be ssh username@host.

2 Comments

1
Colin
Mar 22, 2024 at 8:01 | Reply

I might’ve missed something, but is OpenAI needed here? Maybe for computing costs/time concerns? Could you just use word2vec and some other embedding package that wouldn’t require a subscription and payment?

Is it to leverage their corpus? I’m more familiar with embedding within a corpus so we’d get similarities within your posts but the idea is to get similarities to general language given relatively few posts?
- 1.1
  Cameron Harwick
  Mar 22, 2024 at 9:03
  
  Mainly I wanted to try it on as many dimensions as possible at first (300 for Word2vec vs 3000 for OAI’s), but they say you can get good results from 256 dimensions, so I’m sure it doesn’t matter too much.

Related Posts in WordPress with Vector Embedding

What Is Vector Embedding?

Generating Vector Embeddings For New Posts

Get WordPress Ready

Get the Embeddings with Python

Storing the Embedding

Calculating Related Posts

Optional Regeneration On Update

Generating Previous Embeddings

What Else Can We Do With Embeddings? An Interactive PCA

Footnotes

Topics

2 Comments

Colin

Cameron Harwick

Leave a Reply

More Content

Related Posts in WordPress with Vector Embedding

What Is Vector Embedding?

Generating Vector Embeddings For New Posts

Get WordPress Ready

Get the Embeddings with Python

Storing the Embedding

Calculating Related Posts

Optional Regeneration On Update

Generating Previous Embeddings

What Else Can We Do With Embeddings? An Interactive PCA

Footnotes

Topics

SHARE

2 Comments

Colin

Cameron Harwick

Leave a Reply

More Content

Articles Similar To This One

Morality Is Fractal

The Morality of Marginalism: Risk Aversion versus Aspirational Norms

Bootstrapping Social Cooperation