Since my 10 year old related posts plugin can’t even be downloaded anymore because of a security vulnerability, I figure it’s time to bring related posts into the ✨ AI era ✨ with vector embedding. Surprisingly, I didn’t find any WordPress plugins to do that, so – inspired by TomBot2000, who did this for a static site – I’m going to do it for WordPress.
By contrast with a lot of old-style related posts plugins, which compare similarity at the word level, vector embedding uses neural networks to plot a piece of text in semantic space, meaning you’re going to get much more meaningful recommendations.
An embedding model, like a GPT, is trained on a large corpus of text. But unlike GPTs, who then predict text, an embedding model just plots the input in some n-dimensional space of meaning. For example, “cat” and “feline” would occupy nearly the same point; “cat” and “dog” somewhat nearby, but “cat” and – say – “vestibule” would be very far apart. These are the sorts of distinctions that an old-style word-level plugin isn’t able to make.
The result of generating an embedding is a point in n-dimensional semantic space, some list of n numbers normalized between -1 and 1 that locate the meaning of your text. The nice part is that once you have the embeddings, you can calculate semantic distance between two pieces of text very easily as the distance between two points in that n-dimensional space1 – and this is how Twitter’s “See similar posts” button works so well and so quickly.
To make this work for us, we’ll have to:
In principle we could do this all with PHP. However, OpenAI has a limit on input length in terms of tokens, so we’ll have to check the token length of our posts. There are compatible PHP tokenizers, but they require Composer, and frankly,
Also there’s a very nice OpenAI Python library that makes querying the API very easy (no futzing around with CURL). So what I’ll do is call a Python script from PHP. A little annoying, but not as annoying as Composer and CURL.
First of all, you’ll need an OpenAI platform account and a little credit. Vector embeddings are pretty cheap to generate – I was able to do my entire 103 posts of 223,000 words, twice because I screwed up the SQL the first time, and on the more expensive large model, for 7¢ total. If you have substantially more content, there’s also a small model you can use for another order of magnitude less money.
Second, you’ll need to generate a secret key to authorize your use of the API. You can do this in the sidebar of OpenAI’s platform page. Be sure to copy it down.
Third, we’ll add a setting to WordPress for the API key so we can avoid hardcoding the API key. Add the following to your theme’s functions.php
:
//For entering the OpenAI key in the settings page
add_action('admin_init', function() {
add_settings_section('openai', 'OpenAI API', function() {}, 'reading');
$var = 'openaikey';
register_setting('reading', $var);
add_settings_field($var, 'OpenAI API Key', function() use ($var) {
$default = get_option($var);
echo "<input type=\"text\" value=\"{$default}\"
class=\"regular-text ltr\" name=\"{$var}\" id=\"{$var}\" />";
}, 'reading', 'openai');
});
This adds an ‘OpenAI’ section to the Reading Settings page in the WordPress admin, and adds a text field where we can enter our API key. We can get this later with get_option('openaikey')
.
There are two Python packages that’ll be useful for us: openai
– which lets us bypass all the CURL querying – and tiktoken
– which lets us make sure we don’t run over the input token limit. So the first thing we’ll have to do is SSH into the webserver2 and install them:
pip install tiktoken
pip install openai
Next we’ll write a Python script to which we can pass the API key and the text content. We’ll call it embeddings.py
and put it in the theme folder.
import os, json, argparse, sys
import tiktoken, openai
parser = argparse.ArgumentParser(description='Submit post content to OpenAI to generate embeddings.')
parser.add_argument('--key', type=str, help='The OpenAI private key.')
args = parser.parse_args()
os.environ['OPENAI_API_KEY'] = args.key
These argparse
lines let us pass the key in as an argument by calling embeddings.py --key your_openai_key
, which is what we’ll do from our PHP script. We’ll then set it as an environment variable that the openai
library will access.
The post content, unfortunately, is too long to pass as a command line argument, so we’ll have PHP pass it in with stdin
. We’ll also go ahead and strip out line breaks here.
content = sys.stdin.read().replace("\n", ' ') #Strip line breaks
Next we’ll initialize the OpenAI API and tokenize the content to make sure we’re under the 8,192 token input limit. You can set model='text-embedding-3-small'
if you have a lot of content and need it to be very very cheap (62,500 pages/$ vs 9,615 pages/$), but as I said the large only cost me 7¢ to do twice on all my content.
model='text-embedding-3-large'
ttk = tiktoken.encoding_for_model(model)
tokens = ttk.encode(content)
def remove_stopwords(text):
for word in ['a', 'an', 'the', 'as', 'be', 'are', 'is', 'were', 'that',
'then', 'there', 'these', 'to', 'have', 'has', 'by', 'for', 'with'
]:
text = text.replace(f' {word} ', ' ')
return text
#Shorten if necessary
if len(tokens)>8191:
content = remove_stopwords(content)
tokens = ttk.encode(content)
if len(tokens)>8191:
del tokens[8191:]
content = ttk.decode(tokens)
This converts the content into a list of integer tokens so you can tell how close you are to the limit. In this example, I remove semantically unimportant words from any content that comes in over the token limit, and then if it’s still over the limit, just truncate it and reconstitute the text. My long papers are on the order of 11,000-12,000 tokens, so if your posts are all short you can skip everything except the first line, noting that the API will reject anything over the limit.
Finally, we query OpenAI and print the embeddings, which we’ll pull into our PHP file.
client = openai.OpenAI()
embedding = client.embeddings.create(input=[content], model=model).data[0].embedding
print(json.dumps(embedding))
This returns a 3,072-dimensional vector and outputs it as a JSON array. Computing 3,072-dimensional cosine-distances on the fly is computationally cheap enough that it doesn’t increase page load significantly for 100 posts, but if you’re worried about storage space or have lots of posts, you can reduce the number of dimensions by passing a dimensions
parameter to embeddings.create
.
This function, which also goes in functions.php
, hooks to save_post_post
, which fires every time we save or publish a post. We’ll limit it to posts (not pages), and skip revisions and anything unpublished.
First we generate the content by concatenating the title and the content and stripping tags. Then we call the Python script with the API key as an argument using proc_open
, and pipe the content into stdin
. If all that succeeds, we take the embedding printed by the Python script, and store it in the post’s meta with the _embedding
key.
add_action('save_post_post', function($post_id, $post=null, $update=false) {
$post = get_post($post_id);
if (wp_is_post_revision($post_id) || $post->post_status != 'publish') return;
$content = get_the_title($post_id).' '.strip_tags(get_the_content($post_id));
$descriptorspec = [['pipe','r'], ['pipe','w'], ['pipe','w']]; //stdin, stdout, stderr
$py = proc_open(
'python3 '.__DIR__.'/embeddings.py --key '.get_option('openaikey'),
$descriptorspec, $pipes
);
if (is_resource($py)) {
fwrite($pipes[0], $content);
fclose($pipes[0]);
$embedding = stream_get_contents($pipes[1]);
fclose($pipes[1]);
$error = stream_get_contents($pipes[2]);
fclose($pipes[2]);
if (!proc_close($py)) //Returns -1 on error, 0 on success
update_post_meta($post_id, '_embedding', $embedding);
}
});
Generating the embeddings themselves is the computationally hard work. Once we’ve located our post in semantic space, we just need to calculate the cosine distances between posts. To do that, we add one more function to functions.php
:
function cos_sim($a, $b) {
$n = 0; $d1 = 0; $d2 = 0;
foreach (array_map(null, $a, $b) as $i) {
$n += $i[0] * $i[1];
$d1 += $i[0]**2;
$d2 += $i[1]**2;
}
return $n/(sqrt($d1)*sqrt($d2));
}
With this, we can calculate the related posts on any single.php
page:
<?php $embeddings = $wpdb->get_results(
"SELECT post_id, meta_value
FROM {$wpdb->prefix}postmeta
WHERE meta_key='_embedding'",
OBJECT_K
);
$own = json_decode($embeddings[get_the_ID()]->meta_value);
if ($own) {
$similarities = [];
foreach ($embeddings as $id => $e) {
if ($id == get_the_ID()) continue;
$embedding = json_decode($e->meta_value);
$similarities[$id] = cos_sim($own, $embedding);
}
arsort($similarities); ?>
<h3>Articles Similar To This One</h3>
<ul><?php foreach (array_keys(array_slice($similarities,0,3, true)) as $item) { ?>
<li>
<?php get_post($item);
//Style your post here ?>
</li>
<?php } ?></ul>
<?php } //if ($own) ?>
Essentially, we query all the embeddings from the post meta, pull out the current page’s, calculate the cosine similarity to each page other than the self, and generate related posts markup for the top three. Again, with 100 posts this isn’t intensive enough to noticeably affect page load time, but if you have a lot of posts, you can reduce the dimensionality or implement caching.
And voilà, a script to automatically generate embeddings and search for related posts every time you publish or update a page. See it in action by scrolling down!
The save_post_post
hook above will generate an embedding every time a post is published or updated. Sometimes this is what we want, if there’s a major content update. But for small edits, since it costs us money, we’d like an option to edit posts without regenerating the embedding.
To do this, we’ll add a box to the bottom of the Post Settings sidebar in the editor, fire the save_post_post
hook when it’s checked, and return otherwise.
We’ll add this box with the add_meta_box
function, and attach some checkbox markup to it by referring to the embeddings_checkbox
function.
<?php //Checkbox to avoid double generating embeddings unless we want to
add_action('add_meta_boxes', function() {
add_meta_box(
'gen_embeddings',
'Embeddings',
'embeddings_checkbox',
'post', 'side', 'core'
);
});
function embeddings_checkbox($post) {
$value = get_post_meta($post->ID, '_embedding', true); ?>
<label for="gen_embeddings">
<input type="checkbox" id="gen_embeddings" name="gen_embeddings" <?php if (!$value) echo 'checked'; ?> />
<?php echo $value ? 'Reg' : 'G'; ?>enerate embeddings
</label>
<?php }
The second function checks whether the post has an existing embedding. If it doesn’t, it checks the box by default and prompts ‘Generate embeddings’. If it does, it unchecks the box, and gives you the option to ‘Regenerate embeddings’.
Then we only need to add one more line to the beginning of the save_post_post
hook above, to abort the function if the checkbox isn’t checked.
if (!isset($_POST['gen_embeddings']) || !$_POST['gen_embeddings']) return;
With the above, you could in principle just go into every post and hit “Update” to generate embeddings. But that’s tedious, and shouldn’t we be able to do it all at once?
In fact, I started with this step and then worked on everything previous. In the interest of keeping things simple, since it’s a one-time self-contained script I’ll write the whole thing in Python. It’ll use the same two libraries as before (which you’ll need to install on your local machine now, if you’re running it there), plus cymysql
, as well as much of the code. Be sure, of course, to modify the table prefix as appropriate if it’s anything other than wp_
.
The connection information will be the same as in your WordPress install’s config.php
file, and besides that the only thing you’ll need to add is your OpenAI API key.
import os, re, json
import cymysql, tiktoken, openai
conn=cymysql.connect(host='mysql.website.com', user='user', passwd='pw',db='wpdb')
sql=conn.cursor()
sql.execute('SELECT ID, post_content, post_title
FROM wp_posts
WHERE post_type="post" AND post_status="publish"'
)
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_KEY'
client = openai.OpenAI()
def remove_stopwords(text):
for word in ['a', 'an', 'the', 'as', 'be', 'are', 'is', 'were', 'that',
'then', 'there', 'these', 'to', 'have', 'has', 'by', 'for', 'with'
]:
text = text.replace(f' {word} ', ' ')
return text
model='text-embedding-3-large'
ttk = tiktoken.encoding_for_model(model) #Uses cl100k_base
for post in sql.fetchall():
#Strip HTML tags and line breaks
content = re.sub('<[^<]+?>', '', f'{post[2]} {post[1]}'.replace("\n", ' '))
tokens = ttk.encode(content)
#Shorten if necessary
if len(tokens)>8191:
content = remove_stopwords(content)
tokens = ttk.encode(content)
if len(tokens)>8191:
del tokens[8191:]
content = ttk.decode(tokens)
embedding = client.embeddings.create(input=[content], model=model).data[0].embedding
sql.execute(f'INSERT INTO wp_postmeta (post_id, meta_key, meta_value)
VALUES (%s, %s, %s)',
(post[0], '_embedding', json.dumps(embedding))
)
print('Embedded post:', post[2])
conn.commit()
Run this once to fill in the embeddings on your old posts, and there’s your AI-powered related posts system!
Principal component analysis is a way to summarize as much of the variation in many-dimensional data as you can, using fewer dimensions. For example, a genome is many-dimensional, but since much of the variation in different genes is correlated, a good chunk of the total variation can be captured in just a few axes.
Conveniently, the vector embeddings that we used for related posts are a 3,072-dimensional vector in semantic space, just the sort of thing we can use for a PCA. This will find the two-dimensional plane in that 3,072-dimensional space that captures the most possible variation among the points corresponding to posts.
Below is an interactive PCA plot of all the posts on the website that you can filter by title. Color is category, size is length, and the location is the two semantic principal components. PCAs have no intrinsic meaning besides whatever two factors capture the most total variation, but since this is in semantic space, it looks like PC1 on the horizontal axis is something like Culture and Religion on the left to Macroeconomics and Crypto on the right, and PC2 on the vertical axis is something like Technical at the top to Popular at the bottom. Hover over each dot to see the post.
To generate the PCA, you’ll need Pandas and Scikit-Learn.
import cymysql, json, pandas as pd, os.path
from sklearn.decomposition import PCA
conn=cymysql.connect(host='...', user='...', passwd='...',db='...') #Copy from wp-config.php
sql=conn.cursor()
#This assumes one category per post. If you have more, just delete all but the first LEFT JOIN lines or it'll screw it up.
sql.execute("""SELECT ID, post_title, post_name, post_date, meta_value AS embedding, name AS category, LENGTH(post_content) AS length
FROM wp_posts
LEFT JOIN wp_postmeta ON post_id=ID AND meta_key="_embedding"
LEFT JOIN wp_term_relationships ON object_id=ID
LEFT JOIN wp_term_taxonomy ON wp_term_taxonomy.term_taxonomy_id=wp_term_relationships.term_taxonomy_id AND taxonomy='category'
LEFT JOIN wp_terms ON wp_terms.term_id=wp_term_taxonomy.term_id
WHERE post_type="post" AND post_status='publish' AND meta_key IS NOT NULL AND name IS NOT NULL""")
embeddings = []
posts = []
for post in sql.fetchall():
post = list(post)
embeddings.append(json.loads(post[4]))
posts.append([post[0], post[1], post[2], post[3].strftime('%Y-%m-%d'), post[5], post[6]])
embeddings = pd.DataFrame(embeddings)
posts = pd.DataFrame(posts, columns=['id', 'Title', 'Slug', 'Date', 'Category', 'Length'])
pca = PCA(n_components=2).fit_transform(embeddings)
components = pd.concat([posts, pd.DataFrame(data=pca, columns=['PC1', 'PC2'])], axis=1)
with open(os.path.dirname(__file__)+'/pca.json', 'w') as f:
json.dump(components.values.tolist(), f)
I won’t go into how to turn the JSON into an interactive chart like the above, but that’s all open to be seen in the web inspector. :)
ssh username@host
.It is a commonplace in New Institutional economics that norms matter for economic performance. There remains, however, no deep integration of norms into the rational choice framework beyond merely shunting them into the black box of “preferences”. This paper first establishes the importance for social cooperation of specific and directive . . .
If the basic purpose of moral norms is to coordinate on the conditions under which one should cooperate in social dilemmas, this paper shows that the boundaries of such conditions must be fractal. In other words, as one focuses on the border of the area in signal space where the . . .
Social cooperation is the major thing to be explained in both sociobiology and economics. From the perspective of the former, most species never achieve it at all. From the perspective of the latter, most societies never get very far along compared to the advanced Western societies of the modern world. One . . .
Colin
Mar 22, 2024 at 8:01 |I might’ve missed something, but is OpenAI needed here? Maybe for computing costs/time concerns? Could you just use word2vec and some other embedding package that wouldn’t require a subscription and payment?
Is it to leverage their corpus? I’m more familiar with embedding within a corpus so we’d get similarities within your posts but the idea is to get similarities to general language given relatively few posts?
Cameron Harwick
Mar 22, 2024 at 9:03Mainly I wanted to try it on as many dimensions as possible at first (300 for Word2vec vs 3000 for OAI’s), but they say you can get good results from 256 dimensions, so I’m sure it doesn’t matter too much.