Aug20
How To

Interactive PCA of Vector Embeddings

Principal component analysis is a way to summarize as much of the variation in many-dimensional data as you can, using fewer dimensions. For example, a genome is many-dimensional, but since much of the variation in different genes is correlated, a good chunk of the total variation can be captured in just a few axes.

Conveniently, the vector embeddings that we used for related posts are a 3,072-dimensional vector in semantic space, just the sort of thing we can use for a PCA. This will find the two-dimensional plane in that 3,072-dimensional space that captures the most possible variation among the points corresponding to posts.

PCA Visualization
A visualization of how a PCA from three to two dimensions works. Imagine this, but from 3,072 dimensions to 2. Image from NLPCA.org.

Below is an interactive PCA plot of all the posts on the website that you can filter by title. Color is category, size is length, and the location is the two semantic principal components. PCAs have no intrinsic meaning besides whatever two factors capture the most total variation, but since this is in semantic space, it looks like PC1 on the horizontal axis is something like Culture and Religion on the left to Macroeconomics and Crypto on the right, and PC2 on the vertical axis is something like Technical at the top to Popular at the bottom. Hover over each dot to see the post.

Principal Component 2
Principal Component 1

How To Do The PCA

Once you’ve gotten vector embeddings for all the posts you want, you can use this Python script to calculate a PCA and generate a JSON list with ID, slug, date, post title, category, embeddings, and post length. This script assumes you’ve stored the embeddings as a meta item as in the previous post, but you can easily modify it to pull embeddings from wherever you have them. You’ll need Pandas and Scikit-Learn.

import cymysql, json, pandas as pd, os.path
from sklearn.decomposition import PCA

conn=cymysql.connect(host='...', user='...', passwd='...',db='...') #Copy from wp-config.php
sql=conn.cursor()
#This assumes one category per post. If you have more, just delete all but the first LEFT JOIN lines or it'll screw it up.
sql.execute("""SELECT ID, post_title, post_name, post_date, meta_value AS embedding, name AS category, LENGTH(post_content) AS length
            FROM wp_posts
            LEFT JOIN wp_postmeta ON post_id=ID AND meta_key="_embedding"
            LEFT JOIN wp_term_relationships ON object_id=ID
            LEFT JOIN wp_term_taxonomy ON wp_term_taxonomy.term_taxonomy_id=wp_term_relationships.term_taxonomy_id AND taxonomy='category'
            LEFT JOIN wp_terms ON wp_terms.term_id=wp_term_taxonomy.term_id
            WHERE post_type="post" AND post_status='publish' AND meta_key IS NOT NULL AND name IS NOT NULL""")

embeddings = []
posts = []
for post in sql.fetchall():
    post = list(post)
    embeddings.append(json.loads(post[4]))
    posts.append([post[0], post[1], post[2], post[3].strftime('%Y-%m-%d'), post[5], post[6]])

embeddings = pd.DataFrame(embeddings)
posts = pd.DataFrame(posts, columns=['id', 'Title', 'Slug', 'Date', 'Category', 'Length'])
pca = PCA(n_components=2).fit_transform(embeddings)
components = pd.concat([posts, pd.DataFrame(data=pca, columns=['PC1', 'PC2'])], axis=1)

with open(os.path.dirname(__file__)+'/pca.json', 'w') as f:
    json.dump(components.values.tolist(), f)

I won’t go into how to turn the JSON into an interactive chart like the above, but that’s all open to be seen in the web inspector. :)

Topics

Meta

SHARE

Facebook Twitter Reddit Threads

Leave a Reply

More Content