From 35e8ba6b33a18ec2e1c64ccf4d52d3b0dcc9fe17 Mon Sep 17 00:00:00 2001 From: roseline1 Date: Wed, 5 Nov 2025 18:46:21 +0000 Subject: [PATCH 1/2] fix top 10 activations --- tutorials/tutorial_2_0.ipynb | 3636 +++++++++++++++++----------------- 1 file changed, 1818 insertions(+), 1818 deletions(-) diff --git a/tutorials/tutorial_2_0.ipynb b/tutorials/tutorial_2_0.ipynb index 39956e8aa..3cfb2d590 100644 --- a/tutorials/tutorial_2_0.ipynb +++ b/tutorials/tutorial_2_0.ipynb @@ -1,1820 +1,1820 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "MNk7IylTv610" - }, - "source": [ - "# SAE Lens + Neuronpedia Tutorial\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This tutorial is an introduction to analysis of neural networks using sparse autoencoders (SAEs), a new and popular technique in mechanistic interpretability. For more context, we refer you to [this post](https://transformer-circuits.pub/2023/monosemantic-features).\n", - "\n", - "However, we will explain what SAE features are, how to load SAEs into SAELens and find/identify features, and how to do steering, ablation, and attribution with them.\n", - "\n", - "This tutorial covers:\n", - "\n", - "- A basic introduction to SAEs.\n", - " - What is SAE Lens?\n", - " - Choosing an SAE to analyse and loading it with [SAE Lens](https://github.com/decoderesearch/SAELens).\n", - " - The SAE Class and it's config.\n", - "- SAE Features.\n", - " - What is a feature dashboard?\n", - " - Loading feature dashboards on [Neuronpedia](https://www.neuronpedia.org/).\n", - " - Downloading Autointerp and searching via explanations.\n", - "- Feature inference\n", - " - Using the HookedSAE Transformer Class to decompose activations into features.\n", - " - Comparing Features accross related prompts.\n", - "- Making Feature Dashboards\n", - " - Max Activating Examples\n", - " - Feature Activation Histograms\n", - " - Logit Weight Distributions.\n", - " - Extension: Reproducing `Not all language model features are linear`\n", - "- SAE based Analysis Methods (Advanced)\n", - " - Steering model generation with SAE Features\n", - " - Ablating SAE Features\n", - " - Gradient-based Attribution for Circuit Detection\n", - "\n", - "**Warning:** This tutorial is a rough initial draft, prepared in a fairly short timeframe, and we expect to make many improvements in the future. Nevertheless, we hope this initial version is useful for those looking to get started.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "i_DusoOvwV0M" - }, - "source": [ - "## Set Up (Just Run / Not Important)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yfDUxRx0wSRl" - }, - "outputs": [], - "source": [ - "try:\n", - " import google.colab # type: ignore\n", - " from google.colab import output\n", - "\n", - " COLAB = True\n", - " %pip install sae-lens transformer-lens sae-dashboard\n", - "except:\n", - " COLAB = False\n", - " from IPython import get_ipython # type: ignore\n", - "\n", - " ipython = get_ipython()\n", - " assert ipython is not None\n", - " ipython.run_line_magic(\"load_ext\", \"autoreload\")\n", - " ipython.run_line_magic(\"autoreload\", \"2\")\n", - "\n", - "# Standard imports\n", - "import os\n", - "import torch\n", - "from tqdm.auto import tqdm\n", - "import plotly.express as px\n", - "import pandas as pd\n", - "\n", - "# Imports for displaying vis in Colab / notebook\n", - "\n", - "torch.set_grad_enabled(False)\n", - "\n", - "# For the most part I'll try to import functions and classes near where they are used\n", - "# to make it clear where they come from.\n", - "\n", - "if torch.backends.mps.is_available():\n", - " device = \"mps\"\n", - "else:\n", - " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", - "\n", - "print(f\"Device: {device}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XoMx3VZpv611" - }, - "source": [ - "# Loading a pretrained Sparse Autoencoder\n", - "\n", - "As a first step, we will actually load an SAE! But before we do so, it can be useful to see which are available. The following snippet shows the currently available SAE releases in SAELens, and will remain up-to-date as we continue to add more SAEs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sae_lens.loading.pretrained_saes_directory import get_pretrained_saes_directory\n", - "\n", - "# TODO: Make this nicer.\n", - "df = pd.DataFrame.from_records(\n", - " {k: v.__dict__ for k, v in get_pretrained_saes_directory().items()}\n", - ").T\n", - "df.drop(\n", - " columns=[\n", - " \"expected_var_explained\",\n", - " \"expected_l0\",\n", - " \"config_overrides\",\n", - " \"conversion_func\",\n", - " ],\n", - " inplace=True,\n", - ")\n", - "df # Each row is a \"release\" which has multiple SAEs which may have different configs / match different hook points in a model." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In practice, SAEs can be of varying usefulness for general use cases. To start with, we recommend the following:\n", - "\n", - "- Joseph's Open Source GPT2 Small Residual (gpt2-small-res-jb)\n", - "- Joseph's Feature Splitting (gpt2-small-res-jb-feature-splitting)\n", - "- Gemma SAEs (gemma-2b-res-jb) (0,6) <- on Neuronpedia and good. (12 / 17 aren't very good currently).\n", - "\n", - "Other SAEs have various issues--e.g., too dense or not dense enough, or designed for special use cases, or initial drafts of what we hope will be better versions later. Decode Research / Neuronpedia are working on making all SAEs on Neuronpedia loadable in SAE Lens and vice versa, as well as providing public benchmarking stats to help people choose which SAEs to work with.\n", - "\n", - "To see all the SAEs contained in a specific release (named after the part of the model they apply to), simply run the below. Each hook point corresponds to a layer or module of the model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# show the contents of the saes_map column for a specific row\n", - "print(\"SAEs in the GTP2 Small Resid Pre release\")\n", - "for k, v in df.loc[df.release == \"gpt2-small-res-jb\", \"saes_map\"].values[0].items():\n", - " print(f\"SAE id: {k} for hook point: {v}\")\n", - "\n", - "print(\"-\" * 50)\n", - "print(\"SAEs in the feature splitting release\")\n", - "for k, v in (\n", - " df.loc[df.release == \"gpt2-small-res-jb-feature-splitting\", \"saes_map\"]\n", - " .values[0]\n", - " .items()\n", - "):\n", - " print(f\"SAE id: {k} for hook point: {v}\")\n", - "\n", - "print(\"-\" * 50)\n", - "print(\"SAEs in the Gemma base model release\")\n", - "for k, v in df.loc[df.release == \"gemma-2b-res-jb\", \"saes_map\"].values[0].items():\n", - " print(f\"SAE id: {k} for hook point: {v}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next we'll load a specific SAE, as well as a copy of GPT-2 Small to attach it to. To load the model, we'll use the HookedSAETransformer class, which is adapted from the TransformerLens HookedTransformer.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sNSfL80Uv611" - }, - "outputs": [], - "source": [ - "# from transformer_lens import HookedTransformer\n", - "from sae_lens import SAE, HookedSAETransformer\n", - "\n", - "model = HookedSAETransformer.from_pretrained(\"gpt2-small\", device=device)\n", - "\n", - "sae = SAE.from_pretrained(\n", - " release=\"gpt2-small-res-jb\", # <- Release name\n", - " sae_id=\"blocks.7.hook_resid_pre\", # <- SAE id (not always a hook point!)\n", - " device=device,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The \"sae\" object is an instance of the SAE (Sparse Autoencoder class). There are many different SAE architectures which may have different weights or activation functions. In order to simplify working with SAEs, SAE Lens handles most of this complexity for you.\n", - "\n", - "Let's look at the SAE config and understand each of the parameters:\n", - "\n", - "1. `architecture`: Specifies the type of SAE architecture being used, in this case, the standard architecture (encoder and decoder with hidden activations, as opposed to a gated SAE).\n", - "2. `d_in`: Defines the input dimension of the SAE, which is 768 in this configuration.\n", - "3. `d_sae`: Sets the dimension of the SAE's hidden layer, which is 24576 here. This represents the number of possible feature activations.\n", - "4. `activation_fn_str`: Specifies the activation function used in the SAE, which is ReLU in this case. TopK is another option that we will not cover here.\n", - "5. `apply_b_dec_to_input`: Determines whether to apply the decoder bias to the input, set to True here.\n", - "6. `finetuning_scaling_factor`: Indicates whether to use a scaling factor to weight initialization and the forward pass. This is not usually used and was introduced to support a [solution for shrinkage](https://www.lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes).\n", - "7. `context_size`: Defines the size of the context window, which is 128 tokens in this case. In turns out SAEs trained on small activations from small prompts [often don't perform well on longer prompts](https://www.lesswrong.com/posts/baJyjpktzmcmRfosq/stitching-saes-of-different-sizes).\n", - "8. `model_name`: Specifies the name of the model being used, which is 'gpt2-small' here. [This is a valid model name in TransformerLens](https://transformerlensorg.github.io/TransformerLens/generated/model_properties_table.html).\n", - "9. `hook_name`: Indicates the specific hook in the model where the SAE is applied.\n", - "10. `hook_head_index`: Defines which attention head to hook into; not relevant here since we are looking at a residual stream SAE.\n", - "11. `prepend_bos`: Determines whether to prepend the beginning-of-sequence token, set to True.\n", - "12. `dataset_path`: Specifies the path to the dataset used for training or evaluation. (Can be local or a huggingface dataset.)\n", - "13. `dataset_trust_remote_code`: Indicates whether to trust remote code (from HuggingFace) when loading the dataset, set to True.\n", - "14. `normalize_activations`: Specifies how to normalize activations, set to 'none' in this config.\n", - "15. `dtype`: Defines the data type for tensor operations, set to 32-bit floating point.\n", - "16. `device`: Specifies the computational device to use.\n", - "17. `sae_lens_training_version`: Indicates the version of SAE Lens used for training, set to None here.\n", - "18. `activation_fn_kwargs`: Allows for additional keyword arguments for the activation function. This would be used if e.g. the `activation_fn_str` was set to `topk`, so that `k` could be specified.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(sae.cfg.__dict__)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next we need to load in a dataset to work with. We'll just use a sample of the Pile.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from datasets import load_dataset\n", - "from transformer_lens.utils import tokenize_and_concatenate\n", - "\n", - "dataset = load_dataset(\n", - " path=\"NeelNanda/pile-10k\",\n", - " split=\"train\",\n", - " streaming=False,\n", - ")\n", - "\n", - "token_dataset = tokenize_and_concatenate(\n", - " dataset=dataset, # type: ignore\n", - " tokenizer=model.tokenizer, # type: ignore\n", - " streaming=True,\n", - " max_length=sae.cfg.metadata.context_size,\n", - " add_bos_token=sae.cfg.metadata.prepend_bos,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Basics: What are SAE Features?\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Opening a feature dashboard on Neuronpedia\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before we dive too deep into the various things you can do with SAEs, let's address a basic question: What is an SAE feature?\n", - "\n", - "An SAE feature represents a pattern or concept that the autoencoder has learned to detect in the input data. These features often correspond to meaningful semantic, syntactic, or otherwise interpretable elements of text, and correspond to linear directions in activation space. SAEs are trained on the activations of a specific part of the model, and after training, these features show up as activations in the hidden layer of the SAE (which is much wider than the source activation vector, and produces one hidden activation per feature). As such, the hidden activations represent a decomposition of the entangled/superimposed features found in the original model activations. Ideally, these activations are sparse: Only a few of the many possible hidden activations actually activate for a given piece of input. This sparseness tends to correspond to ease of interpretability.\n", - "\n", - "The dashboard shown here provides a detailed view of a single SAE feature. (Refresh the cell to see more examples). Let's break down its components:\n", - "\n", - "1. Feature Description: At the top, we see an auto-interp-sourced description of the feature.\n", - "\n", - "2. Logit Plots: The top positive and negative logits for the feature. The values indicate the strength of the association.\n", - "\n", - "3. Activations Density Plot: This histogram shows the distribution of activation values for this feature across a randomly sampled dataset. The x-axis represents activation strength, and the y-axis shows frequency. The top chart is simply the distribution of non-zero activations, and the second plot shows the density of negative and positive logits.\n", - "\n", - "4. Test Activation: You can use this feature within the notebook or Neuronpedia itself--simply enter text to see how the feature is activated across the text.\n", - "\n", - "5. Top Activations: Below the plots, we see max-activating examples of text snippets that strongly activate this feature. Each snippet is highlighted where the activation appears.\n", - "\n", - "See this section of [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features#setup-interface) for more information.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import IFrame\n", - "\n", - "# get a random feature from the SAE\n", - "feature_idx = torch.randint(0, sae.cfg.d_sae, (1,)).item()\n", - "\n", - "html_template = \"https://neuronpedia.org/{}/{}/{}?embed=true&embedexplanation=true&embedplots=true&embedtest=true&height=300\"\n", - "\n", - "\n", - "def get_dashboard_html(sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=0):\n", - " return html_template.format(sae_release, sae_id, feature_idx)\n", - "\n", - "\n", - "html = get_dashboard_html(\n", - " sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=feature_idx\n", - ")\n", - "IFrame(html, width=1200, height=600)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For the randomly selected feature above, can you predict which text will make it fire? Can you test your theory?\n", - "\n", - "Eg: Imagine it seemed to fire on pokemon. Testing whether the feature fires on Digimon (a similar game with pet monsters) may suggest a different story.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Downloading / Searching Autointerp\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "What if we wanted to search for a feature relating to a specific thing? Then we could use the explanation search API. Let's just download all the [autointerp explanations](https://openai.com/index/language-models-can-explain-neurons-in-language-models/) for these SAE features and load them in as a Pandas dataframe. The Neuronpedia API docs will be useful here: https://www.neuronpedia.org/api-doc#tag/explanations/GET/api/explanation/export.\n", - "\n", - "_Note: not every SAE in SAE Lens is on Neuronpedia and not all SAEs on Neuronpedia have autointerp for all features. This is a work in progress_.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", - "\n", - "url = \"https://www.neuronpedia.org/api/explanation/export?modelId=gpt2-small&saeId=7-res-jb\"\n", - "headers = {\"Content-Type\": \"application/json\"}\n", - "\n", - "response = requests.get(url, headers=headers)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# convert to pandas\n", - "data = response.json()\n", - "explanations_df = pd.DataFrame(data)\n", - "# rename index to \"feature\"\n", - "explanations_df.rename(columns={\"index\": \"feature\"}, inplace=True)\n", - "# explanations_df[\"feature\"] = explanations_df[\"feature\"].astype(int)\n", - "explanations_df[\"description\"] = explanations_df[\"description\"].apply(\n", - " lambda x: x.lower()\n", - ")\n", - "explanations_df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's search for a feature related to the Bible.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "bible_features = explanations_df.loc[explanations_df.description.str.contains(\" bible\")]\n", - "bible_features" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Let's get the dashboard for this feature.\n", - "html = get_dashboard_html(\n", - " sae_release=\"gpt2-small\",\n", - " sae_id=\"7-res-jb\",\n", - " feature_idx=bible_features.feature.values[0],\n", - ")\n", - "IFrame(html, width=1200, height=600)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Basics: Getting Features Using SAEs\n", - "\n", - "Autointerp is such a bad way to find features. We really care about understanding model predictions on real prompts using SAEs. Let's check for features used in completing this bible verse. Will we see a bible feature?\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from transformer_lens.utils import test_prompt\n", - "\n", - "prompt = \"In the beginning, God created the heavens and the\"\n", - "answer = \"earth\"\n", - "\n", - "# Show that the model can confidently predict the next token.\n", - "test_prompt(prompt, answer, model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Using a HookedSAETransformer\n", - "\n", - "We have a whole tutorial on running models with SAEs using the HookedSAE Transformer class -> \n", - "\"Open\n", - "\n", - "\n", - "Here we'll just demonstrate how to get features using the class.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# SAEs don't reconstruct activation perfectly, so if you attach an SAE and want the model to stay performant, you need to use the error term.\n", - "# This is because the SAE will be used to modify the forward pass, and if it doesn't reconstruct the activations well, the outputs may be effected.\n", - "# Good SAEs have small error terms but it's something to be mindful of.\n", - "\n", - "sae.use_error_term # If use error term is set to false, we will modify the forward pass by using the sae." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Below, we'll use the `run_with_cache_with_saes` function of the HookedSAETransformer, which will give us all the cached activations (including those from the SAE that we've specified in the arguments). Running our prompt through the model gets us activation tensors as follows:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# hooked SAE Transformer will enable us to get the feature activations from the SAE\n", - "_, cache = model.run_with_cache_with_saes(prompt, saes=[sae])\n", - "\n", - "print([(k, v.shape) for k, v in cache.items() if \"sae\" in k])\n", - "\n", - "# note there were 11 tokens in our prompt, the residual stream dimension is 768, and the number of SAE features is 768" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we'll visualize the activations of the hidden layer of the SAE at the final token position of the prompt. Each of these vertical lines correspond to a feature activation. We can also plot the dashboards for each of these activated features, using their position in the activation cache as an index to pull data from Neuronpedia. We'll do this for the top features only.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# let's look at which features fired at layer 8 at the final token position\n", - "\n", - "# hover over lines to see the Feature ID.\n", - "px.line(\n", - " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :].cpu().numpy(),\n", - " title=\"Feature activations at the final token position\",\n", - " labels={\"index\": \"Feature\", \"value\": \"Activation\"},\n", - ").show()\n", - "\n", - "# let's print the top 5 features and how much they fired\n", - "vals, inds = torch.topk(\n", - " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :], 5\n", - ")\n", - "for val, ind in zip(vals, inds):\n", - " print(f\"Feature {ind} fired {val:.2f}\")\n", - " html = get_dashboard_html(\n", - " sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=ind\n", - " )\n", - " display(IFrame(html, width=1200, height=300))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### The Contrast Pairs Trick\n", - "\n", - "Sometimes we may be interested in which features fire differently between two prompts. Let's investigate this question by comparing the resultant activations. As we can see, using the prompt below changes the logit prediction considerably:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from transformer_lens.utils import test_prompt\n", - "\n", - "prompt = \"In the beginning, God created the cat and the\"\n", - "answer = \"earth\"\n", - "\n", - "# here we see that removing the word \"Heavens\" is very effective at making the model no longer predict \"earth\".\n", - "# instead the model predicts a bunch of different animals.\n", - "# Can we work out which features fire differently which might explain this? (This is a toy example not meant to be super interesting)\n", - "test_prompt(prompt, answer, model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's plot the two activation vectors.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prompt = [\n", - " \"In the beginning, God created the heavens and the\",\n", - " \"In the beginning, God created the cat and the\",\n", - "]\n", - "_, cache = model.run_with_cache_with_saes(prompt, saes=[sae])\n", - "print([(k, v.shape) for k, v in cache.items() if \"sae\" in k])\n", - "\n", - "feature_activation_df = pd.DataFrame(\n", - " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :].cpu().numpy(),\n", - " index=[f\"feature_{i}\" for i in range(sae.cfg.d_sae)],\n", - ")\n", - "feature_activation_df.columns = [\"heavens_and_the\"]\n", - "feature_activation_df[\"cat_and_the\"] = (\n", - " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][1, -1, :].cpu().numpy()\n", - ")\n", - "feature_activation_df[\"diff\"] = (\n", - " feature_activation_df[\"heavens_and_the\"] - feature_activation_df[\"cat_and_the\"]\n", - ")\n", - "\n", - "fig = px.line(\n", - " feature_activation_df,\n", - " title=\"Feature activations for the prompt\",\n", - " labels={\"index\": \"Feature\", \"value\": \"Activation\"},\n", - ")\n", - "\n", - "# hide the x-ticks\n", - "fig.update_xaxes(showticklabels=False)\n", - "fig.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can see that there are differences, but let's plot the feature dashboards for the features with the biggest diffs to see what they are. We can see that the biggest difference is that there is now an active \"animal\" feature.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# let's look at the biggest features in terms of absolute difference\n", - "\n", - "diff = (\n", - " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][1, -1, :].cpu()\n", - " - cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :].cpu()\n", - ")\n", - "vals, inds = torch.topk(torch.abs(diff), 5)\n", - "for val, ind in zip(vals, inds):\n", - " print(f\"Feature {ind} had a difference of {val:.2f}\")\n", - " html = get_dashboard_html(\n", - " sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=ind\n", - " )\n", - " display(IFrame(html, width=1200, height=300))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "So we see that with cats, there is now an animal predicting feature that fires quite strongly, and a feature that fires on \"and\" and promotes \"valleys\" and other geological terms no longer fires.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Making Feature Dashboards (Optional)\n", - "\n", - "For those interested, we provide a section showing how to generate the components of feature dashboards.\n", - "\n", - "We've covered what the feature dashboards are displaying, but let's dive into this in more detail so that we fully understand what the plots signify. To repeat the explanation above and provide more detail, basic feature dashboards have 4 main components:\n", - "\n", - "1. Feature Activation Distribution. We report the proportion of tokens a feature fires on, usually between 1 in every 100 and 1 in every 10,000 tokens activations, and show the distribution of positive activations.\n", - "2. Logit weight distribution. This is the projection of the decoder weight onto the unembed and roughly gives us a sense of the tokens promoted by a feature. It's less useful in big models / middle layers.\n", - "3. The top 10 and bottom 10 tokens in the logit weight distribution (positive/negative logits).\n", - "4. **Max Activating Examples**. These are examples of text where the feature fires and usually provide the most information for helping us work out what a feature means.\n", - "\n", - "**Bonus Section: Reproducing circular subspace geometry from [Not all Language Model Features are Linear](https://arxiv.org/abs/2405.14860)**\n", - "\n", - "_Neuronpedia_ is a website that hosts feature dashboards and which runs servers that can run the model and check feature activations. This makes it very convenient to check that a feature fires on the distribution of text you actually think it should fire on. We've been downloading data from Neuronpedia for some of the plots above.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Local: Finding Max Activating Examples\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We'll start by finding the max-activating examples--the prompts that show the highest level of activation from a feature. First, we'll prepare a feature store, which streams a sample of text from an SAE's orginal training dataset and creates activations for them.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# instantiate an object to hold activations from a dataset\n", - "from sae_lens import ActivationsStore\n", - "\n", - "# a convenient way to instantiate an activation store is to use the from_sae method\n", - "activation_store = ActivationsStore.from_sae(\n", - " model=model,\n", - " dataset=sae.cfg.metadata.dataset_path,\n", - " sae=sae,\n", - " streaming=True,\n", - " # fairly conservative parameters here so can use same for larger\n", - " # models without running out of memory.\n", - " store_batch_size_prompts=8,\n", - " train_batch_size_tokens=4096,\n", - " n_batches_in_buffer=32,\n", - " device=device,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def list_flatten(nested_list):\n", - " return [x for y in nested_list for x in y]\n", - "\n", - "\n", - "# A very handy function Neel wrote to get context around a feature activation\n", - "def make_token_df(tokens, len_prefix=5, len_suffix=3, model=model):\n", - " str_tokens = [model.to_str_tokens(t) for t in tokens]\n", - " unique_token = [\n", - " [f\"{s}/{i}\" for i, s in enumerate(str_tok)] for str_tok in str_tokens\n", - " ]\n", - "\n", - " context = []\n", - " prompt = []\n", - " pos = []\n", - " label = []\n", - " for b in range(tokens.shape[0]):\n", - " for p in range(tokens.shape[1]):\n", - " prefix = \"\".join(str_tokens[b][max(0, p - len_prefix) : p])\n", - " if p == tokens.shape[1] - 1:\n", - " suffix = \"\"\n", - " else:\n", - " suffix = \"\".join(\n", - " str_tokens[b][p + 1 : min(tokens.shape[1] - 1, p + 1 + len_suffix)]\n", - " )\n", - " current = str_tokens[b][p]\n", - " context.append(f\"{prefix}|{current}|{suffix}\")\n", - " prompt.append(b)\n", - " pos.append(p)\n", - " label.append(f\"{b}/{p}\")\n", - " # print(len(batch), len(pos), len(context), len(label))\n", - " return pd.DataFrame(\n", - " dict(\n", - " str_tokens=list_flatten(str_tokens),\n", - " unique_token=list_flatten(unique_token),\n", - " context=context,\n", - " prompt=prompt,\n", - " pos=pos,\n", - " label=label,\n", - " )\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we'll generate examples for a random set of features.\n", - "\n", - "The following code does the following (for a randomly selected set of 100 features):\n", - "\n", - "1. Samples tokens from the dataset, prepending a bos if the SAE was trained with that and making sure prompts are the correct size for the SAE.\n", - "2. Generates activations, tracking which tokens a feature fired on.\n", - "3. (Just for `Not all language model features are linear`) Keeps track of the subspace geneated by those features.\n", - "4. Make a dataframe with all the tokens in all the prompts where at least one feature fired.\n", - "\n", - "\\*Note: this code is fairly slow in part due to the dataframe concat and in part because we actually have to run the model rather than using cached activations. SAE Lens officially recommends [SAE Dashboard](https://github.com/jbloomAus/SAEDashboard) for dashboard generation in practice.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# finding max activating examples is a bit harder. To do this we need to calculate feature activations for a large number of tokens\n", - "feature_list = torch.randint(0, sae.cfg.d_sae, (100,))\n", - "examples_found = 0\n", - "all_fired_tokens = []\n", - "all_feature_acts = []\n", - "all_reconstructions = []\n", - "all_token_dfs = []\n", - "\n", - "total_batches = 100\n", - "batch_size_prompts = activation_store.store_batch_size_prompts\n", - "batch_size_tokens = activation_store.context_size * batch_size_prompts\n", - "pbar = tqdm(range(total_batches))\n", - "for i in pbar:\n", - " tokens = activation_store.get_batch_tokens()\n", - " tokens_df = make_token_df(tokens)\n", - " tokens_df[\"batch\"] = i\n", - "\n", - " flat_tokens = tokens.flatten()\n", - "\n", - " _, cache = model.run_with_cache(tokens, names_filter=[sae.cfg.metadata.hook_name])\n", - " sae_in = cache[sae.cfg.metadata.hook_name]\n", - " feature_acts = sae.encode(sae_in).squeeze()\n", - "\n", - " feature_acts = feature_acts.flatten(0, 1)\n", - " fired_mask = (feature_acts[:, feature_list]).sum(dim=-1) > 0\n", - " fired_tokens = model.to_str_tokens(flat_tokens[fired_mask])\n", - " reconstruction = feature_acts[fired_mask][:, feature_list] @ sae.W_dec[feature_list]\n", - "\n", - " token_df = tokens_df.iloc[fired_mask.cpu().nonzero().flatten().numpy()]\n", - " all_token_dfs.append(token_df)\n", - " all_feature_acts.append(feature_acts[fired_mask][:, feature_list])\n", - " all_fired_tokens.append(fired_tokens)\n", - " all_reconstructions.append(reconstruction)\n", - "\n", - " examples_found += len(fired_tokens)\n", - " # print(f\"Examples found: {examples_found}\")\n", - " # update description\n", - " pbar.set_description(f\"Examples found: {examples_found}\")\n", - "\n", - "# flatten the list of lists\n", - "all_token_dfs = pd.concat(all_token_dfs)\n", - "all_fired_tokens = list_flatten(all_fired_tokens)\n", - "all_reconstructions = torch.cat(all_reconstructions)\n", - "all_feature_acts = torch.cat(all_feature_acts)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Getting Feature Activation Histogram\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we can generate the feature activation histogram (just as we saw on the dashboards above) and display the list of max-activating examples we just generated. We'll just do this for the first feature in our random set (index 0).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "feature_acts_df = pd.DataFrame(\n", - " all_feature_acts.detach().cpu().numpy(),\n", - " columns=[f\"feature_{i}\" for i in feature_list],\n", - ")\n", - "feature_acts_df.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "feature_idx = 0\n", - "# get non-zero activations\n", - "\n", - "all_positive_acts = all_feature_acts[all_feature_acts[:, feature_idx] > 0][\n", - " :, feature_idx\n", - "].detach()\n", - "prop_positive_activations = (\n", - " 100 * len(all_positive_acts) / (total_batches * batch_size_tokens)\n", - ")\n", - "\n", - "px.histogram(\n", - " all_positive_acts.cpu(),\n", - " nbins=50,\n", - " title=f\"Histogram of positive activations - {prop_positive_activations:.3f}% of activations were positive\",\n", - " labels={\"value\": \"Activation\"},\n", - " width=800,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "top_10_activations = feature_acts_df.sort_values(\n", - " f\"feature_{feature_list[0]}\", ascending=False\n", - ").head(10)\n", - "all_token_dfs.iloc[\n", - " top_10_activations.index\n", - "] # TODO: double check this is working correctly" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Getting the Top 10 Logit Weights\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As a final step, we'll generate the top 10 logit weights--that is, we'll see what tokens each of the features in our set is promoting most strongly.\n", - "\n", - "Note it's important to fold layer norm (by default SAE Lens loads Transformers with folder layer norm but sometimes we turn preprocessing off to save GPU ram and this would affect the logit weight histograms a little bit).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(f\"Shape of the decoder weights {sae.W_dec.shape})\")\n", - "print(f\"Shape of the model unembed {model.W_U.shape}\")\n", - "projection_matrix = sae.W_dec @ model.W_U\n", - "print(f\"Shape of the projection matrix {projection_matrix.shape}\")\n", - "\n", - "# then we take the top_k tokens per feature and decode them\n", - "top_k = 10\n", - "# let's do this for 100 random features\n", - "_, top_k_tokens = torch.topk(projection_matrix[feature_list], top_k, dim=1)\n", - "\n", - "\n", - "feature_df = pd.DataFrame(\n", - " top_k_tokens.cpu().numpy(), index=[f\"feature_{i}\" for i in feature_list]\n", - ").T\n", - "feature_df.index = [f\"token_{i}\" for i in range(top_k)]\n", - "feature_df.applymap(lambda x: model.tokenizer.decode(x))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Putting it all together: Compare against the Neuronpedia Dashboard\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "How does this compare to the dashboard data pulled from Neuronpedia? Let's take a look:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sae_lens.util import extract_layer_from_tlens_hook_name\n", - "\n", - "\n", - "html = get_dashboard_html(\n", - " sae_release=\"gpt2-small\",\n", - " sae_id=f\"{extract_layer_from_tlens_hook_name(sae.cfg.metadata.hook_name)}-res-jb\",\n", - " feature_idx=feature_list[0],\n", - ")\n", - "IFrame(html, width=1200, height=600)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It seems to replicate! We now see how the dashboard values are created.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Optional: Co-occurence Networks and Irreducible Subspaces\n", - "\n", - "Since we just wrote code very similar to the code we need for reproducing some of the analysis from [\"Not All Language Model Features are Linear\"](https://arxiv.org/abs/2405.14860), we show below how to regenerate their awesome circular representation (demonstrating a geometric relationship between related features, like days of the week).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# only valid for res-jb resid_pre 7.\n", - "# Josh Engel's emailed us these lists.\n", - "day_of_the_week_features = [2592, 4445, 4663, 4733, 6531, 8179, 9566, 20927, 24185]\n", - "# months_of_the_year = [3977, 4140, 5993, 7299, 9104, 9401, 10449, 11196, 12661, 14715, 17068, 17528, 19589, 21033, 22043, 23304]\n", - "# years_of_10th_century = [1052, 2753, 4427, 6382, 8314, 9576, 9606, 13551, 19734, 20349]\n", - "\n", - "feature_list = day_of_the_week_features\n", - "\n", - "examples_found = 0\n", - "all_fired_tokens = []\n", - "all_feature_acts = []\n", - "all_reconstructions = []\n", - "all_token_dfs = []\n", - "\n", - "total_batches = 100\n", - "batch_size_prompts = activation_store.store_batch_size_prompts\n", - "batch_size_tokens = activation_store.context_size * batch_size_prompts\n", - "pbar = tqdm(range(total_batches))\n", - "for i in pbar:\n", - " tokens = activation_store.get_batch_tokens()\n", - " tokens_df = make_token_df(tokens)\n", - " tokens_df[\"batch\"] = i\n", - "\n", - " flat_tokens = tokens.flatten()\n", - "\n", - " _, cache = model.run_with_cache(tokens, names_filter=[sae.cfg.metadata.hook_name])\n", - " sae_in = cache[sae.cfg.metadata.hook_name]\n", - " feature_acts = sae.encode(sae_in).squeeze()\n", - "\n", - " feature_acts = feature_acts.flatten(0, 1)\n", - " fired_mask = (feature_acts[:, feature_list]).sum(dim=-1) > 0\n", - " fired_tokens = model.to_str_tokens(flat_tokens[fired_mask])\n", - " reconstruction = feature_acts[fired_mask][:, feature_list] @ sae.W_dec[feature_list]\n", - "\n", - " token_df = tokens_df.iloc[fired_mask.cpu().nonzero().flatten().numpy()]\n", - " all_token_dfs.append(token_df)\n", - " all_feature_acts.append(feature_acts[fired_mask][:, feature_list])\n", - " all_fired_tokens.append(fired_tokens)\n", - " all_reconstructions.append(reconstruction)\n", - "\n", - " examples_found += len(fired_tokens)\n", - " # print(f\"Examples found: {examples_found}\")\n", - " # update description\n", - " pbar.set_description(f\"Examples found: {examples_found}\")\n", - "\n", - "# flatten the list of lists\n", - "all_token_dfs = pd.concat(all_token_dfs)\n", - "all_fired_tokens = list_flatten(all_fired_tokens)\n", - "all_reconstructions = torch.cat(all_reconstructions)\n", - "all_feature_acts = torch.cat(all_feature_acts)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Using PCA, we can see that these features do indeed lie in a circle!\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# do PCA on reconstructions\n", - "from sklearn.decomposition import PCA\n", - "import plotly.express as px\n", - "\n", - "pca = PCA(n_components=3)\n", - "pca_embedding = pca.fit_transform(all_reconstructions.detach().cpu().numpy())\n", - "\n", - "pca_df = pd.DataFrame(pca_embedding, columns=[\"PC1\", \"PC2\", \"PC3\"])\n", - "pca_df[\"tokens\"] = all_fired_tokens\n", - "pca_df[\"context\"] = all_token_dfs.context.values\n", - "\n", - "\n", - "px.scatter(\n", - " pca_df,\n", - " x=\"PC2\",\n", - " y=\"PC3\",\n", - " hover_data=[\"context\"],\n", - " hover_name=\"tokens\",\n", - " height=800,\n", - " width=1200,\n", - " color=\"tokens\",\n", - " title=\"PCA Subspace Reconstructions\",\n", - ").show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You should be able to see a circular subspace where the order of days of the week is preserved correctly.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Basics: Intervening on SAE Features\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Feature Steering\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "One fun (and sometimes useful) thing we can do once we've found a feature is to use it to steer a model. To do this, we find the maximum activation of a feature in a set of text (using the activation store above), use this as the default scale, multiple it by the vector representing the feature (as extracted from the decoder weights), and finally multiply this by a parameter that we control. This can be varied to see its effect on the text. Below, we'll try steering with a feature that often fires on religious or philosophical statements (feature [20115](https://www.neuronpedia.org/gpt2-small/7-res-jb/20115)). Note that sometimes steering can get GPT2 into a loop, so it's worth running this more than once.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from tqdm.auto import tqdm\n", - "from functools import partial\n", - "import re\n", - "\n", - "\n", - "def find_max_activation(model, sae, activation_store, feature_idx, num_batches=100):\n", - " \"\"\"\n", - " Find the maximum activation for a given feature index. This is useful for\n", - " calibrating the right amount of the feature to add.\n", - " \"\"\"\n", - " max_activation = 0.0\n", - "\n", - " pbar = tqdm(range(num_batches))\n", - " for _ in pbar:\n", - " tokens = activation_store.get_batch_tokens()\n", - "\n", - " layer = int(re.search(r\"\\.(\\d+)\\.\", sae.cfg.metadata.hook_name).group(1)) # type: ignore\n", - " _, cache = model.run_with_cache(\n", - " tokens,\n", - " stop_at_layer=layer + 1,\n", - " names_filter=[sae.cfg.metadata.hook_name],\n", - " )\n", - " sae_in = cache[sae.cfg.metadata.hook_name]\n", - " feature_acts = sae.encode(sae_in).squeeze()\n", - "\n", - " feature_acts = feature_acts.flatten(0, 1)\n", - " batch_max_activation = feature_acts[:, feature_idx].max().item()\n", - " max_activation = max(max_activation, batch_max_activation)\n", - "\n", - " pbar.set_description(f\"Max activation: {max_activation:.4f}\")\n", - "\n", - " return max_activation\n", - "\n", - "\n", - "def steering(\n", - " activations, hook, steering_strength=1.0, steering_vector=None, max_act=1.0\n", - "):\n", - " # Note if the feature fires anyway, we'd be adding to that here.\n", - " return activations + max_act * steering_strength * steering_vector\n", - "\n", - "\n", - "def generate_with_steering(\n", - " model,\n", - " sae,\n", - " prompt,\n", - " steering_feature,\n", - " max_act,\n", - " steering_strength=1.0,\n", - " max_new_tokens=95,\n", - "):\n", - " input_ids = model.to_tokens(prompt, prepend_bos=sae.cfg.metadata.prepend_bos)\n", - "\n", - " steering_vector = sae.W_dec[steering_feature].to(model.cfg.device)\n", - "\n", - " steering_hook = partial(\n", - " steering,\n", - " steering_vector=steering_vector,\n", - " steering_strength=steering_strength,\n", - " max_act=max_act,\n", - " )\n", - "\n", - " # standard transformerlens syntax for a hook context for generation\n", - " with model.hooks(fwd_hooks=[(sae.cfg.metadata.hook_name, steering_hook)]):\n", - " output = model.generate(\n", - " input_ids,\n", - " max_new_tokens=max_new_tokens,\n", - " temperature=0.7,\n", - " top_p=0.9,\n", - " stop_at_eos=False if device == \"mps\" else True,\n", - " prepend_bos=sae.cfg.metadata.prepend_bos,\n", - " )\n", - "\n", - " return model.tokenizer.decode(output[0])\n", - "\n", - "\n", - "# Choose a feature to steer\n", - "steering_feature = steering_feature = 20115 # Choose a feature to steer towards\n", - "\n", - "# Find the maximum activation for this feature\n", - "max_act = find_max_activation(model, sae, activation_store, steering_feature)\n", - "print(f\"Maximum activation for feature {steering_feature}: {max_act:.4f}\")\n", - "\n", - "# note we could also get the max activation from Neuronpedia (https://www.neuronpedia.org/api-doc#tag/lookup/GET/api/feature/{modelId}/{layer}/{index})\n", - "\n", - "# Generate text without steering for comparison\n", - "prompt = \"Once upon a time\"\n", - "normal_text = model.generate(\n", - " prompt,\n", - " max_new_tokens=95,\n", - " stop_at_eos=False if device == \"mps\" else True,\n", - " prepend_bos=sae.cfg.metadata.prepend_bos,\n", - ")\n", - "\n", - "print(\"\\nNormal text (without steering):\")\n", - "print(normal_text)\n", - "\n", - "# Generate text with steering\n", - "steered_text = generate_with_steering(\n", - " model, sae, prompt, steering_feature, max_act, steering_strength=2.0\n", - ")\n", - "print(\"Steered text:\")\n", - "print(steered_text)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Experiment with different steering strengths\n", - "print(\"\\nExperimenting with different steering strengths:\")\n", - "for strength in [-4.0, -2.0, 0.5, 2.0, 4.0]:\n", - " steered_text = generate_with_steering(\n", - " model, sae, prompt, steering_feature, max_act, steering_strength=strength\n", - " )\n", - " print(f\"\\nSteering strength {strength}:\")\n", - " print(steered_text)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also do this via the Neuronpedia AP or on the website [here](https://www.neuronpedia.org/steer/). The example below steers for just a few tokens with a feature that does something very specific. Can you work out what it's doing?\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", - "import numpy as np\n", - "\n", - "url = \"https://www.neuronpedia.org/api/steer\"\n", - "\n", - "payload = {\n", - " # \"prompt\": \"A knight in shining\",\n", - " # \"prompt\": \"He had to fight back in self-\",\n", - " \"prompt\": \"In the middle of the universe is the galactic\",\n", - " # \"prompt\": \"Oh no. We're running on empty. Its time to fill up the car with\",\n", - " # \"prompt\": \"Sure, I'm happy to pay. I don't have any cash on me but let me write you a\",\n", - " \"modelId\": \"gpt2-small\",\n", - " \"features\": [\n", - " {\"modelId\": \"gpt2-small\", \"layer\": \"7-res-jb\", \"index\": 6770, \"strength\": 8}\n", - " ],\n", - " \"temperature\": 0.2,\n", - " \"n_tokens\": 2,\n", - " \"freq_penalty\": 1,\n", - " \"seed\": np.random.randint(100),\n", - " \"strength_multiplier\": 4,\n", - "}\n", - "headers = {\"Content-Type\": \"application/json\"}\n", - "\n", - "response = requests.post(url, json=payload, headers=headers)\n", - "\n", - "print(response.json())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", - "import numpy as np\n", - "\n", - "url = \"https://www.neuronpedia.org/api/steer\"\n", - "\n", - "payload = {\n", - " \"prompt\": 'I wrote a letter to my girlfiend. It said \"',\n", - " \"modelId\": \"gpt2-small\",\n", - " \"features\": [\n", - " {\"modelId\": \"gpt2-small\", \"layer\": \"7-res-jb\", \"index\": 20115, \"strength\": 4}\n", - " ],\n", - " \"temperature\": 0.7,\n", - " \"n_tokens\": 120,\n", - " \"freq_penalty\": 1,\n", - " \"seed\": np.random.randint(100),\n", - " \"strength_multiplier\": 4,\n", - "}\n", - "headers = {\"Content-Type\": \"application/json\"}\n", - "\n", - "response = requests.post(url, json=payload, headers=headers)\n", - "\n", - "print(response.json())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Feature Ablation\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Feature ablation is also worth looking at. In a way, it's a special case of steering where the value of the feature is always zeroed out.\n", - "\n", - "Here we do the following:\n", - "\n", - "1. Use test prompt rather than generate to get more nuance.\n", - "2. attach a hook to the SAE feature activations.\n", - "3. 0 out a feature at all positions (we know that the default feature fires at the final position.)\n", - "4. Check whether this ablation is more / less effective if we include the error term (info our SAE isn't capturing).\n", - "\n", - "Note that the existence of [The Hydra Effect](https://arxiv.org/abs/2307.15771) can make reasoning about ablation experiments difficult.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from transformer_lens.utils import test_prompt\n", - "from functools import partial\n", - "\n", - "\n", - "def test_prompt_with_ablation(model, sae, prompt, answer, ablation_features):\n", - " def ablate_feature_hook(feature_activations, hook, feature_ids, position=None):\n", - " if position is None:\n", - " feature_activations[:, :, feature_ids] = 0\n", - " else:\n", - " feature_activations[:, position, feature_ids] = 0\n", - "\n", - " return feature_activations\n", - "\n", - " ablation_hook = partial(ablate_feature_hook, feature_ids=ablation_features)\n", - "\n", - " model.add_sae(sae)\n", - " hook_point = sae.cfg.metadata.hook_name + \".hook_sae_acts_post\"\n", - " model.add_hook(hook_point, ablation_hook, \"fwd\")\n", - "\n", - " test_prompt(prompt, answer, model)\n", - "\n", - " model.reset_hooks()\n", - " model.reset_saes()\n", - "\n", - "\n", - "# Example usage in a notebook:\n", - "\n", - "# Assume model and sae are already defined\n", - "\n", - "# Choose a feature to ablate\n", - "\n", - "model.reset_hooks(including_permanent=True)\n", - "prompt = \"In the beginning, God created the heavens and the\"\n", - "answer = \"earth\"\n", - "test_prompt(prompt, answer, model)\n", - "\n", - "\n", - "# Generate text with feature ablation\n", - "print(\"Test Prompt with feature ablation and no error term\")\n", - "ablation_feature = 16873 # Replace with any feature index you're interested in. We use the religion feature\n", - "sae.use_error_term = False\n", - "test_prompt_with_ablation(model, sae, prompt, answer, ablation_feature)\n", - "\n", - "print(\"Test Prompt with feature ablation and error term\")\n", - "ablation_feature = 16873 # Replace with any feature index you're interested in. We use the religion feature\n", - "sae.use_error_term = True\n", - "test_prompt_with_ablation(model, sae, prompt, answer, ablation_feature)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Feature Attribution\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from dataclasses import dataclass\n", - "from functools import partial\n", - "from typing import Any, Literal, NamedTuple, Callable\n", - "\n", - "import torch\n", - "from sae_lens import SAE\n", - "from transformer_lens import HookedTransformer\n", - "from transformer_lens.hook_points import HookPoint\n", - "\n", - "\n", - "class SaeReconstructionCache(NamedTuple):\n", - " sae_in: torch.Tensor\n", - " feature_acts: torch.Tensor\n", - " sae_out: torch.Tensor\n", - " sae_error: torch.Tensor\n", - "\n", - "\n", - "def track_grad(tensor: torch.Tensor) -> None:\n", - " \"\"\"wrapper around requires_grad and retain_grad\"\"\"\n", - " tensor.requires_grad_(True)\n", - " tensor.retain_grad()\n", - "\n", - "\n", - "@dataclass\n", - "class ApplySaesAndRunOutput:\n", - " model_output: torch.Tensor\n", - " model_activations: dict[str, torch.Tensor]\n", - " sae_activations: dict[str, SaeReconstructionCache]\n", - "\n", - " def zero_grad(self) -> None:\n", - " \"\"\"Helper to zero grad all tensors in this object.\"\"\"\n", - " self.model_output.grad = None\n", - " for act in self.model_activations.values():\n", - " act.grad = None\n", - " for cache in self.sae_activations.values():\n", - " cache.sae_in.grad = None\n", - " cache.feature_acts.grad = None\n", - " cache.sae_out.grad = None\n", - " cache.sae_error.grad = None\n", - "\n", - "\n", - "def apply_saes_and_run(\n", - " model: HookedTransformer,\n", - " saes: dict[str, SAE],\n", - " input: Any,\n", - " include_error_term: bool = True,\n", - " track_model_hooks: list[str] | None = None,\n", - " return_type: Literal[\"logits\", \"loss\"] = \"logits\",\n", - " track_grads: bool = False,\n", - ") -> ApplySaesAndRunOutput:\n", - " \"\"\"\n", - " Apply the SAEs to the model at the specific hook points, and run the model.\n", - " By default, this will include a SAE error term which guarantees that the SAE\n", - " will not affect model output. This function is designed to work correctly with\n", - " backprop as well, so it can be used for gradient-based feature attribution.\n", - "\n", - " Args:\n", - " model: the model to run\n", - " saes: the SAEs to apply\n", - " input: the input to the model\n", - " include_error_term: whether to include the SAE error term to ensure the SAE doesn't affect model output. Default True\n", - " track_model_hooks: a list of hook points to record the activations and gradients. Default None\n", - " return_type: this is passed to the model.run_with_hooks function. Default \"logits\"\n", - " track_grads: whether to track gradients. Default False\n", - " \"\"\"\n", - "\n", - " fwd_hooks = []\n", - " bwd_hooks = []\n", - "\n", - " sae_activations: dict[str, SaeReconstructionCache] = {}\n", - " model_activations: dict[str, torch.Tensor] = {}\n", - "\n", - " # this hook just track the SAE input, output, features, and error. If `track_grads=True`, it also ensures\n", - " # that requires_grad is set to True and retain_grad is called for intermediate values.\n", - " def reconstruction_hook(sae_in: torch.Tensor, hook: HookPoint, hook_point: str): # noqa: ARG001\n", - " sae = saes[hook_point]\n", - " feature_acts = sae.encode(sae_in)\n", - " sae_out = sae.decode(feature_acts)\n", - " sae_error = (sae_in - sae_out).detach().clone()\n", - " if track_grads:\n", - " track_grad(sae_error)\n", - " track_grad(sae_out)\n", - " track_grad(feature_acts)\n", - " track_grad(sae_in)\n", - " sae_activations[hook_point] = SaeReconstructionCache(\n", - " sae_in=sae_in,\n", - " feature_acts=feature_acts,\n", - " sae_out=sae_out,\n", - " sae_error=sae_error,\n", - " )\n", - "\n", - " if include_error_term:\n", - " return sae_out + sae_error\n", - " return sae_out\n", - "\n", - " def sae_bwd_hook(output_grads: torch.Tensor, hook: HookPoint): # noqa: ARG001\n", - " # this just passes the output grads to the input, so the SAE gets the same grads despite the error term hackery\n", - " return (output_grads,)\n", - "\n", - " # this hook just records model activations, and ensures that intermediate activations have gradient tracking turned on if needed\n", - " def tracking_hook(hook_input: torch.Tensor, hook: HookPoint, hook_point: str): # noqa: ARG001\n", - " model_activations[hook_point] = hook_input\n", - " if track_grads:\n", - " track_grad(hook_input)\n", - " return hook_input\n", - "\n", - " for hook_point in saes.keys():\n", - " fwd_hooks.append(\n", - " (hook_point, partial(reconstruction_hook, hook_point=hook_point))\n", - " )\n", - " bwd_hooks.append((hook_point, sae_bwd_hook))\n", - " for hook_point in track_model_hooks or []:\n", - " fwd_hooks.append((hook_point, partial(tracking_hook, hook_point=hook_point)))\n", - "\n", - " # now, just run the model while applying the hooks\n", - " with model.hooks(fwd_hooks=fwd_hooks, bwd_hooks=bwd_hooks):\n", - " model_output = model(input, return_type=return_type)\n", - "\n", - " return ApplySaesAndRunOutput(\n", - " model_output=model_output,\n", - " model_activations=model_activations,\n", - " sae_activations=sae_activations,\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from dataclasses import dataclass\n", - "from transformer_lens.hook_points import HookPoint\n", - "from dataclasses import dataclass\n", - "from functools import partial\n", - "from typing import Any, Literal, NamedTuple\n", - "\n", - "import torch\n", - "from sae_lens import SAE\n", - "from transformer_lens import HookedTransformer\n", - "from transformer_lens.hook_points import HookPoint\n", - "\n", - "EPS = 1e-8\n", - "\n", - "torch.set_grad_enabled(True)\n", - "\n", - "\n", - "@dataclass\n", - "class AttributionGrads:\n", - " metric: torch.Tensor\n", - " model_output: torch.Tensor\n", - " model_activations: dict[str, torch.Tensor]\n", - " sae_activations: dict[str, SaeReconstructionCache]\n", - "\n", - "\n", - "@dataclass\n", - "class Attribution:\n", - " model_attributions: dict[str, torch.Tensor]\n", - " model_activations: dict[str, torch.Tensor]\n", - " model_grads: dict[str, torch.Tensor]\n", - " sae_feature_attributions: dict[str, torch.Tensor]\n", - " sae_feature_activations: dict[str, torch.Tensor]\n", - " sae_feature_grads: dict[str, torch.Tensor]\n", - " sae_errors_attribution_proportion: dict[str, float]\n", - "\n", - "\n", - "def calculate_attribution_grads(\n", - " model: HookedSAETransformer,\n", - " prompt: str,\n", - " metric_fn: Callable[[torch.Tensor], torch.Tensor],\n", - " track_hook_points: list[str] | None = None,\n", - " include_saes: dict[str, SAE] | None = None,\n", - " return_logits: bool = True,\n", - " include_error_term: bool = True,\n", - ") -> AttributionGrads:\n", - " \"\"\"\n", - " Wrapper around apply_saes_and_run that calculates gradients wrt to the metric_fn.\n", - " Tracks grads for both SAE feature and model neurons, and returns them in a structured format.\n", - " \"\"\"\n", - " output = apply_saes_and_run(\n", - " model,\n", - " saes=include_saes or {},\n", - " input=prompt,\n", - " return_type=\"logits\" if return_logits else \"loss\",\n", - " track_model_hooks=track_hook_points,\n", - " include_error_term=include_error_term,\n", - " track_grads=True,\n", - " )\n", - " metric = metric_fn(output.model_output)\n", - " output.zero_grad()\n", - " metric.backward()\n", - " return AttributionGrads(\n", - " metric=metric,\n", - " model_output=output.model_output,\n", - " model_activations=output.model_activations,\n", - " sae_activations=output.sae_activations,\n", - " )\n", - "\n", - "\n", - "def calculate_feature_attribution(\n", - " model: HookedSAETransformer,\n", - " input: Any,\n", - " metric_fn: Callable[[torch.Tensor], torch.Tensor],\n", - " track_hook_points: list[str] | None = None,\n", - " include_saes: dict[str, SAE] | None = None,\n", - " return_logits: bool = True,\n", - " include_error_term: bool = True,\n", - ") -> Attribution:\n", - " \"\"\"\n", - " Calculate feature attribution for SAE features and model neurons following\n", - " the procedure in https://transformer-circuits.pub/2024/march-update/index.html#feature-heads.\n", - " This include the SAE error term by default, so inserting the SAE into the calculation is\n", - " guaranteed to not affect the model output. This can be disabled by setting `include_error_term=False`.\n", - "\n", - " Args:\n", - " model: The model to calculate feature attribution for.\n", - " input: The input to the model.\n", - " metric_fn: A function that takes the model output and returns a scalar metric.\n", - " track_hook_points: A list of model hook points to track activations for, if desired\n", - " include_saes: A dictionary of SAEs to include in the calculation. The key is the hook point to apply the SAE to.\n", - " return_logits: Whether to return the model logits or loss. This is passed to TLens, so should match whatever the metric_fn expects (probably logits)\n", - " include_error_term: Whether to include the SAE error term in the calculation. This is recommended, as it ensures that the SAE will not affecting the model output.\n", - " \"\"\"\n", - " # first, calculate gradients wrt to the metric_fn.\n", - " # these will be multiplied with the activation values to get the attributions\n", - " outputs_with_grads = calculate_attribution_grads(\n", - " model,\n", - " input,\n", - " metric_fn,\n", - " track_hook_points,\n", - " include_saes=include_saes,\n", - " return_logits=return_logits,\n", - " include_error_term=include_error_term,\n", - " )\n", - " model_attributions = {}\n", - " model_activations = {}\n", - " model_grads = {}\n", - " sae_feature_attributions = {}\n", - " sae_feature_activations = {}\n", - " sae_feature_grads = {}\n", - " sae_error_proportions = {}\n", - " # this code is long, but all it's doing is multiplying the grads by the activations\n", - " # and recording grads, acts, and attributions in dictionaries to return to the user\n", - " with torch.no_grad():\n", - " for name, act in outputs_with_grads.model_activations.items():\n", - " assert act.grad is not None\n", - " raw_activation = act.detach().clone()\n", - " model_attributions[name] = (act.grad * raw_activation).detach().clone()\n", - " model_activations[name] = raw_activation\n", - " model_grads[name] = act.grad.detach().clone()\n", - " for name, act in outputs_with_grads.sae_activations.items():\n", - " assert act.feature_acts.grad is not None\n", - " assert act.sae_out.grad is not None\n", - " raw_activation = act.feature_acts.detach().clone()\n", - " sae_feature_attributions[name] = (\n", - " (act.feature_acts.grad * raw_activation).detach().clone()\n", - " )\n", - " sae_feature_activations[name] = raw_activation\n", - " sae_feature_grads[name] = act.feature_acts.grad.detach().clone()\n", - " if include_error_term:\n", - " assert act.sae_error.grad is not None\n", - " error_grad_norm = act.sae_error.grad.norm().item()\n", - " else:\n", - " error_grad_norm = 0\n", - " sae_out_norm = act.sae_out.grad.norm().item()\n", - " sae_error_proportions[name] = error_grad_norm / (\n", - " sae_out_norm + error_grad_norm + EPS\n", - " )\n", - " return Attribution(\n", - " model_attributions=model_attributions,\n", - " model_activations=model_activations,\n", - " model_grads=model_grads,\n", - " sae_feature_attributions=sae_feature_attributions,\n", - " sae_feature_activations=sae_feature_activations,\n", - " sae_feature_grads=sae_feature_grads,\n", - " sae_errors_attribution_proportion=sae_error_proportions,\n", - " )\n", - "\n", - "\n", - "# prompt = \" Tiger Woods plays the sport of\"\n", - "# pos_token = model.tokenizer.encode(\" golf\")[0]\n", - "prompt = \"In the beginning, God created the heavens and the\"\n", - "pos_token = model.tokenizer.encode(\" earth\")\n", - "neg_token = model.tokenizer.encode(\" sky\")\n", - "\n", - "\n", - "def metric_fn(\n", - " logits: torch.tensor,\n", - " pos_token: torch.tensor = pos_token,\n", - " neg_token: torch.Tensor = neg_token,\n", - ") -> torch.Tensor:\n", - " return logits[0, -1, pos_token] - logits[0, -1, neg_token]\n", - "\n", - "\n", - "feature_attribution_df = calculate_feature_attribution(\n", - " input=prompt,\n", - " model=model,\n", - " metric_fn=metric_fn,\n", - " include_saes={sae.cfg.metadata.hook_name: sae},\n", - " include_error_term=True,\n", - " return_logits=True,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from transformer_lens.utils import test_prompt\n", - "\n", - "test_prompt(prompt, model.to_string(pos_token), model)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tokens = model.to_str_tokens(prompt)\n", - "unique_tokens = [f\"{i}/{t}\" for i, t in enumerate(tokens)]\n", - "\n", - "px.bar(\n", - " x=unique_tokens,\n", - " y=feature_attribution_df.sae_feature_attributions[sae.cfg.metadata.hook_name][0]\n", - " .sum(-1)\n", - " .detach()\n", - " .cpu()\n", - " .numpy(),\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def convert_sparse_feature_to_long_df(sparse_tensor: torch.Tensor) -> pd.DataFrame:\n", - " \"\"\"\n", - " Convert a sparse tensor to a long format pandas DataFrame.\n", - " \"\"\"\n", - " df = pd.DataFrame(sparse_tensor.detach().cpu().numpy())\n", - " df_long = df.melt(ignore_index=False, var_name=\"column\", value_name=\"value\")\n", - " df_long.columns = [\"feature\", \"attribution\"]\n", - " df_long_nonzero = df_long[df_long[\"attribution\"] != 0]\n", - " df_long_nonzero = df_long_nonzero.reset_index().rename(\n", - " columns={\"index\": \"position\"}\n", - " )\n", - " return df_long_nonzero\n", - "\n", - "\n", - "df_long_nonzero = convert_sparse_feature_to_long_df(\n", - " feature_attribution_df.sae_feature_attributions[sae.cfg.metadata.hook_name][0]\n", - ")\n", - "df_long_nonzero.sort_values(\"attribution\", ascending=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i, v in (\n", - " df_long_nonzero.query(\"position==8\")\n", - " .groupby(\"feature\")\n", - " .attribution.sum()\n", - " .sort_values(ascending=False)\n", - " .head(5)\n", - " .items()\n", - "):\n", - " print(f\"Feature {i} had a total attribution of {v:.2f}\")\n", - " html = get_dashboard_html(\n", - " sae_release=\"gpt2-small\",\n", - " sae_id=f\"{extract_layer_from_tlens_hook_name(sae.cfg.metadata.hook_name)}-res-jb\",\n", - " feature_idx=int(i),\n", - " )\n", - " display(IFrame(html, width=1200, height=300))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i, v in (\n", - " df_long_nonzero.groupby(\"feature\")\n", - " .attribution.sum()\n", - " .sort_values(ascending=False)\n", - " .head(5)\n", - " .items()\n", - "):\n", - " print(f\"Feature {i} had a total attribution of {v:.2f}\")\n", - " html = get_dashboard_html(\n", - " sae_release=\"gpt2-small\",\n", - " sae_id=f\"{extract_layer_from_tlens_hook_name(sae.cfg.metadata.hook_name)}-res-jb\",\n", - " feature_idx=int(i),\n", - " )\n", - " display(IFrame(html, width=1200, height=300))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Advanced: Making U-Maps\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "gpuType": "T4", - "provenance": [] - }, - "kernelspec": { - "display_name": "sae-lens-CSfAEFdT-py3.12", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.10" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "MNk7IylTv610" + }, + "source": [ + "# SAE Lens + Neuronpedia Tutorial\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This tutorial is an introduction to analysis of neural networks using sparse autoencoders (SAEs), a new and popular technique in mechanistic interpretability. For more context, we refer you to [this post](https://transformer-circuits.pub/2023/monosemantic-features).\n", + "\n", + "However, we will explain what SAE features are, how to load SAEs into SAELens and find/identify features, and how to do steering, ablation, and attribution with them.\n", + "\n", + "This tutorial covers:\n", + "\n", + "- A basic introduction to SAEs.\n", + " - What is SAE Lens?\n", + " - Choosing an SAE to analyse and loading it with [SAE Lens](https://github.com/decoderesearch/SAELens).\n", + " - The SAE Class and it's config.\n", + "- SAE Features.\n", + " - What is a feature dashboard?\n", + " - Loading feature dashboards on [Neuronpedia](https://www.neuronpedia.org/).\n", + " - Downloading Autointerp and searching via explanations.\n", + "- Feature inference\n", + " - Using the HookedSAE Transformer Class to decompose activations into features.\n", + " - Comparing Features accross related prompts.\n", + "- Making Feature Dashboards\n", + " - Max Activating Examples\n", + " - Feature Activation Histograms\n", + " - Logit Weight Distributions.\n", + " - Extension: Reproducing `Not all language model features are linear`\n", + "- SAE based Analysis Methods (Advanced)\n", + " - Steering model generation with SAE Features\n", + " - Ablating SAE Features\n", + " - Gradient-based Attribution for Circuit Detection\n", + "\n", + "**Warning:** This tutorial is a rough initial draft, prepared in a fairly short timeframe, and we expect to make many improvements in the future. Nevertheless, we hope this initial version is useful for those looking to get started.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i_DusoOvwV0M" + }, + "source": [ + "## Set Up (Just Run / Not Important)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yfDUxRx0wSRl" + }, + "outputs": [], + "source": [ + "try:\n", + " import google.colab # type: ignore\n", + " from google.colab import output\n", + "\n", + " COLAB = True\n", + " %pip install sae-lens transformer-lens sae-dashboard\n", + "except:\n", + " COLAB = False\n", + " from IPython import get_ipython # type: ignore\n", + "\n", + " ipython = get_ipython()\n", + " assert ipython is not None\n", + " ipython.run_line_magic(\"load_ext\", \"autoreload\")\n", + " ipython.run_line_magic(\"autoreload\", \"2\")\n", + "\n", + "# Standard imports\n", + "import os\n", + "import torch\n", + "from tqdm.auto import tqdm\n", + "import plotly.express as px\n", + "import pandas as pd\n", + "\n", + "# Imports for displaying vis in Colab / notebook\n", + "\n", + "torch.set_grad_enabled(False)\n", + "\n", + "# For the most part I'll try to import functions and classes near where they are used\n", + "# to make it clear where they come from.\n", + "\n", + "if torch.backends.mps.is_available():\n", + " device = \"mps\"\n", + "else:\n", + " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "\n", + "print(f\"Device: {device}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XoMx3VZpv611" + }, + "source": [ + "# Loading a pretrained Sparse Autoencoder\n", + "\n", + "As a first step, we will actually load an SAE! But before we do so, it can be useful to see which are available. The following snippet shows the currently available SAE releases in SAELens, and will remain up-to-date as we continue to add more SAEs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sae_lens.loading.pretrained_saes_directory import get_pretrained_saes_directory\n", + "\n", + "# TODO: Make this nicer.\n", + "df = pd.DataFrame.from_records(\n", + " {k: v.__dict__ for k, v in get_pretrained_saes_directory().items()}\n", + ").T\n", + "df.drop(\n", + " columns=[\n", + " \"expected_var_explained\",\n", + " \"expected_l0\",\n", + " \"config_overrides\",\n", + " \"conversion_func\",\n", + " ],\n", + " inplace=True,\n", + ")\n", + "df # Each row is a \"release\" which has multiple SAEs which may have different configs / match different hook points in a model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In practice, SAEs can be of varying usefulness for general use cases. To start with, we recommend the following:\n", + "\n", + "- Joseph's Open Source GPT2 Small Residual (gpt2-small-res-jb)\n", + "- Joseph's Feature Splitting (gpt2-small-res-jb-feature-splitting)\n", + "- Gemma SAEs (gemma-2b-res-jb) (0,6) <- on Neuronpedia and good. (12 / 17 aren't very good currently).\n", + "\n", + "Other SAEs have various issues--e.g., too dense or not dense enough, or designed for special use cases, or initial drafts of what we hope will be better versions later. Decode Research / Neuronpedia are working on making all SAEs on Neuronpedia loadable in SAE Lens and vice versa, as well as providing public benchmarking stats to help people choose which SAEs to work with.\n", + "\n", + "To see all the SAEs contained in a specific release (named after the part of the model they apply to), simply run the below. Each hook point corresponds to a layer or module of the model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# show the contents of the saes_map column for a specific row\n", + "print(\"SAEs in the GTP2 Small Resid Pre release\")\n", + "for k, v in df.loc[df.release == \"gpt2-small-res-jb\", \"saes_map\"].values[0].items():\n", + " print(f\"SAE id: {k} for hook point: {v}\")\n", + "\n", + "print(\"-\" * 50)\n", + "print(\"SAEs in the feature splitting release\")\n", + "for k, v in (\n", + " df.loc[df.release == \"gpt2-small-res-jb-feature-splitting\", \"saes_map\"]\n", + " .values[0]\n", + " .items()\n", + "):\n", + " print(f\"SAE id: {k} for hook point: {v}\")\n", + "\n", + "print(\"-\" * 50)\n", + "print(\"SAEs in the Gemma base model release\")\n", + "for k, v in df.loc[df.release == \"gemma-2b-res-jb\", \"saes_map\"].values[0].items():\n", + " print(f\"SAE id: {k} for hook point: {v}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we'll load a specific SAE, as well as a copy of GPT-2 Small to attach it to. To load the model, we'll use the HookedSAETransformer class, which is adapted from the TransformerLens HookedTransformer.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sNSfL80Uv611" + }, + "outputs": [], + "source": [ + "# from transformer_lens import HookedTransformer\n", + "from sae_lens import SAE, HookedSAETransformer\n", + "\n", + "model = HookedSAETransformer.from_pretrained(\"gpt2-small\", device=device)\n", + "\n", + "sae = SAE.from_pretrained(\n", + " release=\"gpt2-small-res-jb\", # <- Release name\n", + " sae_id=\"blocks.7.hook_resid_pre\", # <- SAE id (not always a hook point!)\n", + " device=device,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The \"sae\" object is an instance of the SAE (Sparse Autoencoder class). There are many different SAE architectures which may have different weights or activation functions. In order to simplify working with SAEs, SAE Lens handles most of this complexity for you.\n", + "\n", + "Let's look at the SAE config and understand each of the parameters:\n", + "\n", + "1. `architecture`: Specifies the type of SAE architecture being used, in this case, the standard architecture (encoder and decoder with hidden activations, as opposed to a gated SAE).\n", + "2. `d_in`: Defines the input dimension of the SAE, which is 768 in this configuration.\n", + "3. `d_sae`: Sets the dimension of the SAE's hidden layer, which is 24576 here. This represents the number of possible feature activations.\n", + "4. `activation_fn_str`: Specifies the activation function used in the SAE, which is ReLU in this case. TopK is another option that we will not cover here.\n", + "5. `apply_b_dec_to_input`: Determines whether to apply the decoder bias to the input, set to True here.\n", + "6. `finetuning_scaling_factor`: Indicates whether to use a scaling factor to weight initialization and the forward pass. This is not usually used and was introduced to support a [solution for shrinkage](https://www.lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes).\n", + "7. `context_size`: Defines the size of the context window, which is 128 tokens in this case. In turns out SAEs trained on small activations from small prompts [often don't perform well on longer prompts](https://www.lesswrong.com/posts/baJyjpktzmcmRfosq/stitching-saes-of-different-sizes).\n", + "8. `model_name`: Specifies the name of the model being used, which is 'gpt2-small' here. [This is a valid model name in TransformerLens](https://transformerlensorg.github.io/TransformerLens/generated/model_properties_table.html).\n", + "9. `hook_name`: Indicates the specific hook in the model where the SAE is applied.\n", + "10. `hook_head_index`: Defines which attention head to hook into; not relevant here since we are looking at a residual stream SAE.\n", + "11. `prepend_bos`: Determines whether to prepend the beginning-of-sequence token, set to True.\n", + "12. `dataset_path`: Specifies the path to the dataset used for training or evaluation. (Can be local or a huggingface dataset.)\n", + "13. `dataset_trust_remote_code`: Indicates whether to trust remote code (from HuggingFace) when loading the dataset, set to True.\n", + "14. `normalize_activations`: Specifies how to normalize activations, set to 'none' in this config.\n", + "15. `dtype`: Defines the data type for tensor operations, set to 32-bit floating point.\n", + "16. `device`: Specifies the computational device to use.\n", + "17. `sae_lens_training_version`: Indicates the version of SAE Lens used for training, set to None here.\n", + "18. `activation_fn_kwargs`: Allows for additional keyword arguments for the activation function. This would be used if e.g. the `activation_fn_str` was set to `topk`, so that `k` could be specified.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(sae.cfg.__dict__)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we need to load in a dataset to work with. We'll just use a sample of the Pile.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "from transformer_lens.utils import tokenize_and_concatenate\n", + "\n", + "dataset = load_dataset(\n", + " path=\"NeelNanda/pile-10k\",\n", + " split=\"train\",\n", + " streaming=False,\n", + ")\n", + "\n", + "token_dataset = tokenize_and_concatenate(\n", + " dataset=dataset, # type: ignore\n", + " tokenizer=model.tokenizer, # type: ignore\n", + " streaming=True,\n", + " max_length=sae.cfg.metadata.context_size,\n", + " add_bos_token=sae.cfg.metadata.prepend_bos,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Basics: What are SAE Features?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Opening a feature dashboard on Neuronpedia\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before we dive too deep into the various things you can do with SAEs, let's address a basic question: What is an SAE feature?\n", + "\n", + "An SAE feature represents a pattern or concept that the autoencoder has learned to detect in the input data. These features often correspond to meaningful semantic, syntactic, or otherwise interpretable elements of text, and correspond to linear directions in activation space. SAEs are trained on the activations of a specific part of the model, and after training, these features show up as activations in the hidden layer of the SAE (which is much wider than the source activation vector, and produces one hidden activation per feature). As such, the hidden activations represent a decomposition of the entangled/superimposed features found in the original model activations. Ideally, these activations are sparse: Only a few of the many possible hidden activations actually activate for a given piece of input. This sparseness tends to correspond to ease of interpretability.\n", + "\n", + "The dashboard shown here provides a detailed view of a single SAE feature. (Refresh the cell to see more examples). Let's break down its components:\n", + "\n", + "1. Feature Description: At the top, we see an auto-interp-sourced description of the feature.\n", + "\n", + "2. Logit Plots: The top positive and negative logits for the feature. The values indicate the strength of the association.\n", + "\n", + "3. Activations Density Plot: This histogram shows the distribution of activation values for this feature across a randomly sampled dataset. The x-axis represents activation strength, and the y-axis shows frequency. The top chart is simply the distribution of non-zero activations, and the second plot shows the density of negative and positive logits.\n", + "\n", + "4. Test Activation: You can use this feature within the notebook or Neuronpedia itself--simply enter text to see how the feature is activated across the text.\n", + "\n", + "5. Top Activations: Below the plots, we see max-activating examples of text snippets that strongly activate this feature. Each snippet is highlighted where the activation appears.\n", + "\n", + "See this section of [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features#setup-interface) for more information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "\n", + "# get a random feature from the SAE\n", + "feature_idx = torch.randint(0, sae.cfg.d_sae, (1,)).item()\n", + "\n", + "html_template = \"https://neuronpedia.org/{}/{}/{}?embed=true&embedexplanation=true&embedplots=true&embedtest=true&height=300\"\n", + "\n", + "\n", + "def get_dashboard_html(sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=0):\n", + " return html_template.format(sae_release, sae_id, feature_idx)\n", + "\n", + "\n", + "html = get_dashboard_html(\n", + " sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=feature_idx\n", + ")\n", + "IFrame(html, width=1200, height=600)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the randomly selected feature above, can you predict which text will make it fire? Can you test your theory?\n", + "\n", + "Eg: Imagine it seemed to fire on pokemon. Testing whether the feature fires on Digimon (a similar game with pet monsters) may suggest a different story.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Downloading / Searching Autointerp\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What if we wanted to search for a feature relating to a specific thing? Then we could use the explanation search API. Let's just download all the [autointerp explanations](https://openai.com/index/language-models-can-explain-neurons-in-language-models/) for these SAE features and load them in as a Pandas dataframe. The Neuronpedia API docs will be useful here: https://www.neuronpedia.org/api-doc#tag/explanations/GET/api/explanation/export.\n", + "\n", + "_Note: not every SAE in SAE Lens is on Neuronpedia and not all SAEs on Neuronpedia have autointerp for all features. This is a work in progress_.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "url = \"https://www.neuronpedia.org/api/explanation/export?modelId=gpt2-small&saeId=7-res-jb\"\n", + "headers = {\"Content-Type\": \"application/json\"}\n", + "\n", + "response = requests.get(url, headers=headers)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert to pandas\n", + "data = response.json()\n", + "explanations_df = pd.DataFrame(data)\n", + "# rename index to \"feature\"\n", + "explanations_df.rename(columns={\"index\": \"feature\"}, inplace=True)\n", + "# explanations_df[\"feature\"] = explanations_df[\"feature\"].astype(int)\n", + "explanations_df[\"description\"] = explanations_df[\"description\"].apply(\n", + " lambda x: x.lower()\n", + ")\n", + "explanations_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's search for a feature related to the Bible.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bible_features = explanations_df.loc[explanations_df.description.str.contains(\" bible\")]\n", + "bible_features" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Let's get the dashboard for this feature.\n", + "html = get_dashboard_html(\n", + " sae_release=\"gpt2-small\",\n", + " sae_id=\"7-res-jb\",\n", + " feature_idx=bible_features.feature.values[0],\n", + ")\n", + "IFrame(html, width=1200, height=600)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Basics: Getting Features Using SAEs\n", + "\n", + "Autointerp is such a bad way to find features. We really care about understanding model predictions on real prompts using SAEs. Let's check for features used in completing this bible verse. Will we see a bible feature?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformer_lens.utils import test_prompt\n", + "\n", + "prompt = \"In the beginning, God created the heavens and the\"\n", + "answer = \"earth\"\n", + "\n", + "# Show that the model can confidently predict the next token.\n", + "test_prompt(prompt, answer, model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using a HookedSAETransformer\n", + "\n", + "We have a whole tutorial on running models with SAEs using the HookedSAE Transformer class -> \n", + "\"Open\n", + "\n", + "\n", + "Here we'll just demonstrate how to get features using the class.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# SAEs don't reconstruct activation perfectly, so if you attach an SAE and want the model to stay performant, you need to use the error term.\n", + "# This is because the SAE will be used to modify the forward pass, and if it doesn't reconstruct the activations well, the outputs may be effected.\n", + "# Good SAEs have small error terms but it's something to be mindful of.\n", + "\n", + "sae.use_error_term # If use error term is set to false, we will modify the forward pass by using the sae." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below, we'll use the `run_with_cache_with_saes` function of the HookedSAETransformer, which will give us all the cached activations (including those from the SAE that we've specified in the arguments). Running our prompt through the model gets us activation tensors as follows:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# hooked SAE Transformer will enable us to get the feature activations from the SAE\n", + "_, cache = model.run_with_cache_with_saes(prompt, saes=[sae])\n", + "\n", + "print([(k, v.shape) for k, v in cache.items() if \"sae\" in k])\n", + "\n", + "# note there were 11 tokens in our prompt, the residual stream dimension is 768, and the number of SAE features is 768" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we'll visualize the activations of the hidden layer of the SAE at the final token position of the prompt. Each of these vertical lines correspond to a feature activation. We can also plot the dashboards for each of these activated features, using their position in the activation cache as an index to pull data from Neuronpedia. We'll do this for the top features only.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# let's look at which features fired at layer 8 at the final token position\n", + "\n", + "# hover over lines to see the Feature ID.\n", + "px.line(\n", + " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :].cpu().numpy(),\n", + " title=\"Feature activations at the final token position\",\n", + " labels={\"index\": \"Feature\", \"value\": \"Activation\"},\n", + ").show()\n", + "\n", + "# let's print the top 5 features and how much they fired\n", + "vals, inds = torch.topk(\n", + " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :], 5\n", + ")\n", + "for val, ind in zip(vals, inds):\n", + " print(f\"Feature {ind} fired {val:.2f}\")\n", + " html = get_dashboard_html(\n", + " sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=ind\n", + " )\n", + " display(IFrame(html, width=1200, height=300))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The Contrast Pairs Trick\n", + "\n", + "Sometimes we may be interested in which features fire differently between two prompts. Let's investigate this question by comparing the resultant activations. As we can see, using the prompt below changes the logit prediction considerably:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformer_lens.utils import test_prompt\n", + "\n", + "prompt = \"In the beginning, God created the cat and the\"\n", + "answer = \"earth\"\n", + "\n", + "# here we see that removing the word \"Heavens\" is very effective at making the model no longer predict \"earth\".\n", + "# instead the model predicts a bunch of different animals.\n", + "# Can we work out which features fire differently which might explain this? (This is a toy example not meant to be super interesting)\n", + "test_prompt(prompt, answer, model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's plot the two activation vectors.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompt = [\n", + " \"In the beginning, God created the heavens and the\",\n", + " \"In the beginning, God created the cat and the\",\n", + "]\n", + "_, cache = model.run_with_cache_with_saes(prompt, saes=[sae])\n", + "print([(k, v.shape) for k, v in cache.items() if \"sae\" in k])\n", + "\n", + "feature_activation_df = pd.DataFrame(\n", + " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :].cpu().numpy(),\n", + " index=[f\"feature_{i}\" for i in range(sae.cfg.d_sae)],\n", + ")\n", + "feature_activation_df.columns = [\"heavens_and_the\"]\n", + "feature_activation_df[\"cat_and_the\"] = (\n", + " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][1, -1, :].cpu().numpy()\n", + ")\n", + "feature_activation_df[\"diff\"] = (\n", + " feature_activation_df[\"heavens_and_the\"] - feature_activation_df[\"cat_and_the\"]\n", + ")\n", + "\n", + "fig = px.line(\n", + " feature_activation_df,\n", + " title=\"Feature activations for the prompt\",\n", + " labels={\"index\": \"Feature\", \"value\": \"Activation\"},\n", + ")\n", + "\n", + "# hide the x-ticks\n", + "fig.update_xaxes(showticklabels=False)\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that there are differences, but let's plot the feature dashboards for the features with the biggest diffs to see what they are. We can see that the biggest difference is that there is now an active \"animal\" feature.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# let's look at the biggest features in terms of absolute difference\n", + "\n", + "diff = (\n", + " cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][1, -1, :].cpu()\n", + " - cache[\"blocks.7.hook_resid_pre.hook_sae_acts_post\"][0, -1, :].cpu()\n", + ")\n", + "vals, inds = torch.topk(torch.abs(diff), 5)\n", + "for val, ind in zip(vals, inds):\n", + " print(f\"Feature {ind} had a difference of {val:.2f}\")\n", + " html = get_dashboard_html(\n", + " sae_release=\"gpt2-small\", sae_id=\"7-res-jb\", feature_idx=ind\n", + " )\n", + " display(IFrame(html, width=1200, height=300))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So we see that with cats, there is now an animal predicting feature that fires quite strongly, and a feature that fires on \"and\" and promotes \"valleys\" and other geological terms no longer fires.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Making Feature Dashboards (Optional)\n", + "\n", + "For those interested, we provide a section showing how to generate the components of feature dashboards.\n", + "\n", + "We've covered what the feature dashboards are displaying, but let's dive into this in more detail so that we fully understand what the plots signify. To repeat the explanation above and provide more detail, basic feature dashboards have 4 main components:\n", + "\n", + "1. Feature Activation Distribution. We report the proportion of tokens a feature fires on, usually between 1 in every 100 and 1 in every 10,000 tokens activations, and show the distribution of positive activations.\n", + "2. Logit weight distribution. This is the projection of the decoder weight onto the unembed and roughly gives us a sense of the tokens promoted by a feature. It's less useful in big models / middle layers.\n", + "3. The top 10 and bottom 10 tokens in the logit weight distribution (positive/negative logits).\n", + "4. **Max Activating Examples**. These are examples of text where the feature fires and usually provide the most information for helping us work out what a feature means.\n", + "\n", + "**Bonus Section: Reproducing circular subspace geometry from [Not all Language Model Features are Linear](https://arxiv.org/abs/2405.14860)**\n", + "\n", + "_Neuronpedia_ is a website that hosts feature dashboards and which runs servers that can run the model and check feature activations. This makes it very convenient to check that a feature fires on the distribution of text you actually think it should fire on. We've been downloading data from Neuronpedia for some of the plots above.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Local: Finding Max Activating Examples\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll start by finding the max-activating examples--the prompts that show the highest level of activation from a feature. First, we'll prepare a feature store, which streams a sample of text from an SAE's orginal training dataset and creates activations for them.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# instantiate an object to hold activations from a dataset\n", + "from sae_lens import ActivationsStore\n", + "\n", + "# a convenient way to instantiate an activation store is to use the from_sae method\n", + "activation_store = ActivationsStore.from_sae(\n", + " model=model,\n", + " dataset=sae.cfg.metadata.dataset_path,\n", + " sae=sae,\n", + " streaming=True,\n", + " # fairly conservative parameters here so can use same for larger\n", + " # models without running out of memory.\n", + " store_batch_size_prompts=8,\n", + " train_batch_size_tokens=4096,\n", + " n_batches_in_buffer=32,\n", + " device=device,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def list_flatten(nested_list):\n", + " return [x for y in nested_list for x in y]\n", + "\n", + "\n", + "# A very handy function Neel wrote to get context around a feature activation\n", + "def make_token_df(tokens, len_prefix=5, len_suffix=3, model=model):\n", + " str_tokens = [model.to_str_tokens(t) for t in tokens]\n", + " unique_token = [\n", + " [f\"{s}/{i}\" for i, s in enumerate(str_tok)] for str_tok in str_tokens\n", + " ]\n", + "\n", + " context = []\n", + " prompt = []\n", + " pos = []\n", + " label = []\n", + " for b in range(tokens.shape[0]):\n", + " for p in range(tokens.shape[1]):\n", + " prefix = \"\".join(str_tokens[b][max(0, p - len_prefix) : p])\n", + " if p == tokens.shape[1] - 1:\n", + " suffix = \"\"\n", + " else:\n", + " suffix = \"\".join(\n", + " str_tokens[b][p + 1 : min(tokens.shape[1] - 1, p + 1 + len_suffix)]\n", + " )\n", + " current = str_tokens[b][p]\n", + " context.append(f\"{prefix}|{current}|{suffix}\")\n", + " prompt.append(b)\n", + " pos.append(p)\n", + " label.append(f\"{b}/{p}\")\n", + " # print(len(batch), len(pos), len(context), len(label))\n", + " return pd.DataFrame(\n", + " dict(\n", + " str_tokens=list_flatten(str_tokens),\n", + " unique_token=list_flatten(unique_token),\n", + " context=context,\n", + " prompt=prompt,\n", + " pos=pos,\n", + " label=label,\n", + " )\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we'll generate examples for a random set of features.\n", + "\n", + "The following code does the following (for a randomly selected set of 100 features):\n", + "\n", + "1. Samples tokens from the dataset, prepending a bos if the SAE was trained with that and making sure prompts are the correct size for the SAE.\n", + "2. Generates activations, tracking which tokens a feature fired on.\n", + "3. (Just for `Not all language model features are linear`) Keeps track of the subspace geneated by those features.\n", + "4. Make a dataframe with all the tokens in all the prompts where at least one feature fired.\n", + "\n", + "\\*Note: this code is fairly slow in part due to the dataframe concat and in part because we actually have to run the model rather than using cached activations. SAE Lens officially recommends [SAE Dashboard](https://github.com/jbloomAus/SAEDashboard) for dashboard generation in practice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# finding max activating examples is a bit harder. To do this we need to calculate feature activations for a large number of tokens\n", + "feature_list = torch.randint(0, sae.cfg.d_sae, (100,))\n", + "examples_found = 0\n", + "all_fired_tokens = []\n", + "all_feature_acts = []\n", + "all_reconstructions = []\n", + "all_token_dfs = []\n", + "\n", + "total_batches = 100\n", + "batch_size_prompts = activation_store.store_batch_size_prompts\n", + "batch_size_tokens = activation_store.context_size * batch_size_prompts\n", + "pbar = tqdm(range(total_batches))\n", + "for i in pbar:\n", + " tokens = activation_store.get_batch_tokens()\n", + " tokens_df = make_token_df(tokens)\n", + " tokens_df[\"batch\"] = i\n", + "\n", + " flat_tokens = tokens.flatten()\n", + "\n", + " _, cache = model.run_with_cache(tokens, names_filter=[sae.cfg.metadata.hook_name])\n", + " sae_in = cache[sae.cfg.metadata.hook_name]\n", + " feature_acts = sae.encode(sae_in).squeeze()\n", + "\n", + " feature_acts = feature_acts.flatten(0, 1)\n", + " fired_mask = (feature_acts[:, feature_list]).sum(dim=-1) > 0\n", + " fired_tokens = model.to_str_tokens(flat_tokens[fired_mask])\n", + " reconstruction = feature_acts[fired_mask][:, feature_list] @ sae.W_dec[feature_list]\n", + "\n", + " token_df = tokens_df.iloc[fired_mask.cpu().nonzero().flatten().numpy()]\n", + " all_token_dfs.append(token_df)\n", + " all_feature_acts.append(feature_acts[fired_mask][:, feature_list])\n", + " all_fired_tokens.append(fired_tokens)\n", + " all_reconstructions.append(reconstruction)\n", + "\n", + " examples_found += len(fired_tokens)\n", + " # print(f\"Examples found: {examples_found}\")\n", + " # update description\n", + " pbar.set_description(f\"Examples found: {examples_found}\")\n", + "\n", + "# flatten the list of lists\n", + "all_token_dfs = pd.concat(all_token_dfs)\n", + "all_fired_tokens = list_flatten(all_fired_tokens)\n", + "all_reconstructions = torch.cat(all_reconstructions)\n", + "all_feature_acts = torch.cat(all_feature_acts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting Feature Activation Histogram\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we can generate the feature activation histogram (just as we saw on the dashboards above) and display the list of max-activating examples we just generated. We'll just do this for the first feature in our random set (index 0).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "feature_acts_df = pd.DataFrame(\n", + " all_feature_acts.detach().cpu().numpy(),\n", + " columns=[f\"feature_{i}\" for i in feature_list],\n", + ")\n", + "feature_acts_df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "feature_idx = 0\n", + "# get non-zero activations\n", + "\n", + "all_positive_acts = all_feature_acts[all_feature_acts[:, feature_idx] > 0][\n", + " :, feature_idx\n", + "].detach()\n", + "prop_positive_activations = (\n", + " 100 * len(all_positive_acts) / (total_batches * batch_size_tokens)\n", + ")\n", + "\n", + "px.histogram(\n", + " all_positive_acts.cpu(),\n", + " nbins=50,\n", + " title=f\"Histogram of positive activations - {prop_positive_activations:.3f}% of activations were positive\",\n", + " labels={\"value\": \"Activation\"},\n", + " width=800,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "top_10_activations = (\n", + " feature_acts_df[feature_acts_df[f\"feature_{feature_list[feature_idx]}\"] != 0]\n", + " .sort_values(f\"feature_{feature_list[feature_idx]}\", ascending=False)\n", + " .head(10)\n", + ")\n", + "all_token_dfs.iloc[top_10_activations.index]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting the Top 10 Logit Weights\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a final step, we'll generate the top 10 logit weights--that is, we'll see what tokens each of the features in our set is promoting most strongly.\n", + "\n", + "Note it's important to fold layer norm (by default SAE Lens loads Transformers with folder layer norm but sometimes we turn preprocessing off to save GPU ram and this would affect the logit weight histograms a little bit).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"Shape of the decoder weights {sae.W_dec.shape})\")\n", + "print(f\"Shape of the model unembed {model.W_U.shape}\")\n", + "projection_matrix = sae.W_dec @ model.W_U\n", + "print(f\"Shape of the projection matrix {projection_matrix.shape}\")\n", + "\n", + "# then we take the top_k tokens per feature and decode them\n", + "top_k = 10\n", + "# let's do this for 100 random features\n", + "_, top_k_tokens = torch.topk(projection_matrix[feature_list], top_k, dim=1)\n", + "\n", + "\n", + "feature_df = pd.DataFrame(\n", + " top_k_tokens.cpu().numpy(), index=[f\"feature_{i}\" for i in feature_list]\n", + ").T\n", + "feature_df.index = [f\"token_{i}\" for i in range(top_k)]\n", + "feature_df.applymap(lambda x: model.tokenizer.decode(x))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Putting it all together: Compare against the Neuronpedia Dashboard\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How does this compare to the dashboard data pulled from Neuronpedia? Let's take a look:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sae_lens.util import extract_layer_from_tlens_hook_name\n", + "\n", + "\n", + "html = get_dashboard_html(\n", + " sae_release=\"gpt2-small\",\n", + " sae_id=f\"{extract_layer_from_tlens_hook_name(sae.cfg.metadata.hook_name)}-res-jb\",\n", + " feature_idx=feature_list[0],\n", + ")\n", + "IFrame(html, width=1200, height=600)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It seems to replicate! We now see how the dashboard values are created.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Optional: Co-occurence Networks and Irreducible Subspaces\n", + "\n", + "Since we just wrote code very similar to the code we need for reproducing some of the analysis from [\"Not All Language Model Features are Linear\"](https://arxiv.org/abs/2405.14860), we show below how to regenerate their awesome circular representation (demonstrating a geometric relationship between related features, like days of the week).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# only valid for res-jb resid_pre 7.\n", + "# Josh Engel's emailed us these lists.\n", + "day_of_the_week_features = [2592, 4445, 4663, 4733, 6531, 8179, 9566, 20927, 24185]\n", + "# months_of_the_year = [3977, 4140, 5993, 7299, 9104, 9401, 10449, 11196, 12661, 14715, 17068, 17528, 19589, 21033, 22043, 23304]\n", + "# years_of_10th_century = [1052, 2753, 4427, 6382, 8314, 9576, 9606, 13551, 19734, 20349]\n", + "\n", + "feature_list = day_of_the_week_features\n", + "\n", + "examples_found = 0\n", + "all_fired_tokens = []\n", + "all_feature_acts = []\n", + "all_reconstructions = []\n", + "all_token_dfs = []\n", + "\n", + "total_batches = 100\n", + "batch_size_prompts = activation_store.store_batch_size_prompts\n", + "batch_size_tokens = activation_store.context_size * batch_size_prompts\n", + "pbar = tqdm(range(total_batches))\n", + "for i in pbar:\n", + " tokens = activation_store.get_batch_tokens()\n", + " tokens_df = make_token_df(tokens)\n", + " tokens_df[\"batch\"] = i\n", + "\n", + " flat_tokens = tokens.flatten()\n", + "\n", + " _, cache = model.run_with_cache(tokens, names_filter=[sae.cfg.metadata.hook_name])\n", + " sae_in = cache[sae.cfg.metadata.hook_name]\n", + " feature_acts = sae.encode(sae_in).squeeze()\n", + "\n", + " feature_acts = feature_acts.flatten(0, 1)\n", + " fired_mask = (feature_acts[:, feature_list]).sum(dim=-1) > 0\n", + " fired_tokens = model.to_str_tokens(flat_tokens[fired_mask])\n", + " reconstruction = feature_acts[fired_mask][:, feature_list] @ sae.W_dec[feature_list]\n", + "\n", + " token_df = tokens_df.iloc[fired_mask.cpu().nonzero().flatten().numpy()]\n", + " all_token_dfs.append(token_df)\n", + " all_feature_acts.append(feature_acts[fired_mask][:, feature_list])\n", + " all_fired_tokens.append(fired_tokens)\n", + " all_reconstructions.append(reconstruction)\n", + "\n", + " examples_found += len(fired_tokens)\n", + " # print(f\"Examples found: {examples_found}\")\n", + " # update description\n", + " pbar.set_description(f\"Examples found: {examples_found}\")\n", + "\n", + "# flatten the list of lists\n", + "all_token_dfs = pd.concat(all_token_dfs)\n", + "all_fired_tokens = list_flatten(all_fired_tokens)\n", + "all_reconstructions = torch.cat(all_reconstructions)\n", + "all_feature_acts = torch.cat(all_feature_acts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using PCA, we can see that these features do indeed lie in a circle!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# do PCA on reconstructions\n", + "from sklearn.decomposition import PCA\n", + "import plotly.express as px\n", + "\n", + "pca = PCA(n_components=3)\n", + "pca_embedding = pca.fit_transform(all_reconstructions.detach().cpu().numpy())\n", + "\n", + "pca_df = pd.DataFrame(pca_embedding, columns=[\"PC1\", \"PC2\", \"PC3\"])\n", + "pca_df[\"tokens\"] = all_fired_tokens\n", + "pca_df[\"context\"] = all_token_dfs.context.values\n", + "\n", + "\n", + "px.scatter(\n", + " pca_df,\n", + " x=\"PC2\",\n", + " y=\"PC3\",\n", + " hover_data=[\"context\"],\n", + " hover_name=\"tokens\",\n", + " height=800,\n", + " width=1200,\n", + " color=\"tokens\",\n", + " title=\"PCA Subspace Reconstructions\",\n", + ").show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should be able to see a circular subspace where the order of days of the week is preserved correctly.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Basics: Intervening on SAE Features\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Feature Steering\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One fun (and sometimes useful) thing we can do once we've found a feature is to use it to steer a model. To do this, we find the maximum activation of a feature in a set of text (using the activation store above), use this as the default scale, multiple it by the vector representing the feature (as extracted from the decoder weights), and finally multiply this by a parameter that we control. This can be varied to see its effect on the text. Below, we'll try steering with a feature that often fires on religious or philosophical statements (feature [20115](https://www.neuronpedia.org/gpt2-small/7-res-jb/20115)). Note that sometimes steering can get GPT2 into a loop, so it's worth running this more than once.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm.auto import tqdm\n", + "from functools import partial\n", + "import re\n", + "\n", + "\n", + "def find_max_activation(model, sae, activation_store, feature_idx, num_batches=100):\n", + " \"\"\"\n", + " Find the maximum activation for a given feature index. This is useful for\n", + " calibrating the right amount of the feature to add.\n", + " \"\"\"\n", + " max_activation = 0.0\n", + "\n", + " pbar = tqdm(range(num_batches))\n", + " for _ in pbar:\n", + " tokens = activation_store.get_batch_tokens()\n", + "\n", + " layer = int(re.search(r\"\\.(\\d+)\\.\", sae.cfg.metadata.hook_name).group(1)) # type: ignore\n", + " _, cache = model.run_with_cache(\n", + " tokens,\n", + " stop_at_layer=layer + 1,\n", + " names_filter=[sae.cfg.metadata.hook_name],\n", + " )\n", + " sae_in = cache[sae.cfg.metadata.hook_name]\n", + " feature_acts = sae.encode(sae_in).squeeze()\n", + "\n", + " feature_acts = feature_acts.flatten(0, 1)\n", + " batch_max_activation = feature_acts[:, feature_idx].max().item()\n", + " max_activation = max(max_activation, batch_max_activation)\n", + "\n", + " pbar.set_description(f\"Max activation: {max_activation:.4f}\")\n", + "\n", + " return max_activation\n", + "\n", + "\n", + "def steering(\n", + " activations, hook, steering_strength=1.0, steering_vector=None, max_act=1.0\n", + "):\n", + " # Note if the feature fires anyway, we'd be adding to that here.\n", + " return activations + max_act * steering_strength * steering_vector\n", + "\n", + "\n", + "def generate_with_steering(\n", + " model,\n", + " sae,\n", + " prompt,\n", + " steering_feature,\n", + " max_act,\n", + " steering_strength=1.0,\n", + " max_new_tokens=95,\n", + "):\n", + " input_ids = model.to_tokens(prompt, prepend_bos=sae.cfg.metadata.prepend_bos)\n", + "\n", + " steering_vector = sae.W_dec[steering_feature].to(model.cfg.device)\n", + "\n", + " steering_hook = partial(\n", + " steering,\n", + " steering_vector=steering_vector,\n", + " steering_strength=steering_strength,\n", + " max_act=max_act,\n", + " )\n", + "\n", + " # standard transformerlens syntax for a hook context for generation\n", + " with model.hooks(fwd_hooks=[(sae.cfg.metadata.hook_name, steering_hook)]):\n", + " output = model.generate(\n", + " input_ids,\n", + " max_new_tokens=max_new_tokens,\n", + " temperature=0.7,\n", + " top_p=0.9,\n", + " stop_at_eos=False if device == \"mps\" else True,\n", + " prepend_bos=sae.cfg.metadata.prepend_bos,\n", + " )\n", + "\n", + " return model.tokenizer.decode(output[0])\n", + "\n", + "\n", + "# Choose a feature to steer\n", + "steering_feature = steering_feature = 20115 # Choose a feature to steer towards\n", + "\n", + "# Find the maximum activation for this feature\n", + "max_act = find_max_activation(model, sae, activation_store, steering_feature)\n", + "print(f\"Maximum activation for feature {steering_feature}: {max_act:.4f}\")\n", + "\n", + "# note we could also get the max activation from Neuronpedia (https://www.neuronpedia.org/api-doc#tag/lookup/GET/api/feature/{modelId}/{layer}/{index})\n", + "\n", + "# Generate text without steering for comparison\n", + "prompt = \"Once upon a time\"\n", + "normal_text = model.generate(\n", + " prompt,\n", + " max_new_tokens=95,\n", + " stop_at_eos=False if device == \"mps\" else True,\n", + " prepend_bos=sae.cfg.metadata.prepend_bos,\n", + ")\n", + "\n", + "print(\"\\nNormal text (without steering):\")\n", + "print(normal_text)\n", + "\n", + "# Generate text with steering\n", + "steered_text = generate_with_steering(\n", + " model, sae, prompt, steering_feature, max_act, steering_strength=2.0\n", + ")\n", + "print(\"Steered text:\")\n", + "print(steered_text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Experiment with different steering strengths\n", + "print(\"\\nExperimenting with different steering strengths:\")\n", + "for strength in [-4.0, -2.0, 0.5, 2.0, 4.0]:\n", + " steered_text = generate_with_steering(\n", + " model, sae, prompt, steering_feature, max_act, steering_strength=strength\n", + " )\n", + " print(f\"\\nSteering strength {strength}:\")\n", + " print(steered_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also do this via the Neuronpedia AP or on the website [here](https://www.neuronpedia.org/steer/). The example below steers for just a few tokens with a feature that does something very specific. Can you work out what it's doing?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import numpy as np\n", + "\n", + "url = \"https://www.neuronpedia.org/api/steer\"\n", + "\n", + "payload = {\n", + " # \"prompt\": \"A knight in shining\",\n", + " # \"prompt\": \"He had to fight back in self-\",\n", + " \"prompt\": \"In the middle of the universe is the galactic\",\n", + " # \"prompt\": \"Oh no. We're running on empty. Its time to fill up the car with\",\n", + " # \"prompt\": \"Sure, I'm happy to pay. I don't have any cash on me but let me write you a\",\n", + " \"modelId\": \"gpt2-small\",\n", + " \"features\": [\n", + " {\"modelId\": \"gpt2-small\", \"layer\": \"7-res-jb\", \"index\": 6770, \"strength\": 8}\n", + " ],\n", + " \"temperature\": 0.2,\n", + " \"n_tokens\": 2,\n", + " \"freq_penalty\": 1,\n", + " \"seed\": np.random.randint(100),\n", + " \"strength_multiplier\": 4,\n", + "}\n", + "headers = {\"Content-Type\": \"application/json\"}\n", + "\n", + "response = requests.post(url, json=payload, headers=headers)\n", + "\n", + "print(response.json())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import numpy as np\n", + "\n", + "url = \"https://www.neuronpedia.org/api/steer\"\n", + "\n", + "payload = {\n", + " \"prompt\": 'I wrote a letter to my girlfiend. It said \"',\n", + " \"modelId\": \"gpt2-small\",\n", + " \"features\": [\n", + " {\"modelId\": \"gpt2-small\", \"layer\": \"7-res-jb\", \"index\": 20115, \"strength\": 4}\n", + " ],\n", + " \"temperature\": 0.7,\n", + " \"n_tokens\": 120,\n", + " \"freq_penalty\": 1,\n", + " \"seed\": np.random.randint(100),\n", + " \"strength_multiplier\": 4,\n", + "}\n", + "headers = {\"Content-Type\": \"application/json\"}\n", + "\n", + "response = requests.post(url, json=payload, headers=headers)\n", + "\n", + "print(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Feature Ablation\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Feature ablation is also worth looking at. In a way, it's a special case of steering where the value of the feature is always zeroed out.\n", + "\n", + "Here we do the following:\n", + "\n", + "1. Use test prompt rather than generate to get more nuance.\n", + "2. attach a hook to the SAE feature activations.\n", + "3. 0 out a feature at all positions (we know that the default feature fires at the final position.)\n", + "4. Check whether this ablation is more / less effective if we include the error term (info our SAE isn't capturing).\n", + "\n", + "Note that the existence of [The Hydra Effect](https://arxiv.org/abs/2307.15771) can make reasoning about ablation experiments difficult.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformer_lens.utils import test_prompt\n", + "from functools import partial\n", + "\n", + "\n", + "def test_prompt_with_ablation(model, sae, prompt, answer, ablation_features):\n", + " def ablate_feature_hook(feature_activations, hook, feature_ids, position=None):\n", + " if position is None:\n", + " feature_activations[:, :, feature_ids] = 0\n", + " else:\n", + " feature_activations[:, position, feature_ids] = 0\n", + "\n", + " return feature_activations\n", + "\n", + " ablation_hook = partial(ablate_feature_hook, feature_ids=ablation_features)\n", + "\n", + " model.add_sae(sae)\n", + " hook_point = sae.cfg.metadata.hook_name + \".hook_sae_acts_post\"\n", + " model.add_hook(hook_point, ablation_hook, \"fwd\")\n", + "\n", + " test_prompt(prompt, answer, model)\n", + "\n", + " model.reset_hooks()\n", + " model.reset_saes()\n", + "\n", + "\n", + "# Example usage in a notebook:\n", + "\n", + "# Assume model and sae are already defined\n", + "\n", + "# Choose a feature to ablate\n", + "\n", + "model.reset_hooks(including_permanent=True)\n", + "prompt = \"In the beginning, God created the heavens and the\"\n", + "answer = \"earth\"\n", + "test_prompt(prompt, answer, model)\n", + "\n", + "\n", + "# Generate text with feature ablation\n", + "print(\"Test Prompt with feature ablation and no error term\")\n", + "ablation_feature = 16873 # Replace with any feature index you're interested in. We use the religion feature\n", + "sae.use_error_term = False\n", + "test_prompt_with_ablation(model, sae, prompt, answer, ablation_feature)\n", + "\n", + "print(\"Test Prompt with feature ablation and error term\")\n", + "ablation_feature = 16873 # Replace with any feature index you're interested in. We use the religion feature\n", + "sae.use_error_term = True\n", + "test_prompt_with_ablation(model, sae, prompt, answer, ablation_feature)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Feature Attribution\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dataclasses import dataclass\n", + "from functools import partial\n", + "from typing import Any, Literal, NamedTuple, Callable\n", + "\n", + "import torch\n", + "from sae_lens import SAE\n", + "from transformer_lens import HookedTransformer\n", + "from transformer_lens.hook_points import HookPoint\n", + "\n", + "\n", + "class SaeReconstructionCache(NamedTuple):\n", + " sae_in: torch.Tensor\n", + " feature_acts: torch.Tensor\n", + " sae_out: torch.Tensor\n", + " sae_error: torch.Tensor\n", + "\n", + "\n", + "def track_grad(tensor: torch.Tensor) -> None:\n", + " \"\"\"wrapper around requires_grad and retain_grad\"\"\"\n", + " tensor.requires_grad_(True)\n", + " tensor.retain_grad()\n", + "\n", + "\n", + "@dataclass\n", + "class ApplySaesAndRunOutput:\n", + " model_output: torch.Tensor\n", + " model_activations: dict[str, torch.Tensor]\n", + " sae_activations: dict[str, SaeReconstructionCache]\n", + "\n", + " def zero_grad(self) -> None:\n", + " \"\"\"Helper to zero grad all tensors in this object.\"\"\"\n", + " self.model_output.grad = None\n", + " for act in self.model_activations.values():\n", + " act.grad = None\n", + " for cache in self.sae_activations.values():\n", + " cache.sae_in.grad = None\n", + " cache.feature_acts.grad = None\n", + " cache.sae_out.grad = None\n", + " cache.sae_error.grad = None\n", + "\n", + "\n", + "def apply_saes_and_run(\n", + " model: HookedTransformer,\n", + " saes: dict[str, SAE],\n", + " input: Any,\n", + " include_error_term: bool = True,\n", + " track_model_hooks: list[str] | None = None,\n", + " return_type: Literal[\"logits\", \"loss\"] = \"logits\",\n", + " track_grads: bool = False,\n", + ") -> ApplySaesAndRunOutput:\n", + " \"\"\"\n", + " Apply the SAEs to the model at the specific hook points, and run the model.\n", + " By default, this will include a SAE error term which guarantees that the SAE\n", + " will not affect model output. This function is designed to work correctly with\n", + " backprop as well, so it can be used for gradient-based feature attribution.\n", + "\n", + " Args:\n", + " model: the model to run\n", + " saes: the SAEs to apply\n", + " input: the input to the model\n", + " include_error_term: whether to include the SAE error term to ensure the SAE doesn't affect model output. Default True\n", + " track_model_hooks: a list of hook points to record the activations and gradients. Default None\n", + " return_type: this is passed to the model.run_with_hooks function. Default \"logits\"\n", + " track_grads: whether to track gradients. Default False\n", + " \"\"\"\n", + "\n", + " fwd_hooks = []\n", + " bwd_hooks = []\n", + "\n", + " sae_activations: dict[str, SaeReconstructionCache] = {}\n", + " model_activations: dict[str, torch.Tensor] = {}\n", + "\n", + " # this hook just track the SAE input, output, features, and error. If `track_grads=True`, it also ensures\n", + " # that requires_grad is set to True and retain_grad is called for intermediate values.\n", + " def reconstruction_hook(sae_in: torch.Tensor, hook: HookPoint, hook_point: str): # noqa: ARG001\n", + " sae = saes[hook_point]\n", + " feature_acts = sae.encode(sae_in)\n", + " sae_out = sae.decode(feature_acts)\n", + " sae_error = (sae_in - sae_out).detach().clone()\n", + " if track_grads:\n", + " track_grad(sae_error)\n", + " track_grad(sae_out)\n", + " track_grad(feature_acts)\n", + " track_grad(sae_in)\n", + " sae_activations[hook_point] = SaeReconstructionCache(\n", + " sae_in=sae_in,\n", + " feature_acts=feature_acts,\n", + " sae_out=sae_out,\n", + " sae_error=sae_error,\n", + " )\n", + "\n", + " if include_error_term:\n", + " return sae_out + sae_error\n", + " return sae_out\n", + "\n", + " def sae_bwd_hook(output_grads: torch.Tensor, hook: HookPoint): # noqa: ARG001\n", + " # this just passes the output grads to the input, so the SAE gets the same grads despite the error term hackery\n", + " return (output_grads,)\n", + "\n", + " # this hook just records model activations, and ensures that intermediate activations have gradient tracking turned on if needed\n", + " def tracking_hook(hook_input: torch.Tensor, hook: HookPoint, hook_point: str): # noqa: ARG001\n", + " model_activations[hook_point] = hook_input\n", + " if track_grads:\n", + " track_grad(hook_input)\n", + " return hook_input\n", + "\n", + " for hook_point in saes.keys():\n", + " fwd_hooks.append(\n", + " (hook_point, partial(reconstruction_hook, hook_point=hook_point))\n", + " )\n", + " bwd_hooks.append((hook_point, sae_bwd_hook))\n", + " for hook_point in track_model_hooks or []:\n", + " fwd_hooks.append((hook_point, partial(tracking_hook, hook_point=hook_point)))\n", + "\n", + " # now, just run the model while applying the hooks\n", + " with model.hooks(fwd_hooks=fwd_hooks, bwd_hooks=bwd_hooks):\n", + " model_output = model(input, return_type=return_type)\n", + "\n", + " return ApplySaesAndRunOutput(\n", + " model_output=model_output,\n", + " model_activations=model_activations,\n", + " sae_activations=sae_activations,\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dataclasses import dataclass\n", + "from transformer_lens.hook_points import HookPoint\n", + "from dataclasses import dataclass\n", + "from functools import partial\n", + "from typing import Any, Literal, NamedTuple\n", + "\n", + "import torch\n", + "from sae_lens import SAE\n", + "from transformer_lens import HookedTransformer\n", + "from transformer_lens.hook_points import HookPoint\n", + "\n", + "EPS = 1e-8\n", + "\n", + "torch.set_grad_enabled(True)\n", + "\n", + "\n", + "@dataclass\n", + "class AttributionGrads:\n", + " metric: torch.Tensor\n", + " model_output: torch.Tensor\n", + " model_activations: dict[str, torch.Tensor]\n", + " sae_activations: dict[str, SaeReconstructionCache]\n", + "\n", + "\n", + "@dataclass\n", + "class Attribution:\n", + " model_attributions: dict[str, torch.Tensor]\n", + " model_activations: dict[str, torch.Tensor]\n", + " model_grads: dict[str, torch.Tensor]\n", + " sae_feature_attributions: dict[str, torch.Tensor]\n", + " sae_feature_activations: dict[str, torch.Tensor]\n", + " sae_feature_grads: dict[str, torch.Tensor]\n", + " sae_errors_attribution_proportion: dict[str, float]\n", + "\n", + "\n", + "def calculate_attribution_grads(\n", + " model: HookedSAETransformer,\n", + " prompt: str,\n", + " metric_fn: Callable[[torch.Tensor], torch.Tensor],\n", + " track_hook_points: list[str] | None = None,\n", + " include_saes: dict[str, SAE] | None = None,\n", + " return_logits: bool = True,\n", + " include_error_term: bool = True,\n", + ") -> AttributionGrads:\n", + " \"\"\"\n", + " Wrapper around apply_saes_and_run that calculates gradients wrt to the metric_fn.\n", + " Tracks grads for both SAE feature and model neurons, and returns them in a structured format.\n", + " \"\"\"\n", + " output = apply_saes_and_run(\n", + " model,\n", + " saes=include_saes or {},\n", + " input=prompt,\n", + " return_type=\"logits\" if return_logits else \"loss\",\n", + " track_model_hooks=track_hook_points,\n", + " include_error_term=include_error_term,\n", + " track_grads=True,\n", + " )\n", + " metric = metric_fn(output.model_output)\n", + " output.zero_grad()\n", + " metric.backward()\n", + " return AttributionGrads(\n", + " metric=metric,\n", + " model_output=output.model_output,\n", + " model_activations=output.model_activations,\n", + " sae_activations=output.sae_activations,\n", + " )\n", + "\n", + "\n", + "def calculate_feature_attribution(\n", + " model: HookedSAETransformer,\n", + " input: Any,\n", + " metric_fn: Callable[[torch.Tensor], torch.Tensor],\n", + " track_hook_points: list[str] | None = None,\n", + " include_saes: dict[str, SAE] | None = None,\n", + " return_logits: bool = True,\n", + " include_error_term: bool = True,\n", + ") -> Attribution:\n", + " \"\"\"\n", + " Calculate feature attribution for SAE features and model neurons following\n", + " the procedure in https://transformer-circuits.pub/2024/march-update/index.html#feature-heads.\n", + " This include the SAE error term by default, so inserting the SAE into the calculation is\n", + " guaranteed to not affect the model output. This can be disabled by setting `include_error_term=False`.\n", + "\n", + " Args:\n", + " model: The model to calculate feature attribution for.\n", + " input: The input to the model.\n", + " metric_fn: A function that takes the model output and returns a scalar metric.\n", + " track_hook_points: A list of model hook points to track activations for, if desired\n", + " include_saes: A dictionary of SAEs to include in the calculation. The key is the hook point to apply the SAE to.\n", + " return_logits: Whether to return the model logits or loss. This is passed to TLens, so should match whatever the metric_fn expects (probably logits)\n", + " include_error_term: Whether to include the SAE error term in the calculation. This is recommended, as it ensures that the SAE will not affecting the model output.\n", + " \"\"\"\n", + " # first, calculate gradients wrt to the metric_fn.\n", + " # these will be multiplied with the activation values to get the attributions\n", + " outputs_with_grads = calculate_attribution_grads(\n", + " model,\n", + " input,\n", + " metric_fn,\n", + " track_hook_points,\n", + " include_saes=include_saes,\n", + " return_logits=return_logits,\n", + " include_error_term=include_error_term,\n", + " )\n", + " model_attributions = {}\n", + " model_activations = {}\n", + " model_grads = {}\n", + " sae_feature_attributions = {}\n", + " sae_feature_activations = {}\n", + " sae_feature_grads = {}\n", + " sae_error_proportions = {}\n", + " # this code is long, but all it's doing is multiplying the grads by the activations\n", + " # and recording grads, acts, and attributions in dictionaries to return to the user\n", + " with torch.no_grad():\n", + " for name, act in outputs_with_grads.model_activations.items():\n", + " assert act.grad is not None\n", + " raw_activation = act.detach().clone()\n", + " model_attributions[name] = (act.grad * raw_activation).detach().clone()\n", + " model_activations[name] = raw_activation\n", + " model_grads[name] = act.grad.detach().clone()\n", + " for name, act in outputs_with_grads.sae_activations.items():\n", + " assert act.feature_acts.grad is not None\n", + " assert act.sae_out.grad is not None\n", + " raw_activation = act.feature_acts.detach().clone()\n", + " sae_feature_attributions[name] = (\n", + " (act.feature_acts.grad * raw_activation).detach().clone()\n", + " )\n", + " sae_feature_activations[name] = raw_activation\n", + " sae_feature_grads[name] = act.feature_acts.grad.detach().clone()\n", + " if include_error_term:\n", + " assert act.sae_error.grad is not None\n", + " error_grad_norm = act.sae_error.grad.norm().item()\n", + " else:\n", + " error_grad_norm = 0\n", + " sae_out_norm = act.sae_out.grad.norm().item()\n", + " sae_error_proportions[name] = error_grad_norm / (\n", + " sae_out_norm + error_grad_norm + EPS\n", + " )\n", + " return Attribution(\n", + " model_attributions=model_attributions,\n", + " model_activations=model_activations,\n", + " model_grads=model_grads,\n", + " sae_feature_attributions=sae_feature_attributions,\n", + " sae_feature_activations=sae_feature_activations,\n", + " sae_feature_grads=sae_feature_grads,\n", + " sae_errors_attribution_proportion=sae_error_proportions,\n", + " )\n", + "\n", + "\n", + "# prompt = \" Tiger Woods plays the sport of\"\n", + "# pos_token = model.tokenizer.encode(\" golf\")[0]\n", + "prompt = \"In the beginning, God created the heavens and the\"\n", + "pos_token = model.tokenizer.encode(\" earth\")\n", + "neg_token = model.tokenizer.encode(\" sky\")\n", + "\n", + "\n", + "def metric_fn(\n", + " logits: torch.tensor,\n", + " pos_token: torch.tensor = pos_token,\n", + " neg_token: torch.Tensor = neg_token,\n", + ") -> torch.Tensor:\n", + " return logits[0, -1, pos_token] - logits[0, -1, neg_token]\n", + "\n", + "\n", + "feature_attribution_df = calculate_feature_attribution(\n", + " input=prompt,\n", + " model=model,\n", + " metric_fn=metric_fn,\n", + " include_saes={sae.cfg.metadata.hook_name: sae},\n", + " include_error_term=True,\n", + " return_logits=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformer_lens.utils import test_prompt\n", + "\n", + "test_prompt(prompt, model.to_string(pos_token), model)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tokens = model.to_str_tokens(prompt)\n", + "unique_tokens = [f\"{i}/{t}\" for i, t in enumerate(tokens)]\n", + "\n", + "px.bar(\n", + " x=unique_tokens,\n", + " y=feature_attribution_df.sae_feature_attributions[sae.cfg.metadata.hook_name][0]\n", + " .sum(-1)\n", + " .detach()\n", + " .cpu()\n", + " .numpy(),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def convert_sparse_feature_to_long_df(sparse_tensor: torch.Tensor) -> pd.DataFrame:\n", + " \"\"\"\n", + " Convert a sparse tensor to a long format pandas DataFrame.\n", + " \"\"\"\n", + " df = pd.DataFrame(sparse_tensor.detach().cpu().numpy())\n", + " df_long = df.melt(ignore_index=False, var_name=\"column\", value_name=\"value\")\n", + " df_long.columns = [\"feature\", \"attribution\"]\n", + " df_long_nonzero = df_long[df_long[\"attribution\"] != 0]\n", + " df_long_nonzero = df_long_nonzero.reset_index().rename(\n", + " columns={\"index\": \"position\"}\n", + " )\n", + " return df_long_nonzero\n", + "\n", + "\n", + "df_long_nonzero = convert_sparse_feature_to_long_df(\n", + " feature_attribution_df.sae_feature_attributions[sae.cfg.metadata.hook_name][0]\n", + ")\n", + "df_long_nonzero.sort_values(\"attribution\", ascending=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i, v in (\n", + " df_long_nonzero.query(\"position==8\")\n", + " .groupby(\"feature\")\n", + " .attribution.sum()\n", + " .sort_values(ascending=False)\n", + " .head(5)\n", + " .items()\n", + "):\n", + " print(f\"Feature {i} had a total attribution of {v:.2f}\")\n", + " html = get_dashboard_html(\n", + " sae_release=\"gpt2-small\",\n", + " sae_id=f\"{extract_layer_from_tlens_hook_name(sae.cfg.metadata.hook_name)}-res-jb\",\n", + " feature_idx=int(i),\n", + " )\n", + " display(IFrame(html, width=1200, height=300))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i, v in (\n", + " df_long_nonzero.groupby(\"feature\")\n", + " .attribution.sum()\n", + " .sort_values(ascending=False)\n", + " .head(5)\n", + " .items()\n", + "):\n", + " print(f\"Feature {i} had a total attribution of {v:.2f}\")\n", + " html = get_dashboard_html(\n", + " sae_release=\"gpt2-small\",\n", + " sae_id=f\"{extract_layer_from_tlens_hook_name(sae.cfg.metadata.hook_name)}-res-jb\",\n", + " feature_idx=int(i),\n", + " )\n", + " display(IFrame(html, width=1200, height=300))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Advanced: Making U-Maps\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "sae-lens-yMclHL4--py3.12", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 } From 4718db11ee451001ab5e2f3002fbeeb1eaae004b Mon Sep 17 00:00:00 2001 From: roseline1 Date: Wed, 5 Nov 2025 19:01:20 +0000 Subject: [PATCH 2/2] replace hardcoded index with feature_idx var --- tutorials/tutorial_2_0.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tutorials/tutorial_2_0.ipynb b/tutorials/tutorial_2_0.ipynb index 3cfb2d590..5799c068a 100644 --- a/tutorials/tutorial_2_0.ipynb +++ b/tutorials/tutorial_2_0.ipynb @@ -933,7 +933,7 @@ "html = get_dashboard_html(\n", " sae_release=\"gpt2-small\",\n", " sae_id=f\"{extract_layer_from_tlens_hook_name(sae.cfg.metadata.hook_name)}-res-jb\",\n", - " feature_idx=feature_list[0],\n", + " feature_idx=feature_list[feature_idx],\n", ")\n", "IFrame(html, width=1200, height=600)" ]