dfnt.xyz

Finetuning models for alternative uses

2024-04-10T11:58:00+00:00

The standard use of model finetuning is fairly easy to get started with. If we just want to adjust the parameters of the entire model to change the type of classification task we’re performing then we can create a new model to wrap the existing one, then change the output dimension and retrain. We can achive this with this code:

class NewNet(torch.nn.Module):
    def __init__(self, new_output_shape):
        super(NewNet, self).__init__()
        self.googlenet = torchvision.models.googlenet(weights='IMAGENET1K_V1')
        output = torch.nn.Linear(self.googlenet.fc.in_features, new_output_shape)
        self.googlenet.fc = output

    def forward(self, x):
        logits = self.googlenet(x)
        return logits

Here we’re taking a GoogLeNet model from torchvision, loading the pre-trained weights and adding a custom Linear layer at the top to manipulate the output shape to whatever we need it to be for the classifcation task we’re trying to solve. Then in the forward function we just pass it to the GoogLeNet and get the output logits. Once this has been set up, we’d just train the network as normal and (provided the classification task isn’t too far from the original task of the pre-trained network), it should converge to a good solution quite quickly since we’re leveraging the features the network has already learned.

Things get a bit more complicated if we want to freeze a part of the network before retraining! In the previous example we were allowing the entire network to be retrained, so parameters from the very early or very late layers could be adjusted to help us with the current classification task. If we freeze layers, we’re specifically telling the parameters in those layers not to change. The early layers of a network are usually frozen since they extract the more fundamental aspects of the data which does not change massively between similar datasets (e.g. vision based networks usually extract Gabor filters in the first few layers). Freezing these layers can speed up network training which can be critical depending on the amount of compute you have available to you.

The layer freezing is in the code below:

def freeze_weights(self, threshold_layer, verbose=False, invert=False):
    """
    threshold_layer (str): This and subsequent layers will be re-initialised
    invert (bool): Invert the freezing/non-freezing (early layers not frozen, later layers frozen)
    verbose (bool): Tell the user which layers are frozen/not frozen
    """
    freeze_flag = False | invert
    flag_changed = False
    # Layer types to ignore when not freezing/freezing
    ignore_layer_set = (torchvision.models.GoogLeNet, torch.nn.Sequential, BasicConv2d, Inception, # Should be ignored since they're not executable
                        torch.nn.MaxPool2d, torch.nn.AdaptiveAvgPool2d, torch.nn.Dropout) # These don't have trainable parameters

    layers = self.googlenet.named_modules()
    
    for idx, layer_data in enumerate(layers): 
        name, layer = layer_data

        if not flag_changed and name == threshold_layer:
            freeze_flag = not freeze_flag
            flag_changed = True
        
        if not isinstance(layer, ignore_layer_set):
            if not freeze_flag:
                for p in layer.parameters():
                    p.requires_grad = False
                
                if verbose:
                    print("Layer {} <- Frozen".format(name))
            else:
                if verbose:
                    print("Layer {} <- Not frozen".format(name))

The above code allows us to specify a layer in the network by name and freeze all layers before it, and not freeze all layers after it. This is achieved by setting the requires_grad property of the parameters of the layer to False. We also have some extra code here (invert) which allows us to invert the freezing (i.e. un-freeze all layers before the named layer and freeze the named layer and all others after).

To get the names of the layers from the model we can use the following helper function:

def view_available_layers(model):
    """
    Print all available layers

    Parameters:
    - model (torch.nn.Module): The model to visualise the layer of
    """
    for name, layer in model.named_modules():
        print("{}".format(name))

Finally, we can do something more interesting! When layers are kept trainable they still start from their pre-trained state. However, it’s interesting to think about what the networks will learn if we keep some pre-trained layers frozen and then reset the other trainable layer parameters to a random initialisation. The effect that this reset has on the model is the focus of our next stage of research (so watch this space!). We can take the freeze_weights code from above and make some small adjustments to reset the parameters.

def freeze_and_init_weights(self, threshold_layer, verbose=False, invert=False):
    """
    threshold_layer (str): This and subsequent layers will be re-initialised
    invert (bool): Invert the freezing/reinitialisation (early layers reinitialised, later layers frozen)
    verbose (bool): Tell the user which layers are frozen/not frozen
    """
    reinit_flag = False | invert
    flag_changed = False
    # Layer types to ignore when resetting/freezing
    ignore_layer_set = (torchvision.models.GoogLeNet, torch.nn.Sequential, BasicConv2d, Inception, # Should be ignored since they're not executable
                        torch.nn.MaxPool2d, torch.nn.AdaptiveAvgPool2d, torch.nn.Dropout) # These don't have trainable parameters

    layers = self.googlenet.named_modules()

    for idx, layer_data in enumerate(layers):
        name, layer = layer_data

        if not flag_changed and name == threshold_layer:
            reinit_flag = not reinit_flag
            flag_changed = True

        if not isinstance(layer, ignore_layer_set):
            if not reinit_flag:
                for p in layer.parameters():
                    p.requires_grad = False

                if verbose:
                    print("Layer {} <- Frozen".format(name))
            else:
                if isinstance(layer, torch.nn.BatchNorm2d):
                    torch.nn.init.ones_(layer.weight)
                    torch.nn.init.zeros_(layer.bias)
                    layer.reset_running_stats()
                elif isinstance(layer, ignore_layer_set):
                    pass
                else:
                    torch.nn.init.xavier_uniform_(layer.weight)
                    if layer.bias != None:
                        torch.nn.init.zeros_(layer.bias)
                if verbose:
                    print("Layer {} <- Reset".format(name))

This code has the reinitialisation of the unfrozen layers included. We set the weights to a random initialisation and the biases are reset to 0. The thing that took me a while to recognise and fix, is that if we don’t reset the paramaters (weights and biases) correctly, the model will consistently fail to learn anything! The one that caught me out is that the weights of the BatchNorm2d layer need to be set to 1 when re-initialising! This took me a while to fix, but I got there in the end!

Creating a training and validation split for torchvision datasets

2024-04-09T16:04:00+00:00

I’ve had an annoying issue with some of the torchvision datasets in that they don’t split the training and validation data. I was trying to decide on the best solution to this issue today and decided to ask ChatGPT (since this is something we can verify!).

The solution it proposed is below:

trainval_dataset = torchvision.datasets.OxfordIIITPet(
    root='/tmp', 
    split='trainval', 
    download=True, 
    transform=transform
    )
testing_dataset = torchvision.datasets.OxfordIIITPet(
    root='/tmp', 
    split='test', 
    download=True, 
    transform=transform
    )

val_split = 0.2
from sklearn.model_selection import train_test_split
train_indices, val_indices = train_test_split(range(len(trainval_dataset)), test_size=val_split, random_state=8208)
training_dataset = torch.utils.data.Subset(trainval_dataset, train_indices)
validation_dataset = torch.utils.data.Subset(trainval_dataset, val_indices)

train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = torch.utils.data.DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = torch.utils.data.DataLoader(testing_dataset, batch_size=BATCH_SIZE, shuffle=False)

Which seems to work quite nicely! I like the fact that you can shuffle the indices to create different training and validation sets each time (and can set it with a seed).

Feature Visualisation Part 2 - Regularised (kind of)

2024-01-04T12:30:00+00:00

In the Feature Visualisation Part 1 - Unregularised post, I discussed the unregularised feature visualisation process, which was a good start when trying to understand what a network is learning but often led to high-frequency patterns, which meant little to us as humans. This was especially obvious at the higher layers of the network, where we’re dealing with more grounded concepts such as individual objects (cats, dogs, cars, planes, etc.). At these higher layers, the high-frequency patterns have little interpretable information we can use to understand what the networks are learning and how classification is being performed.

As a way to remedy this, regularisation approaches have been proposed. These fall into three major categories: Frequency penalisation, Transformation robustness and Learned priors.

Regularisation

Frequency penalisation targets the high-frequency noise we saw in part 1 and reduces it, leading to a ‘less busy’ image (for want of a better phrase). This is achieved most simply through Gaussian blurring, where a Gaussian filter is applied to the image at each optimisation step. Unfortunately, this approach also discourages edges from forming, which can reduce the quality of the generated feature visualisations. Alternatively, a total variation loss can be applied, penalising significant changes over neighbouring pixels across all colour channels. In the feature extraction process detailed here, the anisotropic version of total variation is used:

\[TV(\mathbf{I})= \sum_{i,j} |\mathbf{I}_{i+1, j}-\mathbf{I}_{i,j}| + |\mathbf{I}_{i,j+1}-\mathbf{I}_{i,j}|\]

Where $\mathbf{I}$ represents a single channel of the image matrix (i.e. for a colour image with R, G and B channels, we could express $\mathbf{I}$ as $\mathbf{I}_R$, $\mathbf{I}_G$ or $\mathbf{I}_B$). In addition to reducing high frequencies in the image space, we can also reduce them in the gradient space before they accumulate in the visualisation!

Transformation robustness provides regularisation by randomly jittering, rotating and scaling the optimised image before applying the optimisation step. These transformations shift the high-frequency patterns and noise around during the optimisation process, which lessens their strength, leading to lower frequencies and more structured outputs.

Learned priors attempt to provide regularisation by learning a model of the real data and enforcing it. As an example, a GAN (Generative Adversarial Network) or VAE (Variational Auto-Encoder) can be trained to map an embedding space to images from the dataset, and then as we optimise the image in the embedding space, this will map to an output image which is related to our dataset (note, this doesn’t mean that we can only recover exact images from our dataset, the output space will be continuous, so we have interpolation between images!).

These approaches lead to an interesting debate around the kind of regularisation performed and the aims of the person implementing it. No or weak regularisation (e.g. frequency penalisation/transformation robustness) cannot extract a lot of human interpretable information, focusing mainly on patterns that can include some recognisable structures. On the other hand, strong regularisation (e.g. learned priors) does allow human interpretable visualisations to be produced, but this can result in misleading correlations where the learned priors in GANs or VAEs force the optimised image to vaguely resemble something learned from the dataset, even though the optimised image may not map nicely to that distribution. In situations with humans in the loop, strong regularisation may lead to better results, for instance, if the model needs to be audited to ensure particular features of an image lead to a certain classification. Alternatively, if humans are not needed for a feature visualisation task (which we may expand on soon…), then weak regularisation may be better, reducing the likelihood of generating misleading correlations.

The regularisation approaches used in the Regularised Feature Extraction colab notebook are frequency penalisation and transformation robustness. As such, this leans towards a weaker form of regularisation. This code uses transformations such as jitter, rotation and scaling (see the ModelWrapper class), Gaussian blurring (see ModelWrapper again), and total variation loss for frequency penalisation.

Within the code, we also include another loss that looks at the diversity of the image we are optimising. When performing feature visualisation, there can be many different ways to maximally activate a neuron, each revealing an interesting thing to which the neuron can react. A diversity loss (reminiscent of artistic style transfer) is added to the optimisation objective to account for the diverse ways the neuron can be activated. The diversity loss is calculated as:

\[\mathbf{G}_{i,j} = \sum_{x,y} \text{layer}_n[x,y,i] \cdot \text{layer}_n[x,y,j]\] \[D = -\sum_a \sum_{b \neq a} \frac{\text{vec}(\mathbf{G}_a) \cdot \text{vec}(\mathbf{G}_b)}{\|\text{vec}(\mathbf{G}_a)\|\|\text{vec}(\mathbf{G}_b)\|}\]

Where $\mathbf{G}$ is the Gram matrix of the channels, and $\mathbf{G}_{i,j}$ is the dot product between the (flattened) responses of filters $i$ and $j$. That is, for two filters (though the code suggests a single filter? torch.matmul(flat_activations, torch.transpose(flat_activations, 1, 2))) from the convolutional layer $n$, we sum over the dot products for all neurons which gives us the Gram matrix. We then find the negative pairwise cosine similarity of all possible pairs of visualisations over a layer (makes more sense if i==j…), where the visualisation is the vectorised Gram matrices.

As mentioned in the frequency penalisation section, we can also reduce the presence of high frequencies in the gradient space. Transforming the gradient space is called preconditioning, which does not change the minimums of the gradient function but does change the parameterisation of the space using a different distance metric, which can alter the route we take to reach a minimum. With a good preconditioner, this can speed up the optimisation process and lead to better minimums. The preconditioner suggested by Olah et al. and used in our code performs gradient descent in the Fourier basis, which makes our data decorrelated and whitened. The decorrelation of colour channels allows us to reduce the linear dependence between them, which reduces the redundant information they store, simplifying the optimisation process. The whitening process also removes redundancy and ensures features have a consistent scale, which helps with convergence. Practically, for the feature visualisation method, this means that we define an optimisation image in a Fourier basis, transform the image to a non-Fourier basis when we pass it to the model to collect the activation values, calculate the losses (total variation, diversity and activation), then update the image in the Fourier basis.

The code

Looking at the colab notebook, it’s very similar to the code from the previous post.

We have some new losses which implement the total variation and diversity regularisation approaches which were described more mathematically above.

class TotalVariationLoss(torch.nn.Module):
    """
    Define a Total Variation loss function for visualisation
    """
    def forward(self, image):
        """
        Overrides the default forward behaviour of torch.nn.Module
        Parameters:
        - image (torch.Tensor): The image tensor to calculate the Total Variation of
        Returns:
        - (torch.Tensor): The Total Variation loss
        """
        # Assert that we have a single image (no batches)
        image = image[0]
        assert len(image.shape) == 3, "Expected single image not batch of dimension: {}".format(image.shape)
        diff_h = image[:, 1:, :] - image[:, :-1, :]
        diff_w = image[:, :, 1:] - image[:, :, :-1]

        tv = torch.sum(torch.abs(diff_h)) + torch.sum(torch.abs(diff_w))
        return tv # return tv (rather than -tv) since we want to minimise variation


class Diversity(torch.nn.Module):
    def forward(self, layer_activations):
        """
        Operating over layer_n[i,x,y] and layer_n[j,x,y] summing over all x,y
        Taken partly from https://github.com/greentfrapp/lucent/blob/dev/lucent/optvis/objectives.py#L319
        """
        batch, channels, _, _ = layer_activations.shape
        flat_activations = layer_activations.view(batch, channels, -1)
        gram_matrices = torch.matmul(flat_activations, torch.transpose(flat_activations, 1, 2))
        gram_matrices = torch.nn.functional.normalize(gram_matrices, p=2, dim=(1,2))
        reward = sum([sum([(gram_matrices[i]*gram_matrices[j]).sum() for j in range(batch) if j != i]) for i in range(batch)])/batch
        return -reward # We aim to maximise the diversity, so return -ve

We also introduce a ModelWrapper class which applies the transformation regularisations and a Gaussian blur to the input before passing the result to the target model.

class ModelWrapper(torch.nn.Module):
    def __init__(self, model):
        super(ModelWrapper, self).__init__()
        self.model = model
        self.gaussian_blur = lambda mit, it, st: torchvision.transforms.GaussianBlur(kernel_size=5, sigma=(-1/mit * it + 1)*st)

    def forward(self, x, jit_amt, scale_amt, rot_amt, it, mit, st):

        x = v2.Pad(padding=12, fill=(0.5,0.5,0.5))(x)
        x = v2.RandomAffine(degrees=0, translate=(8/128, 8/128))(x)
        x = v2.RandomAffine(degrees=0, scale=(0.95, 1.05))(x)
        x = v2.RandomAffine(degrees=5)(x)
        x = v2.RandomAffine(degrees=0, translate=(4/128, 4/128))(x)
        x = v2.CenterCrop(size=128)(x)
        x = self.gaussian_blur(mit, it, st)(x)

        return self.model(x)

Then we have an entirely new class for the image transformed into a Fourier basis which includes functions to deprocess back to the standard three channel image:

class OptImage():
    """
    An image for optimisation which includes the colour-decorrelated, Fourier
    transformed image.
    Code from:
    https://github.com/greentfrapp/lucent/blob/dev/lucent/optvis/param/spatial.py
    and
    https://github.com/tensorflow/lucid/blob/master/lucid/optvis/param/spatial.py

    """
    def __init__(self, shape, stdev=0.01, decay=1):
        # Create a colour decorrelated, Fourier transformed image
        self.batch, self.ch, self.h, self.w = shape
        freqs = self.rfft2d_freqs(self.h, self.w)
        init_val_size = (self.batch, self.ch) + freqs.shape + (2,) # 2 for the magntude and phase of FFT

        self.spectrum_mp = torch.randn(*init_val_size) * stdev # This is what we optimise!
        self.spectrum_mp.requires_grad = True # Really important part!

        self.scale = 1/np.maximum(freqs, 1/max(self.h, self.w)) ** decay
        self.scale = torch.tensor(self.scale).float()[None, None, ..., None]


    # Directly from Lucid
    @staticmethod
    def rfft2d_freqs(h, w):
        """Computes 2D spectrum frequencies."""

        fy = np.fft.fftfreq(h)[:, None]
        # when we have an odd input dimension we need to keep one additional
        # frequency and later cut off 1 pixel
        if w % 2 == 1:
            fx = np.fft.fftfreq(w)[: w // 2 + 2]
        else:
            fx = np.fft.fftfreq(w)[: w // 2 + 1]
        return np.sqrt(fx * fx + fy * fy)

    def deprocess(self):
        # Transform colour-decorrelated, Fourier transformed image back to normal
        scaled_spectrum = self.scale*self.spectrum_mp

        if type(scaled_spectrum) is not torch.complex64:
            scaled_spectrum = torch.view_as_complex(scaled_spectrum)

        image = torch.fft.irfftn(scaled_spectrum, s=(self.h,self.w), norm='ortho')

        image = image[:self.batch, :self.ch, :self.h, :self.w]
        image = image / 4.0 # MAGIC NUMBER

        image = OptImage.undo_decorrelate(image)

        return image

    @staticmethod
    def undo_decorrelate(image):
        # Undo the colour decorrelation
        color_correlation_svd_sqrt = np.asarray(
            [[0.26, 0.09, 0.02],
             [0.27, 0.00, -0.05],
             [0.27, -0.09, 0.03]]).astype("float32")

        max_norm_svd_sqrt = np.max(np.linalg.norm(color_correlation_svd_sqrt, axis=0))
        color_correlation_normalized = color_correlation_svd_sqrt / max_norm_svd_sqrt

        c_last_img = image.permute(0,2,3,1)
        c_last_img = torch.matmul(c_last_img, torch.tensor(color_correlation_normalized.T))
        image = c_last_img.permute(0,3,1,2)
        image = torch.sigmoid(image) # An important part of the decorrelation it seems!
        return image

We then have a hook_visualise function which looks very similar to the old version:

def hook_visualise(model, target, filter, iterations=30, lr=10.0, gauss_strength=0.5, tv_lr=1e-4, opt_type='channel'):
    """
    Visualise the target layer of the model

    Parameters:
    - model (torch.nn.Module): The model to visualise a layer of
    - target (str): The target layer to visualise
    - iterations (int, optional): The number of optimisation iterations to run for (default is 30)
    - lr (float, optional):  The learning rate for image updates (default is 10.0)
    - gauss_strength (float, optional): The strength of the Gaussian blur effect (default is 0.5)
    - tv_lr (float, optional): Strength of total variation parameter (default is 1e-4)
    - opt_type (str, optional): The type of optimisation (neuron, channel, layer/dream) (default is 'channel')
    """
    # Set the model to evaluation mode - SUPER IMPORTANT
    model.eval()

    global activation
    activation = None
    def activation_hook(module, input, output):
        global activation
        activation = output

    hook = target.register_forward_hook(activation_hook)

    image_c = OptImage(shape=(1,3,128,128))

    init_image = image_c.deprocess().clone()
    init_image = init_image.detach().squeeze().cpu()
    init_image = init_image.permute(1,2,0)

    # Define the custom loss functions
    loss_fn = VisLoss()
    tv_loss = TotalVariationLoss()
    diversity_reward = Diversity()

    opt = torch.optim.Adam(params=[image_c.spectrum_mp], lr=lr)

    history = {"mean":[], "max":[], "min":[], "loss":[]}
    start_act = None
    end_act = None
    best_act = None
    grad_res = None
    best_loss = np.inf
    best_image = None
    best_it = 0
    rng = np.random.default_rng()

    wrapped_model = ModelWrapper(model)

    max_iterations = iterations
    for it in range(max_iterations):

        opt.zero_grad() # We don't want to zero grad since we need to keep the image gradients to ensure we're going in the right direction!

        jitter_vals = [x for x in range(-8, 9)]
        rotate_vals = [x for x in range(-5, 6)]
        scale_vals = [0.95, 0.975, 1, 1.025, 1.05]
        j_id = rng.integers(0, len(jitter_vals), 1)[0]
        r_id = rng.integers(0, len(rotate_vals), 1)[0]
        s_id = rng.integers(0, len(scale_vals), 1)[0]

        jit_amt = jitter_vals[j_id]
        rot_amt = rotate_vals[r_id]
        scale_amt = scale_vals[s_id]

        res = wrapped_model(image_c.deprocess(), jit_amt, scale_amt, rot_amt, it, max_iterations, gauss_strength)

        # index 0 is the batch index I guess?
        if opt_type == 'layer' or opt_type == 'dream':
            act = activation[0, :, :, :] # Layer (DeepDream)
        elif opt_type == 'channel':
            act = activation[0, filter, :, :] # Channel
        elif opt_type == 'neuron':
            # Select the central neuron by default (TODO: Allow this to be overridden)
            nx, ny = activation.shape[2], activation.shape[3]
            act = activation[0, filter, nx//2, ny//2] # Neuron

        tvl = tv_loss(image_c.deprocess())
        div = diversity_reward(activation)
        loss = loss_fn(act) + tv_lr*tvl + div

        loss.backward()
        opt.step()

        if loss < best_loss:
            best_loss = loss
            best_image = image_c.deprocess().clone()
            best_act = act.detach().numpy()
            best_it = it+1


        print("Iteration: {}/{} - Loss: {:.3f}".format(it+1, max_iterations, loss.detach()))
        np_act = act.detach().numpy()
        if it == 0:
            start_act = np_act
        if it == max_iterations-1:
            end_act = np_act
        print("ACT - Mean: {:.4f} - STD: {:.4f} - MAX: {:.4f} - MIN: {:.4f}".format(np.mean(np_act), np.std(np_act), np.max(np_act), np.min(np_act)))
        history["mean"].append(np.mean(np_act))
        history["max"].append(np.max(np_act))
        history["min"].append(np.min(np_act))
        history["loss"].append(loss.detach().numpy())

    # optimized_image = image.detach().squeeze().cpu()
    print("Best loss: {} - Iteration: {}".format(best_loss, best_it))
    optimized_image = best_image.detach().squeeze().cpu()
    optimized_image = optimized_image.permute(1,2,0)

    pre_inv = optimized_image.clone()
    optimized_image = torch.clamp(optimized_image, 0, 1)

    pre_inv = torch.clamp(pre_inv * 255, 0, 255).to(torch.int)

    hook.remove() # Remove the hook so subsequent runs don't use the previously registered hook!

    return init_image, history, start_act, best_act, optimized_image, pre_inv

There are a few pieces of code to point out here!

When performing optimisation, we optimise over the Fourier basis image which is specified by opt = torch.optim.Adam(params=[image_c.spectrum_mp], lr=lr). Before using the model we need to wrap it so the transformations can be applied: wrapped_model = ModelWrapper(model).

The transformations need to be defined and applied on each forward pass of the model, so we have the ability to dynamically change any of the transformations as optimisation progresses. This could be an interesting area to explore:

jitter_vals = [x for x in range(-8, 9)]
rotate_vals = [x for x in range(-5, 6)]
scale_vals = [0.95, 0.975, 1, 1.025, 1.05]
j_id = rng.integers(0, len(jitter_vals), 1)[0]
r_id = rng.integers(0, len(rotate_vals), 1)[0]
s_id = rng.integers(0, len(scale_vals), 1)[0]

# Select a random jitter, rotation and scale value
jit_amt = jitter_vals[j_id]
rot_amt = rotate_vals[r_id]
scale_amt = scale_vals[s_id]

res = wrapped_model(image_c.deprocess(), jit_amt, scale_amt, rot_amt, it, max_iterations, gauss_strength)

Feature visualisations

This more advanced method for feature visualisation leads to more complex images, especially at the higher layers.

We start by looking at the low ResNet layer layer2.0.conv3. The feature visualisations we can see here demonstrate clear patterns which show a wider variety of ‘styles’ compared to the non-regularised versions. Comparing the regularised and non-regularised versions (which we can do because they’re the same filters from the same layer), we can see the similarities between the same features. Compared to the last post, these feature visualisations show less noise, more defined patterns and a greater variety of features!

Here are the old un-regularised feature visualisations (ResNet layer2.0.conv3):

Here are the newly (weakly) regularised feature visualisations (ResNet layer2.0.conv3):

If we analyse a low layer of the GoogleNet architecture inception3b just as we did in the un-regularised approach we see a similar change in the visualised features as described for ResNet. We have less noise, more defined patterns and a greater variety of features once again!

Here are the old un-regularised feature visualisations (GoogleNet inception3b):

Here are the newly (weakly) regularised feature visualisations (GoogleNet inception3b):

Visualising higher level layers shows more complete structures with recognisable objects starting to emerge.

ResNet layer 4 convolution layer 3:

GoogleNet layer 4e:

Just out of interest I also generated some images which maximise individual neurons rather than channels. For GoogleNet inception5a.branch4[1].conv we get the following, rather cool, images:

Conclusion

I find feature visualisation a fascinating aspect of convolutional neural nets. Of course, the visual aspect is very cool, but the fact that we can gain a deeper understanding of what the network is learning is a very useful and enticing idea. Feature visualisation can be extended with neural circuits, which look at the connections between neurons and the features the neurons can generate and then try to explain the connections between them. An example from Zoom In: An Introduction to Circuits creates circuits where a dog head detector is built from neurons that detect oriented fur, then oriented dog heads, and the oriented dog heads combine to be orientation invariant!

This research has obvious applications to network interpretability. Seeing what a network is learning makes it possible to determine whether features are being extracted in a way in which we expect and make sure that networks are picking up on important discriminating features in a dataset rather than some unexpected property of a certain class of images (the (I believe debunked) parable of a military project detecting tanks from aerial images comes to mind where all images of tanks were taken on a cloudy day, and all images of non-tanks were sunny, the model performed poorly on new data, and it turns out they made a sunny vs cloudy day detector!). This deeper look into the inner workings of neural nets is important for systems where safety and security are critical! Any information we can extract about how these systems work allows us to be more confident in the system’s abilities and helps us avoid cases of unintended behaviour.

Resources

Again, this Distill article by Olah et al. is fantastic and is what the initial parts of this work was based on: Feature Visualization.

This is the Lucid library which supports Olah’s article with code. This helped my understanding of the topic and translation of the maths to code: Lucid - GitHub.

This is the Lucent library which is the PyTorch translation of Lucid. This also helped me understand some of the processes: Lucent: Lucid library adapted for PyTorch - GitHub.

Feature Visualisation Part 1 - Unregularised

2024-01-03T16:43:00+00:00

Feature visualisation is a really helpful tool that can be used when trying to interpret what a neural network is actually learning under the hood. The process is incredibly simple: we pick a neuron we want to visualise, pass random noise as input to the network, ask the neuron to maximise its activation, and then backpropagate the changes that cause this maximal activation back to the input image. From this, we can effectively ‘see’ what the neuron is looking at (especially if we work with convolution neurons in the image space), which can help us to interpret what the network strongly responds to. In many image-based applications, this often leads to somewhat recognisable structures, which allow researchers to make conclusions about what a particular neuron activates for.

This simple description focuses on un-regularised features that often suffer from noise and focuses on high-frequency patterns, which oftentimes are not very interpretable to a human observer (however, the patterns do have interesting links to adversarial noise). This problem has a more complex solution, which I won’t discuss in this post. We’ll keep it simple for now and accept these high-frequency patterns as a stepping stone to a more robust and human-interpretable result.

The code

As is often the case in computer science, conceptual simplicity does not always entail implementation simplicity. This is also the case for feature visualisation. The code has been implemented as a colab notebook and aims to replicate the results found in this Keras tutorial, adapting the code to PyTorch.

The code starts out with a number of helper functions. view_available_layers takes a PyTorch model and gives a list of layers that we have access to and can be used for feature visualisation. It’s worth noting that not all layers are suited to feature visualisation, with convolution or batch norm layers often giving the best results.

We also have a function for generating random images (generate_random_image), which creates a completely grey image and adds some randomised noise in the R, G and B channels. This image acts as our original input to the model and is optimised to maximise the activation of the target.

The display_activations function is a check to make sure that the optimisation is leading to increased activation. This essentially creates an image from the activations of the target before and after we optimise the image.

def view_available_layers(model):
    """
    Print all available layers for visualisation selection
    Parameters:
        - model (torch.nn.Module): The model to visualise the layer of
    """
    for name, layer in model.named_modules():
        print("{}".format(name))


def generate_random_image(shape=(3,128,128)) -> torch.Tensor:
    """
    Generate a random image with noise with a given number of channels, width and height
    Parameters:
        - shape (tuple of int, optional) - The width and height of the random image (default is (3, 128, 128))
    Returns:
        - image (torch.Tensor) - The image as a torch tensor
    """
    c,w,h = shape

    image = torch.ones(size=(1,c,w,h)) * 0.5
    noise = (torch.rand(size=(1, c, w, h))-0.5) * 0.005 # Was 0.05
    image += noise
    image.requires_grad = True
    return image


def display_activations(start_act, end_act):
    """
    Display the start and end activations
    Parameters:
        - start_act (numpy.ndarray): The activation matrix at the start of the optimisation process
        - end_act (numpy.ndarray): The activation matrix at the end of the optimisation process
    """
    fig, axes = plt.subplots(nrows=1, ncols=2)

    # Normalise the matrices between 0 and 1 for display
    # We need to do this across the two activations so we can see how they
    # change in relation to one another
    min = np.min(start_act) if np.min(start_act) < np.min(end_act) else np.min(end_act)
    max = np.max(start_act) if np.max(start_act) > np.max(end_act) else np.max(end_act)

    start_act += np.abs(min)
    start_act /= (np.abs(max) + np.abs(min) + 1e-5)
    end_act += np.abs(min)
    end_act /= (np.abs(max) + np.abs(min) + 1e-5)

    print("NORMALISED :: SA MAX: {} - SA MIN: {} - SA MEAN: {}".format(np.max(start_act), np.min(start_act), np.mean(start_act)))
    print("NORMALISED :: EA MAX: {} - EA MIN: {} - EA MEAN: {}".format(np.max(end_act), np.min(end_act), np.mean(end_act)))

    axes[0].set_title("Start Activations")
    axes[0].imshow(start_act, vmin=0, vmax=1)
    axes[0].axis('off')
    axes[1].set_title("End Activations")
    axes[1].imshow(end_act, vmin=0, vmax=1)
    axes[1].axis('off')
    plt.show()

We then have the visualisation loss VisLoss which provides a metric which guides the optimisation. This loss takes the mean of the activation and returns the negative of the result since we aim to maximise it.

class VisLoss(torch.nn.Module):
    """
    Define a new loss function for visualisation
    """
    def forward(self, activation):
        """
        Overrides the default forward behaviour of torch.nn.Module
        Parameters:
            - activation (torch.Tensor): The activation tensor after the network function has been applied to the image
        Returns:
            - (torch.Tensor): The mean of the activation tensor (avoiding the border to reduce artefacts)
        """
        return -activation.mean()

Next is the visualisation code hook_visualise. This starts by setting the model to eval mode (which is a really important part of the process and something that I missed and spent longer than I care to admit debugging). We then set up and register the forward hook for the activation, which basically sets a listener at the target layer, so when we pass forward through the model, we grab the activation results at the target layer and can process them later. We then define a whole lot of variables, including the optimisation image, the loss function, the optimiser (ADAM), and many activations and losses to keep track of the best-performing image. Next, the optimisation process starts. As with all optimisation, we start by zeroing the optimiser gradient. Then, we pass the optimisation image to the model (ignoring the results). We check the type of optimisation we want to perform from layer/dream- where we optimise over the entire layer of the network (i.e. all channels of the convolution), channel- (the default) where we optimise over a single channel, or neuron- where we optimise over a single neuron only. All images generated for this post were created using the channel option. We then calculate the loss, backpropagate, and then take a step in the direction specified by the optimiser. The rest of the code in this function is just bookkeeping, checking the losses to ensure we return the image with the best loss and updating the activation images and losses we store along the way.

def hook_visualise(model, target, filter, iterations=30, lr=10.0, opt_type='channel'):
    """
    Visualise the target layer of the model
    Parameters:
    - model (torch.nn.Module): The model to visualise a layer of
    - target (str): The target layer to visualise
    - filter (int): The target filter/kernel to visualise
    - iterations (int, optional): The number of optimisation iterations to run for (default is 30)
    - lr (float, optional):  The learning rate for image updates (default is 10.0)
    - opt_type (str, optional): The type of optimisation (neuron, channel, layer/dream) (default is 'channel')
    """
    # Set the model to evaluation mode - SUPER IMPORTANT
    model.eval()

    global activation
    activation = None
    def activation_hook(module, input, output):
        global activation
        activation = output

    hook = target.register_forward_hook(activation_hook)

    # Create the random starting image
    image = generate_random_image(shape=(3,128,128))
    image = image.detach()
    image.requires_grad = True

    # Define the custom loss function
    loss_fn = VisLoss()

    opt = torch.optim.Adam(params=[image], lr=lr)

    history = {"mean":[], "max":[], "min":[], "loss":[]}
    start_act = None
    end_act = None
    best_act = None
    best_loss = np.inf
    best_image = None
    best_it = 0

    max_iterations = iterations
    for it in range(max_iterations):

        opt.zero_grad()

        _ = model(image)

        if opt_type == 'layer' or opt_type == 'dream':
            act = activation[0, :, :, :] # Layer (DeepDream)
        elif opt_type == 'channel':
            act = activation[0, filter, :, :] # Channel
        elif opt_type == 'neuron':
            # Select the central neuron by default
            nx, ny = activation.shape[2], activation.shape[3]
            act = activation[0, filter, nx//2, ny//2] # Neuron

        loss = loss_fn(act)

        loss.backward()
        opt.step()

        if loss < best_loss:
            best_loss = loss
            best_image = image.clone()
            best_act = act.detach().numpy()
            best_it = it+1


        print("Iteration: {}/{} - Loss: {:.3f}".format(it+1, max_iterations, loss.detach()))
        np_act = act.detach().numpy()
        if it == 0:
            start_act = np_act
        if it == max_iterations-1:
            end_act = np_act
        print("ACT - Mean: {:.4f} - STD: {:.4f} - MAX: {:.4f} - MIN: {:.4f} - Loss: {:.4f}".format(np.mean(np_act), np.std(np_act), np.max(np_act), np.min(np_act), loss))
        history["mean"].append(np.mean(np_act))
        history["max"].append(np.max(np_act))
        history["min"].append(np.min(np_act))
        history["loss"].append(loss.detach().numpy())

    print("Best loss: {} - Iteration: {}".format(best_loss, best_it))
    optimized_image = best_image.detach().squeeze().cpu()
    optimized_image = optimized_image.permute(1,2,0)

    pre_inv = optimized_image.clone()
    optimized_image = torch.clamp(optimized_image, 0, 1)
    pre_inv = torch.clamp(pre_inv * 255, 0, 255).to(torch.int)

    hook.remove() # Remove the hook so subsequent runs don't use the previously registered hook!

    return init_image, history, start_act, best_act, optimized_image, pre_inv

The code in the colab notebook focuses on two architectures ResNet-50 and GoogleNet, but can be adapted to any architecture.

Feature visualisations

The following image strongly activates ResNet at the convolution layer layer2.0.conv3, filter 0. The corresponding activation change can be seen below it!

As these images show, the feature visualisation process is able to cause increased activation at the convolution filter specified, which results in a patterned image.

If we expand this to look at the first 64 features, we get the following images which are reminicient of the Keras tutorial referenced earlier.

If we look at the images produced by GoogleNet, we can see the features extracted by this method. First, the Inception 3B layer focusing on filter 0 shows:

With the activation images:

Expanding this to the first 64 filters of 3B leads to these features:

Within the features above we can see some grey images where the optimisation process has failed to find a good direction to move in, with some features having blocks of grey which is an interesting effect! These failures can be addressed with more complex regularised feature visualisation processes.

The extracted features above focus on fairly early layers from ResNet and GoogleNet. What about later (higher level) layers?

The ResNet layer 4 convolution 3 target produces the feature visualisations shown below.

With GoogleNet layer 4e here.

In both instances we see high frequency noisy patterns which don’t really resemble anything that we as humans would be able to identify in relation to the classification task (e.g. there are no images which reference cats, dogs, cars or planes). At the later stages of the networks, which we get these features from, we’d expect to see something at least vaguely recognisable.

This approach to feature visualisation is a good start when we’re trying to determine what networks ‘see’ and gain some insight into the classification process. However, the high frequency patterns leave a lot to be desired when we’re trying to understand the behaviour of a network and how classifications are produced (especially at the higher layers of the network). This leads us to regularised feature visualisation approaches which generate insights which are more human interpretable. Details on regularised approaches can be found in the next post!

In addition, this approach to feature visualisation focuses on individual neurons/channels, and since neural networks consist of many hundreds of thousands/millions of neurons, we only get a small slice of the information. In addition, these feature visualisations often occur with no dependency on previous neurons (each neuron in each layer is maximised in isolation) and since neural networks are incredibly connected structures, this may not give the best indication of the relationship of a neuron to others in the network structure. To expand on this feature visualisation technique, neural circuits are used; however, that’s another story.

Resources

The Keras implementation can be found here.

The post by Olah et al. which started all of this (and is a really useful resource for all things interpretability based) can be found here.

Natural adversaries and out-of-distribution detection

2021-10-21T17:16:04+00:00

The research space of adversarial examples is one which is quite counterintuitive (we can easily change the result of a classifier by making invisible changes to an image) so I’m always looking for nice ways to conceptualise what’s going on. I encountered a nice explanation of a potential feature of adversarial examples in a paper by Smith et al. [1]:

Such examples lie off the manifold of natural examples, occupying regions where the model makes unconstrained extrapolations.

This statement (if accurate) is a really interesting one if we want to try and detect adversarial examples. We would be able to add an Out-of-Distribution (OoD) detection step to our system which will flag inputs which lie off the manifold of natural data either because the network was not trained to handle that input data, or because we’ve encountered an adversary (both of which are situations where we should ignore the predicted class). The idea behind OoD is that if we train a Deep Neural Network (DNN) to perform a particular task such as classification we need to provide the DNN with data and the corresponding labels of that data. During the training process we pass this data to the DNN and it builds a mapping from an input space to an output space. This input space has a particular geometry to it (approaching this from a probabilistic stand-point, the input data follows a distribution) which we can learn. If we have an input provided to us which is far away from this input geometry we’ve created (it does not belong to the same distribution) then if we can detect this, we can return an error or some warning of uncertainty around the results we provide.

A common method of OoD detection is a Bayesian Neural Network (BNN) which is similar to a standard DNN except we place distributions over the weights and biases meaning the network is no longer deterministic. To detect OoD data we can query the BNN with the same example multiple times and determine the deviation of results. The amount of devation gives us an idea around the uncertainty of the BNN in the classifications and can act as a proxy for the OoD methods. This is a very popular approach to OoD detection and has shown some good results in detecting adversarial examples in Smith et al. [1].

This concept of OoD detection pairs nicely with a second paper I found which presents “Natural Adversaries” by Zhao et al. [2]. Natural Adversaries (NAs) aim to create adversarial examples which are more natural to humans. Usually standard attacks such as FGSM [3] or Carlini & Wagner (CW) [4] create adversaries which have an unnatural perturbation. The pixels that are altered can be from any part of the image which often leads to noticeable patterns emerging, or sharp changes in pixel values. The approach NAs take is to make semantic perturbations which appear more natural (e.g. increasing the thickness of a particular line in the image). An example can be seen in the image below which shows that the perturbation is localised to salient parts of the image.

The image below shows the difference between the perturbations of all of the adversarial attacks which hilights the focus of NAs on the important parts of the image.

Due to the proposed naturalness of the image, a question we can ask is:

Is a Natural Adversary out-of-distribution?

This forms the first part of this exploration.

Are Natural Adversaries OoD?

First, we need to discuss the process of generating NAs (which also raises a host of other interesting questions, but one at a time).

Generating NAs

The NA method is a black-box adversarial generation method (it does not require any access to target model internals, such as weights, biases, training method etc.). All we need is to know what task it performs (which is fairly easy if you have a working system “in the wild”) and then create a dataset which we imagine would be effective at training the model to do that particular job.

As usual, we have a target (black-box) classifier $f$ we want to fool and a dataset $X$ of unlabelled data. The goal is to generate an adversarial input $x^ * $ from a clean image $x$ which causes a misclassification $f(x) \neq f(x^ * )$. In practice it is not necessary that $x \in X$ since this is not the case for inputs when the model is deployed, but we do assume that the we’re working with some underlying distribution $\mathcal{P}_x$ that $x$ is sampled from ($x \sim \mathcal{P}_x$). The key part of producing natural adversaries is that we want $x^ * $ to be as close to $x$ as possible in terms of the manifold that defines the data distribution $\mathcal{P}_x$ rather than the original data representation. This also avoids dodgy metrics that are often associated with image similarity, such as $L_2$-norms.

Traditional approaches to adversarial attacks focus on searching for adversaries directly in the input space. Zhao’s method searches in a corresponding dense representation of $z$ space. Therefore, rather than finding an adversary in the input space (giving us $x^ * $ directly) an adversarial $z^ * $ is found in an underlying dense vector space which defines the distribution $\mathcal{P}_x$. This is then mapped back to an image $x^ * $ with a generative model. By using the latent low-dimensional $z$ space adversaries are encouraged to be valid and semantically close to the original image since it is close to the underlying distribution.

Powerful generative models are required to learn a mapping from a latent low-dimensional representation to the distribution $\mathcal{P}_ x$ which is estimated using samples from $X$. Using a large amount of unlabelled data from $X$ as training data, a generator $\mathcal{G}_ \theta$ learns to map noise with distribution $p_z(z)$ (where $z \in \mathbb{R}^d$) to synthetic data which is as close to the training data as possible. A critic $\mathcal{C}_ {\omega}$ is also trained to discriminate the output of $\mathcal{G}_ \theta$ from the true data of $X$.

The objective function for this GAN (refined using the Wasserstein-1 distance making it a WGAN) is defined as:

\[\underset{\theta}{\text{min}}\; \underset{\omega}{\text{max}} \; \mathbb{E}_{x \sim p_x(x)}[\mathcal{C}_ \omega(x)] - \mathbb{E}_{z \sim p_z(z)}[\mathcal{C}_ \omega(\mathcal{G}_ \theta(z))]\]

To represent natural instances of the domain a WGAN is trained on a dataset $X$ which gives us a generator $\mathcal{G}_ \theta$ which maps random dense vectors $x \in \mathbb{R}^d$ to samples $x$ from domain $X$. A matching inverter $\mathcal{I}_ \gamma$ is used to map data instances to corresponding dense representations. We can informally think of these as $\mathcal{G}_ \theta : Z \to X$ and $\mathcal{I}_ \gamma : X \to Z$ where $X$ is the input space and $Z$ is the latent space. The reconstruction error of $x$ is minimised and we also minimise the divergence between sampled $z$ and $\mathcal{I}_ \gamma(\mathcal{G}_ \theta(z))$ to encourage the latent space to be normally distributed. This is described with the following equation:

\[\underset{\gamma}{\text{min}}\; \mathbb{E}_{x \sim p_x(x)} \| \mathcal{G}_ \theta(\mathcal{I}_ \gamma(x))-x \| + \lambda \cdot \mathbb{E}_{z \sim p_z(z)}[\mathcal{L}(z, \mathcal{I}_ \gamma(\mathcal{G}_ \theta(z)))]\]

For images, the divergence $\mathcal{L}$ is the $L_2$-norm, and the constant $\lambda$ is set to 0.1. The aim here is to change $\gamma$ to minimise the sum of the expected values. The first part ($\mathbb{E}_ {x \sim p_x(x)} | \mathcal{G}_ \theta(\mathcal{I}_ \gamma(x))-x |$) is the difference between the input $x$ projected to the latent space, then projected back to the input space, and the actual input. The second part ($\lambda \cdot \mathbb{E}_ {z \sim p_z(z)}[\mathcal{L}(z, \mathcal{I}_ \gamma(\mathcal{G}_ \theta(z)))]$) is the weighted expectation of the divergence between a point in the latent space and the result of projecting that point to the input space, then back to the latent space. In essence, we’re minimising the differences between the two projection directions.

With the learned functions $\mathcal{I}_ \gamma$ and $\mathcal{G}_ \theta$ a natural adversarial example $x^ * $ is defined as:

\[x^ * = \mathcal{G}_ \theta(z^ * ) \text{ where } z^ * = \underset{\tilde{z}}{\text{argmin}}\; \| \tilde{z} - \mathcal{I}_ \gamma(x) \| \text{ s.t. } f(\mathcal{G}_ \theta(\tilde{z})) \neq f(x)\]

The difference with traditional adversarial generation techniques is that for this method, the perturbation is performed in the latent space of the input, then projected back into the input space to check if it successfully fools the classifier.

A step-by-step guide:

Project the input into the latent space: $z’ = \mathcal{I}_ \gamma (x)$
Apply perturbations to $z’$ giving us $\tilde{z}$ which aims to generate an adversarial result
Project the perturbed $\tilde{z}$ onto the input space: $\tilde{x} = \mathcal{G}_ \theta(\tilde{z})$
Check if it fools the classifier: $f(\tilde{x})$

OoD Detection using BNNs

BNNs allow us to analyse two different types of uncertainty in classification, epistemic (uncertainty because of a lack of knowledge) and aleatoric (uncertainty because of natural noise in the data). For OoD detection we’re more concerned with epistemic uncertainty which can act as a proxy for the distance to the natural data manifold. Since epistemic uncertainty is focused on the information the model does not know, we can easily see that since the natural data manifold is built from the information the model is trained on (and therefore has knowledge of) the epistemic uncertainty can be used for OoD detection.

The two measures used for epistemic uncertainty are Mutual Information (MI) and Softmax Variance (SMV). Both can be used as a proxy for the distance from the natural data manifold. The details of these particular measures can be seen in [1]. I won’t go into them here since they get a bit technical, but all we need to know is that higher MI and SMV mean images are further OoD.

Are NAs OoD?

Since the generated adversaries aim to be as close to the natural distribution as possible it’s interesting to consider whether or not NAs are OoD or not. From some experimentation I’ve carried out, it appears not (see the tables below). We pass a clean set of images to a BNN followed by the set of images after having an FGSM attack applied to them, then C&W then finally the NA attack. The MI and SMV was calculated for the results of 100 queries of the BNN.

	Clean	FGSM	C&W	NA
Accuracy	98.0%	9.0%	5.0%	5.0%

	Clean	FGSM	C&W	NA
Mean MI	0.0245	0.0972	0.0385	0.203
Mean SMV	6.80e-4	3.09e-3	1.00e-3	6.41e-3
Relative MI	0.695	2.76	1.09	5.76
Relative SMV	0.670	3.04	0.983	6.31

This table shows the mean MI and SMV for a number of different adversarial attacks and clean images. Also included is the relative MI and SMV using a seperate set of clean images as a baseline. This shows that the MI and SMV of the NA method is far higher than the other adversarial attacks (even over FGSM) which indicates that this method isn’t particularly effective at creating adversaries which remain in distribution.

The reason for this could be that, when we create a NA we’re using a continuous dense vector space to produce the perturbation (which can be seen through the images of interpolation between two classes, see image below). The image actually starts to look more like the class we target since we can only change the semantic parts of the image. This means that when we apply a perturbation we’re naturally going to be somewhere in between classes which could cause the increased uncertainty which leads to the OoD detection by BNNs.

Correcting Adversaries

The architecture of the NA method is an interesting one. We have created a GAN which allows us to map from an input space to a dense vector space and back again. Can we do anything else with it? We can reconstruct images fairly well and if we take a random vector from the embedding space and map it to the input space we end up with nothing recognisable (which means we have a well defined space, see image below) so what happens if we feed adversaries into it? Can we avoid adversarial attacks by passing them through the GAN, get a reconstruction and perform the classification on that reconstruction?

The next question is:

Can we use the architecture of the NA to correct adversaries?

We start by creating a number of perturbed images (1000) using FGSM and C&W then for each of the images in these sets we map them to the dense vector space then back again. The results are in the table below.

	Clean	FGSM	C&W	Clean Reconstructed	FGSM Reconstructed	C&W Reconstructed
Accuracy	97.0%	11.0%	2.00%	94.0%	79.0%	93.0%

If we look at the FGSM results, we can see that the reconstruction method improves the accuracy of the model by approximately 60% which is a huge improvement. Even more suprisingly, the more advanced C&W attack sees an increase in over 80% accuracy which is a massive gain. Examples of the clean images, adversaries and the reconstructions can be seen in the image below.

I found the results in the table above quite suprising. FGSM is the simpler adversarial attack which applies more obvious perturbations to an image than C&W, so why is the reconstruction accuracy of FGSM considerably lower than C&W? I think the answer can be found by looking back to the OoD tests we performed. The MI and SMV of C&W was much less than that of FGSM which means it is closer to the natural data manifold. This means that when reconstructing C&W images they’re closer to the natural manifold and their original position on it which is likely to be close to the true class which results in smaller adversarial perturbation and an easier reconstruction (projection back to the natural manifold). On the other hand, FGSM has a higher MI and SMV which means it’s further from the natural data manifold and so projection back to the manifold may result in the image being in a different classification region anyway. Evidence for this can be seen in the image above which shows that the reconstructed image does not necessarily have the same class as the original adversary.

This is backed up by further results which include adversaries generated by the NA method which lie far from the natural data manifold, therefore, further OoD.

	Clean	NA	NA Reconstructed
Accuracy	97.0%	3.0%	27.0%

This result shows that the reconstruction of NA perturbed images leads to a small improvement, but nothing close to the improvements seen from FGSM or C&W which indicates that the further OoD an image is, the more difficult it is to correct using reconstruction methods. Essentially the perturbed image is so far from the natural data manifold it is a challenge to map it back correctly! Examples of these reconstructions can be seen below.

[1] Smith L, Gal Y (2018) - Understanding Measures of Uncertainty for Adversarial Example Detection

[2] Zhao Z, Dua D, Singh S (2017) - Generating Natural Adversarial Examples

[3] Goodfellow I, Shlens J, Szegedy C (2015) - Explaining and Harnessing Adversarial Examples

[4] Carlini N, Wagner D (2017) - Towards Evaluating the Robustness of Neural Networks