← Back to Writeups

Cosine Similarity Distillation: Teacher-Free Knowledge Transfer via Random Projection Fingerprints

Knowledge distillation is a well-established technique where a small student model learns from a larger teacher model. The standard approach (Hinton et al., 2015) requires the teacher to be loaded in memory and forwarded for every batch during student training. For large models, this is expensive.

Cosine Similarity Distillation (CSD) is a method I developed that completely removes the need for live teacher forwarding. Instead, the teacher's knowledge is compressed into compact "fingerprints" once, and the student learns from these fingerprints during training. This writeup covers the idea, the implementation, and the results.

The Problem with Standard Distillation

In standard KD, the student loss is a combination of the classification loss and a distillation loss that measures how closely the student's logits match the teacher's. This means the teacher must be forwarded alongside the student for every batch. For a teacher with billions of parameters, this doubles the memory and compute requirements of training.

Methods like FitNet go further, requiring the teacher's intermediate features at multiple layers, which multiplies the cost further. The central question I wanted to answer was: can we capture the teacher's knowledge in a compact, precomputed form that the student can learn from without the teacher ever being loaded again?

The CSD Method

CSD works in three phases:

  1. Fingerprint precomputation. The teacher processes the training dataset once. At a chosen intermediate layer, we extract feature vectors, apply random projections through a frozen random matrix, and store the resulting fingerprint vectors. This is done once, and the teacher is never loaded again.
  2. Student training. The student is trained with a combined loss: standard cross-entropy plus a cosine similarity loss between the student's fingerprints and the precomputed teacher fingerprints.
  3. Inference. The student is deployed standalone. The teacher and fingerprints are no longer needed.

The random projection matrix R is a frozen (64 × r) matrix with L2-normalized columns. It acts as a shared coordinate system that both the teacher and student project into. The cosine similarity loss ensures the student's projections align directionally with the teacher's.

R = randn(64, r), columns L2-normalized          (frozen, no gradients)
phi_T = mean_{aug views}(normalize(pool(layer3(x))) @ R)  (precomputed once)
phi_S = normalize(pool(layer3(x_aug)), dim=1) @ R          (per batch)
Loss = CE(logits, labels) + lambda * (1 - cos_sim(phi_S, phi_T))

Storage Efficiency

The key advantage of CSD is storage. For CIFAR-100 with a ResNet-56 teacher (~3.29 MB), per-class fingerprints are as small as 50 KB -- a 67x reduction. This means the teacher's knowledge can be distributed alongside the student with negligible overhead, or even regenerated on the fly from the compact fingerprints.

Per-sample fingerprints are larger (24 MB for 50,000 samples) but still smaller than the teacher. In practice, per-class fingerprints work nearly as well and are more practical.

Augmentation-Aware Fingerprints

An important detail: the teacher generates fingerprints from the same data augmentations (RandomCrop + RandomHorizontalFlip) that the student sees during training, averaged over multiple augmented views per image. This ensures the student is learning to match a fingerprint target it can physically achieve, rather than a static target that doesn't account for augmentation.

Results on CIFAR-100

I compared CSD against standard KD and FitNet on CIFAR-100 with a ResNet-56 teacher and ResNet-20 student:

CSD achieves roughly the same accuracy gain as KD but with 67x smaller storage and a single teacher forward pass instead of 78,200. The exact numbers will be published in the paper, but the trend is consistent across multiple runs.

Privacy Implications

An interesting side effect of the random projection approach is that the fingerprints are information-theoretically hard to invert. Since the random matrix is discarded after fingerprint generation, reconstructing the original teacher features from the fingerprints is practically impossible. This makes CSD useful for privacy-sensitive applications where sharing a live teacher model is not desirable.

What's Next

I'm currently working on extending CSD to larger architectures (ImageNet-scale models) and exploring whether the same approach can be applied to transformer attention maps. The paper is a work in progress, and the full implementation is available on GitHub.

↑ Back to top