1 Feb 2019 • on scala, java, data structures, phantom types

Phantom Type Safe Map

In this post I’ll take a look at the type-safe map pattern, and how it is used in real world open source libraries. Then I’ll dive into how we can add phantom types to actually track at compile time which values are present in the map.

A Phantom Map?

The Type-Safe Map

I’ve ran into a “type-safe Map” pattern in Java that looks something like this:

abstract class Key<V> {}

public class TypesafeMap {
  private Map<Class<?>, Object> underlying =
    new HashMap<Class<?>, Object>();

  public <V> void set(Class<? extends Key<V>> key, V value) {
    underlying.put(key, value);
  }

  public <V> V get(Class<? extends Key<V>> key) {
    return (V) underlying.get(key);
  }
}

The map is really a Map[Class[_], Any] but the keys tell you what type the value is, so we can do “type-safe” get and set operations while hiding the underlying casting.

An immutable version in Scala might look something like this:

abstract class Key[V] {}

case class TypesafeMap(underlying: Map[Key[_], Any] = Map()) {
  def get[V](key: Key[V]): Option[V] = {
    underlying.get(key).map(_.asInstanceOf[V])
  }

  def updated[V](key: Key[V], value: V): TypesafeMap = {
    copy(underlying + (key -> value))
  }
}

But Why?

A type-safe Map can be used as a sort of anonymous class where we can collect any set of sparse attributes.

For example, this snippet

case class Person(
  name: Option[String] = None,
  age: Option[Int] = None,
  children: Option[List[Person]] = None,
  birthYear: Option[Int] = None,
  address: Option[String] = None
)

val person = Person(
  age = Some(40),
  address = Some("123 Main St")
)

can be expressed as

case object Name extends Key[String]
case object Age extends Key[Int]
case object Children extends Key[List[TypesafeMap]]
case object BirthYear extends Key[Int]
case object Address extends Key[String]

val person = TypesafeMap()
  .updated(Age, 40)
  .updated(Address, "123 Main St")  

Unlike a case class, this gives more flexibility to add arbitrarily more “fields” later on by extending Key.

In the Person example this seems pretty useless, but if the class had dozens or hundreds of possible fields, where most of them are None, you could imagine that this could be quite useful.

A practical example of this might be a library for NLP text annotations. The annotations could apply to documents, sentences, tokens, etc, and new modules and annotations could be added without modifying the core library.

In fact, Stanford’s CoreNLP library uses a TypesafeMap class for exactly this, and defines hundreds of CoreAnnotations which are really just keys.

Grepping for implements CoreAnnotation, we can get a sense of how this is used:

src/edu/stanford/nlp/ling/SegmenterCoreAnnotations.java
 public static class CharactersAnnotation implements CoreAnnotation<List<CoreLabel>> {
 public static class XMLCharAnnotation implements CoreAnnotation<String> {

src/edu/stanford/nlp/ling/CoreAnnotations.java
 public static class TextAnnotation implements CoreAnnotation<String> {
 public static class LemmaAnnotation implements CoreAnnotation<String> {
 public static class PartOfSpeechAnnotation implements CoreAnnotation<String> {
 public static class NamedEntityTagAnnotation implements CoreAnnotation<String> {
 public static class NamedEntityTagProbsAnnotation implements CoreAnnotation<Map<String,Double>> {
 public static class CoarseNamedEntityTagAnnotation implements CoreAnnotation<String> {
 public static class FineGrainedNamedEntityTagAnnotation implements CoreAnnotation<String> {
// ...

src/edu/stanford/nlp/sentiment/SentimentCoreAnnotations.java
 public static class SentimentAnnotatedTree implements CoreAnnotation<Tree> {
 public static class SentimentClass implements CoreAnnotation<String> {

src/edu/stanford/nlp/ie/machinereading/structure/MachineReadingAnnotations.java
 public static class EntityMentionsAnnotation implements CoreAnnotation<List<EntityMention>> {
 public static class RelationMentionsAnnotation implements CoreAnnotation<List<RelationMention>> {
 public static class AllRelationMentionsAnnotation implements CoreAnnotation<List<RelationMention>> {
 public static class EventMentionsAnnotation implements CoreAnnotation<List<EventMention>> {
 public static class DocumentDirectoryAnnotation implements CoreAnnotation<String> {

// ... Many many more ...

In total, they define over 300 CoreAnnotations in thirty different files. This would be pretty hard to accomplish with standarad data classes.

Another real world example that we happen to use at Foursquare can be found in Twitter’s Finagle library. Finagle uses a type-safe map called Stack.Params to configure http clients and servers. In this case, they represent Param as a type class instance that acts as a key and also provides a default value.

Again, grepping for extends Stack.Param gives a similar pattern with dozens of fields across a number of files:

finagle-thrift/src/main/scala/com/twitter/finagle/Thrift.scala
   implicit object ClientId extends Stack.Param[ClientId] {
   implicit object ProtocolFactory extends Stack.Param[ProtocolFactory] {
   implicit object Framed extends Stack.Param[Framed] {
   implicit object AttemptTTwitterUpgrade extends Stack.Param[AttemptTTwitterUpgrade] {
     implicit object MaxReusableBufferSize extends Stack.Param[MaxReusableBufferSize] {

finagle-http/src/main/scala/com/twitter/finagle/Http.scala
   implicit object HttpImpl extends Stack.Param[HttpImpl] {
   implicit object MaxChunkSize extends Stack.Param[MaxChunkSize] {
   implicit object MaxHeaderSize extends Stack.Param[MaxHeaderSize] {
   implicit object MaxInitialLineSize extends Stack.Param[MaxInitialLineSize] {
   implicit object MaxRequestSize extends Stack.Param[MaxRequestSize] {
   implicit object MaxResponseSize extends Stack.Param[MaxResponseSize] {
   implicit object Streaming extends Stack.Param[Streaming] {
   implicit object Decompression extends Stack.Param[Decompression] {
   implicit object CompressionLevel extends Stack.Param[CompressionLevel] {

finagle-core/src/main/scala/com/twitter/finagle/loadbalancer/LoadBalancerFactory.scala
 implicit object EnableProbation extends Stack.Param[EnableProbation] {

finagle-core/src/main/scala/com/twitter/finagle/service/Retries.scala
 object Budget extends Stack.Param[Budget] {

finagle-mysql/src/main/scala/com/twitter/finagle/mysql/Handshake.scala
 implicit object Credentials extends Stack.Param[Credentials] {
 implicit object Database extends Stack.Param[Database] {
 implicit object Charset extends Stack.Param[Charset] {

finagle-netty4/src/main/scala/com/twitter/finagle/netty4/package.scala
   private[netty4] implicit object Allocator extends Stack.Param[Allocator] {
   implicit object WorkerPool extends Stack.Param[WorkerPool] {

finagle-mux/src/main/scala/com/twitter/finagle/mux/lease/exp/Lessor.scala
 implicit object Param extends Stack.Param[Param] {

Making it Safer

While these maps are “type safe” in that the key can guarantee the type of the value, there is no way to tell at compile time which values are actually present in the Map.

So in our text annotation example, suppose we want to compose a pipeline of Annotators, or functions that add new values to our map, but depend on existing values. For illustration, let’s assume we have three keys Text: Key[String], Tokens: Key[List[String]], and PosTags: Key[List[String]] that need to be added in that order.

This will probably manifest itself as an assert or an Option/Either return type, or worse, the method will quietly swallow the problem of missing data:

// Bad! Signatures aren't descriptive

def tokenize(m: TypesafeMap): TyepsafeMap = {
  m.get(Text) match {
    case Some(text) => m.updated(Tokens, text.split(" ")) // We have text, add tokens
    case None => m // Can't do anything
  }
}

def posTag(m: TypesafeMap): TypesafeMap = {
  m.get(Tokens) match {
    case Some(tokens) => m.updated(PosTags, doTagging(tokens)) // We have tokens, do tagging
    case None => m // Can't do anything
  }
}

val text = TypesafeMap().updated(Text, "foo bar")

// This compiles but doesn't do anything,
// because Tokens are needed before posTag is called
val tagged = posTag(text)

But really we want our method signature to give more information. The parameter type should describe the data required, and the return type should describe what is added:

// Hypothetical! Some kind of mix-in traits could tell us what data we have
// Not really possible

// Requires Text, adds Tokens
def withTokens(m: TypesafeMap with HasText): TypesafeMap with HasTokens = {
  // m(Text) is safe because our type requires that text is present
  m.updated(Tokens, m(Text).split(" "))
}

// Requires Tokens, adds PosTags
def withPosTags(m: TypesafeMap with HasTokens): TypesafeMap with HasPosTags = {
  m.updated(PosTags, doTagging(m(Tokens)))
}

This would let us construct a pipeline of text annotators that build off each other in a type-safe way, without worrying about runtime Nones or errors from missing annotations.

Here is where phantom types come in!

Phantom Types to the Rescue

A phantom type is a type parameter that doesn’t correspond to any concrete value.

In our case we can add a phantom type parameter to TypesafeMap that will keep track of what values are in our map so far. It will let us achieve the following actual syntax:

// Requires Text, adds Tokens
def tokenize[T <: Text.Aware](
  m: TypesafeMap[T]
): TypesafeMap[T with Tokens.Aware] = {
  m.updated(Tokens, m(Text).split("""\s*\b\s*""").toList)
}

// Requires Tokens, adds PosTags
def posTag[T <: Tokens.Aware](
  m: TypesafeMap[T]
): TypesafeMap[T with PosTags.Aware] = {
  m.updated(PosTag, doTagging(m(Tokens)))
}

val m = TypesafeMap().updated(Text, "Hello, world!")
posTag(m) // Does not compile, no Tokens present for tagging!
posTag(tokenize(m)) // Compiles!

where TypesafeMap and Key are now defined as

abstract class Key[V] {
  // Each Key has it's own Aware marker trait that can be
  // added to the phantom type parameter as values are added. 
  // Nothing concrete actually extends Aware.
  trait Aware
}

// Private constructor to prevent adding elements outside of safe updated
case class TypesafeMap[+T] private (underlying: Map[Key[_], Any]) {
  // Returns the value associated with this key. Will only compile
  // if the value exists.
  def apply[V](key: Key[V])(implicit has: T <:< key.Aware): V = {
    // Looks dangerous but will never throw an error!
    underlying(key).asInstanceOf[V]
  }

  // Stores the key/value and adds key.Aware to our phantom type
  def updated[V](key: Key[V], value: V): TypesafeMap[T with key.Aware] = {
    TypesafeMap[T with key.Aware](underlying + (key -> value))
  }
}

object TypesafeMap {
  def apply(): TypesafeMap[Any] = TypesafeMap[Any](Map())
}

Notice that we can now use apply[V](key: Key[V]): V instead of get[V](key: Key[V]): Option[V].

But unlike a normal Map, apply will never throw a NoSuchElementException. This is because the only way to add to the map is through updated which appends with key.Aware to the phantom type parameter.

Then the apply method only compiles if there is an implicit T <:< key.Aware, which asserts that there must be proof that T extends key.Aware (ie the value exists).

In summary, our typesafe map gets us:

Ability to store arbitrary sparse data
Safe casting of values, with types provided by the keys
Safe apply method, with no null or Option types, because the code only compiles if the key is present.

You can see the working code and tests for this post on GitHub here.