Replace JSON with Dhall: DynamoDB case study

In this post I will show you how you can rewrite a piece of schema-less JSON file into Dhall. As an example I will use JSON being used for creating a DynamoDB table. It was chosen for illustrative purposes only and you don't need to know anything about DynamoBD and it is not really relevant to the key message of this post.

Do not treat this blogpost as either comprehensive introduction to Dhall or list of best practices. I am a Dhall beginner and want to present a use case when it is useful. Thus, the code itself might not be of the highest quality.

Before diving into Dhall we will take a look at how configuration files are being written currently.

Current approach to configuration files

Dhall is advertised as non-repetitive alternative to YAML and I think such positioning definitely makes sense. YAML, JSON, and their derivatives have become a de facto standard for many aspects of Devops and configuration management. Just think how you write your docker-compose file, your Kubernetes files, your OpenAPI specification, your DynamoDB table specification or CI job. All of them are either YAML or JSON. However, not so many users of them would actually say they like those formats. Lack of schema, no support for code reuse or even variables, no type safety - those are the biggest problems, among others.

Another language in this domain is HashiCorp Configuration Language, also known simply as HCL, which is used to define Terraform based infrastructure. To me HCL feels like a language that emerged in ad-hoc fashion rather than one that was meticulously designed. It misses basic tools like user defined functions so it is hard to structure your code in a lightweight way. Lack of enums is also quite disturbing. Let's consider attribute encryption_type of aws_kinesis_stream resource. Even though it is documented that the only acceptable values are NONE or KMS Terraform will happily accept any other value.

As a person working daily with strongly statically typed language (i.e. Scala) I was struck that crucial parts of code are written in a way that simple typo will be detected only at runtime. I sighed if only we have some simple, possibly Turing incomplete language specialized in configuration. Then a colleague of mine pointed me to Dhall and I realized that was the thing I was looking for.

We can do better: Dhall

Dhall is a configuration language. You can think of it as JSON. However, unlike JSON, it is programmable - you can define functions. It is modular - you can extract commonly used functions to a file and import it in many places. It's also statically typed so you will be notified of type errors ahead of time. Since it is also strongly typed there is no type casting.

Although Dhall is programmable it is not Turing complete. It is a conscious design decision - thanks to that it is always guaranteed to terminate and will never hang. It only means that there is not general recursion in the language but you still can for example map over list.

I do not want to describe Dhall in detail in this blogpost. If you want to know more both Dhall's readme and site are good places to start.

What I want to do instead is to show you an example of how Dhall can be used to simplify a configuration file.

DynamoDB example - original JSON

Before we start refactoring we need to understand what is the starting point, namely how the JSON used by DynamoDB looks like. We will be working with made up example so no need to think too much about the structure of that table. We focus on the way it is specified instead.

DynamoDB table can be created using CLI:

aws dynamodb create-table --cli-input-json file:///your/path/table.json

In this text we focus solely on table.json file, which syntax is described in AWS docs . Here is how it may look like:

{
  "AttributeDefinitions": [
    {
      "AttributeName": "Id",
      "AttributeType": "S"
    },
    {
      "AttributeName": "Artist",
      "AttributeType": "S"
    },
    {
      "AttributeName": "Song",
      "AttributeType": "S"
    },
    {
      "AttributeName": "Year",
      "AttributeType": "N"
    }
  ],
  "KeySchema": [
    {
      "KeyType": "HASH",
      "AttributeName": "Id"
    }
  ],
  "GlobalSecondaryIndexes": [
    {
      "IndexName": "ArtistSongIndex",
      "Projection": {
        "ProjectionType": "ALL"
      },
      "ProvisionedThroughput": {
        "WriteCapacityUnits": 3,
        "ReadCapacityUnits": 3
      },
      "KeySchema": [
        {
          "KeyType": "HASH",
          "AttributeName": "Artist"
        },
        {
          "KeyType": "RANGE",
          "AttributeName": "Song"
        }
      ]
    },
    {
      "IndexName": "YearArtistIndex",
      "Projection": {
        "ProjectionType": "ALL"
      },
      "ProvisionedThroughput": {
        "WriteCapacityUnits": 2,
        "ReadCapacityUnits": 2
      },
      "KeySchema": [
        {
          "KeyType": "HASH",
          "AttributeName": "Year"
        },
        {
          "KeyType": "RANGE",
          "AttributeName": "Artist"
        }
      ]
    }
  ],
  "ProvisionedThroughput": {
    "WriteCapacityUnits": 2,
    "ReadCapacityUnits": 2
  },
  "TableName": "Songs"
}

Problems with above JSON:

lack of variables. If you make a typo by referring to "Yearr" instead of "Year" in any index definition it will be caught as late as while running AWS request
lack of types. You can define KeyType as 56 and nothing will complain
you can forget about TableName which is a required field
lack of enums. You can define KeyType as "whatever" even though "HASH" or "RANGE" are only valid values
lack of comments. It's JSON specific issue, YAML has a way of adding comments
it's very repetitive. You need to repeat 4 lines of ProvisionedThroughput over and over although it is basically a function of 2 integer arguments. Thus, it is cumbersome to write
due to all verbosity the signal to noise ratio of the file is very low. It makes reading and comprehending key ideas expressed in the file difficult

Once we know what we want to fix let's start doing that with Dhall!

Rewriting DynamoDB example with Dhall

How to run the code

You can find the full code used in the example in github repository. Its README contains instruction on how to run the code.

File structure

File structure is as follows:

dhall
├── generic
│   ├── functions.dhall
│   ├── schema.dhall
│   └── types.dhall
└── migration.dhall

Directory generic contains common types and functions useful when working with DynamoDB create-table JSON format. In an ideal world it would have been written already by someone else and published in some repository. It consists of things that are supposed to be written once and used many times. I cut corners though and I implemented just pieces that are relevant to the example presented in this post.

The file migration.dhall is the only one that includes pieces of information related to exemplary JSON file mentioned at the beginning of this post.

Given such file structure you can generate JSON out of migration.dhall by:

dhall-to-json --explain --pretty <<< './dhall/migration.dhall : ./dhall/generic/schema.dhall'

Defining types

Let's start with defining types in types.dhall. Here is the fragment of it:

let AttributeDefinition = {
  AttributeName: Text,
  AttributeType: Text
}

let ProvisionedThroughput = {
  WriteCapacityUnits: Natural,
  ReadCapacityUnits:  Natural
}
-- more types ommited for sake of readability

As you see it is quite straightforward. It also shows the usual pattern of having a sequence of let in the first part of Dhall's file. It needs to be followed by in keyword and expression using definitions created with let.

In our case we will use a record with all defined types in the in section:

in

{
  AttributeDefinition	= AttributeDefinition,
  GlobalSecondaryIndex  = GlobalSecondaryIndex,
  KeySchemaItem         = KeySchemaItem,
  ProvisionedThroughput = ProvisionedThroughput
}

Let's try it out: (I am using dhall command here which reads from standard input, by ctrl-d you can signal end of input)

> dhall
let Types = ./generic/types.dhall in
{
  WriteCapacityUnits = 5,
  ReadCapacityUnits  = 5
} : Types.ProvisionedThroughput
^D
{ ReadCapacityUnits = 5, WriteCapacityUnits = 5 }

It worked as expected. Now let's make a type mistake and see if Dhall will catch it:

> dhall
let Types = ./generic/types.dhall in
{
  WriteCapacityUnits = 5,
  ReadCapacityUnits  = "hello"
} : Types.ProvisionedThroughput
^D
Use "dhall --explain" for detailed errors

Error: Expression doesn't match annotation

{ ReadCapacityUnits : - Natural
                      + Text
, …
}

Error caught, success!

Defining schema

Now we can import types defined in previous point in schema.dhall:

let Types = ./generic/types.dhall

in {
  TableName:              Text,
  KeySchema:              List Types.KeySchemaItem,
  AttributeDefinitions:   List Types.AttributeDefinition,
  GlobalSecondaryIndexes: List Types.GlobalSecondaryIndex,
  ProvisionedThroughput:  Types.ProvisionedThroughput
}

Split between types.dhall and schema.dhall is arbitrary; they could as well be a single file. I find it clean to have the top level type defined in a separate file but Dhall itself does not enforce any structure.

Using schema

The most straightforward way of using that schema would be:

let Types = ./generic/types.dhall

in

{
  AttributeDefinition = [
    {
      AttributeName = "S",
      AttributeType = "Id"
    }
    -- other attributes ommited
  ]
  -- other attributes ommited
}

However, it is similarly verbose to the original JSON and we wanted to avoid that. To prevent repetition we will declare a few functions in functions.dhall to create a nice DSL we can use in migration.dhall.

Here's the fragment of functions.dhall related to AttributeDefinition:

let mkAttribute = 
    λ(attributeType: Text)
  → λ(attributeName: Text)
  → {
      AttributeName=attributeName,
      AttributeType = attributeType
    }
-- partially applied functions for each of types:
let mkStringAttribute = mkAttribute "S"
let mkNumberAttribute = mkAttribute "N"

As you can see Dhall incorporates techniques known from functional programming such as currying and partial application. Thanks to that it gives us simple and reliable framework for abstraction.

Eventual form

All the generic functionality is in place, it is time to use it to rewrite the inital example:

let Types     = ./generic/types.dhall
let Functions = ./generic/functions.dhall

let id     = "Id"
let artist = "Artist"
let song   = "Song"
let year   = "Year"
let defaultThroughput = Functions.mkThroughput 2 2

in

{ 
  TableName = "Songs", 
  KeySchema = [Functions.mkHashIndex id],
  AttributeDefinitions = [
    Functions.mkStringAttribute id,
    Functions.mkStringAttribute artist,
    Functions.mkStringAttribute song,
    Functions.mkNumberAttribute year
  ],
  GlobalSecondaryIndexes = [
    Functions.mkIndex [Functions.mkHashIndex artist, Functions.mkRangeIndex song]   (Functions.mkThroughput 3 3),
    Functions.mkIndex [Functions.mkHashIndex year,   Functions.mkRangeIndex artist] defaultThroughput
  ],
  ProvisionedThroughput  = defaultThroughput
}

That's it!

DynamoDB example - what was achieved

There is clear progress when you take a look at the final result and original DB example. The general feeling is that the resulting configuration is devoid of any noise; it simply conveys the essence of what needs to be expressed.

We were able to:

eliminate repetitiveness of original format
introduce variables so we don't have to repeat ourselves when it comes to name of fields. It also reduces spelling mistakes
force our configuration to adhere to the defined schema. It means it protects us from type errors, omitting attribute keys etc.

You may argue that I had to write schema and Dhall functions that allowed me to radically improve level of expressiveness so there is some additional code outside of nice demo at the end.

That's right, but:

you write your schema and helper functions only once and then you can use them multiple times
once Dhall become more popular there will be a lot of schemas and code written by community. Of course, to some extent, it is already a case examples being dhall-nix or dhall-kubernetes.

DynamoDB example - deficiencies

Even though it looks quite good I must admit when I heard about Dhall first time I had something more powerful in mind. I expected to be able to describe whole schema with great precision using ADT. Moreover, I hoped for strong typing in a sense that I will hardly ever use Text (i.e. Dhall's String) type and the solution here is full of it.

Take a look at part of schema:

AttributeDefinitions : List {
  AttributeName: Text,
  AttributeType: Text
}

While AttributeName is actually quite fine as Text, AttributeType in its substance is an enum with a few valid values only as documented here. You cannot put there ABC and such type of mistake should be caught by configuration language when checking against schema. In that regard the mantra should be to check as much as possible as early as possible.

Union types to the rescue?

The good news is that Dhall enables to express enums on type level by using unions. Here we try to be more explicit about what types we expect for AttributeType:

-- There are a few more types supported by DynamoDB, let's consider those 3 to be more concise:
let AttributeType = < Number : {} | Binary : {} | String : {} >
let attributeType = constructors AttributeType
let AttributeDefinition = {
  AttributeName: Text,
  AttributeType: AttributeType
}
let idAttr = {
  AttributeName = "Id",
  AttributeType = attributeType.String {=}
}

in 

idAttr

We can run it against dhall to prove that Dhall "understands" the meaning of such configuration:

dhall <<< './unions.dhall'
{ AttributeName =
    "Id"
, AttributeType =
    < String = {=} | Binary : {} | Number : {} >
}

Now, let's try to generate JSON out of it:

dhall-to-json --pretty <<< './unions.dhall'
{
  "AttributeName": "Id",
  "AttributeType": {}
}

"AttributeType": {} is not something we want to achieve. We would like to have "AttributeType": "S". It is understandable that dhall-to-json did not come up with expected result taking into account we have not defined JSON representation for AttributeType union. We may do that by defining a function attributeTypeToString = λ(t : AttributeType) → Text in Dhall, which is easy. There is a major problem here though - as return type of that function is Text we would need to declare AttributeType field as Text again negating most of the benefit of introducing union type AttributeType at first. It still may have some benefit, but only providing you will keep the convention of setting AttributeType field always by using attributeTypeToString function. Mind that it would work only by convention and there is nothing in Dhall's type system that will stop you from setting AttributeType to any, possibly invalid, Text.

All in all, the problem boils down to:

When using Dhall via dhall-to-json all types in leaf nodes of a schema have to be declared as primitive types supported by dhall-to-json.

It is not a problem of dhall-to-json itself; it is clear that it cannot be more precise then underlying format. Hypothetically it could have some resolution mechanism so it would try to find a function of type AttributeType -> Text to enable usage of rich types directly in schema but it is not a design goal of dhall-to-json. I have not checked dhall-to-yaml but I believe it has the same constraint.

Although it may look like an obvious limitation it took me some time to realize it. I believe it should be taken into account when thinking about potential use cases for dhall-to-json.

Possible solutions

One apparent solution would be to write our own dhall-to-dynamo using Dhall's Haskell bindings. We would be able to treat DynamoDB related types differently there. However, in this blogpost I am advocating Dhall as a Swiss army knife for configuration formats. We should be able to write a few relatively straightforward .dhall files and simply profit without caring about Haskell bindings or even knowing Haskell at all, let alone building and distributing binaries
We may define AttributeType as < Number : Text | Binary : Text | String : Text >. Then we may create a type constructors which will propagate valid Text values, e.g. let mkNumber = attributeType.Number "N". The problem here is that nothing stops user from bypassing the type constructor and simply specifying attributeType.Number "rubbish". We cannot < Number : "N" | ... as "N" is term as opposed to type and Dhall provides no means of restricting valid values of types (would be very happy to be proven wrong here but I was not able to find anything in that regard)
We can define two schemas in Dhall: rich and primitive. Rich one would operate on semantic types while primitive on underlying format types. A schema developer would need to provide a function transformSchema: RichSchema -> Schema, that function being the only gateway from rich to primitive types. A person using schema would be supposed to work only with rich types and would call transformSchema function at the very end of the config.

I implemented the third approach in a very limited scope (I enriched only AttributeType to be a union type) here. In such limited scope the change looks quite simple but I am afraid in even a bit more advanced case maintaining function transformSchema.dhall would become a bottleneck. That important factor in that regard is depth of schema structure. In case of really deep structures some tools for working with them, such as optics in FP or visitor pattern in OOP, would be very useful. As far as I know Dhall currently does not provide them.

Still, I believe the last approach is best from proposed ones and is worth further exploring.

In case you wonder - why not simply call transformAttributeType (and transformHashType and so on) avoiding any necessity of working with nested structures? While it would work it would be against of the whole idea of strong typing. The essence of proposed solution is to have strictly one place where we translate

Other use cases / Possible extensions

What I described in this post is using Dhall only for generating one file which describes just one piece of overall architecture. The vision worth pursuing is something I call Dhall all the way down. The idea is to use Dhall files as the only ones that should be modified by developer of the application.

So instead of setting up DynamoDB table with Terraform, providing table schema with JSON and configuring your Scala application with HOCON (aka typesafe-config) you would configure everything at Dhall level only once. Dhall can generate proper configuration files in underlying formats so it is not required for all pieces to understand Dhall. The biggest advantage of such approach would be referential integrity check. Without Dhall when changing table name in JSON it is easy to forget to update HOCON used by Scala application.

It is a high level vision. Not sure how feasible it is right now. One apparent problem in example described above is lack of dhall-to-hocon.

Conclusion

Dhall provides a simple way of defining configuration files in less verbose and less error-prone way than JSON or YAML. Also, writing schema and helper functions is quite easy job and can pay off in increased productivity even for small use cases. I would say that if you need to maintain more that 5-10 configuration JSON files similar to described in this post it's already a scale to start profiting from Dhall.

Disclaimer: for people not fluent in statically typed functional languages learning curve can be steeper.

If you hope for being able to define schema in super typesafe and extremely precise way so you are able to express things like field A is either number lower than 5 or string of length 15 then Dhall itself will not help you to that much extent (at least at the moment).

My general feeling is that Dhall philosophy is to provide a set of clean, well defined and very thoroughly defined primitives while not caring that much about ergonomics for specific use case. That goes along with observations gathered in Dhall survey. It seems like providing right abstractions/tools for specific use cases is an exercise left for future. I agree with that on philosophical level because it is much easier to provide opinionated solutions on top of clean primitives than the other way round. From a pragmatic point of view the question is how fast and how big the community and tooling around Dhall will grow. I do not feel entitled to give my bet on this as I just started my adventure with Dhall. I personally will start using Dhall for simple cases and experiment around more advanced ones. With a grain of evangelism which I hopefully did in this post.

Acknowledgements

Thanks to Gabriel Gonzalez and all contributors for the wonderful work on Dhall. High quality of all software involved and clarity of thought of documentation are stunning.

Thanks to Krzysztof Janosz who introduced me to Dhall.

Github repository

Repository with code used in this article