A tale of two Monads: Free vs MTL

Notice: This website has moved to a new URL. Please visit us at Renaissance Learning R&D.

A tale of two Monads: Free vs MTL

by @eborden on September 28, 2017

Much has been said about the benefits of Free monads. The idea of expressing computation as a data type is very powerful. Separating interpretation from representation gives us a lot of freedom, but what are its costs?

A recent blog post did a fantastic job of summarizing many of the issues with Free, Free Monads Considered Harmful. I don’t want to rehash those details. Instead I want to speak concretely. Why did Front Row build a tool with Free? Why did we subsequently refactor to mtl and how did we make that decision?

Consider your use case

Free is a powerful abstraction, but you really must consider your use case before using it. Like the sage says, “use the right tool for the right job.” At Front Row we utilized Free to design a library for generating arbitrary fixtures, which makes it easy to write integration tests. Free worked perfectly! Until it didn’t.

Why did it work?

Prototyping

Free allowed us to quickly iterate on an implementation without thinking about transformer stacks, associativity of monads in the stack, how best to decompose our type classes, etc. These are noise when you are prototyping. You likely don’t even understand your semantics at first, you just need to get something built and prove your use case.

Small graphs of data

Our initial use case, building graphs of arbitrary relational data, did not require high performance. These graphs were shallow, building fewer than 20 artifacts, so the diminishing returns of Free did not manifest.

Inspecting the graph

We thought we might need to inspect the graph to produce finalizers for unwinding idempotent graphs. We didn’t.

Why didn’t it work?

One simple answer; unintended use cases. An enterprising engineer took this tool and said, “Hey, can I use this to build data for benchmarking?” The answer was obviously YES! Great, our tool has found a new use case that we didn’t anticipate, cheers everyone. What happened next lead us to refactor away from Free.

Generating 300,000 arbitrary entities

Our enterprising engineer quickly spun up a tool for generating 300,000 database entities to test a few new API endpoints. Writing this tool took no time, but running it did. The runtime was 30 minutes! This is unacceptable. Why did it take that long?

Probing the problem

This library is essentially a DSL around Arbitrary and database insertion, so our first stop was the database. Were any queries being run?

postgres=# SELECT datname, query, usename FROM pg_stat_activity ORDER BY query_start DESC LIMIT 1; \watch 1

The above query allowed us to watch for insertions, and we quickly realized that NOTHING was being done. A quick hop over to top showed that memory was climbing. What was going on? Why was nothing happening except generating heat?

Profiling to the rescue

GHC has fantastic profiling tools. Sadly for a long time they were locked away because of terrible work flows. stack (and yes nix) has greatly simplified utilizing these tools. Passing the --profile flag to a stack command will build all your libraries with profiling instrumentation and produce .prof files.

Building a benchmark

First we built a benchmark to get some hard and consistent numbers on speed. Our simple benchmark was generating n nodes without any dependencies. We used criterion to accomplish this.

import Data.Functor.Identity
import Control.Monad (void, replicateM)
import Criterion.Main (defaultMain)
import GHC.Generics
import qualified Graphula.Free as Free
import Test.QuickCheck.Arbitrary

main :: IO ()
main = defaultMain
  [ bgroup "initial algebra"
    [ bench "1"  . nfIO $ replicateNodeInitial 1
    , bench "100"  . nfIO $ replicateNodeInitial 100
    , bench "1000"  . nfIO $ replicateNodeInitial 1000
    ]
  ]

data A
  = A
  { aa :: String
  , ab :: Int
  } deriving (Generic)

instance Arbitrary A where
  arbitrary = A <$> arbitrary <*> arbitrary

instance HasDependencies A

graphIdentity :: Free.Frontend NoConstraint Identity (IO r) -> IO r
graphIdentity f = case f of
  Free.Insert n next ->
    next $ Just $ Identity n
  Free.Remove _ next ->
    next

replicateNodeInitial :: Int -> IO ()
replicateNodeInitial i = void . Free.runGraphula graphIdentity . replicateM i $ node @A

We then ran this script with profiling enabled, stack bench --profile. This produced a little noise from criterion, but we got the information we needed.

Identifying cost centers

GHC *.prof files contain lots of useful information, but we were really only concerned with cost. Where was our program spending most of its time? Cost center reports show a breakdown of functions with their module/source lines and percentage of time/allocation spent in them.

COST CENTRE        MODULE                            SRC                                                    %time %alloc

>>=.\              Control.Monad.Trans.Free          src/Control/Monad/Trans/Free.hs:(229,39)-(231,44)       20.5   27.1
>>=                Control.Monad.Trans.Free          src/Control/Monad/Trans/Free.hs:(229,3)-(231,44)        18.1   16.6
fmap               Graphula.Free                     src/Graphula/Free.hs:124:1-48                           11.8   20.0
getOverhead        Criterion.Monad                   Criterion/Monad.hs:(47,1)-(56,12)                        7.1    0.0
getGCStats         Criterion.Measurement             Criterion/Measurement.hs:(46,1)-(48,16)                  4.7    0.0
>>=                Data.Vector.Fusion.Util           Data/Vector/Fusion/Util.hs:33:3-18                       3.4    3.6
transFreeT         Control.Monad.Trans.Free          src/Control/Monad/Trans/Free.hs:337:1-79                 3.1    4.0
fmap               Graphula.Free                     src/Graphula/Free.hs:118:1-58                            2.8    6.5
fmap               Control.Monad.Trans.Free          src/Control/Monad/Trans/Free.hs:(209,3)-(211,42)         2.2    2.5
fmap               Control.Monad.Trans.Free          src/Control/Monad/Trans/Free.hs:(132,3)-(133,37)         2.2    1.8
iterT              Control.Monad.Trans.Free          src/Control/Monad/Trans/Free.hs:(309,1)-(313,21)         1.5    1.3
basicUnsafeWrite   Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:114:3-69                1.4    0.7
fmap.f'            Control.Monad.Trans.Free          src/Control/Monad/Trans/Free.hs:(210,5)-(211,42)         1.1    1.6
rSquare.p          Statistics.Regression             Statistics/Regression.hs:109:5-66                        1.0    0.7
nodeEditWith       Graphula                          src/Graphula.hs:(385,1)-(389,10)                         1.0    1.0
innerProduct.\     Statistics.Matrix.Algorithms      Statistics/Matrix/Algorithms.hs:42:3-51                  1.0    0.7
backendArbitrary.\ Graphula.Free                     src/Graphula/Free.hs:(253,24)-(255,10)                   1.0    0.5
fmap               Data.Vector.Fusion.Stream.Monadic Data/Vector/Fusion/Stream/Monadic.hs:(125,3)-(127,20)    1.0    1.0

This report is a dead giveaway. Collapsing stats for >>=.\ and >>= show 40% of time and 43% of allocations are spent in binding Free. Our next top offender is fmap from our data type’s Functor instance. We are spending half of the time in this program just glueing computation together!

Refactoring

Obviously Free was a problem. The primary use case of our library is generating arbitrary nodes, not monadic bind. We need to be spending our resources more wisely. So we refactored to mtl. This was mostly mechanical since we’d already figured out the details in our prototype.

Next step was creating a benchmark to prove our time wasn’t wasted.

import Control.Monad (replicateM, void)
import Control.Monad.Catch
import Control.Monad.Trans
import Criterion.Main
import Data.Functor.Identity
import GHC.Generics
import Graphula
import Test.QuickCheck.Arbitrary

main :: IO ()
main = defaultMain
  [ bgroup "final algebra"
    [ bench "1"  . nfIO $ replicateNodeFinal 1
    , bench "100"  . nfIO $ replicateNodeFinal 100
    , bench "1000"  . nfIO $ replicateNodeFinal 1000
    ]
  ]

newtype GraphulaIdentity a = GraphulaIdentity { runGraphulaIdentity :: IO a }
  deriving (Functor, Applicative, Monad, MonadIO, MonadThrow, MonadCatch, MonadMask)

instance MonadGraphulaFrontend GraphulaIdentity where
  type NodeConstraint GraphulaIdentity = NoConstraint
  type Node GraphulaIdentity = Identity
  insert = pure . Just . Identity
  remove = const (pure ())

replicateNodeFinal :: Int -> IO ()
replicateNodeFinal i = void . runGraphulaIdentity . runGraphulaT . replicateM i $ node @A

New cost centers

Again we ran this benchmark with stack bench --profile and inspected our cost centers.

COST CENTRE            MODULE                            SRC                                                    %time %alloc

nodeEditWith           Graphula                          src/Graphula.hs:(385,1)-(389,10)                        18.0   19.3
attemptsToInsertWith   Graphula                          src/Graphula.hs:(419,1)-(426,60)                        18.0   16.6
replicateNodeFinal     Main                              bench/Main.hs:77:1-89                                   13.0   17.6
getOverhead            Criterion.Monad                   Criterion/Monad.hs:(47,1)-(56,12)                        6.2    0.0
>>=                    Data.Vector.Fusion.Util           Data/Vector/Fusion/Util.hs:33:3-18                       6.0    8.6
getGCStats             Criterion.Measurement             Criterion/Measurement.hs:(46,1)-(48,16)                  4.1    0.0
basicUnsafeWrite       Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:114:3-69                2.3    1.5
attemptsToInsertWith.\ Graphula                          src/Graphula.hs:425:17-22                                2.3    3.5
newTFGen               System.Random.TF.Init             src/System/Random/TF/Init.hs:82:1-43                     2.1    8.8
logNode                Graphula                          src/Graphula.hs:139:3-21                                 2.1    0.0
rSquare.p              Statistics.Regression             Statistics/Regression.hs:109:5-66                        1.9    1.7
generate               Test.QuickCheck.Gen               Test/QuickCheck/Gen.hs:(101,1)-(103,20)                  1.8    3.5
innerProduct.\         Statistics.Matrix.Algorithms      Statistics/Matrix/Algorithms.hs:42:3-51                  1.8    1.8
qr.\.\                 Statistics.Matrix.Algorithms      Statistics/Matrix/Algorithms.hs:29:34-51                 1.5    1.4
kbnAdd                 Numeric.Sum                       Numeric/Sum.hs:(138,1)-(141,39)                          1.4    0.9
basicUnsafeIndexM      Data.Vector.Primitive             Data/Vector/Primitive.hs:215:3-75                        1.4    2.2
fmap                   Data.Vector.Fusion.Stream.Monadic Data/Vector/Fusion/Stream/Monadic.hs:(125,3)-(127,20)    1.3    2.4
newQCGen               Test.QuickCheck.Random            Test/QuickCheck/Random.hs:80:1-31                        1.1    0.0
basicUnsafeIndexM      Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:283:841-899                  1.1    0.0
generateNode           Graphula                          src/Graphula.hs:138:3-44                                 1.0    0.0
transpose              Statistics.Matrix                 Statistics/Matrix.hs:(268,1)-(270,22)                    0.8    1.4
unsafeIndex            Statistics.Matrix                 Statistics/Matrix.hs:242:1-40                            0.7    1.0
basicUnsafeSlice       Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:(84,3)-(85,25)          0.6    1.3

Now these look a lot better! Our most expensive functions are all associated with generating nodes. Our costs are aligned. We are spending time and allocation where our primary use case lies.

What about timing?

Turns out this implementation was belligerently faster.

The fun part is that we barely changed anything. Some constraints needed to be changed on our functions, but their implementations didn’t need to change because Free allowed us to think about the core semantic abstractions first. If you are curious you can see both implementations on github, free vs mtl.

The actual benchmarks live here.

Even better the runtime of our enterprising engineer’s data generation went from 30 minutes to 3 minutes. An order of magnitude is always a nice performance validation.

What did we learn?

What we learned is utilizing Free vs mtl is not universally good or bad. They are each fantastic abstractions and it really comes down to your use case. If your use case is going to end up building deeply nested graphs of Free data types then it probably isn’t for you. In the end we gained a lot clarity from prototyping with Free and a lot of performance from refactoring to mtl.