Test the Definition of Testing
by @mjgpy3 on October 04, 2023
As tech workers, we draw abstractions daily, generalize and categorize our world. These techniques constrain our thinking. We hunt for domain edges and erect fences where we must. These boundaries keep our codebases focused and thus our complexity down.
One area I’d like to see fewer boundaries drawn though is when we think and talk about testing. To illustrate what I mean, have you…
- heard blame ascribed to a QA team/department for “letting a bug get into production”?
- spent time arguing whether a test is a unit test or an integration test?
- had a coworker who identified themselves as a “manual tester”, “automated tester”, or a mix of the two?
- observed a game of “throw-it-over-the-wall-volleyball” between testers and programmers?
- seen a QA Engineer incentivized (often financially) to find more bugs?
If you answered yes, then you are witnessing symptoms of the industry’s myopic view on quality and testing.
Tackling how we think about testing is a proverbial elephant I won’t be able to eat here. But to take a first bite, I’d like to catalog a few ways I’ve seen us test at Freckle that break the typical “testing mold”. The examples will be themed around breaking the “cardinal rule of testing”: don’t test in production.
One feature we recently finished implementing at Freckle is around skill recommendation. The domain idea is simple. If a teacher is about to assign some Math skill, we recommend they assign simpler skills to students that we know are struggling.
On its face this seems easy enough. Yet, when our Product team asked us how many users would receive recommendations, we were stumped!
The recommendation feature would be built off of our proficiency logic which is (1) modeled in our Haskell backend, and (2) known to be a slow calculation!
Determining how many users would be affected would require a big slow loop that hit our backend for every skill and every teacher. This would be a huge performance hit to actual users while it ran, take a long long time to run, and would require new endpoints with cross-user auth.
After a great deal of back-and-forth, we had the classic eureka moment. We decided that we would build only the recommendation logic, behind a feature flag with no UI changes. When a teacher visited the assignment page, we would fetch proficiency data, determine whether a recommendation was necessary, then do nothing. Well we wouldn’t really do nothing, we used pendo tracking events to track whether the recommendation would have been made.
This approach allowed us to
- quickly and cheaply determine the answer to Product’s question (the teacher-facing page and proficiency endpoints already existed)
- observe the real performance impact of hitting these endpoints (from real traffic)
- get a head start on implementation, and
- do all this without risk – the feature-flag serving as an “off switch” if something went wrong
Implementing “ghost features” reminds me of a broader (albeit less ghastly) set of tools we often use to test code in production: A/B tests, gradual rollout releases, and dark launches. I won’t belabor these techniques beyond providing basic descriptions and examples – there are tons of resources out there.
A/B tests are a rubber-meets-the-road style of testing. They’re most handy when we have competing solutions (i.e theories) and we want some form of data as to which is better. We use split.io to perform A/B tests and for feature flagging more generally.
Freckle often A/B tests “by the book” so I’ll not concern us with questions of how?. I’d rather challenge the reader to ask, in what ways can we apply A/B testing practices to things beyond features?
Is it useful to
- open competing pull requests, showing alternative solutions to a problem?
- ask questions in different ways, to different audiences, to inspire different types of answers?
- be agile – trying new things and measuring their impact?
Gradually rolling out a new solution to users over time is another nifty tactic. Rather than going all-in on a risky decision, this lets you peek under the bandaid before deciding whether you want to rip it off.
Dark launches, though similar to ghost features, are the least impactful approach from a testing perspective. This is the case because dark launches usually want to hide as much functionality behind a feature flag. Ghost features, on the other hand, intentionally expose risk and use feature flags as a last-resort kill switch.
Dark launches are not totally without testing benefit. They show that your code integrates (e.g. passes CI) well. They also show that, should you enable a feature, you can go back (e.g. they inherently test the state in which the feature is off).
Taking a step back from broad, user-facing feature changes, sometimes we need to use these techniques at different levels in the tech stack.
Recently a job in our data ingestion pipeline started dying with out-of-memory errors. The data we ingest isn’t huge (a couple gigabytes) but slight inefficiencies in processing added up and were crashing our handlers.
We got to the bottom of the problem and opened a PR that restructured the ingress using clever streaming patterns. The fix felt good but it entailed a lot of rewriting. Of course, the ingress code that was rewritten was heavily IO-bound and, as such, lacked automated tests.
In the words of @pbrisbin, in his PR review for this change:
Cool beans. Now how do we know you haven’t broken it?
How We Know
As when product asked how many users would be affected earlier, @pbrisbin’s question was a real stumper! The hard questions often inspire the most interesting answers. After more of that seemingly magical “deep thought” we stumbled upon yet another eureka idea.
The solution we came up with actually resembles A/B testing in a way. We built out a little test harness that
- runs the ingestion using the old logic, then
- runs the new logic
- pretty-print both results (done trivially with pretty-show)
- use good old, basic diff to compare results (using Diff)
We then ran this test for some big customers, behind-the-scenes in production. We then manually looked over the diffs for regressions in our logic. This allowed us to confidently merge the change and perform continual refactors with ease. The harness even quickly revealed an ingestion filtering error in the old logic!
I hope you’ve enjoyed these unique techniques to testing as much as I enjoyed witnessing them unfold. I’m convinced that these approaches are innovative. I’m also convinced that I would not have witnessed them if, in the mind of a Freckler, testing were just “that activity the QA engineer does” or only “that thing I do sometimes when I write a unit test.”