Test Suite Debugging Time

Created 2016-09-05 / Edited 2016-09-05

One of the codebases I work on regularly has a large and slow test suite. So slow that it typically only runs in totality in a continuous-integration (ci) environment, and there it uses the parallel_tests gem to slice it into pieces and run in parallel. This gets it to run in like 30 minutes instead of 2.5 hours.

But I want to run the whole thing linearly sometimes, especially so I can make sure tests aren't conflicting with each other. So I created a new ci job that runs it linearly ... and got a fail that I didn't get otherwise. Took forever to figure it out!

This was about a worker that I call a "sweeper" -- it's job is to run once a day and make sure nothing was missed by other realtime jobs during the day. Here is a simplified version.

class SweeperWorker
  INTERVAL = 25.hours.ago # Overlap the day by an hour
  def perform
    candidates.each(&:do_work) # something like that
  end
  def candidates
    Thing.where('created_at > ?', INTERVAL)
  end
end

# rspec
describe SweeperWorker do
  let(:new_thing) { Thing.new(created_at: 5.hours.ago) }
  let(:old_thing) { Thing.new(created_at: 26.hours.ago) }

  it "does pick up the new thing" do
    new_thing # force to exist
    expect(SweeperWorker.new.candidates).to include(new_thing)
  end

  it "does not pick up the old thing" do
    old_thing # force to exist
    expect(SweeperWorker.new.candidates).not_to include(old_thing)
  end
end

Not sure that is quite right, but you should get the idea.

The second example was failing -- both the "old thing" and the "new thing" were getting into the sweeper window -- but only when the test suite was run linearly. Run it in ci, no problem. Run just this test/file locally, no problem.

The suspect is thus "time" -- but where? I went down the wrong track for a while, looking into timezones. Rails vs MySQL on ci vs my machine and so on ... some of those things weren't quite matched up, but still. I ran the test over and over, adding more and more debugging so I could peer into the contents of variables and the DB. Everything looked fine -- all the timestamps were what I expect. So it must be the query itself that is somehow asking for a longer timespan.

Finally I got it! Right there in the first line of SweeperWorker:

INTERVAL = 25.hours.ago # Overlap the day by an hour

This constant is set when the class is loaded, and never changed! So by the time this test runs in the 2.5 hours of test time, "25 hours ago" is now "27.5 hours ago" -- so our "old" thing is in our window. The "25.hours.ago" certainly looks relative, but it is a calculation that outputs an absolute time which is then saved for later use.

Move that constant into the method as a variable and it is fixed. And now we can sing our victory song!

youtube:w8KQmps-Sog

Lessons:

  • Beware of calculated constants
  • Make sure relative values are actually relative
  • If the speed of execution (slow vs fast) changes the outcome, look for things that are more or less spread out in the timeline, such as constant definition.