Recommend a Great CI/DevOps Presentation: “Continuous Integration at Facebook”
I started practicing Continuous Integration in late 2005, using CruiseControl (CC in short), the so-called “Grand-daddy” of CI servers. The most challenging part is running automated UI tests; it took me a couple of years to master it (I had to hack in CC code and implemented some features). Since then, I have attended a number of CI presentations at conferences/meetups. Frankly, none was good. It did not address the real challenges of CI at all, let alone offering practical solutions.
On 20/06/2015, I found an excellent presentation (YouTube) on How Facebook does CI from F8 (Facebook Developer Conference) by Katie Coons, a software engineer at Product Stability (a good division name) at Facebook. It was quite short, only 15 mins, but with a wealth of wisdom. I have watched it many times, and it is still valid. Still, in my opinion, the best presentation on CI/CD and DevOps. I was very surprised by the low number of views and likes on YouTube.
Katie started the presentation with
“№1 goal of CI at Facebook is developer efficiency”
“We want computers waiting on humans.”
Then the three key goals of CI at Facebook:
“We don’t want our developers to chase down the failures that are not their fault, that’s waste of devleopers’ time”
“The system must provide rapid feedback to developers because our developers wll not wait for it. Moving fast is a really important part of Facebooks’ culture”
“The system must provide frequent feedback to developers. Because the sooner we let them know about a problem, the easier and faster it is for them to fix it”
I concur with my experience. Ironically, all other CI presentations missed all of the key goals, maybe with the exception of Rapid, but often was vague. I like Katie gave a specific target build time for a CI build: 10 minutes.
Here, I share my understandings of these 3 CI goals:
Reduce the false-alarms, i.e., flagged a test failure, but it was not. If the team did not trust the results of automated test executions, everything goes out of the window, right? However, we frequently see projects executing automated tests every night with a mere ~40% pass rate in Jenkins. Why bother?
The feedback from CI build must be very quick, as we all know, the classic software engineering rule: earlier a bug was found, cheaper and quicker to fix. However, it is easier said than being done. Few companies have a proper parallel testing lab set up to make executing automated functional tests faster.
Once, one manager (of another division) wanted to review their CI implementation (over the phone), she said: “Our DevOps guys spent a lot of efforts, the build runs for 10 hours …”. I replied: “It has already failed. Your DevOps guys had no idea how to reduce the build time”. The result was exactly as I predicted, the whole thing was discarded shortly after.
It shall be easy to trigger a run of CI build. There is no point if the total execution time is 10 mins, but the preparation time for each build took 2 hours.
On developer’s involvement in automated testing and CI:
“It is really important at Facebook I will never, as a developer, write code and toss over the fence and for someone else to write test for it. It is my job as a developer at Facebook to write tests for my code, and my reviewers make sure I do”.
💡Show the above to your programmers who don’t want to write automated tests.
“For all of our end-to-end tests at Facebook we use WebDriver, WebDriver is an open-source JSON wired protocol, I encourage you all check it out if you haven’t already. ”
“One of the great advantages of WebDriver is that it gets applications cross all platforms.”
💡Show the above section to your team members who want to create ‘own test framework’ or using other commercial/proprietary ones. If you interested in why Selenium WebDriver is the best choice for test automation, please read my other articles:
- “Web Automation Framework Trend ⇒ Selenium WebDriver”
- “Please, not another Web Test Automation Framework, just use Selenium WebDriver”
- “Selenium WebDriver is the Easiest-to-Learn Web Test Automation Framework”
- “False ‘Selenium WebDriver Cons’”
- “Advice on Self-Learning Test Automation with Selenium WebDriver”
On Continuous Testing Server:
Facebooks’ own Sandcastle. Of course, not Jenkins/Bamboo/TeamCity. Have you ever seen Level 2 of AgileWay CT Grading (50+ E2E tests) implemented in Jenkins? I mean, reliable test executions with quick feedback. I haven’t.
“> 1000 test results per second every single day”
On addressing flaky automated tests:
“Failures inevitable at Scale”
“Infrastructure failures get retried”
“You would be surprised how many tests cannot run in parallel with themselves”
Please note test-execution-retries here. From the context, the retries were handled by the Sandcastle CI server, NOT the test scripts like Cypress. If you are interested in this topic, please read my other article: “Why Auto-Retry of Test Execution in a Test Framework is Wrong?”
On testing infrastructure.
“At Facebook, We have some of our top engineers working on development infrastructure”
And these impressive FaceBook Build Farm.
(the above is pretty much what I got at the moment)
That’s why I gave Level 6 (highest, a handful in the world) to Facebook on AgileWay CT Grading.