January 7, 2018 UpdateLast Updated: Jan 16, 2018 11:49AM CST
Status Update: Jan 7th, 2018
Today we experienced an outage to the Church Online Platform that lasted one hour. We know many churches, including our own, rely on the platform each week to share the Gospel. We take this very seriously—and our team feels the weight when churches are offline.
We want nothing more than to see the name of Jesus proclaimed through #churchonline, and today’s outage is not acceptable.
In this status update, we’ll cover more specifically what happened to cause today’s outage. And, we also want you to know that we are more committed than ever to #churchonline and serving churches through the Church Online Platform.
What caused today’s outage?
Two specific Events-related issues created a spike in response times and caused the platform outage:
1) The system that runs at regular intervals to gather and report metrics for each Event had too much data to process
2) Three specific back-end queries related to processing data for event times were overloaded
To resolve the outage, the system that gathers and processes metrics was paused. This allowed the platform to come back up, but response times were still relatively high. To bring response times down further, the team wrote new indexes on two of the three highest volume queries related to Event times. This improved the response of these queries by 200% and brought response times on the platform down to acceptable levels.
Since implementing these fixes, platform response times have been good; however, we’re working to address inconsistencies on metrics reportings and summary emails.
How was this outage different than the others?
The platform has experienced significant outages three out of the past four Sunday mornings. Each of the outages have been related to how data is processed on the platform. After the first two weekends, we believed we had fixed the root cause issue by putting some high traffic processes behind a dedicated CDN to protect the platform. Unfortunately, the issues we experienced today were masked by the issue we solved following Christmas, and it didn’t show up again until this weekend.
Will there be another outage next week?
In addition to the primary issues, there are other back-end data related systems that contributed to this outage. As the platform has experienced significant growth, the servers and systems in place that could handle the loads from two years ago haven’t been able to handle the increases, and they’ve revealed stress points in the rest of the platform the last few weeks.
Our team is fully focused on implementing fixes throughout our technical stack starting today and continuing in the weeks following to correct how the platform handles data processing to prevent future outages related to these issues.
We care deeply about the stability of the platform and are working to ensure there are no more outages. God is moving through #churchonline, and we won’t let technology setbacks hold us back from moving the mission of the Church forward.
What are the long-term plans for platform stability?
Beyond data processing improvements, we plan to address multiple other opportunities within the platform to improve performance and resiliency. This includes:
Working with our tech stack vendors (Heroku, Redis, Postgres, PubNub) to ensure we are optimizing their integration with the platform
Implementing additional platform monitoring and “circuit breakers” to make the platform more resilient
Investigating additional platform architecture and product changes to further improve performance and user experience
We will continue to communicate as we work diligently to strengthen the platform that helps us all fulfill our calling.
“And then he told them, "Go into all the world and preach the Good News to everyone.” Mark 16:15