It was a fatal error that could have easily destro
It was a fatal error that could have easily destroyed the company, said CloudBees’ senior consultant Viktor Farcic, but they handled it with full transparency on their blog, they updated users constantly on social media, and through that process, gained back the respect from the community. If a team discovers a fatal outage like GitLab’s, Asthana suggests focusing on the impacts for just a minute to help garner the team resources needed to address the issue. When it comes time to debug, you need to stay level-headed. On the other side, if a developer discovers a fatal error, that means that the company found the issue before the users, said Farcic. If the team is fast enough, they can fix an internally discovered problem before a user starts experiencing it, but more often than not, fatal errors are not found internally, said Farcic. If a developer manually discovers an error, it usually means the team does not have a good set of automated tests and that the iterations are not small enough, he added. Tips for developing safe and secure applicationsErrors are just like diseases; when ignored, they tend to spread and mutate into bigger problems, said Farcic. Just like doctors tell patients that the best medicine for staying healthy is prevention (followed by early detection of a disease) software errors are just the same, he said. Some of the best “medicine” for bugs and exploits is to put a monitoring and alerting solution in place, which will try to correct the problem automatically so the team can self-heal errors. If that doesn’t work, a developer can act swiftly and fix it themselves. Avetisov’s recommendation for developing applications is test driven development with unit testing, automated regression and boundary checking. Prioritizing non-functional tests equally with functional tests are “all great ways to help catch exploits before they make it into the wild,” he said. Even if you test, test, and test some more, Asthana said things will still go wrong. The thing to do is be prepared, and before you ever find yourself in crisis mode, set up the right triggers and intervals for monitoring.“Make sure you have the right collaboration and versioning tools in place for all team members to stay apprised of the latest code changes,” said Asthana. What’s the first thing emergency personnel ask a group of people to do in the midst of a crisis? Remain calm. For developers that discover an issue or exploit in their applications, staying calm and assessing the situation is definitely a good place to start, but determining the level of severity will really determine what will happen next. Finding an error or exploit According to Abhinav Asthana, CEO at API development and management company Postman, software teams typically have well-defined protocols for handling “Priority 1” issues. Often times, the best thing to do is to inform clients who rely on the service in addition to other teammates whose service might be impacted.The first instinct, says George Avetisov, CEO of HYPR, is to fix the issue right away, but realistically, the developer should identify potential reasons the error occurred, and communicate with any end users who fit the profile. “The urgency of a patch depends on how likely it is for a user to encounter the issue, versus what is lost or exposed by the error, so it’s important to establish how critical a fix is, before undertaking corrective measures,” said Avetisov. Avetisov said when a developer finds a fatal error, documentation is key. It eases communication for operation teams and with the customers or users, and the worst thing a developer can do is “a silent background fix,” which leaves no traceability for future debugging or bug tracking. Developers will also be able to debug the issue by first identifying the root cause with some level of certainty, said Asthana. Due to the timing of the issue, the root cause of the issue may be mistakenly attributed to something like recently pushed out code. Discovering a fatal exploit Minutes can turn into millions of dollars lost in revenue for organizations that discover a fatal outage. Not only in lost revenue for the organization, but the outage could impact customers that rely on that given service, said Asthana. Think back to the GitLab error where an admin accidentally typed a command to delete a primary database back in February 2017. The site had to be taken down for maintenance, and it wasn’t accessible for several hours. According to the company, the customers who installed GitLab’s software on their own servers weren’t affected since that doesn’t connect to GitLab.com. The outage was bad, to say the least, but paying customers weren’t impacted at all.