Category Archives: Incident Management

WordPress website error site reverting to old version

I have been noticing an intermittent problem with this website over the last 6 months or so where the site was reverting to a very old version of the site that showed my old design log and only old posts. At first I thought I had a cache problem on my pc and attempted to flush my local dns hoping that would resolve the issue. The problem manifested itself across multiple machines so I quickly realized that was not the solution but did not seek a more permanent fix since the problem was very intermittent in nature and I have been extremely busy (not a good excuse). When the problem reoccurred today I had finally had enough and logged a ticket with my web hosting support company to work on a permanent resolution.

Problem: Website for this site was having a problem and was reverting to an old version of the site (with an old logo design) and only showing posts as of 1/2012 and older.

Impact: Site design looked dated and visitors were not seeing the improved design/layout of the site or the new material posted on the site. I also suspect this hurt the site from a search engine perspective and lost traffic due to the site appearing old due to lack of new content.

Actions taken to attempt resolution: Thought problem was DNS related so flushed my local dns cache but realized something broader was going on when problem was found across multiple machines. Attempted to research problem using google search engine but most guidance was regarding webmaster tools related options and did not seem applicable. After failing to find a satisfactory fix I logged a support ticket with my webhosting provider.

Root Cause: I had to provide my webhosting technical assistance people admin access to the site and specify what database was used by the site. I created a unique temporary account/password for them and they completed the analysis and resolution very quickly. The root cause of my problem was found to be a corrupted WordPress table and once this table was repaired using the PhpMyAdmin tool the site is now displaying as it should be.

Lessons learned: Do not wait extended periods of time to deal with a problem. I could have had this issue resolved much sooner if I would have taken immediate action and logged a support ticket. The lunarpages support team was very helpful and quickly solved this issue once I provided them the needed access and confirmed the database id.

Information Security Implications: As mentioned above I had to provide site admin credentials to the technical support team to troubleshoot the problem. I followed the following security best practices during the interaction:

  • Had a full backup of my site before the work began
  • Created a unique temporary admin account just for this purpose
  • Deleted the account as soon as my support ticket was closed out successfully

This turned out to be a pretty good operational/security case study so I thought it would be useful to document and share.

Information security issues can lead to bankruptcy

Information security is often an after thought at best for many small to midsize businesses. DigiNotar, a Dutch certificate authority, is a great case study on what can go wrong when adequate information security controls are not put in place. DigiNotar was severely compromised leading to the undermining of the very core that their business was built on, trust and authority. The end result was an information security related bankruptcy that was preventable. What went wrong at DigiNotar and what can you learn from their experience?

Lessons learned from DigiNotar information security incident

The more your business relies on trust the greater your information security risk and the more controls you need

Trust is based on your reputation and when you are in a business requiring a high degree of trust it can be game over when a big incident occurs that hits to the core of your model. There is a direct relationship to how much your business relies on trust and how much information security you need. The final straw was when the Dutch government lost confidence after inadequate disclosure and revoked their trusted status.

Full prompt disclosure is the best way to recover your reputation

DigiNotar detected a problem with their certificate authority infrastructure nearly a month before the incident blew their business out of the water. They failed to make adequate disclosure causing their customer to question the trust they had placed in DigiNotar. What if DigiNotar came clean in the beginning? Perhaps they would have been able to salvage the company.

Full security audit needs to be conducted after their is reasonable cause to believe a serious security event has occured

The primary goal should be to determine the method of attack and seek to eliminate sources of vulnerability and to clean affected systems. The security review should be conducted by professionals and it could get quite expensive but it is necessary to prevent worse events such as total implosion of the business. If a full audit and full disclosure occurred the company would be likely still exist.

Are you auditing and controlling the right high risk business activities?

DigiNotar’s compromise led to the creation of 531 unauthorized certificates. If this control was reviewed closer and followed up on with quick terminations and the actions described above the company would still be in business.

Effective information security controls can make the difference between prosperity and bankruptcy. The choice is yours. To help make sure your business is taking information security seriously be sure to review our information security essentials for small and mid size businesses

How to perform a root cause analysis?

The purpose of a root cause analysis in either the IT operations or information security fields is to gain insight into the source(s) of a problem with the goal of preventing recurrence. A root cause analysis should be performed after an incident has been responded to and not during. During the incident individuals should not be distracted and the primary focus of all involved should be on the restoration of service and the elimination of business impact. Once business operations have returned to normal the next steps should be to collect any relevant information and do a debrief in preparation for a formal root cause analysis.

Organizations often make use of the 5 Why method to determine how an incident occurred. Asking the question why several times helps to effectively drill down to what caused a problem vs. simply stating the problem itself.

Example of the 5 Why’s Method:

The company file and print server was infected with a worm <- Why?

The server was not patched with the latest Microsoft patches <- Why?

The automated server to deploy the patches has been broken a month and is not operational < – Why?

The change to upgrade it 5 weeks ago was unsuccessful and no additional action was performed to correct < –Why?

The change to the server was not properly planned or documented and the engineers were unaware that the upgrade activity had occurred. <-Why

Proper change control processes were not followed

The only time you are likely to hear more whys is in a car with several small children who you are trying to explain something to.

Tips for conducting an effective root cause analysis

  • A root cause analysis should be performed as soon after an incident as is practical to allow for the needed prework and attendees to be scheduled. Extended delays increase the likelihood and incident will not be well remembered and the momentum to correct may be lost.
  • Conduct sufficient prework to document the incident and actions taken during the event. Review the documentation with those involved for factual accuracy.
  • Schedule the root cause analysis so that key individuals are available to attend
  • A standardized form should be utilized when conducting a root cause analysis. Ideally this information will be stored in an application or database so that metrics are easily generated to allow for long term improvement tracking.

Common errors that occur in the root cause analysis process

  • Failing to properly document the facts around the incident in a timely manner
  • Failing to understand the difference between correlated facts and causation
  • Not driving to a deep enough level and simply recording what happened vs. why it happened and how it can be prevented.
  • Not tracking improvement tasks to make sure they have been completed as expected
  • Not auditing the root cause analysis process for quality

Tips if root cause can not be determined

  • Determine what additional information should be collected next time and develop a process for collecting the needed information in case¬† the event reoccurs.¬†
  • Do not just assign a root cause if it is not correct for the false sense of completeness. Recognize that not all incidents can be attributed to a root cause first pass go around and make a plan to be effective if the issue recurs.

Sample Root Cause Analysis Form

Statement of issue: Describe the problem that occurred

Chronology of events: Detail events that occurred with specific timelines and actions taken during the incident

Business Impact: Define and quantify the problem from a business perspective

Participants: Document individuals that participated in the root cause analysis

Corrective Actions with individuals name responsible for completing and date completed:

Lessons Learned: Document to enable future improvements

Other areas with similar exposure: Document so same incident does not have to be experienced multiple times in different operating areas

Contributing Causes: Items may not be root cause but were contributing factors that need correction

Was the incident a repeat event?

Final thoughts on root cause analysis

If you are capturing your root cause analysis in a database it may be useful to track many other items for reporting and improvement metrics. Some of these items might include:

  • Incident # (to link back to your problem management system)
  • Incident Status
  • Incident Start and End Time
  • Location/Country/Region of incident
  • Incident category (application/server/etc.)
  • Service affected
  • Organization owner of incident
  • Type of problem

Effectively performing a root cause analysis is one of the most important things you can do to improve operations and drive a continuous operations improvement mindset.