Category Archives: IT Operations

WordPress website error site reverting to old version

I have been noticing an intermittent problem with this website over the last 6 months or so where the site was reverting to a very old version of the site that showed my old design log and only old posts. At first I thought I had a cache problem on my pc and attempted to flush my local dns hoping that would resolve the issue. The problem manifested itself across multiple machines so I quickly realized that was not the solution but did not seek a more permanent fix since the problem was very intermittent in nature and I have been extremely busy (not a good excuse). When the problem reoccurred today I had finally had enough and logged a ticket with my web hosting support company to work on a permanent resolution.

Problem: Website for this site was having a problem and was reverting to an old version of the site (with an old logo design) and only showing posts as of 1/2012 and older.

Impact: Site design looked dated and visitors were not seeing the improved design/layout of the site or the new material posted on the site. I also suspect this hurt the site from a search engine perspective and lost traffic due to the site appearing old due to lack of new content.

Actions taken to attempt resolution: Thought problem was DNS related so flushed my local dns cache but realized something broader was going on when problem was found across multiple machines. Attempted to research problem using google search engine but most guidance was regarding webmaster tools related options and did not seem applicable. After failing to find a satisfactory fix I logged a support ticket with my webhosting provider.

Root Cause: I had to provide my webhosting technical assistance people admin access to the site and specify what database was used by the site. I created a unique temporary account/password for them and they completed the analysis and resolution very quickly. The root cause of my problem was found to be a corrupted WordPress table and once this table was repaired using the PhpMyAdmin tool the site is now displaying as it should be.

Lessons learned: Do not wait extended periods of time to deal with a problem. I could have had this issue resolved much sooner if I would have taken immediate action and logged a support ticket. The lunarpages support team was very helpful and quickly solved this issue once I provided them the needed access and confirmed the database id.

Information Security Implications: As mentioned above I had to provide site admin credentials to the technical support team to troubleshoot the problem. I followed the following security best practices during the interaction:

  • Had a full backup of my site before the work began
  • Created a unique temporary admin account just for this purpose
  • Deleted the account as soon as my support ticket was closed out successfully

This turned out to be a pretty good operational/security case study so I thought it would be useful to document and share.

How to fix a security certificate error while browsing the internet

The last week or two the pc only used by the kids had been having problems with a security certificate error when they were trying to browse the internet. The browsing eventually got where it needed to go but only after extra clicks of accepting the risks of going to a potentially bad site and adding an exception in the browser. The problem was happening with both Internet Explorer and Firefox browsers so I assumed that a virus was causing the problem.

I performed some basic antivirus scans using the free AVG antivirus software installed on the machine as well as Spybot Search and Destroy. Nothing overly incriminating was found by either scan only the expected low/mid risk cookies always found. I was a bit surprised at this result so started looking for some other alternatives of what could be wrong.

After a bit of research I was able to find a documented case that closely matched my situation. The suggested advice was to check the date on my pc because if the machine is dated in the past with an incorrect date this has been known to cause a problem with internet security certificates. Sure enough the machine had been reset to the original date of when it was purchased and the issue went away after the date was corrected.

Quick Summary:

Problem: Common area machine was generating security certificate errors/warnings while browsing the internet with multiple different browsers (firefox, Internet Explorer etc..)

Solution: Check the date on the machine and make sure it is at the current calendar day. The pc had somehow been reset to default settings and was dated back to 2007 which was the source of the problem.

How to run an incident management process

The primary purpose of an incident management process in the IT operations or security fields is to quickly restore normal service operations to minimize the impact on normal business operations. Here is a rundown of a typical incident response situation:

1. Operations business critical or security related incident is reported to the help desk by an end user or automated monitoring system. It is important for the help desk to get detailed information about the exact nature of the problem including a detailed problem statement of what is not working. The help desk should document the specifics into the problem log for documentation purposes.

2. Help desk reviews issue and support scripts and determines the business impact of the issue and if the issue should be escalated as a high priority item

3. Help desk follows documented escalation process and begins to form a system restoration team (as detailed by the application/system support script)

4. System restoration team assembles on a designated global phone bridge with intent of getting all people necessary for system restoration.

5. For a high priority application type problem without a clearly defined problem it is typical to get the end to end support team on the line. Typical system restoration participants are

  • Application support team member
  • Server support team member
  • Database support team member
  • Network/Firewall team member
  • Someone who can test functionality/items as needed (often a business user)
  • Facilitator of the incident response call

6. Facts surrounding the event are discussed with the combined team so everyone is aligned on the problem that needs to be solved. The incident response facilitator should be the primary voice of the system response team and keep the team on track with the primary goal to restore normal business operations

7. Depending on the severity of the problem it is important to keep relevant stakeholders updated to the progress and expected duration of the problem (if known). Communicating effectively is one of the most important things that needs to be done during an incident to set proper expectations and keep those affected informed. Effective communication is one of the key things that can be done to help minimize the likelihood of unneeded political escalation of the event.

8. It is best practice to keep the phone bridge open until the problem is resolved to maintain problem solving momentum. If the problem is expected to run too long to make that practical it is good to define the needed update times and schedule the sessions as needed.

9. It is important to validate that the service has been restored to normal prior to disbanding the system restoration team. This is best done by validating with an end user on the bridge.

10. Before terminating the call the team should make sure the incident diary is updated with information about what was done to resolve the problem. In addition, any information needed for the RCCA should be assembled while the incident is still fresh in everyone’s mind.

Important points about the Incident Management process

  • There is sometimes a tradeoff between quicker restoration vs. collecting system log and other information in event to find a root cause of the problem. This conflict should be managed appropriately depending on the likelihood of finding a true root cause (which is very desirable to prevent future problems) vs. faster restoration of the affected service.
  • It is important that the system restoration team facilitator be in charge of leading the assembled resources to maintain an orderly process. Too many chefs in the kitchen will not help restore service in a more timely manner.
  • Documenting the problem ticket regularly through the process is important for tracking status, communicating updates, and as a source of data for the future root cause analysis.
  • Opening a group chat room for the system restoration team is a good way to share technical information without sidetracking the phone bridge directing resolution of the problem. It also serves as a nice log for the problem diary and a potential source of information for the root cause analysis.

How to perform a root cause analysis?

The purpose of a root cause analysis in either the IT operations or information security fields is to gain insight into the source(s) of a problem with the goal of preventing recurrence. A root cause analysis should be performed after an incident has been responded to and not during. During the incident individuals should not be distracted and the primary focus of all involved should be on the restoration of service and the elimination of business impact. Once business operations have returned to normal the next steps should be to collect any relevant information and do a debrief in preparation for a formal root cause analysis.

Organizations often make use of the 5 Why method to determine how an incident occurred. Asking the question why several times helps to effectively drill down to what caused a problem vs. simply stating the problem itself.

Example of the 5 Why’s Method:

The company file and print server was infected with a worm <- Why?

The server was not patched with the latest Microsoft patches <- Why?

The automated server to deploy the patches has been broken a month and is not operational < – Why?

The change to upgrade it 5 weeks ago was unsuccessful and no additional action was performed to correct < –Why?

The change to the server was not properly planned or documented and the engineers were unaware that the upgrade activity had occurred. <-Why

Proper change control processes were not followed

The only time you are likely to hear more whys is in a car with several small children who you are trying to explain something to.

Tips for conducting an effective root cause analysis

  • A root cause analysis should be performed as soon after an incident as is practical to allow for the needed prework and attendees to be scheduled. Extended delays increase the likelihood and incident will not be well remembered and the momentum to correct may be lost.
  • Conduct sufficient prework to document the incident and actions taken during the event. Review the documentation with those involved for factual accuracy.
  • Schedule the root cause analysis so that key individuals are available to attend
  • A standardized form should be utilized when conducting a root cause analysis. Ideally this information will be stored in an application or database so that metrics are easily generated to allow for long term improvement tracking.

Common errors that occur in the root cause analysis process

  • Failing to properly document the facts around the incident in a timely manner
  • Failing to understand the difference between correlated facts and causation
  • Not driving to a deep enough level and simply recording what happened vs. why it happened and how it can be prevented.
  • Not tracking improvement tasks to make sure they have been completed as expected
  • Not auditing the root cause analysis process for quality

Tips if root cause can not be determined

  • Determine what additional information should be collected next time and develop a process for collecting the needed information in case¬† the event reoccurs.¬†
  • Do not just assign a root cause if it is not correct for the false sense of completeness. Recognize that not all incidents can be attributed to a root cause first pass go around and make a plan to be effective if the issue recurs.

Sample Root Cause Analysis Form

Statement of issue: Describe the problem that occurred

Chronology of events: Detail events that occurred with specific timelines and actions taken during the incident

Business Impact: Define and quantify the problem from a business perspective

Participants: Document individuals that participated in the root cause analysis

Corrective Actions with individuals name responsible for completing and date completed:

Lessons Learned: Document to enable future improvements

Other areas with similar exposure: Document so same incident does not have to be experienced multiple times in different operating areas

Contributing Causes: Items may not be root cause but were contributing factors that need correction

Was the incident a repeat event?

Final thoughts on root cause analysis

If you are capturing your root cause analysis in a database it may be useful to track many other items for reporting and improvement metrics. Some of these items might include:

  • Incident # (to link back to your problem management system)
  • Incident Status
  • Incident Start and End Time
  • Location/Country/Region of incident
  • Incident category (application/server/etc.)
  • Service affected
  • Organization owner of incident
  • Type of problem

Effectively performing a root cause analysis is one of the most important things you can do to improve operations and drive a continuous operations improvement mindset.