CS352 Lecture: Crash Control                            Last revised 11/16/02

Materials: Transparencies of Korth & Silberschatz p. 654, 655

I. Introduction
-- ------------

   A. One of the most important functions of a DBMS is ensuring the integrity
      of the data in the face of various unpredictable events such as power
      failures, hardware failures, software failures etc.  In fact, the uniform
      protection of data that a good DBMS provides may be one of its greatest
      advantages over traditional file processing.

   B. In general, it is necessary to protect data against several kinds of
      sources of corruption:

      1. Logical errors in the incoming data, which cause an operation to be
         aborted before it is completed.  (Also in this category is the
         possibility that an interactive system may allow a user to abort
         an operation he has begun before it completes.)  

      2. Failure of a transaction to complete execution due to issues related to
         concurrency (e.g. rollback, deadlock).
         
      3. System crashes that shut down the system unexpectedly.  These can
         arise from:

         a. Power failures.
         b. Hardware failures - e.g. a chip going bad in the CPU.
         c. Software failures - e.g. operating system crashes.
         d. Network communication failures (due to many possible causes)
         e. Human error - an operator pressing a wrong button or issuing a
            command that crashes the system.

         If a system crash occurs in the middle of updating the database, the
         writing of the new data may be only partially completed, resulting
         in corrupted data.
 
      4. Hardware failures that damage the media storing data.  Of these, the
         most potentially catastrophic is a head crash, in which a disk drive
         head comes into contact with the surface of the disk - effectively
         destroying all the data on the platter.

      5. External catastrophes such as fire, flood, etc.

   C. In general, data is stored in three types of storage, each with its own
      degree of security against loss:

      1. Data in VOLATILE STORAGE - e.g. the main memory of the computer - is 
         subject to loss at any time due to any kind of system failure.  In
         particular, power failures, most hardware failures, and many software
         crashes will cause data in volatile storage to be lost.

      2. Data in NON-VOLATILE STORAGE - e.g. disk and tape - is much more
         secure.  Data in non-volatile storage is generally not lost unless
         there is a catastrophic failure of the storage device itself (e.g.
         a head crash on a disk) or an external catastrophe such as fire or
         flooding.  In this regard, tape is much less vulnerable than disk.

      3. Conceptually, STABLE STORAGE is storage that is immune to any kind of 
         loss (short of nuclear war or the like!)  While no storage medium is 
         totally immune to destruction of data, stable storage may be
         approximated in one of two ways:

         a. The use of certain kinds of storage media, such as write-once laser
            disks.  Data written on a such media is immune to virtually any
            possible source of damage short of the destruction of the disk
            itself.

         b. The writing of the same data on more than one non-volatile medium,
            so that if one is damaged the other(s) will still retain the data
            intact.
            
            i. Use of on-site mirroring - e.g. through RAID
            
           ii. Use of off-site mirroring (remote backup), which protects against
               physical dangers as well as system errors/crashes.

   D. Actually, the protection of data involves two different sorts of
      measures.

      1. Regular system backups are an essential part of protecting data 
         against loss.  Backups are designed to prevent loss of data due to 
         physical damage to non-volatile storage media (e.g. head crashes
         on the disk), and also provide some protection against inadvertent
         erasure of data that is still needed.

         a. Backups are of general kinds:

            i. A full backup (dump) makes a copy of the complete database to 
               some non-volatile storage medium, which is then removed from the
               system and stored in a safe place. 

           ii. An incremental backup records only the CHANGES made to the
               database since the last backup (full or incremental).  Thus,
               recovering from a disk failure will require mounting the last 
               full backup and restoring the data from it, then mounting each
               subsequent incremental backup in turn and recovering the
               changes from it.

         b. Backup schedules will vary from system to system.  However, in
            most heavy-volume systems, incremental backups are done on a daily
            basis, and full backups are done perhaps once a week.  On lighter
            use systems, the backup intervals may be extended somewhat.

         c. The media used for backup are some sort of non-volatile storage:
            often tape, but sometimes removeable disk.  However, the handling 
            of the backup media makes them effectively stable storage, since
            they are removed from the system once the data is written to them
            and only remounted if a recovery is necessary.  In addition,
            backup media are often stored in a fireproof safe either in the
            computer center or in a separate building.

      2. Measures taken to recover from aborted operations and system crashes.

         a. Note that backup is particularly designed to protect against loss of
            data due to damage to non-volatile storage media - a relatively rare

            occurrence.  Since any work done since the last backup is not saved,
            other measures need to be taken to allow a fast restoration to
            normal operation after an aborted operation or a system crash.  
            (It is not generally acceptable for a late afternoon power failure 
            to be able to destroy all work done since the daily backup was run 
            in the morning!)  

         b. It is these measures we focus on today.  The measures we will 
            discuss seek to allow a rapid restoration of the system to a 
            consistent state after an aborted operation or a crash that does 
            NOT result in damage to non-volatile storage media (i.e. only the 
            contents of volatile storage are lost.)

      3. The concept of a transaction will be a the heart of our discussion.  In
         particular, the measures we will discuss are related to ensuring the
         durability property of a transaction.  We will assume that each 
         transaction is assigned a unique identification  (e.g. a serial 
         number), and that some record of incoming transactions is kept. We 
         must ensure the durability of every transaction that committed before 
         the crash, and must also deal with transactions that were in process 
         at the time of the crash.

   E. Crucial to some schemes for guaranteeing the consistency of a database is
      the notion of a processing LOG, stored in "stable" storage.  During the
      processing of each transaction, a series of entries are made in the
      log.  Each entry includes the transaction's serial number, an entry
      type, and possibly other data.

      1. When the transaction begins, a start transaction entry is made 
         in the log.

      2. Appropriate entries are made in the log to record changes that
         the transaction makes to the database (more on these later.)

      3. One of two types of entry is made in the log when the transaction
         completes:

         i. A COMMIT entry indicates that the transaction completed 
            successfully, so that all its changes to the database can now be
            made permanent.

        ii. An ABORT entry would indicate that the transaction failed for some
            internal reason (logical error in the data or user abort), so that 
            none of its changes to the database should be allowed to remain.
            (Such an entry is not necessarily made; the absence of a commit
            entry implies an abort.)

       iii. If a transaction was in process (but not finished) when a system 
            crash occurs, then neither of these entries will appear in the
            log.  This implies that:

            - No changes that the transaction has already made to the database
              should be allowed to remain.

            - The transaction can be restarted from the beginning after the
              database is restored to a consistent state.

      4. We suggested that the log should be maintained in stable storage.
         Actually, this is not absolutely necessary.  Non-volatile storage
         can be used for the log (and often is).

         a. One way to achieve a measure of stability is to keep two copies
            of the log on separate media, so that failure of one medium will
            not destroy the log.

         b. Ideally, the second copy of the log is kept at a remote site.

         c. In either case, however, we DO have to ensure that the log data is
            written to the storage medium BEFORE the changes it records are
            actually written.

            i. Since each log entry is relatively short, the system will 
               normally buffer entries in primary storage until a full block of
               entries has accumulated, and then will write that block to the
               log, followed by updating the various database entries on disk.

           ii. When we talked about buffer management, we noted that under
               some circumstances a particular data buffer would have to be 
               pinned so that it can't be written back to disk.  One situation 
               where this happens is the case where a change has been made to an
               in-memory copy of a record, but the corresponding log entry
               recording the change has not yet been physically written to
               the log.   In this case, the data buffer must be pinned until
               the log block has been filled up and written.

               If, for some reason, it becomes necessary to write out the
               data block before the log block recording the change has been
               written, then we will need to FORCE the premature writing of a 
               partially-filled log block.

          iii. If we are using a second copy of the log at a remote site, then
               a very safe strategy would prohibit commit a transaction if we
               cannot write the log entry at BOTH sites due to a communication
               problem or failure of the remote site.  Since this could prevent
               any work from occurring, this can relaxed to allow work to 
               proceed with some minimal risk of data loss in such cases.

         d. Further, we have to ensure that a system crash during the writing
            of a particular log block cannot corrupt data previously written
            to the log.  This is a particular concern if we have to force
            the output of a log block before it is full.  It would be reasonable
            to consider reading this block back in, adding more data to it,
            and then writing it back out.  But a failure during this rewrite
            could corrupt the previous data.  To prevent this, we can rewrite
            the modified information to a new block, then - after this succeeds
            - reset the file header to point to this block rather than the 
            original block.  Of course, now we have to guard against a crash
            corrupting the file header!

   E. We will consider three general schemes for crash control:

      1. Incremental Log with Deferred updates: no changes are actually made  
         to the database until AFTER the transaction has committed and the 
         COMMIT entry has been written to the log.

      2. Incremental Log with Immediate updates: changes are made to the
         database during the processing of the transaction; but only AFTER a
         log entry is made indicating the original value so that the change
         can be reversed should the system crash or the transaction fail for    
         any reason.

      3. Shadow paging: two complete copies of the relevant database data are
         kept during the transaction: the original values and the modified
         values.  Only after the transaction commits are the modified values
         made to permanently replace the original ones.

II. Incremental Log with Deferred Updates
--- ----------- --- ---- -------- -------

   A. In this scheme, no actual changes are made to the database until the
      transaction completes.  Instead, the DBMS converts each write operation
      issued by the transaction to a write to the LOG, in which it records
      the item that would be altered and the new value that would be put
      there.

      Example: A transaction to transfer AMOUNT dollars from a user's checking 
               account number CACCOUNT to his savings account number SACCOUNT
               might execute the following SQL program:

                UPDATE CHECKING_ACCOUNTS
                        SET BALANCE = BALANCE - AMOUNT
                        WHERE ACCOUNT_NO = :CACCOUNT;
                UPDATE SAVINGS_ACCOUNTS
                        SET BALANCE = BALANCE + AMOUNT
                        WHERE ACCOUNT_NO = :SACCOUNT;

               If the initial balances in each account are $1000 and $2000
               respectively, and the amount to be transferred is $50.00, then 
               the following entries would be made in the log.  (Assume the
               transaction has been given the serial number T1234, and that
               the account records in question have internal identifiers
               (record numbers) 127 and 253.)

        T1234 starts
        T1234 writes 950.00 to BALANCE of CHECKING_ACCOUNTS record 127
        T1234 writes 2050.00 to BALANCE of SAVINGS_ACCOUNTS record 253
        T1234 commits

   B. Only when the transaction has partially committed are the actual updates 
      to the database made.

      Example: as soon as the T1234 commits record has been written to
               stable storage, the DBMS would proceed to write the two new
               values into the database.

   C. If the transaction is user-aborted or fails for some reason, no harm is 
      done, since no changes have been made to the database.

   D. Should the system crash during the course of execution of the 
      transaction, one of two things will happen:

      1. If the crash occurs BEFORE the commits record has been written, then
         the transaction can either be simply be restarted from the beginning 
         or ignored (the latter if it was an interactive transaction and the
         system was down long enough for the user to have left.)  Either
         approach will leave the system in a consistent state.

      2. If the crash occurs AFTER the commits record has been written, then
         each of the values specified in the log will be written to the 
         specified locations.  Note that the data may already have been 
         written; but no harm is done writing the same value to the same 
         location a second time.

   E. Under this scheme, the following algorithm is executed whenever the
      system restarts after a crash:

        for each transaction whose start record appears in the log 
          if its commit record has been written to the log then
            write each of the new values specified in the log to the database
          else
            (do nothing)

      1. This is called REDO-ing the transaction.  Note that what is redone is
         not the actual computation, but the writing of the results of the
         computation which have been preserved in the log.

      2. Note that all transactions which have written their commits record
         must be redone - not just the last one.  This is because the actual
         updates to the database may be made to in-memory buffers - not to the
         actual disk - depending on what buffering strategy is used.  Thus, 
         the only way to ensure that a transaction's effect is consistently 
         reflected in the database after a crash is to redo all of its writes.

      3. Because the requirement of redoing all partially-committed transactions
         after a crash can be burdensome, it is common to make use of 
         CHECKPOINTS.

         a. From time to time (perhaps during slack periods), the DBMS will
            suspend processing of transactions and flush all its internal 
            buffers to disk.  This means that all committed transactions are
            now fully reflected in non-volatile storage.

         b. The DBMS now writes a CHECKPOINT entry to the log, indicating that
            all prior transactions have been fully entered into non-volatile
            storage.

         c. If this is done, then when a crash occurs the redoing of 
            transactions can be begun with the first transaction to have
            committed after the most recent checkpoint entry in the log.
            (This transaction may have been in progress when the checkpoint
            was taken.)

III. Incremental Log with Incremental Update
---- ----------- --- ---- ----------- ------

   A. This scheme differs from the previous scheme in that it actually makes
      changes to the database during the course of processing a transaction,
      rather than waiting to do them until after the transaction completes.
      This requires that more information be placed in the LOG.

      1. Each write operation results in an entry to the log, as in the
         previous scheme.  This entry must be written out to stable storage
         BEFORE the corresponding data is written to the database.

      2. The entry includes both the new value and the OLD value.

      Example: For our previous example, the log would now look like this:

        T1234 starts
        T1234 writes 950.00 to BALANCE of CHECKING_ACCOUNTS record 127 (old
                value was 1000.00)
        T1234 writes 2050.00 to BALANCE of SAVINGS_ACCOUNTS record 253 (old
                value was 2000.00)
        T1234 commits

   B. If the transaction is user-aborted or fails, then all of the writes that 
      have already taken place must be undone by writing the original value back
      to the database.

   C. Should a crash occur, the following algorithm is executed when the
      system restarts:

        for each transaction whose start record appears in the log 
          if its commit record has been written to the log then
            write each of the new values specified in the log to the database
          else
            rewrite each of the old values specified in the log to the database

      1. As before, writing the new values specified in the log is called
         REDO-ing the transaction, and may result in writing values that have
         been already been written (which is harmless.)

      2. Writing the old values specified in the log back is called UNDO-ing
         the transaction.  This may also result in writing a value that has
         already been written back (due to a prior abort of the transaction
         in question) or in writing a value to a location that has never been
         changed (because the system crashed before the write recorded in the
         log actually took place.)  In either case, no harm is done.

      3. Note that all undo operations must be done FIRST, then all redo
         operations.  This is necessary to prevent undoing a redo!

   D. Checkpointing may be used with this scheme as with the prior one to
      reduce the number of transactions to be processed during crash
      recovery.

   E. In general, this scheme involves somewhat more overhead than the previous
      scheme.

      1. Log entries are longer (two values, not just one, must be recorded)

      2. A user-aborted or failed transaction requires work to clean up its
         effects (not so with the prior scheme.)

      3. After a crash, some action must be taken for EVERY transaction in the
         log, not just for those that committed.

      4. Very important: before every write operation, the log entry must be
         forced out to stable storage.  In the previous scheme, log entries
         need only be forced out to stable storage at commit time.

      5. However, this scheme does allow changes made by a transaction to
         become visible in the database more quickly.  (This is not
         necessarily always an advantage, however, since it can interfere with
         recoverability or lead to cascading rollback it the transaction fails.
         However, one place where it might be useful is a transaction that 
         writes a large amount of data before committing - too much data to 
         buffer in internal memory - e.g. 

                update savings 
                        set balance = balance + interest_rate * balance;

IV. Shadow Paging
--- ------ ------

   A. The final method for handling crash control is quite different from the
      other two, since it does not make use of a log at all.  Instead, it
      works by maintaining two separate versions of the active portion of the
      database: the current version (which reflects all changes made thus far)
      and the shadow version (which reflects the state of the database before
      the transaction started.)

      1. The current transaction "sees" only the current version.  In a
         concurrent environment, other users "see" the shadow version still.

      2. If the current transaction is user aborted or fails, then the current
         version of the database is discarded and the shadow version remains
         as it was.

      3. When the transaction commits, the current version is made permanent
         by a simple pointer readjustment.

   B. The name shadow PAGING comes from the fact that the system views the
      physical storage as a series of pages (blocks) numbered 1 .. n (where
      n can be very large).  No requirement is imposed that consecutive pages
      occupy consecutive physical locations on the disk; instead, a page table
      is used to map page numbers to physical locations.

      TRANSPARENCY: page 654

      Access to the page table is via a pointer to it kept in some known 
      location on disk.

   C. Maintenance of two versions of the active portion of the database is
      accomplished by having two page tables: a shadow page table and the
      current page table.

      1. The shadow page table is the page table as it existed before the
         transaction started, and is never changed during the execution of
         the transaction.

      2. The current page table is created when the transaction starts, by
         copying the page table.  (So, initially, the current and shadow
         page tables are identical.)

      3. During the course of transaction execution, the following protocol
         is used:

         a. All read operations are performed through the current page table.
            Data is always read from the disk in units of one full page.

         b. When a write operation is to be performed, a check is made to see
            whether the current and shadow tables both point to the same
            physical location for the page.  If so, then a free page is
            obtained and the current page table is made to point to it.
            In either case, data is then written to the page pointed to by
            the current page table.  (Data is always written to the disk in
            units of one full page.)  This ensures that the pages pointed to
            by the shadow page table are never altered.

            TRANSPARENCY: page 655

       4. To commit the transaction, the external pointer to the page table
          is modified to make the current page table the permanent page table.
          Then any pages pointed to by the now discarded shadow page table, but
          not by the current page table, can be made available for re-use by
          the free storage manager.

   D. Should a crash occur, recovery is instantaneous: the external pointer
      always points to the page table as it existed before the current
      transaction began to execute.  There are, then, two major advantages to
      this scheme:

      1. The overhead of maintaining a transaction log is eliminated.

      2. No undoing or redoing of transactions is ever needed.  

   E. There are several disadvantages to this scheme, however, though all can
      be overcome:

      1. There is the overhead of creating a new copy of the page table -
         potentially many thousands of words - for each transaction.  (This
         can be minimized by paging the page table, so that only pages in the
         page table that point to data pages that are changed need to be 
         altered.)

      2. It is no longer possible to achieve physical contiguity of data in 
         units larger than one page - i.e. we can improve access time to related
         data items by putting them on the same page, but we cannot easily force
         related pages to be near each other.

         a. This problem can be partially overcome by sophisticated storage
            allocation techniques.

         b. This is not a major problem when the database is used on a 
            multiuser system, since other users may force the disk heads to
            move between accesses done for a single user anyway.

     3. Another problem is the overhead of garbage collection to recover freed
        pages - especially after a crash.  However, the methodology is
        straight-forward: any page not pointed to by a page table entry is
        necessarily free, so a simple scan of the page table will suffice.

     4. Finally, this scheme is a bit harder to adapt to concurrent processing
        than the other schemes - since now each concurrent transaction must
        have its own private current page table and interaction between
        transactions is harder to control.  (This is overcome by continuing to
        make limited use of logging.)

Copyright ©2002 - Russell C. Bjork