Debugging Backwards in Time

Abstract

Some of the difficulties involved with debugging a program using traditional debugging techniques are examined and criticized, then an alternative is introduced. By recording every state change in the run of a program, it is possible to present the programmer everything that might be desired. Essentially, it becomes possible to debug the program by going "backwards in time," vastly simplifying the process of debugging. An implementation of these ideas, the "Omniscient Debugger", is used to demonstrate their viability. New techniques for displaying information in the most intuitive fashion with the least effort are presented. Finally performance issues are discussed along with possible optimizations.

Introduction

Over the past forty years there has been change in way program debuggers work. In 1961 a debugger called "DDT" existed on Digital machines which allowed the programmer to examine, deposit, set break points, set trace points, single step, etc. In 2002 the primary debuggers for Java allows one to perform the identical functions with greater ease, but little more.

A brief survey of the major debuggers for Java: Java Workshop, Forte for Java, JBuilder, Symantic Cafe, Visual Age for Java, and IDEA reveals no major advances in debugging techniques since DDT. Moreover, many of them are more difficult to set up and use(!) Not one of these environments provide a debugger equal to that of the Lisp environments of the late 70s, which at least had trace and break packages that were easily manipulated, and included comprehensible print strings for objects.

(This is not to imply that a great deal of very good work hasn't been done in these products. Especially in the areas of remote debugging and multiple language debugging, some very impressive accomplishments have been made. It's just that no one has looked beyond the breakpoint paradigm.)

I believe that the reason for this is that the designers of debuggers have all concentrated on answering one question: "What information can we provide to the programmers while the program is running?" While this is not at all unreasonable, it's not the right question. The question they should have been asking is "What information will help the programmer most?" The answers are not the same.

Notably, an informal survey (a show of hands in conferences, classes, etc.) of several thousand programmers reveals that roughly 90% of all programmers debug their programs strictly by use of println()! This I take to be a horrendous comment on the state of affairs. As much as I am about criticize tradtional debuggers below, they are still vastly superior to this.

The Wrong Thing

Current debuggers concentrate on debugging running programs. This implies that programs are to be debugged by guessing. They expect the programmer to guess where the problem is and put a breakpoint there. If the guess is wrong, the programmer needs to choose a different place for a breakpoint, or just continue on to the next time the breakpoint is hit. For many simple problems, this is sufficent. For many it is not.

Stopping at a breakpoint is expensive. Choosing the right places to set breakpoints is awkward and error-prone. Once set, it takes time for the program to stop and for the programmer to figure out where the program is. Then the programmer has to figure out if the state of the system is still consistant with his expectations (has it hit the bug yet?). Maybe he has to type in the names of a few variables to examine. Using watchpoints, he may follow a single object as it changes, but doing so is also very slow and awkward. Even just pushing "continue" again and again is slow and error-prone. For something we may be doing thousands of times, every mouse movement and every keystroke is expensive.

Because they are based on "guess and break," traditional debuggers above do not make other techniques of debugging easy. One of the most effective debugging techniques is to watch traces of method class, looking for unusual arguments and return values. (This was the primary method used for debugging Lisp programs twenty years ago, one which has never gained favor in the 'C' world.)

This is unfortunate, because most of us really do think in terms of methods as the primary unit of execution. Indeed, all of debugging could be described as the search for a method which returned a "wrong" value. None of the debuggers supply a good interface for tracing method calls. They make it difficult to: turn on tracing, read trace lines and recognize values, scan large numbers of trace lines, and match up return values with the method call they came from. (Lisp was very good at all but the last.) Moreover, in a multithreaded program, following trace output is generally impossible.

Non-deterministic bugs (multithreaded timing bugs, changing I/O packet sizes, mouse movements, or just unknown parameters) drive programmers crazy. None of these debuggers help with them at all. (Deja Vu (see Alpern et. al.) from IBM does make an attempt at dealing with this by forcing programs to re-execute in the same order.) How many times has a programmer spent hours, carefully narrowing down a bug to one module, only to push "continue" one too many times and end up back where he started.

The Right Thing

The most "obvious" way to debug a system is to put it into action and run it through its paces. When you notice something not working as intended, you should be able to stop the process and go backwards to figure out what went wrong. In most realms, "going backwards" is fundamentally impossible and we are forced to use other means to approximate it. In computing, this is not the case. It is possible for us to record the exact state of the computation at each and every instant and later "revert" the program to any previous state we desire. This is what I call...

Omniscient Debugging

An omniscient debugger works by collecting information about every state change (every variable assignment of any type) and every method call in a program. After the program is finished, it brings up a debugger display and allows the programmer to look at the state of the program at any time desired. The programmer can select any variable and go "backwards in time" to see where it was set, what all of its values were. It will be possible to "single step" the program forwards or backwards, to step to any method call, follow any exception throw, any context switch.

Maintaining State

I say I want to be able to "revert" the program to any previous state. Effectively this means that every change in every accessible object or local variable constitutes a separate state. We need to record each assignment in each thread and define an ordering on them. We'll start with a naive assumption that all such events are linearly ordered and we'll assign a simple time stamp (an integer) to each.

Collecting this data is fairly simple and can be done in a variety of fashions -- "hacking" the JVM, enhancing the compiler, inserting code in the binaries. We'll look at a few examples below.

For the sake of clarity, we are going to assume that the programmer will run the program with recording turned on to collect data. After the program exits, we'll run the debugger. All of the debugging we're about to describe will be done strictly with this data. There will be no "live program" underneath. Later we'll drop this requirement.

The time stamp will be the sole referent to program state. Anything the programmer wishes to see will be related to a time stamp. Usually that time stamp will not be directly evident. The programmer will select a trace line to look at, switch to a different thread, move up on down the stack, or even step through the code. He won't even notice the time stamp, but everything will be based on it, and all display panes will be updated to reflect it.

Performance and space requirements are a common concern, but not central to our objectives. We wish to demonstrate that omniscient debugging is an effective method of finding problems. If it isn't, then performance is meaningless. If it is, then performance is just limit on the set of programs we can apply it to. We'll look at performance in detail later.

Omniscient Debugging is Interesting

There are a number of things that make omniscient debugging worth the effort of exploring. First and foremost, debugging is easier if you can go backwards.

On top of this, it eliminates the worst problems with breakpoint debuggers: no "guessing" where to put breakpoints, no "extra steps" to debugging, no "Whoops, I went too far", no non-deterministic problems. (If the problem occurred under the debugger, you're got it. It can't get away.)

It gives the programmer a unique view of the program. Being able to see the traces of all the method calls for each of the threads is quite simply interesting.

All the data is serializable. It can be analyzed remotely. It can be saved to a file. Several programmers can work with the exact same debugging session. Beta customers can email a debugging session to developers even if they don't have the source.

As a matter of fact, with a good decompiler, a programmer should be able to do a decent job of debugging unknown code to which he doesn't have the source code!

The ODB

For the rest of this paper, we are going to discuss the issues of Omniscient Debugging within the context of the "ODB", an implementation of this concept in Java. It's important to realize that the general concept of Omniscient Debugging is independent of the implementation language and can equally well be done in C or C++.

The ODB is an implementation which collects information by instrumenting the byte code of the target program as its loaded. It uses a single high-level lock to order the events from different threads. It is a proof-of-concept and serves as an upper bound for the cost of collection and display. It is at least an order of magnitude too slow.

It is also a very, very effective debugger.

There are two major issues we need to address, presentation of information and "navigation" -- how does the programmer get the debugger to display the program state/time of interest? These are both computer/human interface issues, and as such they are matters of individual preference and opinion. There are very few "CHI" issues which are agreed upon uniformly. None-the-less, I believe that the choices I have made below are close to optimal, and certainly equal to any I have seen in any other debugger.

Presentation

The only information a debugger should display is that which is part of the changing state of the program. Anything else gets in the way of debugging. We are concerned with the data that makes up the program state and whatever can help the programmer understand that state: object states, method calls, threads, the stacks, variable values, I/O streams, the window system, the code, and any locks in use. All of this information is of immediate interest and should be visable simultaniously with no programmer effort required.

Information such as the compiler used, the current JVM, the graphical construction kit, the list of individual files, the IDE's "project name", etc. is not of immediate interest and should not use up any valuable screen real estate.

The exact layout and appearence of these items is open to debate. The ODB makes reasonable choices. Others are too. The debugger is going to be used to display the state of the program at different times and all of the data displays should be updated simultaniously in a consistant fashion. In particular, the programmer should never be in doubt about what a display is showing.

We are going to use the time stamps to reference program state. Each and every object, variable, I/O stream, etc. will have a known value at each time stamp. When we change the currently displayed time stamp (we will hence forth refer to this as "reverting the debugger" to a given time) all of the data displays will be updated to reflect the values at that time. The line of code which caused the state change will be highlighted, the method trace that it happened in too. The current thread, the current stack, the variable values, etc. will all reflect the selected time stamp.

Things that don't change between the previously selected time stamp and the current one, shouldn't change either. They shouldn't change their appearence, their position, or their highlighting. Things that do change should do so obviously. New variable values should be marked, code lines should be highlighted, etc. The programmer should never wonder "did this change?"

Recognizing State

First we must be able to recognize state. This basically means the programmer should be able to look at an object and know which object it is and what values its instance variables have. In existing debuggers this is highly problematic. Many objects are displayed using their toString() methods and show up as [I@80cf1 or javax.swing.JPanel[,0,0,invalid,...].

These are not very useful. The first is sufficent in the sense that it does supply a unique identifier for the object, but it is very difficult to recognize. The second gives information about the contents of the object, but it's impossible to know if two print strings represent the same object or just two objects with the same contents. Moreover, an object might have different print strings at different times!

We need reasonable print strings for objects. We want strings that tell us what kind of object it is, some kind of unique identifier (so we know if we see it again), and whatever brief internal data (which can't change!) the programmer thinks most useful. We also want the print string to be short enough to fit several on an 80 column display, say a max of 20 characters.

The toString() method should do this. Unfortunately, some methods usurp the toString() methods for their own purposes. Still, the ODB needs such a print string, so it supplies its own format showing the class name and an index number: <MyObject_75>, chopping off the package.

Class objects are displayed as just the class name (e.g., Person ). Boolean values, numbers, and characters are displayed as expected: true, false, 123, 45.678, 'X' (88). Bytes are also displayed like characters with both the ASCII character and the decimal value in parenthesis.

This representation can be easily displayed in a Jlist, or stored in an editor. It can be printed out and it can even be typed back in to the debugger at a later date to reference that object. The precise format of the print string (and if it's even a string as opposed to some more elaborate glyph) is open to debate. It just has to fulfill the requirements above.

The contents of objects will always be displayed with their values current to the selected time stamp. The ODB uses this indented structure, although others would quiet reasonable:
<Person_2 Bob>
  name        "Bob"
  age         23
  home        <Address_3>
  friend      <Person_3 Jim>

int[3]_2
  0          33
  1          66
  2          77

<Person_2 Bob>          (previous to Bob's creation)
  name        --
  age         --
  home        --
  friend      --
There are a few cases where a different print format is more "intutive" for the programmer. Threads are the one object type that often have a reasonable, programmer-supplied name, so we'll use that name along with our brackets for consistancy: <Consumer-15>. If the programmer supplies a number in the name, we'll assume it's unique, otherwise we'll add a unique number ourselves. Threads have a dual nature, they are both normal objects with instance variables and they are "engines of computation". When displayed as "engines" they will include an indication of their status (running, blocked, not-yet-created, etc.). Hence, when displayed as an object, a blocked thread with one instance variable will look like this:
<Consumer_5>
  _blockedOn     <Person_5 Jan>
  language       "French"
and when displayed in the "Threads Pane" as an engine of execution, like this:
<Consumer_5>                   <Person_5 Jan>
-- <Consumer_6> --
<Consumer_7>
<Consumer_8>                   Dead
where 5 is blocked on Jan (we don't know if it's blocked waiting for the lock to become availble or if it's in a call to wait()), 6 has not yet been created, 7 is running, and 8 has already exited.

The collection classes constitue another special case. Our objective is to display the data in the most comprehensible fashion, not to adhear to any particular format. In reality, a Vector comprises an array of type Object, an internal index, and some additional bookkeeping variables (capacityIncrement, modCount, etc.) which are not of any interest to anyone other than the implementor. So the ODB displays a Vector in a format consistant with the way it's used:
<Vector_23>
  6    --
  5    --
  4    --
  3    <Person_15 Epp>
  2    <Person_12 Lovi>
  1    <Person_1 Tarvi>
  0    <Person_1 Tarvi>
where the maximum number of entries in the Vector was seven, and at the current moment there are only four entries. The point is that when the Vector is displayed at different times, it will fill up or empty out, but it will not otherwise change its appearence. In a similar fashion, a Hashtable mapping people to their position in a baseball game will look like this:
<Hashtable_3>
  <Person_15 Epp>     <Position_2 LeftField>
  <Person_12 Lovi>    <Position_3 RightField>
  <Person_1 Tarvi>    <Position_4 FirstBase>
  <Person_2 Tiia>     <Position_6 SecondBase>
Large objects are problematic and remain an open issue. What is the best way to display a 1000x1000x1000 array? Or even just a 10000 element array? Or a large bitmap? Happily, there are very few such objects and they are invariably arrays or large sets of instance variables of identical type.

Displaying Method Traces

In the halcyon days of Lisp, breakpoints in the traditional sense were unknown. You could break at function entry, but not at an individual line. Instead, the most common method of discerning a program's operation was through function tracing. It was a simple and straightforward request which took little time or effort. It was also generally quite effective. Tracing has never gained the same respect in other languages. We reintroduce tracing here as the primary method of identifying where in a program run one is.

Every method that is recorded will be displayed in a "Method Trace" pane. The format of the trace line will be object, method name, arguments, and return value. Each line will be indented according to depth and a matching "return line" will be displayed as method, return value. Return lines which directly follow their trace line will be elieded. Hence a method which prints out siblings and returns the number printed would produce traces like this:
<Person_5 Jan>.printSiblings() -> 3
 <PrintStream_0>.println("Sue") -> void
 <PrintStream_0>.println("Sid") -> void
 <PrintStream_0>.println("Sam") -> void
printSiblings -> 3
In multithreaded programs, each thread will have its own set of trace lines which will only be displayed when the debugger is reverted to a time stamp in that thread. Experience indicates that people think of threads as independent entities and that they just don't want to see different thread traces together.

When the debugger is reverted to a time stamp, the nearest previous trace line will be highlighted. This can be a little unclear when the highlighted line is a return and the actual time stamp is somewhere later, in the calling method. But we haven't found a better way to do it.

To avoid overwhelming the programmer with excessive, uninteresting data, it is possible to filter the trace window to show only those methods calls which the programmer finds interesting. Typically this will mean filtering out lines showing calls to certain classes, packages, specific methods, etc.

Although interesting in itself, the primary purpose of the trace window is not for debugging directly, but rather as a convenient index into the time sequence of the program. In other words, the programmer is going to use the trace to find interesting points in the program and then revert to the program state when that method call was executed.

The ODB Main Debugger Window

Now that we have the components and the print string formats, let's see how the ODB fits them together. Below we see the main debugger window. All of the components mentioned above have a pane, and all are visible simlutaniously. Indeed, the only other windows the ODB uses are popup dialogues for selection.



All of the panes show their respective information updated to the currently select time. Changing the current time by selecting a line in one of the panes, pushing a button, or selecting from a popup will cause the entire debugger to be updated to the new time.

The Display Panes

The ODB attempts to show all the data of interest on the screen at the same time. Popup dialogues are only used for finding files, which is rarely done. The panes are all updated to display the same time stamp, save for the exception noted in the Stack Pane.
Threads Pane
Threads are displayed in the format <Thread-3>, where "Thread-3" is whatever is returned by getName(). If the currently selected time is previous to the creation of the thread (before the thread's first time stamp), it is shown as -- <Thread-3> --, and as <Thread-3> Dead if after the thread has exited (after the last time stamp for a thread whose isAlive() method returns false). If the thread is blocked, waiting for a lock or in a call to wait() or join(), it is shown as <Thread-10> DemoWait. Adding the thread to the objects pane will show this as a pseudo-instance variable _blockedOn.
Stack Pane
All methods on the stack are displayed here. Single-clicking on a line will update the Code pane, the arguments, and the local variables for that stack pane. The current time stamp will not be changed. Pushing any button will act on the actual time, not what is displayed in the code pane. (It's not clear what is most "intuitive" here.) The menu command Code->Goto Stack Frame will revert to that time stamp.
Locals Pane
All arguments and local variables will be displayed here. If the program has been compiled without the -g flag, local variable names will displayed as LocalVariable_1, etc. The Sun compiler never records the proper names for arguments, instead it calls them arg0, arg1, etc. More problematic is that many compilers will optimize out local variables entirely, essentially reusing one variable for another . Hence a variable "i", which is only used in a loop, could be reused for another variable "average" which doesn't overlap with it, as below.

for (int i = 0; index < MAX; i++) {sum+=array[i];}
average = sum/MAX;


Also, if there are two local mutually exclusive variable bindings with the same name, the ODB will only show one, but with the values for both.

for (int i = 0; index1 < MAX1; i++) {sum1+=array1[i];}
for (int i = 0; index2 < MAX2; i++) {sum2+=array2[i];}

'this' Pane
This pane displays the current 'this' object (or the class object for static methods).
Trace Pane
For each thread, this shows the trace of all method calls that were made. Only three arguments are shown. Only five are recorded.
Code Pane
This displays the current line of code. The buttons all move forwards or backwards starting from this line. The two exceptions are: when a higher stack frame has been selected (see Stack Pane) and when a time stamp occurs in a source file that can't be found, in which case the code pane will be empty.
I/O Pane
This shows the all the lines written by System.out.println() (Obviously, only calls from instrumented code.) If the line is preceded by "--", that indicates that the current time is previous to the time that line was written. (Eventually this pane will show calls to print(), write(), etc. going to any I/O stream.)
Objects Pane
This shows whatever objects the user has copied here via double-clicking on objects in other windows. Double-clicking on an instance variable will add that object to the pane. Double-clicking on an object in this pane will "close up" that object, hiding all its instance variables. Double-clicking a second time will open it back up. Individual instance variables may also be removed selectively (see Objects->Remove and Objects->Retain Only).

Double-clicking on an instance variable will expand the value of that instance variable in-place (like the array int[20]_0 in figure 1). If the value is a primitive or is an object with no instance variables, nothing will happen. Pushing the Previous Value button, etc. will advance to the next value of the instance variable, not the next value of the instance variable's instance variable. For example, the prevous value of array in figure 1 is null, so the array will disappear if you go there. If you select an instance variable's instance variable, the commands will work on that, which could cause it to disappear if the instance variable also changed value. (It's a little confusing to explain. Just try it.)

Values which have not yet been set (e.g., instance variables of objects before they have been created) will be displayed as "--". Variables which changed values from the previously selected time stamp will be displayed with a leading "*".

Double-clicking on an instance variable's instance variable will copy that an instance variable to the top level. An instance variable's value may also be copied to the top level via Objects->Add Instance Variable.

The Objects menu allows you to display class object of the selected object (and the static variables thereof) Objects->Add Class.
Windows Pane
This is the one major unimplemented portion of the ODB. The intention is to display the windows created by the application and allow the programmer to observe changes as different times are selected, and to allow the programmer to navigate based on changes to the window.
Minibuffer
Modeled after EMACS, extended commands (such as Evaluate Expression or Search) are displayed here. You may do an incremental text search through the trace pane. <Ctrl-S> will start the search and give focus to the minibuffer . Typing characters will extend the search string. Another <Ctrl-S> will advance to the next match. <Ctrl-R> will reverse the search. <Return> will end the search and revert the debugger to the time of the current match. (The debugger will not revert otherwise.) <Ctrl-G> will abort any command.

Navigation

The primary modes of navigation are:

o Selecting a line in the trace window, which will revert the debugger to the time of the first time stamp recorded in that method. If that method was not instrumented, then the debugger will revert to the line it was called on. You may request the debugger to always to the calling line by turning Trace->Go to first line in method off.

o Selecting a line in the code window, which will revert the debugger to a time stamp on that line (see below).

o Pushing one of the buttons, which will revert the debugger to a time stamp depending on the currently selected object or line in that pane. (See Navigation Buttons below.)

o Selecting a line in the Stack pane, which will not change the current time stamp, but will show you where methods in the stack were called from (reverting the code pane) and the values of the local variables. (See Stack Pane.)

o Selecting a thread, which will revert to the nearest time stamp in that thread, earlier than the current time preferred. (All instrumentation methods are synchronized, so time stamps are strictly sequential.)

o Selecting a line in the I/O window, which will revert to the time when the newly selected line was printed.

The choice of which time stamp on a line in the code pane to select may be controlled by a set of options in the Code menu. The normal direction is to go backwards in time if the newly selected line is above the previously selected line, and forwards if not. Going forwards, you will select the first time stamp (of a contiguous set, e.g., call/return) on a line; going backwards, you will select the last.

There are a number of other possible behaviors that might be desired, such as to allow a time stamp in any thread, to allow a time stamp outside of the current stack frame. These will probably be supported by adding modifier keys to the selection (e.g., <Shift-click>, <Ctrl-click>).

Navigation Buttons

The buttons work as uniformly as possible for the different panes. The four most common buttons revert to the first, previous, next, and last time for selected object in that pane. In the Threads Pane, these execute first/last time stamp for the thread, but previous/next context switch. In the Code Pane, they are first/last line in current method, and "step into forward" and "step out of backwards". The other buttons execute "step over", "return from", and "next iteration on current line".

You may bring up a list of all the values a variable ever had and select one of them. The commands Object->Select IV Value selects a value for an instance variable, and Object->Select Local Value selects a value from a local variable. The ODB will revert to the time when that variable first took that value.



Locks and Wait Sets

When a thread blocks, waiting for a lock or calls wait() , the object on which it blocks will be displayed next to it in the Threads Pane (e.g., <Thread-7> is blocked on the class object DemoWait below). A pseudo-instance variable _blockedOn will be added to the thread and shown in the object pane if the thread is displayed there. Three pseudo-instance variable ( _waiters, _owner, _sleepers) will be added to the object. Sleepers are those threads which are sleeping, waiting to get the object's lock. Waiters are those that called wait(). If a thread blocks in uninstrumented code, the ODB will not notice.





Inter-Thread Relationships
In an uninstrumented program, operations which do not affect shared data have no relationship at all to operations in other threads. Only operations on shared objects (i.e., changing a variable value or I/O) have relationships.

In a properly locked program, relationships are clear. All operations on shared objects that occured before unlocking the lock are ordered prevous to all operations on objects in the next thread that occured after locking that lock. Operations on objects unrelated to locks are unrelated to each other. (Such objects cannot be shared by the threads in question because we just assumed proper locking.)

In a program which is not properly locked, all of this becomes very confusing. Depending on the design of the underlying hardware, a number of different bugs might appear, all of them non-deterministic. The current implementation of the ODB make all of these hardware-dependent bugs impossible.

Evaluating Expressions Interactively

If we relax our requirement on running the debugger on a "dead" program, we can do some more interesting things. These are not dependent on the ODB being an Omniscient debugger, any debugger could do the same.

Arbitrary method calls may be made at any time. You may revert the debugger to any time you wish and then evaluate a method call using the current values of the recorded objects. So if you revert to time 1234, when <Demo_0>.array elements 16 and 17 are 454 and 123 respectively, calling <Demo_0>.quick(16, 17) will sort those two. If you advance to time 1330, when 16 and 17 are already 123 and 454, then sorting them won't change anything.

In order to maintain some sense of consistancy, the methods will be executed in a different "timeline". Basically, the current state will be copied into a new timeline which is initialized with only one time stamp. The method will then be run (with recording turned on and off automatically), populating this new timeline. You can switch back and forth between the two timelines freely. Further method calls will clear the secondary timeline before running.

In the secondary timeline you may change the value of instance variables before running. If the ODB is displaying an array in the Objects Pane, the programmer can change any of those values before executing a method call. <Tab> will do an auto-complete on object and method names. <Return> will execute the command (running the method or setting the instance variable value).

Performance and Memory Requirements

Performance and the amount of memory required is a major issue. In an unoptimized, absolute worst-case scenario, the costs could be enormous. The estimated 10x slow-down combined with writing out upwards of 100MB/sec would be prohibitive for many programs.

A number of factors ameliorate this situation.

The first is clever optimization. Any computation which is provably deterministic (e.g., a loop that just manipulates non-shared variables and makes no non-determinstic method calls) need not be recorded completely. The starting point is sufficent. For a large class of programs, this will radically reduce their CPU time and memory requirements.

The second is judicious instrumentation. A huge percentage of any program runs "safe" code -- code that we trust and that we don't need to record. For example, many graphics programs spend over 90% of their time in the Swing libraries. For such situations, the costs can easily drop by 99%.

A good example of this is the following three loops:
for (int i=0; i<MAX; i++) sum+=array[i];
for (int i=0; i<MAX; i++) x=x*x+x;
for (int i=0; i<MAX; i++) s="Item_"+i;
The best optimized, instrumented version of the first runs about 30x slower than the uninstrumented code (700MHz G3, Apple's JDK 1.3). The second is only about 10x slower, and the third is less than 2x slower.

The third is temporal -- a great many bugs are directly linked to an external event (e.g., a button push) and recording need only be turned on when the button is about to be pushed. Once again, the amount of data recorded may be reduced to trivial numbers.

The fourth is also temporal. If the computation runs for a very long time, producing large numbers of events, it is possible to "garbage collect" old events. The ODB looks at the value of the MEMORY environment variable to determine how much memory to devote to time stamps. Currently it allows a maximum of MEMORY/200 time stamps. When that number is exceeded, the "garbage collector" kicks in and throws away some of the early time stamps.

Of course these are not really garbage. So the ODB attempts to retain the time stamps which are most likely to be interesting (object changes), while throwing away those that aren't (local variables and method calls). In the first mode, the ODB will throw away all time stamps for local variables and method calls that are older than 50% of the entries. If this is insufficent (less than 10% are collected), then the second mode kicks in and entries for objects are also collected.

In actual experience with the ODB, neither CPU overhead nor memory requirements have proven to be a major stumbling block. Debugging the debugger with itself is not a problem on a 110MHz SS4 with 128MB. On a 700MHz iBook it's a pleasure.

Implementation

There's not much of great interest in the implementation. It has proven to be a fine testbed for experimentation on debugging techniques, navigation, and display. It is not terribly clean nor is it terribly efficient. For a "professional" release, I would use a completely different design, now that I know what the issues are.

The ODB keeps a single array of time stamps. Each time stamp is a 32-bit int containing a thread index (8 bits), a source line index (20 bits), and a type index (4 bits). An event for changing an instance variable value requires three words: a time stamp, the variable being changed, and the new value. An event for a method call requires: a time stamp, and a "MethodLine" object containing: the object, the method name, the arguments, and the return value (or exception), along with a small pile of housekeeping variables. This adds up to about 10 words. A matching "ReturnLine" is also required.

Every variable has a "HistoryList" object associated with it which is just a list of time stamp/value pairs. When the ODB wants to know what the value of a variable was at time 102, it just grabs the value at the closest previous time. History lists for local variables and arguments hang off the MethodLine, those for instance variables hang off a "shadow" object which resides in a hashtable. Some clever games are played to make this all as efficent as possible.

Every time an event occurs, it locks the entire debugger, a new time stamp is generated and inserted into the list, and associated structures are built. Return values and exceptions are pasted into the appropriate MethodLines as they are generated. Some more clever games are played to allow uninstrumented code to manipulate instrumented objects (eg, arraycopy() and sort()) and allow us to track the changes.

The code insertion is very simple. The source byte code is scanned, and instrumentation code is inserted before every assignment and around every method call. Instrumenting methods larger than some size wiil exceed the 16-bit limit for class files. Currently that number is about 20,000 bytes. (I have only seen one such method "in the wild". There are also known solutions to this.)

A typical insertion looks like this:
 289 aload 4
 291 astore_1
 292 ldc_w #404 <String "Micro4:Micro4.java:64">
 295 ldc_w #416 <String "ie">
 298 aload_1
 299 aload_2
 300 invokestatic #147 <Method void change(String, String, Object, TraceLine)>
where the original code was lines 289 and 291, assigning a value to the local variable "ex". The instrumentation creates an event that records that on source line 64 of Micro.java (#292), the local variable "ie" (#295), which can be found on the traceline in register 2 (#299), was assigned the value in register 1 (#298). The other kinds of events are similar.

A "professional" version would allow different threads to use different lists of time stamps, synchronizing only at user-declared synchronization points. It would do more efficient code insertion. It would decouple display from collection. It would have selective instrumentation of non-instrumented code (so the programmer could see inside of Vectors, etc.) It would be integrated into an IDE infrastructure. It would allow programmer-created conditions and specialized display. It would be about 10x faster and use less space than the ODB. It would have prettier buttons.

Current Status

The ODB is an experimental, proof-of-concept implementation. It freely available for download from www.LambdaCS.com. A user's manual is included in the jar file along with a few aliases for UNIX and Windows. Instrumentation is normally done automatically and invisibly by a class loader. If you can run your program from the command line, you can run the ODB the same way.
% java  foo.bar.MyProgram
% debug foo.bar.MyProgram
Related Work There have been several previous attempts at doing something along these lines. Starting in 1969 the EXDAMS project (Balzer) was a TTY-based system which retained some amount of program state and allowed the programmer to examine it post-facto. It is difficult to assertain just how much it was able to do. It was clearly hindered by the lack of a display.

"ZStep 95" (Lieberman) is a visualization/debugging tool for Lisp which contained much of the same concept of retaining state and some of the "navigation" concepts. It has a significant emphasis on graphical representations of call graphs and allows the programmer to "play back" the execution of a program, video-style. It does not attempt to do a full-blown "debugger-style" display of data.

"Deja Vu" (Alpern) addresses the non-deterministic problem by recording synchronization information, but then assumes the traditional debugging methodology.

Leonardo (Crescenzi) is a C-based visualization tool under current development which also address some of the issues in debugging, but is not a debugger either.

There is a fairly large, active community working in "program visualization". They often do similar types of instrumentation, and a number of them mention the potential usefulness of this in debugging. Once again, none of them actually implement a full-fledbed debugger.