Debugging Backwards in Time
Abstract
Some of the difficulties involved with debugging a program
using traditional debugging techniques are examined and criticized,
then an alternative is introduced. By recording every state change in the run of a
program, it is possible to present the programmer everything that might
be desired. Essentially, it becomes possible to debug the program by going
"backwards in time," vastly simplifying the process of debugging. An implementation
of these ideas, the "Omniscient Debugger", is used to
demonstrate their viability. New techniques
for displaying information in the most intuitive fashion with the
least effort are presented. Finally performance issues are discussed
along with possible optimizations.
Introduction
Over the past forty years there has been change in way
program debuggers work. In 1961 a debugger called "DDT" existed on Digital machines which
allowed the programmer to examine, deposit, set break points, set trace
points, single step, etc. In 2002 the primary debuggers for Java allows one
to perform the identical functions with greater ease, but little more.
A brief survey of the major debuggers for Java: Java Workshop, Forte for
Java, JBuilder, Symantic Cafe, Visual Age for Java, and IDEA reveals no
major advances in debugging techniques since DDT. Moreover, many of them
are more difficult
to set up and use(!) Not one of these environments provide a debugger equal to
that of the Lisp environments of the late 70s, which at least had trace and
break packages that were easily manipulated, and included comprehensible print
strings for objects.
(This is not to imply that a great deal of very good work hasn't been done
in these products. Especially in the areas of remote debugging and multiple
language debugging, some very impressive accomplishments have been made. It's
just that no one has looked beyond the breakpoint paradigm.)
I believe that the reason for this is that the designers of debuggers have
all concentrated on answering one question: "What information can we provide
to the programmers while the program is running?" While this is not at all
unreasonable, it's not the right question.
The question they should have been asking is "What information will help the
programmer most?" The answers are not the same.
Notably, an informal survey (a show of hands in conferences, classes, etc.)
of several thousand programmers reveals that
roughly 90% of all programmers debug their programs strictly by use of
println()! This I take to be a horrendous comment on the state
of affairs. As much as I am about criticize tradtional debuggers below,
they are still vastly superior to this.
The Wrong Thing
Current debuggers concentrate on debugging running programs. This implies
that programs are to be debugged by guessing. They expect the programmer to
guess where the problem is and put a breakpoint there. If the guess is
wrong, the programmer needs to choose a different place for a breakpoint,
or just continue on to the next time the breakpoint is hit. For many
simple problems, this is sufficent. For many it is not.
Stopping at a breakpoint is expensive. Choosing the right places to set
breakpoints is awkward and error-prone. Once set, it takes time for the
program to stop and for the programmer to figure out where the program is.
Then the programmer has to figure out if the state of the system is still
consistant with his expectations (has it hit the bug yet?). Maybe he
has to type in the names of a few variables to examine. Using watchpoints,
he may follow a single object as it changes, but doing so is also very slow
and awkward. Even just pushing "continue" again and again is slow and
error-prone. For
something we may be doing thousands of times, every mouse movement and
every keystroke is expensive.
Because they are based on "guess and break," traditional debuggers above do not make
other techniques of debugging easy. One of the most effective debugging
techniques is to watch traces of method class, looking for unusual arguments
and return values. (This was the primary method used for debugging Lisp
programs twenty years ago, one which has never gained favor in the 'C'
world.)
This is unfortunate, because most of us really do think in terms of methods
as the primary unit of execution. Indeed, all of debugging could be
described as the search for a method which returned a "wrong" value. None
of the debuggers supply a good interface for tracing method calls. They make
it difficult to: turn on tracing, read trace lines and recognize values,
scan large numbers of trace lines, and match up return values with the method
call they came from. (Lisp was very good at all but the last.) Moreover,
in a multithreaded program, following trace output is generally impossible.
Non-deterministic bugs (multithreaded timing bugs, changing I/O packet
sizes, mouse movements, or just unknown parameters) drive programmers
crazy. None of these debuggers help with them at all. (Deja Vu (see Alpern
et. al.) from IBM does make an attempt at dealing with this by forcing
programs to re-execute in the same order.) How many times has a programmer
spent hours, carefully narrowing down a bug to one module, only to
push "continue" one too many times and end up back where he started.
The Right Thing
The most "obvious" way to debug a system is to put it into action and run it
through its paces. When you notice something not working as intended, you
should be able to stop the process and go backwards to figure out what went
wrong. In most realms, "going backwards" is fundamentally impossible and we
are forced to use other means to approximate it. In computing, this is not
the case. It is possible for us to record the exact state of the computation
at each and every instant and later "revert" the program to any previous
state we desire. This is what I call...
Omniscient Debugging
An omniscient debugger works by collecting information about every state
change (every variable assignment of any type) and every method call in
a program. After the program is finished, it brings up a debugger display
and allows the programmer to look at the state of the program at any
time desired. The programmer can select any variable and go "backwards
in time" to see where it was set, what all of its values were. It will
be possible to "single step" the program forwards or backwards, to step
to any method call, follow any exception throw, any context switch.
Maintaining State
I say I want to be able to "revert" the program to any previous
state. Effectively this means that every change in every accessible object
or local variable constitutes a separate state. We need to
record each assignment in each thread and define an ordering on them.
We'll start with a naive assumption that all such events are linearly ordered and we'll
assign a simple time stamp (an integer) to each.
Collecting this data is fairly simple and can be done in a variety of
fashions -- "hacking" the JVM, enhancing the compiler, inserting code
in the binaries. We'll look at a few examples below.
For the sake of
clarity, we are going to assume that the programmer will run the program
with recording turned on to collect data. After the program exits, we'll run
the debugger. All of the debugging we're about to describe will be done
strictly with this data. There will be no "live program" underneath.
Later we'll drop this requirement.
The time stamp will be the sole referent to program state. Anything the
programmer wishes to see will be related to a time stamp. Usually that time
stamp will not be directly evident. The programmer will select a trace
line to look at, switch to a different thread, move up on down the stack, or
even step through the code. He won't even notice the time stamp, but
everything will be based on it, and all display panes will be updated to
reflect it.
Performance and space requirements are a common concern, but not central
to our objectives. We wish to demonstrate that omniscient debugging is
an effective method of finding problems. If it isn't, then performance
is meaningless. If it is, then performance is just limit on the set of
programs we can apply it to. We'll look at performance in detail later.
Omniscient Debugging is Interesting
There are a number of things that make omniscient debugging worth the effort
of exploring. First and foremost, debugging is easier if you can go
backwards.
On top of this, it eliminates the worst problems with breakpoint debuggers:
no "guessing" where to put breakpoints, no "extra steps" to debugging, no
"Whoops, I went too far", no non-deterministic problems. (If the problem
occurred under the debugger, you're got it. It can't get away.)
It gives the programmer a unique view of the program. Being able to
see the traces of all the method calls for each of the threads is quite
simply interesting.
All the data is serializable. It can be analyzed remotely. It can be saved to a
file. Several programmers can work with the exact same debugging session.
Beta customers can email a debugging session to developers even if they
don't have the source.
As a matter of fact, with a good decompiler, a programmer should be able
to do a decent job of debugging unknown code to which he doesn't have
the source code!
The ODB
For the rest of this paper, we are going to discuss the issues of
Omniscient Debugging within the context of the "ODB", an implementation
of this concept in Java. It's important to realize that the general
concept of Omniscient Debugging is independent of the implementation
language and can equally well be done in C or C++.
The ODB is an implementation which collects information by instrumenting the
byte code of the target program as its loaded. It uses a single high-level
lock to order the events from different threads. It is a proof-of-concept
and serves as an upper bound for the cost of collection and display. It is
at least an order of magnitude too slow.
It is also a very, very effective debugger.
There are two major issues we need to address, presentation of
information and "navigation" -- how does the programmer get the
debugger to display the program state/time of interest? These are
both computer/human interface issues, and as such they are matters
of individual preference and opinion. There are very few "CHI"
issues which are agreed upon uniformly. None-the-less, I believe
that the choices I have made below are close to optimal, and certainly
equal to any I have seen in any other debugger.
Presentation
The only information a debugger should display is that which is part of the
changing state of the program. Anything else gets in the way of debugging.
We are concerned with the data that makes up the program state and whatever
can help the programmer understand that state: object states, method calls,
threads, the stacks, variable values, I/O streams, the window system, the
code, and any locks in use. All of this information is of immediate interest
and should be visable simultaniously with no programmer effort
required.
Information such as the compiler used, the current JVM, the graphical
construction kit, the list of individual files, the IDE's "project name",
etc. is not of immediate interest and should not use up any valuable screen
real estate.
The exact layout and appearence of these items is open to debate. The ODB
makes reasonable choices. Others are too. The debugger is going to be used
to display the state of the program at different times and all of the
data displays should be updated simultaniously in a consistant fashion.
In particular, the programmer should never be in doubt about what a
display is showing.
We are going to use the time stamps to reference program state. Each and
every object, variable, I/O stream, etc. will have a known value at each
time stamp. When we change the currently displayed time stamp (we will hence
forth refer to this as "reverting the debugger" to a given time) all of the
data displays will be updated to reflect the values at that time. The line
of code which caused the state change will be highlighted, the method trace
that it happened in too. The current thread, the current stack, the variable
values, etc. will all reflect the selected time stamp.
Things that don't change between the previously selected time stamp and
the current one, shouldn't change either. They shouldn't change their
appearence, their position, or their highlighting. Things that do change
should do so obviously. New variable values should be marked, code lines
should be highlighted, etc. The programmer should never wonder "did this
change?"
Recognizing State
First we must be able to recognize state. This basically means the
programmer should be able to look at an object and know which object it is
and what values its instance variables have. In existing debuggers this is
highly problematic. Many objects are displayed using their toString()
methods and show up as [I@80cf1
or javax.swing.JPanel[,0,0,invalid,...].
These are not very useful. The first is sufficent in the sense that it
does supply a unique identifier for the object, but it is very difficult
to recognize. The second gives information about the contents of the object,
but it's impossible to know if two print strings represent the same
object or just two objects with the same contents. Moreover, an object
might have different print strings at different times!
We need reasonable print strings for objects. We want strings that
tell us what kind of object it is, some kind of unique identifier (so we
know if we see it again), and whatever brief internal data (which can't
change!) the programmer
thinks most useful. We also want the print string to be short enough to fit
several on an 80 column display, say a max of 20 characters.
The toString() method should do this. Unfortunately, some
methods usurp the toString() methods for their own purposes.
Still, the ODB needs such a print
string, so it supplies its own format showing the class name and an index
number: <MyObject_75>, chopping off the package.
Class objects are displayed as just the class name (e.g.,
Person ). Boolean values, numbers, and characters are displayed
as expected: true, false, 123,
45.678, 'X' (88). Bytes are also displayed like
characters with both the ASCII character and the decimal value in
parenthesis.
This representation can be easily displayed in a Jlist, or
stored in an editor. It can be printed out and it can even be typed back in to the
debugger at a later date to reference that object. The precise format of the
print string (and if it's even a string as opposed to some more elaborate
glyph) is open to debate. It just has to fulfill the requirements above.
The contents of objects will always be displayed with their values current
to the selected time stamp. The ODB uses this indented structure, although
others would quiet reasonable:
<Person_2 Bob>
name "Bob"
age 23
home <Address_3>
friend <Person_3 Jim>
int[3]_2
0 33
1 66
2 77
<Person_2 Bob> (previous to Bob's creation)
name --
age --
home --
friend --
There are a few cases where a different print format is more "intutive"
for the programmer. Threads are the one object type that often have
a reasonable, programmer-supplied name, so we'll use that name along
with our brackets for consistancy: <Consumer-15>. If
the programmer supplies a number in the name, we'll assume it's unique,
otherwise we'll add a unique number ourselves.
Threads have a dual nature, they are both normal objects with instance
variables and they are "engines of computation". When displayed as "engines"
they will include an indication of their status (running, blocked,
not-yet-created, etc.). Hence, when displayed as an object, a blocked thread
with one instance variable will look like this:
<Consumer_5>
_blockedOn <Person_5 Jan>
language "French"
and when displayed in the "Threads Pane" as an engine of execution, like
this:
<Consumer_5> <Person_5 Jan>
-- <Consumer_6> --
<Consumer_7>
<Consumer_8> Dead
where 5 is blocked on Jan (we don't know if it's blocked waiting for the
lock to become availble or if it's in a call to wait()),
6 has not yet been created, 7 is running, and 8 has already exited.
The collection classes constitue another special case. Our objective
is to display the data in the most comprehensible fashion, not to adhear
to any particular format. In reality, a Vector comprises an array of
type Object, an internal index, and some additional bookkeeping variables
(capacityIncrement, modCount, etc.) which are not of any interest to
anyone other than the implementor. So the ODB displays a Vector in a
format consistant with the way it's used:
<Vector_23>
6 --
5 --
4 --
3 <Person_15 Epp>
2 <Person_12 Lovi>
1 <Person_1 Tarvi>
0 <Person_1 Tarvi>
where the maximum number of entries in the Vector was seven, and at the
current moment there are only four entries. The point is that when the
Vector is displayed at different times, it will fill up or empty out,
but it will not otherwise change its appearence. In a similar fashion,
a Hashtable mapping people to their position in a baseball game
will look like this:
<Hashtable_3>
<Person_15 Epp> <Position_2 LeftField>
<Person_12 Lovi> <Position_3 RightField>
<Person_1 Tarvi> <Position_4 FirstBase>
<Person_2 Tiia> <Position_6 SecondBase>
Large objects are problematic and remain an open issue. What is the best
way to display a 1000x1000x1000 array? Or even just a 10000 element array?
Or a large bitmap? Happily, there are very few such objects and they are
invariably arrays or large sets of instance variables of identical type.
Displaying Method Traces
In the halcyon days of Lisp, breakpoints in the traditional sense were
unknown. You could break at function entry, but not at an individual
line. Instead, the most common method of discerning a program's
operation was through function tracing. It was a simple and straightforward
request which took little time or effort. It was also generally
quite effective. Tracing has never gained the same respect in other
languages. We reintroduce tracing here as the primary method
of identifying where in a program run one is.
Every method that is recorded will be displayed in a "Method Trace"
pane. The format of the trace line will be object, method name, arguments,
and return value. Each line will be indented according to depth and a matching "return line"
will be displayed as method, return value. Return lines which directly
follow their trace line will be elieded. Hence a method which prints
out siblings and returns the number printed would produce traces like this:
<Person_5 Jan>.printSiblings() -> 3
<PrintStream_0>.println("Sue") -> void
<PrintStream_0>.println("Sid") -> void
<PrintStream_0>.println("Sam") -> void
printSiblings -> 3
In multithreaded programs, each thread will have its own set of trace lines
which will only be displayed when the debugger is reverted to a time stamp
in that thread. Experience indicates that people think of threads as independent
entities and that they just don't want to see different thread traces together.
When the debugger is reverted to a time stamp, the nearest previous
trace line will be highlighted. This can be a little unclear when the highlighted
line is a return and the actual time stamp is somewhere later, in the calling
method. But we haven't found a better way to do it.
To avoid overwhelming the programmer with excessive, uninteresting data, it
is possible to filter the trace window to show only those methods calls
which the programmer finds interesting. Typically this will mean filtering
out lines showing calls to certain classes, packages, specific methods, etc.
Although interesting in itself, the primary purpose of the trace window is
not for debugging directly, but rather as a convenient index into the time
sequence of the program. In other words, the programmer is going to use the
trace to find interesting points in the program and then revert to the
program state when that method call was executed.
The ODB Main Debugger Window
Now that we have the components and the print string formats, let's see
how the ODB fits them together. Below we see the main debugger window.
All of the components mentioned above have a pane, and all are visible
simlutaniously. Indeed, the only other windows the ODB uses are popup
dialogues for selection.
All of the panes show their respective information updated to the currently
select time. Changing the current time by selecting a line in one of
the panes, pushing a button, or selecting from a popup will cause the
entire debugger to be updated to the new time.
The Display Panes
The ODB attempts to show all the data of interest on the screen at
the same time. Popup dialogues are only used for finding files, which
is rarely done. The panes are all updated to display the same time
stamp, save for the exception noted in the Stack Pane.
Threads Pane
Threads are displayed in the format <Thread-3>, where "Thread-3" is
whatever is returned by getName(). If the currently selected time is
previous to the creation of the thread (before the thread's first time
stamp), it is shown as -- <Thread-3> --, and as <Thread-3> Dead if after
the thread has exited (after the last time stamp for a thread whose
isAlive() method returns false). If the thread is blocked, waiting for a
lock or in a call to wait() or join(), it is shown as <Thread-10> DemoWait.
Adding the thread to the objects pane will show this as a
pseudo-instance variable _blockedOn.
Stack Pane
All methods on the stack are displayed here. Single-clicking on a line will
update the Code pane, the arguments, and the local variables for that stack
pane. The current time stamp will not be changed. Pushing any button will
act on the actual time, not what is displayed in the code pane. (It's not
clear what is most "intuitive" here.) The menu command Code->Goto Stack Frame
will revert to that time stamp.
Locals Pane
All arguments and local variables will be displayed here. If
the program has been compiled without the -g flag, local variable names
will displayed as LocalVariable_1, etc. The Sun compiler never records
the proper names for arguments, instead it calls them arg0, arg1, etc.
More problematic is that many compilers will optimize out local variables
entirely, essentially reusing one variable for another . Hence a variable
"i", which is only used in a loop, could be reused for another variable
"average" which doesn't overlap with it, as below.
for (int i = 0; index < MAX; i++) {sum+=array[i];}
average = sum/MAX;
Also, if there are two local mutually exclusive variable bindings with
the same name, the ODB will only show one, but with the values for both.
for (int i = 0; index1 < MAX1; i++) {sum1+=array1[i];}
for (int i = 0; index2 < MAX2; i++) {sum2+=array2[i];}
'this' Pane
This pane displays the current 'this' object (or the class object for
static methods).
Trace Pane
For each thread, this shows the trace of all method calls
that were made. Only three arguments are shown. Only five are recorded.
Code Pane
This displays the current line of code. The buttons all move
forwards or backwards starting from this line. The two exceptions are:
when a higher stack frame has been selected (see Stack Pane) and when a
time stamp occurs in a source file that can't be found, in which case the code pane
will be empty.
I/O Pane
This shows the all the lines written by System.out.println() (Obviously,
only calls from instrumented code.) If the line is preceded by "--", that
indicates that the current time is previous to the time that line was
written. (Eventually this pane will show calls to print(), write(), etc.
going to any I/O stream.)
Objects Pane
This shows whatever objects the user has copied here via double-clicking on
objects in other windows. Double-clicking on an instance variable will add
that object to the pane. Double-clicking on an object in this pane will
"close up" that object, hiding all its instance variables. Double-clicking
a second time will open it back up. Individual instance variables may also
be removed selectively (see Objects->Remove and Objects->Retain Only).
Double-clicking on an instance variable will expand the value
of that instance variable in-place (like the array int[20]_0 in
figure 1). If the value is a primitive or is an object with no instance
variables, nothing will happen. Pushing the Previous Value button, etc.
will advance to the next value of the instance variable, not the
next value of the instance variable's instance variable. For example, the
prevous value of array in figure 1 is null, so
the array will disappear if you go there. If you select an
instance variable's instance variable, the commands will work on that, which
could cause it to disappear if the instance variable also changed value.
(It's a little confusing to explain. Just try it.)
Values which have not yet been
set (e.g., instance variables of objects before they have been created) will
be displayed as "--". Variables which changed values from the previously
selected time stamp will be displayed with a leading "*".
Double-clicking on an
instance variable's instance variable will copy that an
instance variable to the top level. An instance variable's value may also be
copied to the top level via Objects->Add Instance Variable.
The Objects menu allows you to display class object of the selected object (and the static
variables thereof) Objects->Add Class.
Windows Pane
This is the one major unimplemented portion of the ODB. The intention is to display
the windows created by the application and allow the programmer to observe
changes as different times are selected, and to allow the programmer to
navigate based on changes to the window.
Minibuffer
Modeled after EMACS, extended commands (such as Evaluate Expression or
Search) are displayed here. You may do an incremental text search through
the trace pane. <Ctrl-S> will start the search and give focus to the
minibuffer . Typing characters will extend the search string. Another
<Ctrl-S> will advance to the next match. <Ctrl-R> will reverse the search.
<Return> will end the search and revert the debugger to the time of the
current match. (The debugger will not revert otherwise.) <Ctrl-G> will abort any command.
Navigation
The primary modes of navigation are:
o Selecting a line in the trace window, which will revert the debugger to
the time of the first time stamp recorded in that method. If that method
was not instrumented, then the debugger will revert to the line it was
called on.
You may request the debugger to always to the calling line by turning
Trace->Go to first line in method off.
o Selecting a line in the code window, which will revert the debugger to a
time stamp on that line (see below).
o Pushing one of the buttons, which will revert the debugger to a time stamp
depending on the currently selected object or line in that pane. (See Navigation
Buttons below.)
o Selecting a line in the Stack pane, which will not change the current time
stamp, but will show you where methods in the stack were called from
(reverting the code pane) and the values of the local variables. (See Stack Pane.)
o Selecting a thread, which will revert to the nearest time stamp in that
thread, earlier than the current time preferred. (All instrumentation methods
are synchronized, so time stamps are strictly sequential.)
o Selecting a line in the I/O window, which will revert to the time when the
newly selected line was printed.
The choice of which time stamp on a line in the code pane to select may be
controlled by a set of options in the Code menu. The normal direction is to
go backwards in time if the newly selected line is above the previously
selected line, and forwards if not. Going forwards, you will select the
first time stamp (of a contiguous set, e.g., call/return) on a line; going
backwards, you will select the last.
There are a number of other possible behaviors that might be
desired, such as to allow a time stamp in any thread, to allow a time
stamp outside of the current stack frame.
These will probably be supported by adding modifier keys
to the selection (e.g., <Shift-click>, <Ctrl-click>).
Navigation Buttons
The buttons work as uniformly as possible for the different panes. The four
most common buttons revert to the first, previous, next, and last time for
selected object in that pane. In the Threads Pane, these execute first/last
time stamp for the thread, but previous/next context switch. In the Code
Pane, they are first/last line in current method, and "step into forward"
and "step out of backwards". The other buttons execute "step over", "return from",
and "next iteration on current line".
You may bring up a list of all the values a variable ever had and
select one of them. The commands Object->Select IV Value selects a value
for an instance variable, and Object->Select Local Value
selects a value from a local variable. The ODB will revert to the time
when that variable first took that value.
Locks and Wait Sets
When a thread blocks, waiting for a lock or calls wait() , the object on which it
blocks will be displayed next to it in the Threads Pane (e.g.,
<Thread-7>
is blocked on the class object DemoWait
below). A pseudo-instance variable _blockedOn
will be added to the thread and shown in the object pane if the thread is
displayed there. Three pseudo-instance variable ( _waiters, _owner,
_sleepers) will be added to the object. Sleepers are those threads
which are sleeping, waiting to get the object's lock. Waiters are those
that called wait().
If a thread blocks in
uninstrumented code, the ODB will not notice.
Inter-Thread Relationships
In an uninstrumented program, operations which do not affect shared data have
no relationship at all to operations in other threads. Only operations on
shared objects (i.e., changing a variable value or I/O) have relationships.
In a properly locked program, relationships are clear. All operations on
shared objects that occured before unlocking the lock are ordered prevous
to all operations on objects in the next thread that occured after locking
that lock. Operations on objects unrelated to locks are unrelated to each
other. (Such objects cannot be shared by the threads in question because
we just assumed proper locking.)
In a program which is not properly locked, all of this becomes very confusing.
Depending on the design of the underlying hardware, a number of different
bugs might appear, all of them non-deterministic. The current implementation of the
ODB make all of these hardware-dependent bugs impossible.
Evaluating Expressions Interactively
If we relax our requirement on running the debugger on a "dead" program,
we can do some more interesting things. These are not dependent on the
ODB being an Omniscient debugger, any debugger could do the same.
Arbitrary method calls may be made at any time. You may revert the debugger
to any time you wish and then evaluate a method call using the current
values of the recorded objects. So if you revert to time 1234, when
<Demo_0>.array elements 16 and 17 are 454 and 123 respectively, calling
<Demo_0>.quick(16, 17) will sort those two. If you advance to time 1330,
when 16 and 17 are already 123 and 454, then sorting them won't change
anything.
In order to maintain some sense of consistancy, the methods will be executed
in a different "timeline". Basically, the current state will be copied into
a new timeline which is initialized with only one time stamp. The method
will then be run (with recording turned on and off automatically),
populating this new timeline. You can switch back and forth between the two
timelines freely. Further method calls will clear the secondary timeline
before running.
In the secondary timeline you may change the value of instance variables
before running. If the ODB is displaying an array in the Objects Pane,
the programmer can change any of those values before executing a method
call.
<Tab> will do an auto-complete on object and method
names. <Return> will execute the command (running the method or setting the
instance variable value).
Performance and Memory Requirements
Performance and the amount of memory required is a major issue. In an
unoptimized, absolute worst-case scenario, the costs could be enormous. The
estimated 10x slow-down combined with writing out upwards of 100MB/sec would
be prohibitive for many programs.
A number of factors ameliorate this situation.
The first is clever optimization. Any computation which is provably
deterministic (e.g., a loop that just manipulates non-shared variables and
makes no non-determinstic method calls) need not be recorded completely. The starting point
is sufficent. For a large class of programs, this will radically reduce
their CPU time and memory requirements.
The second is judicious instrumentation. A huge percentage of any program
runs "safe" code -- code that we trust and that we don't need to record. For
example, many graphics programs spend over 90% of their time in the Swing
libraries. For such situations, the costs can easily drop by 99%.
A good example of this is the following three loops:
for (int i=0; i<MAX; i++) sum+=array[i];
for (int i=0; i<MAX; i++) x=x*x+x;
for (int i=0; i<MAX; i++) s="Item_"+i;
The best optimized, instrumented version of the first runs about 30x slower
than the uninstrumented code (700MHz G3, Apple's JDK 1.3).
The second is only about 10x slower, and the third is less than 2x slower.
The third is temporal -- a great many bugs are directly linked to an external event
(e.g., a button push) and recording need only be turned on when the button
is about to be pushed. Once again, the amount of data recorded may be reduced
to trivial numbers.
The fourth is also temporal. If the computation runs for a very long time,
producing large numbers of events, it is possible to "garbage collect" old
events.
The ODB looks at the value of the MEMORY environment variable to determine
how much memory to devote to time stamps. Currently it allows a maximum
of MEMORY/200 time stamps. When that number is exceeded, the
"garbage collector" kicks in and throws away some of the early time
stamps.
Of course these are not really garbage. So the ODB attempts to retain the
time stamps which are most likely to be interesting (object changes), while throwing away
those that aren't (local variables and method calls).
In the first mode, the ODB will throw away all time stamps
for local variables and method calls that are older than 50% of the entries.
If this is insufficent (less than 10% are collected), then the second mode
kicks in and entries for objects are also collected.
In actual experience with the ODB, neither CPU overhead nor memory requirements
have proven to be a major stumbling block. Debugging the debugger with itself is
not a problem on a 110MHz SS4 with 128MB. On a 700MHz iBook it's a pleasure.
Implementation
There's not much of great interest in the implementation. It has proven to
be a fine testbed for experimentation on debugging techniques, navigation,
and display. It is not terribly clean nor is it terribly efficient. For a
"professional" release, I would use a completely different design, now that
I know what the issues are.
The ODB keeps a single array of time stamps. Each time stamp is a 32-bit int
containing a thread index (8 bits), a source line index (20 bits), and a
type index (4 bits). An event for changing an instance variable value
requires three words: a time stamp, the variable being changed, and the new
value. An event for a method call requires: a time stamp, and a "MethodLine"
object containing: the object, the method name, the arguments, and the
return value (or exception), along with a small pile of housekeeping
variables. This adds up to about 10 words. A matching "ReturnLine" is also
required.
Every variable has a "HistoryList" object associated with it which is just a
list of time stamp/value pairs. When the ODB wants to know what the value of
a variable was at time 102, it just grabs the value at the closest previous
time. History lists for local variables and arguments hang off the
MethodLine, those for instance variables hang off a "shadow" object which
resides in a hashtable. Some clever games are played to make this all as
efficent as possible.
Every time an event occurs, it locks the entire debugger, a new time stamp
is generated and inserted into the list, and associated structures are
built. Return values and exceptions are pasted into the appropriate
MethodLines as they are generated. Some more clever games are played to allow
uninstrumented code to manipulate instrumented objects (eg, arraycopy() and
sort()) and allow us to track the changes.
The code insertion is very simple. The source byte code is scanned, and
instrumentation code is inserted before every assignment and around every
method call. Instrumenting methods larger than some size wiil exceed the
16-bit limit for class files. Currently that number is about 20,000 bytes.
(I have only seen one such method "in the wild". There are also known
solutions to this.)
A typical insertion looks like this:
289 aload 4
291 astore_1
292 ldc_w #404 <String "Micro4:Micro4.java:64">
295 ldc_w #416 <String "ie">
298 aload_1
299 aload_2
300 invokestatic #147 <Method void change(String, String, Object, TraceLine)>
where the original code was lines 289 and 291, assigning a value to the
local variable "ex". The instrumentation creates an event that records that on
source line 64 of Micro.java (#292), the local variable "ie" (#295), which can be found on
the traceline in register 2 (#299), was assigned the value in register 1 (#298).
The other kinds of events are similar.
A "professional" version would allow different threads to use different
lists of time stamps, synchronizing only at user-declared synchronization
points. It would do more efficient code insertion. It would decouple display
from collection. It would have selective instrumentation of
non-instrumented code (so the programmer could see inside of Vectors, etc.)
It would be integrated into an IDE infrastructure. It would allow
programmer-created conditions and specialized display. It would be about 10x
faster and use less space than the ODB. It would have prettier buttons.
Current Status
The ODB is an experimental, proof-of-concept implementation. It freely
available for download from www.LambdaCS.com. A user's manual is included
in the jar file along with a few aliases for UNIX and Windows. Instrumentation
is normally done automatically and invisibly by a class loader. If you can
run your program from the command line, you can run the ODB the same way.
% java foo.bar.MyProgram
% debug foo.bar.MyProgram
Related Work
There have been several previous attempts at doing something along these
lines. Starting in 1969 the EXDAMS project (Balzer) was a TTY-based
system which retained some amount of program state and allowed the
programmer to examine it post-facto. It is difficult to assertain
just how much it was able to do. It was clearly hindered by the lack
of a display.
"ZStep 95" (Lieberman) is a visualization/debugging tool for Lisp which
contained much of the same concept of retaining state and some
of the "navigation" concepts. It has a significant emphasis on graphical
representations of call graphs and allows the programmer to "play back"
the execution of a program, video-style. It does not attempt to do a full-blown
"debugger-style" display of data.
"Deja Vu" (Alpern)
addresses the non-deterministic problem by recording synchronization
information, but then assumes the traditional debugging methodology.
Leonardo (Crescenzi) is a C-based visualization tool under current development
which also address some of the issues in debugging, but is not a
debugger either.
There is a fairly large, active community working in "program visualization". They
often do similar types of instrumentation, and a number of them mention the potential
usefulness of this in debugging. Once again, none of them actually implement a
full-fledbed debugger.