Manipulation of Genealogical Data
Nick Ryan
Computing Laboratory,
University of Kent at Canterbury.
The use of graphical representations of information, particularly
in the form of graphs or networks, is widespread throughout the social
sciences. Examples include the study of transportation networks and
the structure of human settlement by geographers (Cliff, 1979), of
many forms of social networks by sociologists and political scientists
(Roberts, 1979), in operational research (Avondo-Bodino, 1979), the
study of kinship by social anthropologists (Buchler, 1968), and of
stratigraphic relationships by archaeologists (Haigh, 1985; Ryan,
1985). Although each of these methods can be reduced to one of representation
and analysis of nodes and connecting links, the practitioners in the
various application fields are invariably accustomed to an established
form of diagrammatic representation which is peculiar to their discipline.
The particular application to be discussed here concerns anthropological
kinship data 1. It arose in response to a need expressed by Professor
John Davis for a system which would aid in the analysis of data concerning
the Zuwaya people of Libya, a recorded population, including deceased
ancestors, of over eleven thousand individuals. Such a data set was
amenable to a limited amount of statistical analysis using a standard
package, but no existing software was suitable for both browsing through
the data and examining alternative views of the structure of relationships.
The use of computers in the analysis of genealogies is not new. Over
the past 25 years a number of systems have been described which allow
kinship data to be sorted and searched for relationships between individuals.
Some of these have been capable of producing output in graphical
form although most have only produced tables of relationships. As
with the style of many computer applications in the social sciences
over this period, these systems have relied on batch processing.
Such a process may be repeated many times during the course of an
analysis with changes to the data or program parameters between each
run before the required results are obtained. In most cases the printed
output of the program will not be in a suitable form or of adequate
quality for publication.
The gtree program described here provides a highly interactive tool
for both thinking about and analysing genealogies, and is capable
of producing publication standard illustrations. The fundamental
characteristic which distinguishes the program from its antecedents
lies in the manner of interaction between user and data. The majority
of interactive software for social science users works in response
to commands typed on the keyboard. Commands may be selected from
a displayed menu, or the user is expected to memorise them. In practice,
most systems which employ the latter approach provide some form of
help command to obviate the need for frequent recourse to a printed
manual. In contrast, the user's interaction with gtree is largely
graphical, use of the keyboard being minimised.
TheGtree program described here provides a highly interactive tool
for both thinking about and analysing genealogies, and is capable
of producing publication standard illustrations. The fundamental
characteristic which distinguishes the program from its antecedents
lies in the manner of interaction between user and data. The majority
of interactive software for social science users works in response
to commands typed on the keyboard. Commands may be selected from
a display menu, or the user is expected to memorise them. In practice,
most systems which employ the latter approach provide some form of
help command to obviate the need for frequent recourse to a printed
manual. In contrast, the user's interaction with gtree is largely
graphical, use of the keyboard being minimised.
Gtree was developed on an ICL Perq computer under PNX 2.0, a version
of the UNIX operating system. This machine has an A4 sized portrait
format screen with a graphics resolution of 768 by 1024 pixels, together
with a digitising tablet and pointing device (referred to as the puck).
The puck and tablet may be used both for direct input of graphical
information in the form of x-y coordinates, and to move a cursor around
the display. Thus the puck may be used to point at and select from
displayed items. A facility is provided which allows the screen to
be subdivided into a number of separate windows or virtual display
screens. Each window may be regarded as a distinct display unit within
which separate concurrent processes may be run. Windows may be moved
around the screen and may overlap in a manner analogous with sheets
of paper on a desk top. This facility operates under the control
of a process known as a window manager. Gtree is normally run in
an environment designed for non-expert users which employs swim, a
Simple Window Manager developed by David Barnes (Barnes, 1985). Swim
provides the user with a straightforward means of manipulating windows,
whilst permitting the system designer to specify the extent of the
user's control over each window. Conventional window manager systems
impose a significant learning 'hump' on both experienced and inexperienced
computer users. The former, in particular, may find it difficult
to overcome their existing concepts of how a terminal device should
appear. Common problems include losing track of obscured windows,
forgetting to select the required window, and deleting windows before
processes have terminated. Swim was developed with the specific aim
of minimising such problems thus permitting the user to concentrate
on the application rather than on managing the hardware.
The mechanism of interaction used in gtree is closely linked to the
hardware capabilities of the Perq, but experience with other suitably
equipped machines has shown that portability is not a major problem.
The program is written in the C programming language and is highly
modular in design, and can thus be adapted to any machine which supports
this language and the appropriate hardware. Graphics production is
taken care of using the standard ÜNIX plot library. An early
version of the program was adapted for use on a DEC VAX-780 under
UNIX (BSD 4.2) using a BBC microcomputer as a graphics terminal.
In this case it was necessary to employ cursor keys in order to move
the cursor for making selections as no pointing device was available.
Despite considerable simplification of the display, this version
appeared significantly more difficult to use, largely as a result
of this perceptually less direct method of interaction.
The ability to point at and hence select any person displayed in a
diagram was to be used to retrieve any stored information concerning
that person, and a similar mechanism could be employed to ask questions
such as what is the relationship between these two people? In addition,
it would be necessary to provide a means of editing the diagrams produced
by the computer in order to produce results which would be suitable
for illustrative purposes.
The C programming language was used partly as a result of the writer's
personal preference amongst those available on the Perq, but also
because of the ease with which such data structures can be implemented.
The concept of structure (a group of variables which may be treated
together as a single variable) and of pointers which may be used to
access variables, provide appropriate means of representation of individuals
and relationships. Together, these allow a recursive definition of
genealogical data in terms of a network node corresponding to an individual
person (Fig.1).
Each individual or node is viewed as a set of basic information such
as name, sex, residence, occupation, etc., together with a number
of pointers to nodes containing the details of primary kin. The simple
model adequately describes the relationships between an individual
and parents, and may be easily extended to include both adoptive and
biological links, but problems arise with the representation of offspring,
siblings and spouses. At first it would appear necessary to provide
each node with a sufficient number of pointers to accommodate the
maximum possible number of each of these primary kin types. Clearly
such a solution would be enormously wasteful of storage space as most
nodes would have far fewer kin than these maxima.
The problem has been overcome by representing groups of each kin type
as a linear data structure in the form of a linked list. In this
way, a node with several offspring has a single pointer directed at
the first offspring. Each offspring has a single pointer to the next
sibling . Thus to access all of an individual's, the list is traversed
via the first offspring, visiting in turn each subsequent sibling.
The end of the list is signalled by a null or empty pointer. As
each node contains pointers to both parents, it is possible to check
which offspring belong to which spouse, and thus both illegitimate
offspring and those of multiple marriages may be accommodated.
Thus far the data structure is conventional and resembles that employed
by Wilson (Wilson, 1984) in programs that describe program and data
structures. marriages are treated somewhat differently. Each individual
may participate in several relationships, but these cannot be represented
by a simple linked list. In the case of the sibling relationship,
if A is related to B and B is related to C, then A is related to C.
Clearly this is not necessarily true of marriages. The early versions
of the program used four pointers in each node to allow for each individual
to have up to four spouses, the maximum found in the Zuwaya data.
The limitation was acceptable in the early stages of development
but was an unwelcome constraint which could stand in the way of wider
application of the program. A more general approach was needed.
The solution employed was to introduce a second data structure which
is used to hold details of a marriage (Fig.2). Again, pointers are
used to reference thenodes holding details of the individuals participating
in the marriage. A pointer to subsequent marriage is provided for
each participant allowing a linked list to be formed containing all
of an individual's marriages. Each marriage can therefore occur in
two such lists, one containing the male participant's marriages the
other, those of the female. Each node now needs only one marriage
pointer to indicate the head of such a list, through which all spouses
may be accessed.
When all data has been read and the network has been established,
control passes to the main part of the program. This begins by displaying
the population in lineage form, together with a menu of options, in
the main output window (Fig.4). The menu was prepared using med,
a general purpose menu editor with associated function library, developed
by John Bovey (Bovey, 1983). This allows the programmer to design
menus for the Perq display and provides a simple interface for applications
programs. Menu selections are made by pointing a cursor at an area
of the menu and pressing one of the puck buttons. The options provided
can be grouped under three headings: display, editing and miscellaneous
functions.
The puck may also be used to select an individual from the displayed
diagram. The selected individual is highlighted on the display and
stored information relating to the individual is displayed in a small
window at the bottom of the screen. Thus the diagram and pointing
device may be used as a means of accessing the database of information
stored on the population.
The display functions allow the user to select the manner in which
the data is to be viewed. DISPLAY ORIGINAL DIAGRAM may be used at
any time to return to the initial display and to cancel any modifications
which may have been made using the editing functions. This is achieved
by selecting the menu option. The remaining options all operate with
respect to a selected individual and produce a display of ascending
and/or descending kin, either in lineal or bilateral form. In each
case the menu option is selected first. The cursor, which changes
to indicate the selected function, is then used to select the individual
who will represent the ego in the resulting display.
The order of operations in which the menu option is selected before
the node on which the operation is to be performed appears to be counter-intuitive.
In general, the user can be expected to have selected an individual
and examined the stored information before deciding to request an
alternative display. It would therefore seem preferable to adopt
an approach in which there is always a 'currently selected individual'
upon which a menu function may operate as soon as it is selected 'individual'
upon which a menu function may operate as soon as it is selected.
How these different approaches may be viewed by the user remains
to be investigated, but it is clear that the chosen style should be
used consistently throughout the program. As with textual command
languages it would be a potential source of confusion to mix both
prefix and postfix notations in a graphical interface.
The diagrams are generated in two stages. First, the network is
traversed and the display position of each required node is calculated.
Each node which is to be included in a diagram is marked as displayed
and an index to displayed nodes is generated. This index is used
by the second part of the display process which generates the diagram
from information stored in the indexed nodes. The algorithm used
to traverse the network depends on the form of display required.
Descending lineage diagrams require a simple in-order traversal starting
from the root node and visiting nodes in the order offspring - root
- sibling. This is a conventional recursive process of visiting all
nodes of a binary tree. Lineal ascendants may be found by repeatedly
visiting the parent of the required sex until the head of the lineage,
or one who has no known parent, is found. Bilateral descent diagrams
are produced by traversing the network in the order offspring - root
- spouse - sibling. Bilateral ascendants are found by following the
route father - root - mother. In both cases, searching continues
until a user-defined search depth is reached or until a node which
has already been marked as displayed is found.
The second stage, which produces the display, is the same for all
types of diagram. Using the index generated in the first stage, symbols
are drawn at the calculated display position and ascent and descent
links are drawn. Marriage links are handled by a separate part of
the program which ensures that no such links are obscured by overlapping.
Marriage links are, of course, included automatically in all bilateral
diagrams, but are an option in the case of lineal diagrams.
Somewhat simpler would be to incorporate an algorithm which minimised
the number of crossing lines produced. This would probably be developed
from a planar graph drawing algorithm, although it would be important
to remember that a kinship diagram is not simply an arbitrary network
as it conventionally encodes relative temporal information in the
vertical location of nodes. Once such rules need to be accommodated
it becomes clear that this approach would, in effect, be a limited
implementation of the first solution, although the aim would be merely
to minimise the amount of editing which the user must perform in order
to arrive at an intelligible diagram.
By far the simplest approach to display generation would be to ignore
such problems and leave the user to sort out how best to display the
information starting from a logically correct, but aesthetically unpredictable,
diagram displayed by the program. The current version employs this
simple approach. In practice, it has proved relatively straightforward
to produce an intelligible diagram using the DELETE PERSON and MOVE
PERSON editing functions. However, it is sometimes necessary to
experiment with moving individuals to reduce crossing lines in order
to understand the diagram before a serious attempt to produce a final
diagram is made. With complex diagrams involving large numbers of
individuals this can be quite tedious. For this reason, a minimal
overlap algorithm should be added in a future version. The theory
this should provide an optimal balance between machine and human contribution
to the production of diagrams.
Deleting individuals is achieved by first selecting the menu option
and then moving the cursor to the required person. Pressing a puck
button results in the removal of the person from the display. Moving
individuals is similar except that selecting the person, the cursor
is moved to the required new position and the button pressed once
again. In addition to these functions for detailed editing, it is
also possible to remove all children and unmarried individuals from
a diagram. This provides a rapid means of simplification which is
frequently required in order to exclude those individuals who do not
participate in further relationships.
There are three remaining functions available from the menu. The
SAVE PICTURE option may be used to save a representation of the displayed
diagram as a UNIX plot(5) format file. This file may be viewed later
on the Perq screen using the plot (1) utility. It may also be displayed
on any other available graphics device such as a terminal or plotter.
An alternative method involves use of the splot utility developed
by Peter Simeon (Simeon, 1984), which allows part or all of the Perq
display to be reproduced on a Canon LBP10 laser printer. This method
was employed in producing the accompanying diagrams.
A
number of improvements to the program which will be incorporated into
future versions have already been mentioned in passing. One aspect
of the original design which has not yet been included is a means
of displaying the relationship between any two selected individuals.
This is in essence a form of the graph-theoretic problem of finding
the shortest route between two nodes of a graph for which a number
of solutions are available. It is intended to incorporate this function
in future versions of the program.
A major area for future development lies in increasing the general
applicability of the program. At present, the data input section
of the program is tailored specifically to the Zuwaya data with its'
inherent patrilineal form. The reminder of the program is, however,
free of such constraints and it is necessary to develop a means whereby
data may be accepted in other forms. Current thinking on this problem
revolves around a data dictionary approach by which a user may specify
the format and content of the data, leaving the program to extract
the information necessary for the construction of the internal model
of the population. Before this can be achieved, it will be necessary
to undertake a detailed analysis of the data collection methods of
the wide variety of potential users in order to define the variations
in data which must be accommodated.
The program as described is concerned solely with display and analysis
of networks of kin relationships. Although the facilities provided
can considerably reduce the tedium associated with producing intelligible
diagrams, it does not address a number of the major reasons behind
many anthropologists concern with kinship. Kinship in itself is not
central to many aspects of anthropological research, rather it is
seen as means of illustrating and explaining the interaction of many
other networks of social relations. A significant development of
the concepts employed in the present program would be to extend the
range of relationships and types of related objects so as to incorporate
these other networks. This might be achieved through the provision
of abstract data types and relations which could be defined according
to the user's requirements. In this way it would be possible, for
example, to examine household composition, land tenure and job specialisation
against the background of kinship relations. As many anthropological
observations include an historical dimension, the system should provide
a means of displaying the dynamic nature of these relationships as
well as more conventional static views.
At
present gtree represents a valuable tool which, given wider availability
of appropriate hardware, could reduce the burden of mechanical tasks
involved in analysing genealogical data. It is an example of a type
of software which goes beyond current perceptions of normal social
science computing, and illustrates several ways in which personal
workstations can significantly enhance the working environment of
social scientists, who, with the exception of geographers, have not
generally been seen as potential beneficiaries. Gtree provides an
example of the kind of material which might be used in an integrated
text and graphics production environment which can be provided on
such hardware. The ability to display different ideological and structural
views of data also suggests potential applications to teaching.
The extensions described above would result in a more powerful tool
which would find many applications not only in anthropology, but also
in related disciplines including social and economic history and sociology.
Until such hardware is more generally available to anthropologists
and other social scientists, it will however remain a useful testbed
for examining methods of interaction employing graphical representations
of data. The development of the proposed extensions would provide
significant contributions to the application of such methods and to
the study of human-computer interaction in general.
Barnes, D.J. 1985: A Controlled Window Management Environment for the Perq. Software Tools Centre, UKC..
Buchler, I. and H. Selby 1968: Kinship and Social Organisation: an introduction to theory and method.
Cliff, A.D., P. Haggett and J.K. Ord 1979: Graph Theory and Geography. In Applications of Graph Theory. (eds.) Wilson, R.J. and L.W. Beineke.
Hiagh, J.G.B. 1985: The Harris Matrix as a Partially Ordered Set. Computer Applications in Archaeology.
Roberts, F.S. 1979: Graph Theory and the Social Sciences. In Applications of Graph Theory. (eds.) Wilson, R.J. and L.W. Beineke
Ryan , N.S. 1985: Some Thoughts on an Archaeologist's Toolkit. Computer Applications in Archaeology.
Simeon, H.P. 1984: Splot - making plots of Perq windows on the Canon laser printer. Software Tools Centre, UKC.
Wilson, A.D. 1984: Programs to Process Trees, Representing Program Structures and Data Structures. Software - Practice and Experience 14, 807-816.