GTREE: A System for the Interactive Display and
Manipulation of Genealogical Data
Nick Ryan
Computing Laboratory,
University of Kent at Canterbury.
BICA Issue No. 3: July 1985
1. Introduction
The use of graphical representations of information, particularly in the form of graphs or networks, is widespread throughout the social sciences. Examples include the study of transportation networks and the structure of human settlement by geographers (Cliff, 1979), of many forms of social networks by sociologists and political scientists (Roberts, 1979), in operational research (Avondo-Bodino, 1979), the study of kinship by social anthropologists (Buchler, 1968), and of stratigraphic relationships by archaeologists (Haigh, 1985; Ryan, 1985). Although each of these methods can be reduced to one of representation and analysis of nodes and connecting links, the practitioners in the various application fields are invariably accustomed to an established form of diagrammatic representation which is peculiar to their discipline.
The particular application to be discussed here concerns anthropological kinship data 1. It arose in response to a need expressed by Professor John Davis for a system which would aid in the analysis of data concerning the Zuwaya people of Libya, a recorded population, including deceased ancestors, of over eleven thousand individuals. Such a data set was amenable to a limited amount of statistical analysis using a standard package, but no existing software was suitable for both browsing through the data and examining alternative views of the structure of relationships.
The use of computers in the analysis of genealogies is not new. Over the past 25 years a number of systems have been described which allow kinship data to be sorted and searched for relationships between individuals. Some of these have been capable of producing output in graphical form although most have only produced tables of relationships. As with the style of many computer applications in the social sciences over this period, these systems have relied on batch processing. Such a process may be repeated many times during the course of an analysis with changes to the data or program parameters between each run before the required results are obtained. In most cases the printed output of the program will not be in a suitable form or of adequate quality for publication.
The gtree program described here provides a highly interactive tool for both thinking about and analysing genealogies, and is capable of producing publication standard illustrations. The fundamental characteristic which distinguishes the program from its antecedents lies in the manner of interaction between user and data. The majority of interactive software for social science users works in response to commands typed on the keyboard. Commands may be selected from a displayed menu, or the user is expected to memorise them. In practice, most systems which employ the latter approach provide some form of help command to obviate the need for frequent recourse to a printed manual. In contrast, the user's interaction with gtree is largely graphical, use of the keyboard being minimised.
TheGtree program described here provides a highly interactive tool for both thinking about and analysing genealogies, and is capable of producing publication standard illustrations. The fundamental characteristic which distinguishes the program from its antecedents lies in the manner of interaction between user and data. The majority of interactive software for social science users works in response to commands typed on the keyboard. Commands may be selected from a display menu, or the user is expected to memorise them. In practice, most systems which employ the latter approach provide some form of help command to obviate the need for frequent recourse to a printed manual. In contrast, the user's interaction with gtree is largely graphical, use of the keyboard being minimised.
Gtree was developed on an ICL Perq computer under PNX 2.0, a version of the UNIX operating system. This machine has an A4 sized portrait format screen with a graphics resolution of 768 by 1024 pixels, together with a digitising tablet and pointing device (referred to as the puck). The puck and tablet may be used both for direct input of graphical information in the form of x-y coordinates, and to move a cursor around the display. Thus the puck may be used to point at and select from displayed items. A facility is provided which allows the screen to be subdivided into a number of separate windows or virtual display screens. Each window may be regarded as a distinct display unit within which separate concurrent processes may be run. Windows may be moved around the screen and may overlap in a manner analogous with sheets of paper on a desk top. This facility operates under the control of a process known as a window manager. Gtree is normally run in an environment designed for non-expert users which employs swim, a Simple Window Manager developed by David Barnes (Barnes, 1985). Swim provides the user with a straightforward means of manipulating windows, whilst permitting the system designer to specify the extent of the user's control over each window. Conventional window manager systems impose a significant learning 'hump' on both experienced and inexperienced computer users. The former, in particular, may find it difficult to overcome their existing concepts of how a terminal device should appear. Common problems include losing track of obscured windows, forgetting to select the required window, and deleting windows before processes have terminated. Swim was developed with the specific aim of minimising such problems thus permitting the user to concentrate on the application rather than on managing the hardware.
The mechanism of interaction used in gtree is closely linked to the hardware capabilities of the Perq, but experience with other suitably equipped machines has shown that portability is not a major problem. The program is written in the C programming language and is highly modular in design, and can thus be adapted to any machine which supports this language and the appropriate hardware. Graphics production is taken care of using the standard ÜNIX plot library. An early version of the program was adapted for use on a DEC VAX-780 under UNIX (BSD 4.2) using a BBC microcomputer as a graphics terminal. In this case it was necessary to employ cursor keys in order to move the cursor for making selections as no pointing device was available. Despite considerable simplification of the display, this version appeared significantly more difficult to use, largely as a result of this perceptually less direct method of interaction.
2. Development of the program
The initial specification of the program called for a tool which would accept data specifying details of a large number of people including their position within the patrilineal descent system and details of known marriages. Output was to be in the form of kinship diagrams of the type that social anthropologists often draw by hand. The crucial function of the computer was to be in re-arranging the data so that it could be displayed not only in a patrilineal form, but also in matrilineal or bilateral forms. It was, therefore, to be used to perform what amounts to ideological transformations on the original data, and thus to free the anthropologist from constraints imposed by their informants and data collection methods. These tasks are extremely tedious to perform by hand and are far too time consuming to attempt on populations of several thousands of people.
The ability to point at and hence select any person displayed in a diagram was to be used to retrieve any stored information concerning that person, and a similar mechanism could be employed to ask questions such as what is the relationship between these two people? In addition, it would be necessary to provide a means of editing the diagrams produced by the computer in order to produce results which would be suitable for illustrative purposes.
2.1. Genealogical Data Structure
Although gtree was developed to deal with data organised in patrilineal form, it was essential to the general applicability of the program that this should not constrain the overall design. It was therefore necessary to adopt data structures which were equally appropriate to alternative matrilineal and bilateral views of the data. Simple genealogical examples are frequently employed by writers of computing texts as examples of binary tree structures. In practice, however, genealogical structures are considerably more complex than some of these writers would have us believe, and must be modelled not as trees, but as networks of individuals linked by relationships. Even the apparently simple binary tree formed by the ancestors of an individual frequently contains a number of duplicated nodes (consider, for example, the effect of cousin marriage).
The C programming language was used partly as a result of the writer's personal preference amongst those available on the Perq, but also because of the ease with which such data structures can be implemented. The concept of structure (a group of variables which may be treated together as a single variable) and of pointers which may be used to access variables, provide appropriate means of representation of individuals and relationships. Together, these allow a recursive definition of genealogical data in terms of a network node corresponding to an individual person (Fig.1).
Each individual or node is viewed as a set of basic information such as name, sex, residence, occupation, etc., together with a number of pointers to nodes containing the details of primary kin. The simple model adequately describes the relationships between an individual and parents, and may be easily extended to include both adoptive and biological links, but problems arise with the representation of offspring, siblings and spouses. At first it would appear necessary to provide each node with a sufficient number of pointers to accommodate the maximum possible number of each of these primary kin types. Clearly such a solution would be enormously wasteful of storage space as most nodes would have far fewer kin than these maxima.
The problem has been overcome by representing groups of each kin type as a linear data structure in the form of a linked list. In this way, a node with several offspring has a single pointer directed at the first offspring. Each offspring has a single pointer to the next sibling . Thus to access all of an individual's, the list is traversed via the first offspring, visiting in turn each subsequent sibling. The end of the list is signalled by a null or empty pointer. As each node contains pointers to both parents, it is possible to check which offspring belong to which spouse, and thus both illegitimate offspring and those of multiple marriages may be accommodated.
Thus far the data structure is conventional and resembles that employed by Wilson (Wilson, 1984) in programs that describe program and data structures. marriages are treated somewhat differently. Each individual may participate in several relationships, but these cannot be represented by a simple linked list. In the case of the sibling relationship, if A is related to B and B is related to C, then A is related to C. Clearly this is not necessarily true of marriages. The early versions of the program used four pointers in each node to allow for each individual to have up to four spouses, the maximum found in the Zuwaya data. The limitation was acceptable in the early stages of development but was an unwelcome constraint which could stand in the way of wider application of the program. A more general approach was needed.
The solution employed was to introduce a second data structure which is used to hold details of a marriage (Fig.2). Again, pointers are used to reference thenodes holding details of the individuals participating in the marriage. A pointer to subsequent marriage is provided for each participant allowing a linked list to be formed containing all of an individual's marriages. Each marriage can therefore occur in two such lists, one containing the male participant's marriages the other, those of the female. Each node now needs only one marriage pointer to indicate the head of such a list, through which all spouses may be accessed.
2.2 Structure of the program
The basic structure of the program is shown in Fig.3. After initialisation, the main data is read and a node is created for each person encountered. As each person is added, pointers are set to link the new node to previously encountered primary kin. In the case of the patrilineal Zuwaya data, this is achieved by a simple recursive process in which nodes are inserted into a tree structure representing the lineage. When all nodes have been created, the marriage data is read and the marriage links generated. Again, simple tree searching techniques are employed.
When all data has been read and the network has been established, control passes to the main part of the program. This begins by displaying the population in lineage form, together with a menu of options, in the main output window (Fig.4). The menu was prepared using med, a general purpose menu editor with associated function library, developed by John Bovey (Bovey, 1983). This allows the programmer to design menus for the Perq display and provides a simple interface for applications programs. Menu selections are made by pointing a cursor at an area of the menu and pressing one of the puck buttons. The options provided can be grouped under three headings: display, editing and miscellaneous functions.
The puck may also be used to select an individual from the displayed diagram. The selected individual is highlighted on the display and stored information relating to the individual is displayed in a small window at the bottom of the screen. Thus the diagram and pointing device may be used as a means of accessing the database of information stored on the population.
The display functions allow the user to select the manner in which the data is to be viewed. DISPLAY ORIGINAL DIAGRAM may be used at any time to return to the initial display and to cancel any modifications which may have been made using the editing functions. This is achieved by selecting the menu option. The remaining options all operate with respect to a selected individual and produce a display of ascending and/or descending kin, either in lineal or bilateral form. In each case the menu option is selected first. The cursor, which changes to indicate the selected function, is then used to select the individual who will represent the ego in the resulting display.
The order of operations in which the menu option is selected before the node on which the operation is to be performed appears to be counter-intuitive. In general, the user can be expected to have selected an individual and examined the stored information before deciding to request an alternative display. It would therefore seem preferable to adopt an approach in which there is always a 'currently selected individual' upon which a menu function may operate as soon as it is selected 'individual' upon which a menu function may operate as soon as it is selected. How these different approaches may be viewed by the user remains to be investigated, but it is clear that the chosen style should be used consistently throughout the program. As with textual command languages it would be a potential source of confusion to mix both prefix and postfix notations in a graphical interface.
The diagrams are generated in two stages. First, the network is traversed and the display position of each required node is calculated. Each node which is to be included in a diagram is marked as displayed and an index to displayed nodes is generated. This index is used by the second part of the display process which generates the diagram from information stored in the indexed nodes. The algorithm used to traverse the network depends on the form of display required. Descending lineage diagrams require a simple in-order traversal starting from the root node and visiting nodes in the order offspring - root - sibling. This is a conventional recursive process of visiting all nodes of a binary tree. Lineal ascendants may be found by repeatedly visiting the parent of the required sex until the head of the lineage, or one who has no known parent, is found. Bilateral descent diagrams are produced by traversing the network in the order offspring - root - spouse - sibling. Bilateral ascendants are found by following the route father - root - mother. In both cases, searching continues until a user-defined search depth is reached or until a node which has already been marked as displayed is found.
The second stage, which produces the display, is the same for all types of diagram. Using the index generated in the first stage, symbols are drawn at the calculated display position and ascent and descent links are drawn. Marriage links are handled by a separate part of the program which ensures that no such links are obscured by overlapping. Marriage links are, of course, included automatically in all bilateral diagrams, but are an option in the case of lineal diagrams.
Facilities for editing the displayed kinship diagrams were required for two reasons. Firstly, when producing diagrams whether for personal data exploration, or in order to communicate information to others, in publications or as lecture slides for example, it will invariably be necessary to simplify to gain clarity. Secondly, the interactive thinking tool concept presupposes that the user will employ professional skills in using the program so that diagrams may be manipulated in order to clearly illustrate important information.
At an early stage of development of the display functions three possibilities presented themselves. At the most complex level we might attempt to incorporate expert knowledge into the program to enable it to decide the most appropriate way in which to display a diagram. This would require that the program has a model of what constitutes an acceptable kinship diagram. This approach was rejected on the grounds that no matter how sophisticated the program became, it remained unlikely that program and user would concur on the optimal solution in any particular context. Indeed, with such diagrams., there is no correct solution. The added complexity was simply not worthwhile and would certainly degrade the overall performance of the program.
Somewhat simpler would be to incorporate an algorithm which minimised the number of crossing lines produced. This would probably be developed from a planar graph drawing algorithm, although it would be important to remember that a kinship diagram is not simply an arbitrary network as it conventionally encodes relative temporal information in the vertical location of nodes. Once such rules need to be accommodated it becomes clear that this approach would, in effect, be a limited implementation of the first solution, although the aim would be merely to minimise the amount of editing which the user must perform in order to arrive at an intelligible diagram.
By far the simplest approach to display generation would be to ignore such problems and leave the user to sort out how best to display the information starting from a logically correct, but aesthetically unpredictable, diagram displayed by the program. The current version employs this simple approach. In practice, it has proved relatively straightforward to produce an intelligible diagram using the DELETE PERSON and MOVE PERSON editing functions. However, it is sometimes necessary to experiment with moving individuals to reduce crossing lines in order to understand the diagram before a serious attempt to produce a final diagram is made. With complex diagrams involving large numbers of individuals this can be quite tedious. For this reason, a minimal overlap algorithm should be added in a future version. The theory this should provide an optimal balance between machine and human contribution to the production of diagrams.
Deleting individuals is achieved by first selecting the menu option and then moving the cursor to the required person. Pressing a puck button results in the removal of the person from the display. Moving individuals is similar except that selecting the person, the cursor is moved to the required new position and the button pressed once again. In addition to these functions for detailed editing, it is also possible to remove all children and unmarried individuals from a diagram. This provides a rapid means of simplification which is frequently required in order to exclude those individuals who do not participate in further relationships.
2.2.3. Miscellaneous functions
There are three remaining functions available from the menu. The SAVE PICTURE option may be used to save a representation of the displayed diagram as a UNIX plot(5) format file. This file may be viewed later on the Perq screen using the plot (1) utility. It may also be displayed on any other available graphics device such as a terminal or plotter. An alternative method involves use of the splot utility developed by Peter Simeon (Simeon, 1984), which allows part or all of the Perq display to be reproduced on a Canon LBP10 laser printer. This method was employed in producing the accompanying diagrams.
A number of improvements to the program which will be incorporated into future versions have already been mentioned in passing. One aspect of the original design which has not yet been included is a means of displaying the relationship between any two selected individuals. This is in essence a form of the graph-theoretic problem of finding the shortest route between two nodes of a graph for which a number of solutions are available. It is intended to incorporate this function in future versions of the program.
A major area for future development lies in increasing the general applicability of the program. At present, the data input section of the program is tailored specifically to the Zuwaya data with its' inherent patrilineal form. The reminder of the program is, however, free of such constraints and it is necessary to develop a means whereby data may be accepted in other forms. Current thinking on this problem revolves around a data dictionary approach by which a user may specify the format and content of the data, leaving the program to extract the information necessary for the construction of the internal model of the population. Before this can be achieved, it will be necessary to undertake a detailed analysis of the data collection methods of the wide variety of potential users in order to define the variations in data which must be accommodated.
The program as described is concerned solely with display and analysis of networks of kin relationships. Although the facilities provided can considerably reduce the tedium associated with producing intelligible diagrams, it does not address a number of the major reasons behind many anthropologists concern with kinship. Kinship in itself is not central to many aspects of anthropological research, rather it is seen as means of illustrating and explaining the interaction of many other networks of social relations. A significant development of the concepts employed in the present program would be to extend the range of relationships and types of related objects so as to incorporate these other networks. This might be achieved through the provision of abstract data types and relations which could be defined according to the user's requirements. In this way it would be possible, for example, to examine household composition, land tenure and job specialisation against the background of kinship relations. As many anthropological observations include an historical dimension, the system should provide a means of displaying the dynamic nature of these relationships as well as more conventional static views.
At present gtree represents a valuable tool which, given wider availability of appropriate hardware, could reduce the burden of mechanical tasks involved in analysing genealogical data. It is an example of a type of software which goes beyond current perceptions of normal social science computing, and illustrates several ways in which personal workstations can significantly enhance the working environment of social scientists, who, with the exception of geographers, have not generally been seen as potential beneficiaries. Gtree provides an example of the kind of material which might be used in an integrated text and graphics production environment which can be provided on such hardware. The ability to display different ideological and structural views of data also suggests potential applications to teaching.
The extensions described above would result in a more powerful tool which would find many applications not only in anthropology, but also in related disciplines including social and economic history and sociology. Until such hardware is more generally available to anthropologists and other social scientists, it will however remain a useful testbed for examining methods of interaction employing graphical representations of data. The development of the proposed extensions would provide significant contributions to the application of such methods and to the study of human-computer interaction in general.
LIST OF REFERENCES
Avondo-Bodino, G. 1979: Graph Theory and Operations Research. In Applications of Graph Theory, (eds.) Wilson, R.J. and L.W. Beineke.
Barnes, D.J. 1985: A Controlled Window Management Environment for the Perq. Software Tools Centre, UKC..
Buchler, I. and H. Selby 1968: Kinship and Social Organisation: an introduction to theory and method.
Cliff, A.D., P. Haggett and J.K. Ord 1979: Graph Theory and Geography. In Applications of Graph Theory. (eds.) Wilson, R.J. and L.W. Beineke.
Hiagh, J.G.B. 1985: The Harris Matrix as a Partially Ordered Set. Computer Applications in Archaeology.
Roberts, F.S. 1979: Graph Theory and the Social Sciences. In Applications of Graph Theory. (eds.) Wilson, R.J. and L.W. Beineke
Ryan , N.S. 1985: Some Thoughts on an Archaeologist's Toolkit. Computer Applications in Archaeology.
Simeon, H.P. 1984: Splot - making plots of Perq windows on the Canon laser printer. Software Tools Centre, UKC.
Wilson, A.D. 1984: Programs to Process Trees, Representing Program Structures and Data Structures. Software - Practice and Experience 14, 807-816.