CSAC Ethnographics Gallery
CSAC FeatureMainResearchResourcesTeachingOrganisationsOther

Introduction to Database Systems for Anthropologists


Janet Bagg, UKC

BICA Issue No. 8: June 1992

This article is aimed at anthropologists with little computer experience who wish to know more about the types of things that are possible to do using such a machine. Conversations with anthropologists have indicated that many would like to know more about database systems. The form comes from teaching classes and individual students as well as from dissatisfaction with introductory books and user manuals filled with business or natural science examples. I have tried to concentrate on the principles of using the type of application rather than on any specific product. For which keys to press or menu items to choose, refer to the product manual. Each product is different in detail but shares common elements with others.

What is a database

A database is an organised collection of information held for some common purpose. It can be a collection of numeric data, categorised information on a topic or long pieces of narrative text. In this article I am using the term to refer to the type of information usually handled by a Database Management System (DBMS). A DBMS is a specialised program designed to handle information in a particular form. I will deal with the two most common types of DBMS in the academic world, the flat file and the relational.

This introduction describes simple examples of such information, how best to represent them and what can be done with them. These examples are deliberately restricted to a few pieces of information, whereas in practice databases usually contain many.

Three fundamental units of information in a database system are the field, the record and the table.

A field contains information of a specific type. In a collection of information about persons one item recorded might be age. When storing this information in a database age would be a field which would contain the age of each person. It is useful to think of fields as the columns in a printed table, each holding information about a specific aspect of the matter being represented.

A record is an instance of the entity being represented. In the above example, a record would contain information about a single person. In a printed table this might be a row. For every person the database might contain name, age and sex.

A table is the combination of rows and columns. Again the analogy of the printed table in a book or a report is useful. It contains information about a number of instances of the same entity type, for example persons, farms or villages.

There is a problem with database system terminology. Many aspects of a database systems can be referred to by a number of terms. Product suppliers have an unfortunate tendency to use terms which are inconsistent with, and occasionally in conflict with, those used in other products and those used in standard languages such as SQL. The manufacturer of one widely used micro database system uses the term database to refer to what I, following more common usage, have called table. This usage is not only confusing but also reflects operational inadequacies in the particular product. At the end of this article is a glossary of terms used to refer to aspects of a database system.

What is a database management system suitable for?

Whether or not to use a database system depends in part on the characteristics of the information and its purpose.

If the data is largely numerical and the purpose is some form of numerical analysis, a spreadsheet or statistical analysis package such as SPSSx may be more appropriate. Most DBMS have simple numerical functions built in which allow simple calculation, but are not really suitable for this type of analysis on their own.

Where data consists for the most part of large pieces of text and the purpose of examining the information relies on its textual nature, again a DBMS is probably inappropriate. Special products often called Textbases or Text Databases exist which are designed for working with this type of information. Where such software is not available a DBMS might prove useful for storing and accessing an index to textual information.

The most common type of information stored in a database system consists of categorisations of data and sometimes of numeric data as well. The purpose is usually to select or sort this information on some criteria internal to the data. Numeric data can be extracted or calculated and transferred to a spreadsheet or statistical package as a secondary operation.

A few database systems are able to store pictures. Those which are at present commercially available tend to be highly specialised towards retrieval and display, and are limited in more general facilities. DBMS able to store pictures as part of a full range of facilities exist at present only in experimental form but will become more commonplace during this decade.

It is in fact possible to make most DBMS perform all of the above tasks, but this requires great expertise, which is rarely available and is usually better achieved by using a more appropriate type of package.

A simple example - Household information


This section describes a simple database containing information about households. It consists of one table with n fields. The fields are:

HHid, location, members, type and income


and will contain information on the location of the residence (this could be address, village, quarter etc.), the number of household members, a classification of household type and the total household income.

The first field, HHid, is a unique identifier given to each household. In this case an integer number is used but an alphabetic or alphanumeric code could be employed instead. Identifiers of this type are important in distinguishing different occurrences of the same entity. This is because other possible identifiers, such as location, may not have unique values.

Before putting information into a database, the DBMS must be told what the data is like in each field. That is whether it consists of numbers or strings of characters (letters or digits), how many characters can be expected, whether the numbers are integers (eg. 67) or real (eg. 6.7) and the number of decimal places. Some systems have special types for dates, time, currency units and blocks of text. There are limitations with such types. Dates must usually be after a specified day and year, which can sometimes be quite recent (post 1900 or even within the last twenty or so years). They must also be specified in an exact manner, such as 8/5/88. Blocks of text can be included in some systems but can not be searched. Currency units usually have to be of a decimal type.

In the example the fields are defined as follows:


HHid integer number
location string - maximum 30 characters
members integer number
type string - maximum 5 characters
income currency to 2 decimal places

A DBMS will usually have a facility for entering data into a table, record by record and will allow for some data validation. Validation is where only a specified range of values is allowed. This would force an error to be reported if a letter was typed into the field for number of members as this must be an integer number. Many systems also allow ranges to be set, for instance known upper and lower income limits might prevent some typing errors.

When all records are in the database, information can be extracted on various criteria, records with certain values can be counted and the information presented sorted into a different order. Sorting can usually be done on any field and calculations on any numerical field. With large databases, it is possible to select particular fields to be displayed. To answer a question about the household types are found among the poor (defined as less than 100 currency units), households with less than 100 units of income could be selected and the number with each type counted.

Alternatively all of the records could be presented but sorted by number of members, grouping together households of the same size. Average income could be calculated for the whole set of records or for groups of records selected on certain criteria, such as residence in a particular area.

This has described some of the possibilities of using a DBMS to record a few pieces of information about an entity of a particular type. The data is in the form of a flat file database. Some DBMS can only use data in this form. The example is unrealistically simple but illustrates the relevant points.

A second example - Households and members

If it were wished to record more detailed information about households, such as the name, age and sex of each member this could be achieved by making a table with these fields plus the household fields in the previous example. An identifier for each person, Pid, has also been added.

HHid, name, age, sex, Pid, location, type, income


However, this would not be a good idea because the household data would have to be repeated for each member of each household. This would be a waste of space and of processing time, but above all of input time. For a household with ten members, exactly the same information would have to be typed in ten times for the four household fields. If a change needs to be made to any of the household fields it will have to be repeated for each occurrence.

It would be much better to have two different tables - one for people and one for households - each with the information recorded only once. The two tables would have to be related together in some way. One way of doing this would be to use the unique identifier HHid. This would be a field in the household table and also in the people table. Each person record would contain the number or code of their household. This common field could then be used to join the information in the household table to that in the people table. In database terminology a unique identifier field is known as a primary key . When such a field is placed in another table in order to create a link (HHid in the person table) it is known as a foreign key . These two tables are a very simple relational database.

HHid, location, type, income

Pid, name, age, sex, HHid

By joining together the two tables through the household identifier, questions such as "What is the average age of members in households of each type ?" could be answered. Information in one table is here used to select relevant cases from the other. Some DBMS require the field that will be used for joining tables (in this case the HHid) to be specified in advance. Others, theoretically, allow any fields to be used to join two tables. In practice this is limited by the nature of the fields (they must both be of the same type) and by the logic of the information (one would not try to join on name and location).

Because information on individuals within the household has been added, the members field is no longer needed. The number of people in each household can be calculated by counting the number of people with each household identifier. Type is a more difficult concept and may rely on complex rules of classification. Although it may be possible to write database queries that will work out types from information in the people table, it could be a complex and time consuming task and could require extra data not recorded in the database (for instance the kin relations of the members).

This simple database can be altered to allow other possibilities. If information is available on the income of each household member, this could be recorded in a field within the people table. Perhaps total household income could be calculated from this and it would also allow breakdowns of income by age and sex. If, however, there appeared to be a separate category of general household income, distinct from personal incomes, a field in the household table would be needed to record this data. Most DBMS allow the same name to be used for fields in different tables - though not in the same one - but for clarity it is usually better to have unique names.The two tables now look like this,

HHid, location, type, income

Pid, name, age, sex, pincome, HHid

This would allow comparison between proportions of total income contributed by individuals (perhaps examined by sex and age) and by the household as a unit.

This is a very simple relational database consisting of the two tables, household and people linked by the household identifier (HHid). The types within which the information is classified for this example have been used uncritically. To represent information in a more complex way requires more tables but the method of creating and linking them is similar. The objective in designing a relational database is to store each discrete piece of information once and once only (except of course for the keys used for linking tables eg. HHid). If a piece of information has to be changed at any time - if perhaps an erroneous location has been entered - this is done in one place and nowhere else.

Similar operations to those described for one table can be performed on each table alone, on items selected from one table on the basis of criteria from the other or on items selected on the basis of values in fields in both tables. The personal income of females over 25 years old in households of a given type, for instance, can be examined by this method. As well as calculating figures, the database can be used for retrieving details within selected criteria such as where households are (location) which contain persons over 60 years of age living alone.

Getting information out


This section uses a common query language to show how information can be extracted from a relational database. The language is SQL. There are many database products available and, unfortunately, little standardisation in ways of obtaining information from them. SQL stands for Standard Query Language, and it is intended to be such for relational DBMS. Most better quality systems use a variant of it although each still has idiosyncrasies.

Many DBMS have facilities for creating output screens, often called forms or layouts. These are very useful if you are going to be asking for exactly the same information on many occasions, as perhaps with an administrative database of students' details. With a research database, questions are perhaps more likely to be unique and varied. For this purpose a query language is usually better than a pre-set report. It is hoped that this section may provide a rough guide to general principles. It is not intended as a manual.

"What is the average age of members in households of each type ?" select h.type, avr = avg(p.age)
from household h, person p
where h.HHid = p.HHid
group by h.type

type    avr
1b      23
2a      29
2b      45
The fields required as output are indicated (select... ), and the tables in which they are found (from... ). Fields are referenced with a prefix showing which table they are in (h. and p. ). In SQL, because joins are not pre-defined, the two joining fields HHid must be equal in order to match the correct people and households (where... ). The group by statement performs the query on records with similar values in the specified field(s), in this case type. The function avg calculates the average of the field following in brackets. Without group by, the average would be of all records rather than by household type.


What is the personal income of females over 25 years old in households of each type ?

select h.type, p.income
from household h, person p
where h.HHid = p.HHid and p.sex ="f" and p.age > 25
order by h.type


This query shows how several selection criteria can be used together (where...and.. ) and how the information can be presented in a sorted order ,in this case by household type (order by... ).

"Where are households which contain persons over 60 years of age living alone ?"

select h.location
from household h, person p
where h.HHid = p.HHid and p.age > 60 and p.HHid in
{ select HHid from person
group by HHid
having count(HHid) = 1
}


This is a more complex query. It has a main clause finding households with members over 60 years of age and a subclause finding one person households. The subclause is similar in syntax to the main clause but is placed within curly brackets. Note that although a join is needed between the two tables for the first task, the subclause can be performed on person alone by using the HHid.

"Compare the proportion of total income contributed by individuals and by the household as a unit." select h.HHid,h.income,totp=sum(p.pincome),
hprop=((float4(h.income)/float4(h.income+sum(p.pincome)))*100),
pprop=((float4(sum(p.pincome))/float4(h.income+sum(p.pincome)))*100)
from hh h,per p
where h.HHid=p.HHid
group by h.HHid,h.income

		
HHid income totp hprop pprop 1 2300 5700 28.75 71.25 2 3050 1750 63.54 36.46 3 4000 3000 57.14 42.86

This query demonstrates how a more complex calculation can be built
up in the select statement. The results - pprop for personal income
and hprop for household income - are percentages of the total income
from household and members.

Summary

A database management system holds information collected together
for a common purpose. This data is held in tables. Each table holds
pieces of similar information about entities of the same type. Each
entity forms a record and each piece of information is grouped with
other items of the same type into fields.

The most simple type of database, the flat file type, consists of a single table. For those used to statistical packages, such as SPSSx, this is similar to the way data is stored in rows and columns for such a package.
A relational database consists of two or more tables, each linked to at least one other table by a field common to both. The fields must be of the same type in each table to be joined. Ideally each piece of information should occur only once in the database.
A DBMS is most suitable for data consisting of discrete pieces of information, for instance a census or survey. It is much less suitable for large pieces of text and for mainly numeric data where the sole purpose is to calculate numeric results.
This paper has introduced a few of the basic concepts associated with database systems. It is hoped that a further paper will explore how more complex modelling can be achieved with a relational database system and look at how to design a database.

Glossary

		

Entity

    A term sometimes used for the generic type represented by a table.
In the example household and person would be entities while information
about individual households and people would be records.

Field

    The minimal unit of information into which records are divided. A
field can be visualised as a column containing data about a similar
aspect of a record.

Field type
Fields can consist of character strings (letters or digits), integer numbers, real numbers and special types such as dates or blocks of text. The type of a field determines to some extent the operations that may be performed upon it. Numeric fields may be used in calculation, character fields can be counted but not added together.

Flat File

    A database consisting of one table made up of records arranged into
fields. This is the simplest type of database.

Foreign Key

    A key from another table appearing in the current table to allow
a join between the two tables. Instances of a foreign key can, and
usually do, appear more than once.

Form

    Pre-set organised display of information from a database. Usually
can be displayed on the screen and also printed out. Useful when the
same information is required frequently. Also known as layout, screen
or report.

Primary Key

    A unique identifier field given to each record in a table

Query


A question asked of the information in a database. Usually framed in the query language for the DBMS.

Query Language

    Language for asking questions about the information in the database.
Useful for ad hoc
		  questions. DBMS usually have their
own query languages. SQL is supposed to be the industry standard but
exists in several variants. It is fairly simple to learn and to use.
Some DBMS allow small programs written in a computer language (such
as Pascal, Fortran or C) to be added to their query language. A few
allow SQL to be embedded in external computer programs.

>Record

    Each instance of a type of entity recorded within a table. Values
can be recorded for each of the fields. If fields are columns, then
records are rows. In a database of people, each person is a record.

Relational

    Consisting of two or more sets of information related together. The
relation is achieved by having a field which is common to two or more
tables.

Table


A database unit consisting of rows of records each with columns of fields. Other terms used for this unit are Database or File. These terms are used in a number of common products. The term Relation is used for this unit in academic computer science literature but rarely in application oriented discussions or in commercial products.


Welcome to the Ethnographics Gallery

Current News, Events and Activities for CSAC and Kent Anthropology

Archiving a Cameroonian Photographic Studio

Visual Anthropology at Kent

Ethnobiology of Europe website

Seeing the ring: A nineteenth century photograph album

Other News about Kent Anthropology


UKC Anthropology
Studying Anthropology at Kent

Kent Student Notes

Kent Anthropologists

UKC Anthropology Society



CSAC's Resources for Anthropologists

A collection of resources by CSAC and others that may be of use to anthropologists

Summary list of CSAC online publications
CSAC Studies in Anthropology ISSN 1363 1098
CSAC Publications
BICA Online
Anthropology Intermedia Library
more...

Bibliography and Reading
Online Reading for Anthropologists

Experience Rich Anthropology

Anthropological Index Online

CSAC Anthropology Bibliography (Makhzan)

UK Anthropology Theses


Organisations
The Royal Anthropological Institute

RAI Anthropological Index Online

RAI Calendar of Events

Association of Social Anthropologists

ASA Monographs CD Ordering Info

Society for Anthropological Sciences

SASci Wikid


CSAC thanks the following organisations for their support:
Centre for Sociology, Anthropology and Politics

Economic and Social Research Council

Arts and Humanities Research Council

Engineering and Physical Sciences Research Council

Medical Research Council

Higher Education Funding Council for England


About the Ethnographics Gallery

The Ethnographics Gallery is a project of the Centre for Social Anthropology and Computing. It is the direct descendent of the oldest online resource for Anthropology, dating to 1986. While we are giving the Gallery a face lift, please remember there are 20 year old pages within these halls.

We have no funding stream for this site, and so little time to maintain older material so it well may have a bit of a museum effect. Newer material will be appropriately wizzy.


What is the Ethnographics Gallery?

The Ethnographics Gallery is a publication of the Centre for Social Anthropology and Computing. This site contains reports on CSAC research, Teaching materials, and Resources that can be used for planning and executing research, including bibliographic materials, databases of ethnographic material, fieldnotes, descriptors, and software for working with ethnographic data. Suggestions always welcome, but we have no funding stream for this website. It contains materials created since 1986, and many of them are rather unfashionable by today's standards. We do, however, want everything to work! mail suggestions to csac@kent.ac.uk

Return to top

History

Our first internet service was begun in November, 1986, followed by our first web site in May, 1993, one of the first 400 web sites. The Ethnographics Gallery was founded in Feburary 1994. Our mission at that time was to provide a forum for anthropologists on the internet, and we helped to launch a number of organisations into cyberspace. Today, we are mostly concerned with novel forms of online publishing, disseminating our research, promoting learning resources, and disseminating information about using computers in anthropological research.

Return to top

Updated Sun Jan 22 20:00:14 GMT+00:00 2006
RSS Feed - Return to CSAC's Ethnographics Gallery

CSAC Ethnographics Gallery

Return to CSAC's Ethnographics Gallery