Comparative classifications of ethnographic data: constructing the Outline of Cultural Materials

O. Kortendick
M. Fischer

Section 2

Previous Section

Findings

Preparation of the corpus

We treated the OCM as usual textual material. In order to analyse it by machine, we removed the headers and the index, which is in itself quite large. We used a machine readable file that is sold by HRAF, Inc. Unfortunately it has a lot of typographical errors and we had to remove categories 772-778 because they were duplicated. The typographical errors were left, though.

Some basic measurements were taken first. The corpus contains 45,567 Tokens and 7,497 Types, which comes near to a Type-Token-Ratio of .16. This in itself is quite remarkable. The OCM is quite noisy, it contains a lot of distortion, in other words, has a astonishing redundancy. Roughly half of the types appear only once (51.58%) and this group represents 8.5% of the tokens.

In other words, 91.5% of the Tokens (the content) are represented by only 48.42% of the types.

Table 1 shows the distribution of types and tokens

Token Token % Condition Types Types %
23.635 51.87 >27 198 2.64
29.130 63.93 >12 511 6.82
33.632 73.81 >6 1.011 13.55
41.700 91.5 >1 3.630 48.42
45.567 100 all 7.497 100

There is positive linear relationship between tokens and types:

Graphic A

Pearson’s r proves that with a high score of r=.91 (p=.03*).

To make the point clearer: These results are not extraordinary if you compare them with samples of usual text. Typically, oral interview material has lower scores of ttr (like < .05). The point is, that the OCM just isn’t ordinary text. It should by no means be this redundant. It should have higher scores of ttr than literature, for it should be precise and make each point clear. Nevertheless, it’s not. So it seems to be the case that the single categories don’t have enough distance between them, mainly describing the same things with the same words, which is unfortunate, of course, if you want high rates of ICR.

To control for that hypothesis, we recoded the OCM by using it’s own words as categories. The background of this approach is Kortendick 1995, where there was shown that there is a high correlation between coding systems that use the most frequent words as categories, and systems that are based on content-defined categories, when applied to the same text. In other words, the things people think and express are connected with the way the express them.

We made a list of the most frequent 198 words (condition was >27):

Sample 1 (the first 25)

3120 of
2728 and
1413 e
1400 g
687 see
657 the
644 etc
610 also
395 in
307 for
306 to
261 or
219 special
217 organization
207 with
191 social
180 a
166 manufacture
154 by
131 sex
129 about
128 general
128 use
127 activities
126 labor

It would be unfair to take this list as is. We cleaned it up. Only nouns and adjectives remained in the list. 163 words left over. They represent 9,274 tokens. 14,361 words “got lost” that means, this was the degree of distortion in the text. Calculated on the corpus as whole it is 31.52% of the material that is filtered by this.

Sample 2 (first 25 words)
219 special
217 organization
191 social
166 manufacture
131 sex
128 general
128 use
127 activities
126 labor
126 public
123 types
113 specialized
112 behavior
107 methods
107 political
106 techniques
105 production
102 categories
101 military
96 equipment
92 statements
92 status
89 ideas
86 government
84 religious

Still high redundancy is shown:

Graphic

As you can see, the curve is much smoother as in graphic N, the appearance of the different words is equalised.

We used TEXTPACK/PC for coding. The dictionary was run over all segments, counting the overall appearance of each code. An SPSS outputfile was created and converted to Anthropac’s conventions. All Segments were coded at least once. That means, that with just 163 codes we were able to tag each single segment, what is a good indicator for their closeness. Up to 51 (Segment 179) codes appeared in a segment:

Graphic

Distribution of frequencies shows a perfect normal distribution curve. The frequencies were here grouped in 10ths for reason of a better graphical representation.

Graphic


There is a medium strong relationship between the Number of words in a segment an it’s codes, but not as strong as expected:

Table (Pearson’s r on Number of Codes by Number of Words per segment)

     - -  Correlation Coefficients  - -

             NCODES     NWORDS

NCODES       1,0000      ,5729
            (  717)    (  717)
            P= ,       P= ,000

NWORDS        ,5729     1,0000
            (  717)    (  717)
           P= ,000    P= ,

(Coefficient / (Cases) / 2-tailed Significance)
" . " is printed if a coefficient cannot be computed

Within Anthropac, we dichotomised the dataset. It doesn’t seem to make much sense to treat the code variables as ordinal or even interval scaled. We just tested whether a code appeared in a segment or not. After that we calculated similarities, on the basis of positive matches (Tanimoto procedure). Then Johnson’s hierarchical clustering was applied, with single-linkage, based on the similarities between segments.

There a lot of clusters which represent the closeness of segments. Let’s have a look at some examples (we chose 7).


The first group consists of Categories

271 (Water and Thirst)
824 (Ethnobotany)
825 (Ethnozoology)
151 (Sensation and Perception)
822 (Ethnophysics)
823 (Ethnogeography)
821 (Ethnometeorology)
826 (Ethnoanatomy)
827 (Ethnophysiology)
828 (Ethnopsychology)

(Read this graphical representation from left to right, the position of 271 is not shown.)
Let’s concentrate for the moment on the 82* series, for it has the closest distance. What are the reasons for this? If one looks at the definitions of the OCM, it seems that they are very similar, and have lots of common tags (the whole 82* series has between 10 and 30 tags). The closeness is created mainly by seven tags: 23 (ideas), 55 (development), 56 (cultural), 72 (associated), 118 (notions), 137 (scientific) and 161 (patterns). But what is the meaning of these words? In other words, how do we fill vague expressions like “notion” or “idea” with any content at all? These categories don’t provide much help within their definitions, and therefor have lots of lateral connections to other categories (> 10). They are defined actually by these connections, not by the content of their definitions.

Let’s look at another example. When we prepared the raw document for the analysis, we left the introductory segments in. This means the small segments precede each series of categories. These segments end on a **0 code. There are 79 of them in the corpus. 63 (!) of them (= 79.75%) form a perfect cluster. These descriptions are very uniform, their information content approaches zero.
Having a look at them makes clear why. They start almost identically with a phrase like: “General statements dealing with several aspects...”. The question is, what is the purpose of this? In other words, part of the redundancy, of the noise within the OCM comes from Tibetan-prayer-mill-like phrases that don’t support coding at all.

Graphic

The authors of the OCM (1983: xix) describe in their “development perspectives” the history of the system. A major change came in 1977-79 when four major categories were established: Towns, Cities, Districts, Provinces. These too, were subdivided. After testing it with the Okayama File, these new categories were discontinued: “The new categories proved to be too cumbersome and created more difficulties than they solved.” Our cluster analysis shows the reasons for this, and would have saved some time, if conducted in the late 70’s by HRAF. Consider:
Here they are again in perfect harmony.:

633 (Cities)
632 (Towns)
634 (Districts)
635 (Provinces)
636 (Dependencies) (missing in the graphic because it joins the group later)

Yet they are better defined than the 82* series. They contain distinct examples, and much fewer lateral connections (0-4) as the first group. Nevertheless, what makes coding difficult is the fact that these non-discrete categories should consist of much more example material. Furthermore their differentiation remains unclear: Why have these western society oriented entities to be integrated in the OCM?

Graphic Graphic


Another criticism of the OCM is its lack of unidimensionality. See the following example that makes the point clear:

We find categories:

348 (Industrial structures)
346 (Religious and Educational Structures)
343 (Outbuildings)
344 (Public Structures)
347 (Business Structures)
345 (Recreational Structures)
349 (Miscellaneous Structures)

First there should be a difference between public and non-public “structures” as the authors put it. Or industrial vs. non-industrial. A church or a library (examples from category 346 (Religious Structures) is public as well. Or a bathhouse (example from category 343) is it public or private?
Again, lots of lateral connections make the uncertainty of the authors transparent.

Lack of completeness is another weak point. Although many people think of the OCM as biased on material culture see that cluster of “Food quest”:
It holds categories Graphic


223 (Fowling)
225 (Marine Hunting)
224 (Hunting and Trapping)
226 (Fishing)

Why is there a differentiation between hunting birds and say, whales, and all the other animals? Just think of the consequences: When coding material with the OCM you automatically get many entries on 224 which you have to explore in detail, if you are not interested in birds or whales. More, by coding with such an incomplete system the impression arises,that all hunting is basically the same in all cultures, except for birds and sea mammals.

We don’t want to be unfair with this analysis. But the authors of the OCM say it in their own words: The purpose is “a comprehensive subject classification system pertaining to all aspects of human behaviour and related phenomena”. Take this as the “aims of research” the system has to correspond with. However, consider the following:

The group of

603 (Grandparents and Grandchildren)
604 (Avuncular and Nepotic Relatives)
606 (Parents-In-Law and Children-In-Law)
605 (Cousins)
607 (Siblings-In-Law)

Closeness between these categories arises from the related descriptions. e.g.:

AVUNCULAR AND NEPOTIC RELATIVES -- patterns of behavior between paternal and maternal uncles and aunts and fraternal and sororal nephews and nieces; respective rights, privileges, and powers of the relatives involved; special elaborations (e.g., avunculate, amitate); relationships with the spouses of avuncular and nepotic relatives; etc.

But then, what is meant by “patterns of behaviour” in a human context? Virtually everything could fit into that category. And why should one only code behaviour between the members of these categories? Why not between cousin and sibling-in-law? Imagine how many combinations arouse from this and if one of them is important enough to code it, why not the other?

Let this be the last example of how different kinds of category join one big cluster because of their obvious closeness, and let’s concentrate of the 24*, 38* and 39* series within it.

These are categories dealing with agriculture, chemical industries and capital goods industries. The problem however, is that they are basically described with the same words. The feature of a good system, the mutual exclusiveness, is once again corrupted because these most diverse themes are described with the same words. Look at a detail like categories 274 (Beverage Industries) and 295 (Special Clothing Industries) that are combines on a high level. Why are these different themes so close together? The answer is, that they are defined with nearly the same words:

Read this graphic like this: Each Category is represented by a bar, the values are dichotomized. 11 tags are in common.

Graphic

Conclusions

This study was conducted as a result of our investigations of useing the OCM in a major ethnographic project which requires an ap priori system. Alhtough we had been aware of problems with the OCM for some decades, the results of this examination were rather surprising.

But what are the practical consequences of our results? Does this mean that the effort of fifty years by Murdock and HRAF has been wasted? Fortunately not, though this has more to do with anthropological custom than the precision of the OCM.

What the OCM consists of is a large number of terms that anthropologists have found helpful for describing humans, their creations and their environments. It is a fairly comprehensive list of technical terms which are by and large well known to anthropologists, at least those inducted before 1985 or so. It works because these terms are already well known, and are likely to flag sections of interest and value to anthropological research, since the terms arose from anthropological research. What the OCM is not is a well ordered thesaurus of anthropological terminology. The categories that appear are for convenience, not classificatory efficacy - there is little that can be deduced because a term appears in a particular category.

The OCM is then, at best, a weak classification system, in the sense that although the individual terms themselves are useful for classifying material, and a great deal of anthropological experience and expertise is represented, little is gained by the overall structure within which these are organised. The OCM is a system (or outline) only in the sense that a heap or pile is a system.


Return to CSAC Studies in Anthropology
Table of Contents