Some basic measurements were taken first. The corpus contains 45,567 Tokens and 7,497 Types, which comes near to a Type-Token-Ratio of .16. This in itself is quite remarkable. The OCM is quite noisy, it contains a lot of distortion, in other words, has a astonishing redundancy. Roughly half of the types appear only once (51.58%) and this group represents 8.5% of the tokens.
In other words, 91.5% of the Tokens (the content) are represented by only 48.42% of the types.
Table 1 shows the distribution of types and tokens
| Token | Token % | Condition | Types | Types % |
|---|---|---|---|---|
| 23.635 | 51.87 | >27 | 198 | 2.64 |
| 29.130 | 63.93 | >12 | 511 | 6.82 |
| 33.632 | 73.81 | >6 | 1.011 | 13.55 |
| 41.700 | 91.5 | >1 | 3.630 | 48.42 |
| 45.567 | 100 | all | 7.497 | 100 |
There is positive linear relationship between tokens and types:
Graphic A
Pearsons r proves that with a high score of r=.91 (p=.03*).
To make the point clearer: These results are not extraordinary if you compare them with samples of usual text. Typically, oral interview material has lower scores of ttr (like < .05). The point is, that the OCM just isnt ordinary text. It should by no means be this redundant. It should have higher scores of ttr than literature, for it should be precise and make each point clear. Nevertheless, its not. So it seems to be the case that the single categories dont have enough distance between them, mainly describing the same things with the same words, which is unfortunate, of course, if you want high rates of ICR.
To control for that hypothesis, we recoded the OCM by using its own words as categories. The background of this approach is Kortendick 1995, where there was shown that there is a high correlation between coding systems that use the most frequent words as categories, and systems that are based on content-defined categories, when applied to the same text. In other words, the things people think and express are connected with the way the express them.
We made a list of the most frequent 198 words (condition was >27):
Sample 1 (the first 25)
3120 of
2728 and
1413 e
1400 g
687 see
657 the
644 etc
610 also
395 in
307 for
306 to
261 or
219 special
217 organization
207 with
191 social
180 a
166 manufacture
154 by
131 sex
129 about
128 general
128 use
127 activities
126 labor
It would be unfair to take this list as is. We cleaned it up. Only nouns and adjectives remained in the list. 163 words left over. They represent 9,274 tokens. 14,361 words got lost that means, this was the degree of distortion in the text. Calculated on the corpus as whole it is 31.52% of the material that is filtered by this.
Sample 2 (first 25 words)
219 special
217 organization
191 social
166 manufacture
131 sex
128 general
128 use
127 activities
126 labor
126 public
123 types
113 specialized
112 behavior
107 methods
107 political
106 techniques
105 production
102 categories
101 military
96 equipment
92 statements
92 status
89 ideas
86 government
84 religious
Still high redundancy is shown:
Graphic
As you can see, the curve is much smoother as in graphic N, the appearance of the different words is equalised.
We used TEXTPACK/PC for coding. The dictionary was run over all segments, counting the overall appearance of each code. An SPSS outputfile was created and converted to Anthropacs conventions. All Segments were coded at least once. That means, that with just 163 codes we were able to tag each single segment, what is a good indicator for their closeness. Up to 51 (Segment 179) codes appeared in a segment:
Graphic
Distribution of frequencies shows a perfect normal distribution curve. The frequencies were here grouped in 10ths for reason of a better graphical representation.
Graphic
There is a medium strong relationship between the Number of words
in a segment an its codes, but not as strong as expected:
Table (Pearsons r on Number of Codes by Number of Words per segment)
- - Correlation Coefficients - -
NCODES NWORDS
NCODES 1,0000 ,5729
( 717) ( 717)
P= , P= ,000
NWORDS ,5729 1,0000
( 717) ( 717)
P= ,000 P= ,
(Coefficient / (Cases) / 2-tailed Significance)
" . " is printed if a coefficient cannot be computed
Within Anthropac, we dichotomised the dataset. It doesnt seem to make much sense to treat the code variables as ordinal or even interval scaled. We just tested whether a code appeared in a segment or not. After that we calculated similarities, on the basis of positive matches (Tanimoto procedure). Then Johnsons hierarchical clustering was applied, with single-linkage, based on the similarities between segments.
There a lot of clusters which represent the closeness of segments. Lets have a look at some examples (we chose 7).
The first group consists of Categories
271 (Water and Thirst)
824 (Ethnobotany)
825 (Ethnozoology)
151 (Sensation and Perception)
822 (Ethnophysics)
823 (Ethnogeography)
821 (Ethnometeorology)
826 (Ethnoanatomy)
827 (Ethnophysiology)
828 (Ethnopsychology)
(Read this graphical representation from left to right, the position
of 271 is not shown.)
Lets concentrate for the moment on the 82* series, for it has
the closest distance. What are the reasons for this? If one looks
at the definitions of the OCM, it seems that they are very similar,
and have lots of common tags (the whole 82* series has between 10 and
30 tags). The closeness is created mainly by seven tags: 23 (ideas),
55 (development), 56 (cultural), 72 (associated), 118 (notions), 137
(scientific) and 161 (patterns). But what is the meaning of these words?
In other words, how do we fill vague expressions like
notion or idea with any content at all? These
categories dont provide much help within their definitions,
and therefor have lots of lateral connections to other categories
(> 10). They are defined actually by these connections, not by
the content of their definitions.
Lets look at another example. When we prepared the raw document
for the analysis, we left the introductory segments in. This means
the small segments precede each series of categories. These segments
end on a **0 code. There are 79 of them in the corpus. 63 (!) of them
(= 79.75%) form a perfect cluster. These descriptions are very uniform,
their information content approaches zero.
Having a look at them makes clear why. They start almost identically
with a phrase like: General statements dealing with several
aspects.... The question is, what is the purpose of this? In
other words, part of the redundancy, of the noise within the OCM comes
from Tibetan-prayer-mill-like phrases that dont support coding
at all.
Graphic
The authors of the OCM (1983: xix) describe in their development
perspectives the history of the system. A major change came
in 1977-79 when four major categories were established: Towns, Cities,
Districts, Provinces. These too, were subdivided. After testing it
with the Okayama File, these new categories were discontinued: The
new categories proved to be too cumbersome and created more difficulties
than they solved. Our cluster analysis shows the reasons for
this, and would have saved some time, if conducted in the late 70s
by HRAF. Consider:
Here they are again in perfect harmony.:
633 (Cities)
632 (Towns)
634 (Districts)
635 (Provinces)
636 (Dependencies) (missing in the graphic because it joins the group
later)
Yet they are better defined than the 82* series. They contain distinct examples, and much fewer lateral connections (0-4) as the first group. Nevertheless, what makes coding difficult is the fact that these non-discrete categories should consist of much more example material. Furthermore their differentiation remains unclear: Why have these western society oriented entities to be integrated in the OCM?
Graphic Graphic
Another criticism of the OCM is its lack of unidimensionality. See the following example that makes the point clear:
We find categories:
348 (Industrial structures)
346 (Religious and Educational Structures)
343 (Outbuildings)
344 (Public Structures)
347 (Business Structures)
345 (Recreational Structures)
349 (Miscellaneous Structures)
First there should be a difference between public and non-public structures
as the authors put it. Or industrial vs. non-industrial. A church
or a library (examples from category 346 (Religious Structures) is
public as well. Or a bathhouse (example from category 343) is it public
or private?
Again, lots of lateral connections make the uncertainty of the authors
transparent.
Lack of completeness is another weak point. Although many people think
of the OCM as biased on material culture see that cluster of Food
quest:
It holds categories Graphic
223 (Fowling)
225 (Marine Hunting)
224 (Hunting and Trapping)
226 (Fishing)
Why is there a differentiation between hunting birds and say, whales, and all the other animals? Just think of the consequences: When coding material with the OCM you automatically get many entries on 224 which you have to explore in detail, if you are not interested in birds or whales. More, by coding with such an incomplete system the impression arises,that all hunting is basically the same in all cultures, except for birds and sea mammals.
We dont want to be unfair with this analysis. But the authors of the OCM say it in their own words: The purpose is a comprehensive subject classification system pertaining to all aspects of human behaviour and related phenomena. Take this as the aims of research the system has to correspond with. However, consider the following:
The group of
603 (Grandparents and Grandchildren)
604 (Avuncular and Nepotic Relatives)
606 (Parents-In-Law and Children-In-Law)
605 (Cousins)
607 (Siblings-In-Law)
Closeness between these categories arises from the related descriptions. e.g.:
AVUNCULAR AND NEPOTIC RELATIVES -- patterns of behavior between paternal and maternal uncles and aunts and fraternal and sororal nephews and nieces; respective rights, privileges, and powers of the relatives involved; special elaborations (e.g., avunculate, amitate); relationships with the spouses of avuncular and nepotic relatives; etc.
But then, what is meant by patterns of behaviour in a human context? Virtually everything could fit into that category. And why should one only code behaviour between the members of these categories? Why not between cousin and sibling-in-law? Imagine how many combinations arouse from this and if one of them is important enough to code it, why not the other?
Let this be the last example of how different kinds of category join one big cluster because of their obvious closeness, and lets concentrate of the 24*, 38* and 39* series within it.
These are categories dealing with agriculture, chemical industries and capital goods industries. The problem however, is that they are basically described with the same words. The feature of a good system, the mutual exclusiveness, is once again corrupted because these most diverse themes are described with the same words. Look at a detail like categories 274 (Beverage Industries) and 295 (Special Clothing Industries) that are combines on a high level. Why are these different themes so close together? The answer is, that they are defined with nearly the same words:
Read this graphic like this: Each Category is represented by a bar, the values are dichotomized. 11 tags are in common.
Graphic
But what are the practical consequences of our results? Does this mean that the effort of fifty years by Murdock and HRAF has been wasted? Fortunately not, though this has more to do with anthropological custom than the precision of the OCM.
What the OCM consists of is a large number of terms that anthropologists have found helpful for describing humans, their creations and their environments. It is a fairly comprehensive list of technical terms which are by and large well known to anthropologists, at least those inducted before 1985 or so. It works because these terms are already well known, and are likely to flag sections of interest and value to anthropological research, since the terms arose from anthropological research. What the OCM is not is a well ordered thesaurus of anthropological terminology. The categories that appear are for convenience, not classificatory efficacy - there is little that can be deduced because a term appears in a particular category.
The OCM is then, at best, a weak classification system, in the sense that although the individual terms themselves are useful for classifying material, and a great deal of anthropological experience and expertise is represented, little is gained by the overall structure within which these are organised. The OCM is a system (or outline) only in the sense that a heap or pile is a system.