Aim. To find phrases cited by a target, e.g. Bede, from a
source, e.g. Jerome.
Definitions. What constitutes a phrase? For immediate purposes,
we will search for two or more words, as long as one of the words is not a
demonstrative pronoun, conjunction, preposition, or interjection. These
words need not be next to each other, but for immediate purposes, they
will be within 10 words of one another.
- What about parenthetical phrases? If these interrupt two words
which are legitimate targets, then they have to be discounted from the
10-word limit. The question is then how to recognize such phrases.
- What about rare or unique terms that belong to one of the
above-mentioned categories? These might well indicate a source.
- What about variant spellings?
Method. The simplest way seems to be to treat each text, the
source and the target, as arrays of words. Rather than trying to create or
modify syntactic filters and thus to break texts into clauses and phrases,
a brute-force method will isolate a chunk of results which can later be
parsed. A series of refined searches will operate on smaller and smaller
Design. The most important element so far has been the data
set. Variations in punctuation, spelling, and layout cause
innumerable problems that cannot be easily handled. Rather than
dealing with editorial conventions in a complex series of
subroutines, we have decided to deal with the text as it would likely
have been inscribed. Thus:
This has been one of the more difficult aspects of the sourcer so far. One
solution is to write scripts which seek out sentence and phrase
boundaries, and to mark those in SGML entities. For example, search "ut"
and replace with a phrase marker plus "ut". The problem, obviously, arises
at the other end of the phrase. Sentences can be marked rather more easily
since modern editors use punctuation to indicate periods.
- remove all punctuation from source and target texts
- remove all double spaces
- lowercase everything
- configure source and target texts similarly
Given the speed of today's processors, we have decided not to try to
optimize the code. Thus, reading the entire text into an array is not a
consideration as it might once have been.
- What about sense-clusters which straddle sentences? For
example, "... homo .... Est" in the source may appear as "homo est" in the
target. This can be handled by ignoring punctuation: "homo ... est" will
hit with "homo est."
- What about spelling variations? There seem to be two ways
to handle this easily. The first is to hard-code phonological variants
into the abstraction routine. Thus, any word with a front vowel may be
modified to reflect reduction or monophthongization, especially with i and
e and ie. Similar phonological considerations can be added. A second way
is to drop all vowels--to Hebraicize, as it were--and search for clusters
of consonants. Once the results are in, return the vowels in phonological
order (back vowels before front vowels, low vowels before high vowels, and
monophthongs before diphthongs). As the results are reduced, pick an
arbitrary point at which they can be displayed to the user for further
Elements. There are a number of essential components. Obviously,
the two texts. Also, lists of closed-class words are required. And,
finally, linked scripts or a main script which calls C or perl routines as