If you have any suggestions, please email me.

Aim. To find phrases cited by a target, e.g. Bede, from a source, e.g. Jerome.


Definitions. What constitutes a phrase? For immediate purposes, we will search for two or more words, as long as one of the words is not a demonstrative pronoun, conjunction, preposition, or interjection. These words need not be next to each other, but for immediate purposes, they will be within 10 words of one another.

  • What about parenthetical phrases? If these interrupt two words which are legitimate targets, then they have to be discounted from the 10-word limit. The question is then how to recognize such phrases.
  • What about rare or unique terms that belong to one of the above-mentioned categories? These might well indicate a source.
  • What about variant spellings?

Method. The simplest way seems to be to treat each text, the source and the target, as arrays of words. Rather than trying to create or modify syntactic filters and thus to break texts into clauses and phrases, a brute-force method will isolate a chunk of results which can later be parsed. A series of refined searches will operate on smaller and smaller data sets.


Design. The most important element so far has been the data set. Variations in punctuation, spelling, and layout cause innumerable problems that cannot be easily handled. Rather than dealing with editorial conventions in a complex series of subroutines, we have decided to deal with the text as it would likely have been inscribed. Thus:

  • remove all punctuation from source and target texts
  • remove all double spaces
  • lowercase everything
  • configure source and target texts similarly
This has been one of the more difficult aspects of the sourcer so far. One solution is to write scripts which seek out sentence and phrase boundaries, and to mark those in SGML entities. For example, search "ut" and replace with a phrase marker plus "ut". The problem, obviously, arises at the other end of the phrase. Sentences can be marked rather more easily since modern editors use punctuation to indicate periods.
  • What about sense-clusters which straddle sentences? For example, "... homo .... Est" in the source may appear as "homo est" in the target. This can be handled by ignoring punctuation: "homo ... est" will hit with "homo est."
Given the speed of today's processors, we have decided not to try to optimize the code. Thus, reading the entire text into an array is not a consideration as it might once have been.
  • What about spelling variations? There seem to be two ways to handle this easily. The first is to hard-code phonological variants into the abstraction routine. Thus, any word with a front vowel may be modified to reflect reduction or monophthongization, especially with i and e and ie. Similar phonological considerations can be added. A second way is to drop all vowels--to Hebraicize, as it were--and search for clusters of consonants. Once the results are in, return the vowels in phonological order (back vowels before front vowels, low vowels before high vowels, and monophthongs before diphthongs). As the results are reduced, pick an arbitrary point at which they can be displayed to the user for further editing.

Elements. There are a number of essential components. Obviously, the two texts. Also, lists of closed-class words are required. And, finally, linked scripts or a main script which calls C or perl routines as needed.