[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
The Lojban Morphology algorithm - comments and workers wanted
- To: lojban-list
- Subject: The Lojban Morphology algorithm - comments and workers wanted
- From: lojbab (Bob LeChevalier)
- Date: Tue, 9 Jul 91 05:14 EDT
I started on this about a year ago, and have done minimal work on it.
Neither John Cowan nor I am quite pleased with the algorithm as defined,
though it does seem to properly define how to break a phoneme string
into words. Of course there seems to be no way to turn this into anything
like an LR(k) algorithm.
Thus I put this on the floor for our formalists and computer scientists to
tackle. Can anyone find a better, clearer, or more elegant way to handle
Lojban text algorithmically, or even to describe the morphology formally?
----
lojbab = Bob LeChevalier, President, The Logical Language Group, Inc.
2904 Beau Lane, Fairfax VA 22031-1303 USA
703-385-0273
lojbab@snark.thyrsus.com
Lojban Morphology Algorithm
Trial 2 - 8 July 1991
Assumption - Text string of transcribed phonemes and stress.
1. Because a pause is always a word break, process chunks of text that
end in a pause. Mark word breaks at each pause.
2. If an apostrophe occurs other than between two vowels, then flag an
error.
3. If any word contains an impermissible medial, then flag an error.
(Optionally, this step can be saved for last, which might allow some
amount of error correction. For all consonant clusters, treat a
permissible initial as joined to the following vowel syllable. For all
other clusters, divide syllables between the consonants. Divide
syllables at a close-comma.
4. For each piece of pause-bounded text, case on the final letter before
the pause. If an error is found, terminate processing of the pause-
group. The group must either be within a "zoi" quote, or the text is in
error. If it is a quote, the entire group is part of the quote and there
is no need to attempt further lexing.
a. If the pause is immediately preceded by a consonant, a name has
been found (this should only occur at the very end of the pause-
group).
1) Seek backwards from the final consonant until finding
"lai", "la'V", "doi", or start of text.
2) If "la'V" is found for any V other than "i", flag a mal-
formed name and continue searching backwards from this point
per 1), as this may be a recoverable error.
3) If "lai", "la'i", "doi", or start of text are found, mark a
word break between them and the name. Identify the name. Also
place a word break before the marker and label the marker as a
cmavo. Recurse from 4. for any unprocessed text before the
marker, treating the inserted word break as a pause.
b. If the pause is immediately preceded by "y":
1) If the "y" is preceded by a vowel, mark an error.
2) If the "y" is alone, mark a ".y." cmavo.
3) If the "y" is preceded by an apostrophe, then there is a
vowel before the apostrophe. Place a word break before the
vowel. Mark the "V'y" as a lerfu.
4) If the "y" is preceded by a consonant, place a word break
before the consonant, and mark the "Cy" as a lerfu.
5) Recurse from 4. for any remaining unprocessed text before
inserted word breaks, treating the inserted word break as a
pause.
c. If the pause is preceded by a vowel other than "y":
1) If no stressed syllable exists in the text, then:
a) If any consonant pair is found within the text, mark
an error.
b) Mark a word break before each consonant.
1] For each word broken off, if the ending vowel is
a "y", then mark an error if the phoneme before the
"y" is a vowel. Otherwise mark the word as a lerfu.
2] If the ending vowel is other than a "y", and is
preceded by another vowel, ensure a valid diphthong
is formed; mark an error if not. Mark a valid word
as a cmavo.
2) If at least one stressed syllable is found, take the first
such syllable as a starting point.
a) Examine the vowel of the following syllable, treating
a diphthong as a single vowel.
b) If there is no following syllable, then word break
before the stressed syllable and following syllables.
1] If the stressed syllable begins with a consonant
cluster, then mark an error.
2] Otherwise, the text is a string of cmavo.
Analyze and word divide per 4.c.1)b).
c) If the following syllable contains the FIRST half of a
"V'V", either the text to this point is a string of cmavo
or the stress is a secondary stress. Determine which by
searching for a consonant cluster or "CyC" string in the
text preceding the "V'V".
1] If neither is found, the text up to and including
the stressed syllable is a string of cmavo. Mark a
word break after the vowel of the stressed syllable
and analyze the preceding text per 4.c.1)b).
2] If a consonant pair is found, the stress is a
secondary stress. Change the text to unstressed, and
repeat from 4.c.2) for the next stressed syllable if
there is one. If there is none, mark an error.
d) If the following syllable vowel is not a "y", word
break after that vowel.
e) If the following syllable contains a "y", then check
the following syllable to see if it is the FIRST half of a
"V'V". If so, then process per b) for a cmavo string or
secondary stress. If not, then word break after that
following syllable.
f) For a candidate word containing a stressed syllable
and following syllables:
1] If it is less than 5 characters long, then:
a] If there is a consonant cluster, than mark
an error.
b] If there is no consonant cluster, then break
up per 4.c.1)b).
2] Ignoring apostrophes in the count, if there is no
consonant cluster of "CyC" in the first 5 characters,
then word break before the first non-initial
consonant. The preceding will be either a lerfu (if
the vowel is a "y") or a cmavo (otherwise). Recurse
on the remaining text starting at 5.c.2)f).
3] If the word is 5 letters long and of the form
CCVCV, with a permissible initial for the consonant
pair, or of the form CVCCV, it is a gismu.
Otherwise, mark a 5-letter word as an error.
4] If a greater than 5 letter word is found, perform
a "Tosmabru" test to see if an initial cmavo form
word can fall off. If so, mark the falling off word
as a cmavo and recurse on the remaining text staring
at 5.c.2)f).
5] Attempt to break up the word into rafsi by the
lujvo analysis algorithm. If it breaks up, it is a
lujvo. Otherwise it is a le'avla.