Code snippets of possible use to translators of Turkic languages
Maintaining no more than a toe in the Altaic pool - I've really been more interested in
this lately -
I've been reading Kazakh Grammar with Affix List (Karl A. Krippes, Dunwoody Press,
Kensington MD, 1996) and at bedtime too, my sleepiness unchanged, I swear. Don't think
I'll be going, even though the person who sold it to me through amazon.com enclosed a
handwritten note urging Kazakhstan's beauties on me. Shoot, all I wanted was to compare Kazakh to
Turkish. The verdict: "yes."
As in Yes, the next -stan I'm going to is Bulgaristan, whose language, by the
way, sure is a surprise to those who take Russian as the Slavic paragon. As for Kazakh, I
can guess this: a program to translate it would use subroutines similar to, but not identical
to, those I wrote for my Turkish tutorial.
Here, I sketch some of those subroutines. Borrow as you see fit. I ask only a citation
in your comments. Python comments are inline and start with #, though multiliners
can be managed here and there with """ on both ends. And if you aren't writing
in Python, well, I admit it's lousy with XML but dynamic typing is Steak For Men, I don't care
what anybody else says.
Dealing with Cyrillic:
Finding it troublesome to handle Cyrillic characters inside a Python file -
re.compile(something).search(something).start() gave funny numbers usually,
and nothing good at all if the characters were Unicoded first, and anyway, it isn't convenient
to type in Cyrillic - I decided to do for Kazakh what I did for Turkish, which is an
upfront conversion of the encrypt to Latin letters. Everything as received was dropped
into lower case; then (for example) undotted i became capital i, soft g became capital g...you
get the idea. For Cyrillic incoming, I propose what you see below: letters are turned to
two-character equivalents, with obvious one-character equivalents bearing a trailing
space to keep everything all even-like. Looks funny, and you are free to change 'em
to suit you, but I think you'd end up doing it this way because it solves a lot of problems
while creating few if any new ones. Cyrillic characters are two-byte and that's all there is
to it. And the user will never see how you write your own code.
#from www.machine-altaica.com/kaza.htm
frontUnroundedVowels='Ә','ә','Е','е','І','і'#use tuples 'cuz these won't ever change
# ae e i
ifrontUnroundedVowels_Latin='ae','ae','e ','e ','i ','i ',
frontRoundedVowels='Ө','ө','Ү','ү',
# oe ue
frontRoundedVowels_Latin='oe','oe','ue','ue',
backUnroundedVowels='А','а','Ы','ы',
# a ie
backUnroundedVowels_Latin='a ','a ','ie','ie',
backRoundedVowels='О','о','Ұ','ұ',
# o u
backRoundedVowels_Latin='o ','o ','u ','u ',
frontVowels=frontUnroundedVowels+frontRoundedVowels+('И','и',)
frontVowels_Latin=frontUnroundedVowels_Latin+frontRoundedVowels_Latin+('ii','ii',)
# i
backVowels=backUnroundedVowels+backRoundedVowels+('Ё','ё','Ю','ю','Я','я',)
# yo yu ya
backVowels_Latin=backUnroundedVowels_Latin+backRoundedVowels_Latin+('yo','yo','yu','yu','ya','ya',)
unclassified='У','у','Й','й',#'cuz they're really consonants in Kazakh?
# wu yi#no idea if these are apt representations
unclassified_Latin='wu','wu','yi','yi',
voicedConsonants='Б','б','В','в','Г','г','Ғ','ғ','Д','д','Ж','ж','З','з','Л','л','М','м','Н','н','Ң','ң','Р','р',
# b v g gh d zh z l m n ng r
voicedConsonants_Latin='b ','b ','v ','v ','g ','g ','gh','gh','d ','d ','zh','zh','z ','z ','l ','l ','m ','m ','n ','n ','ng','ng','r ','r ',
unvoicedConsonants='К','к','Қ','қ','ф','Ф','Т','т','Х','х','Ц','ц','Ч','ч','Ш','ш','Һ','һ','П','п','С','с',
# k q f t kh ts ch sh h p s
unvoicedConsonants_Latin='k ','k ','q ','q ','f ','f ','t ','t ','kh','kh','ts','ts','ch','ch','sh','sh','h ','h ','p ','p ','s ','s ',
def toLatin(cyrLtr):
kinds='frontUnroundedVowels','frontRoundedVowels','backUnroundedVowels','backRoundedVowels','unclassified','voicedConsonants','unvoicedConsonants',
for k in range(7):
thisTupleSrc=eval(kinds[k])
thisTupleOut=eval(kinds[k]+"_Latin")
for r in range(len(thisTupleSrc)):
if cyrLtr==thisTupleSrc[r]:
return thisTupleOut[r]
return cyrLtr#meaning it isn't Cyrillic
def convertCyrillicString(theStr):
output=""
x=0
while x<len(theStr):
#this MUST deal with ASCII characters...like spaces!
twofer=theStr[x:x+2]
if ord(twofer[0])>207 and ord(twofer[0])<212:#it IS Cyrillic?
output=output+toLatin(twofer)
x=x+2
else:
output=output+twofer[0]+" "
x=x+1
return output
Finding an infix with mutable letters:
Kazakh seems to be "floppier" than Turkish when it comes to these things. In Turkish, for example,
the only plural infixes are lar and
ler, and in my translator I accept either because they'd never be confused with
anything else. (I gather that in Turkmen, there are "Iranicized" dialects which are
less assiduous about vowel harmony and there, too, it might be useful for a translating
program to accept either.) In Kazakh, and remember this one slim tome is my only source,
it seems the canonical
лар/лер
can mutate to
дар/дер
or
тар/тер,
depending
on the letter(s) immediately anterior.
#from www.machine-altaica.com/kaza.htm
#This fn. will be called when some other fn. believes it has found a noun/verb/adjective root.
#(I.e., by comparing the whole word to entries in its noun, verb, and/or adjective glossaries.)
#This fn.'s job is take the word being translated and, starting from the point the word deviates
#from the glossary entry, determine if the deviation is a plural infix, or begins with
#a plural infix.
def isGrammaticalPlural(theWord,matchOutTo):
if len(theWord)<len(theWord[0:matchOutTo])+6:#the plural infix is invariably 6 technical characters long
print "infix too short"
return False
thePutativeInfix=theWord[matchOutTo:matchOutTo+6]
print thePutativeInfix
if thePutativeInfix[4:6]!='r ':#the plural infix always ends in "r"
print "infix doesn't end in r"
return False
rootsLastVowel=getLastVowel(theWord[0:matchOutTo+2])
if rootsLastVowel=="":
print ("NO vowels? Hey, maybe in a SLAVIC language...but in a TURKIC language?")
return False
candidate2ndLetters='a ','? ',#the only letters allowed in the second position
type1=""
for a in range(2):
if thePutativeInfix[2:4]==candidate2ndLetters[a]:
type1=kindOfVowel(thePutativeInfix[2:4])
break
if type1=="":
print "2nd letter not a or e"
return False
candidate1stLetters='l ','d ','t ',#the only letters allowed in the first position
type0=""
for l in range(3):
if thePutativeInfix[0:2]==candidate1stLetters[l]:
type0=kindOfConsonant(thePutativeInfix[0:2])
break
if type0=="":
print "1st letter not l, d, or t"
return False
natureOfRootsLastVowel=kindOfVowel(rootsLastVowel).split("|")
if type1.find(natureOfRootsLastVowel[0])!=0:#both need be front or back; roundedness is immaterial for this infix
print "affix seems 'plural' but vowels are mismatched"
return False
rootsLastLetter=theWord[matchOutTo-2:matchOutTo]
natureOfTheC=kindOfConsonant(rootsLastLetter)
if (natureOfTheC=="" and thePutativeInfix[0:2]=='l ') or \
(natureOfTheC=="unvoiced" and thePutativeInfix[0:2]=='d ') or \
(natureOfTheC=="voiced" and thePutativeInfix[0:2]=='t '):
return True
else:
return False
Testing alternate spellings of a word:
I don't know whether this ever comes up in any of the Central Asian languages written in
non-Russian Cyrillic (because I wonder just how much they are "written") but on some Turkish
websites I see contributions from people who must have been stuck with Western keyboards and just
did what they could. Every i with a dot, s's and c's all with shaven chins, etc. My translator
has the ability to take, say, kucuk and permute all the letters until it gets to küçük,
for which there is a match in the adjective lexicon. I am guessing here that some ardent Kazakh blogger,
wanting to type газқағар,
"gas mask," might settle for газкагар,
swearing at his
hardware's shortcomings but figuring his audience would understand what he meant. Which means
translation software must be ready for it too.
#from www.machine-altaica.com/kaza.htm
#this fn. receives the number of alternate spellings tried (zero to begin with, of course)
#and generates a list of 0's and 1's corresponding to the "wobbly" letters in the encrypt.
#If there are 3 wobblies in a word, that means it can be spelled pow(2,3)=8 different ways.
#Those 8 different ways can be represented as follows:
#0:000
#1:001
#2:010
#3:011
#4:100
#5:101
#6:110
#7:111...every combination
def binList(altsTried,availables):
global asTypedOrAsMeant
total=0
slot=availables-1
if altsTried==0:
for a in range(availables):
asTypedOrAsMeant.append(1)
while slot>-1:
if pow(2,slot)+total<=altsTried:
total=pow(2,slot)+total
asTypedOrAsMeant[slot]=0
else:
asTypedOrAsMeant[slot]=1
slot=slot-1
return asTypedOrAsMeant
#note asTypedOrAsMeant will be the binary in reverse, but that's OK
#In Cyrillic, imagine one would type 'n ','k ','g ', instead of 'ng', 'q ','gh',
#This fn. will be called at the bottom of the loop that checks the encrypt against lexicons of
#nouns, verbs, etc.; if you reach the bottom of that loop, the encrypt must still be untranslated
#and we now try alternate spellings, on the suspicion it IS a translatable word only it's been
#mistyped. With each new spelling, reloop. This fn. thus returns a new spelling of the word.
#Example: receive "g a z k a g a r", return "g a z q a gha r" and all the other possibilities
def lazyTyping(encrypt):
global asTypedOrAsMeant#a list of 0's and 1's
global whereTheWobbliesAre#where in the encrypt the wobbly letters are
global whatTheWobblyIs#the indices, in the wobblies tuple, corresponding to the letter typed. E.g., 'k ' will be 1, 'cuz it's wobblies[1]
wobblies='n ','k ','g ',#instead of 'ng', 'q ','gh',
reallies='ng','q ','gh',
global AltSpellingsAvail
global AltSpellingsTried
if AltSpellingsAvail==0:#1st visit here for this word
for z in range(0,len(encrypt),2):
for y in range(len(wobblies)):
if encrypt[z:z+2]==wobblies[y]:
whereTheWobbliesAre.append(z)
whatTheWobblyIs.append(y)
break
#so for "g a z k a g a r", whereTheWobbliesAre=[0,6,10]
AltSpellingsAvail=pow(2,len(whereTheWobbliesAre))
AltSpellingsTried=0
if AltSpellingsAvail>1:#not 0...a word with none of the wobblies has NO alt spellings
asTypedOrAsMeant=binList(AltSpellingsTried,len(whereTheWobbliesAre))
for y in range(len(whereTheWobbliesAre)):#at slots 0, 6, and 10, trade wobblies for reallies, depending on what asTypedOrAsMeant says
insert=wobblies[whatTheWobblyIs[y]]
if asTypedOrAsMeant[y]==1:
insert=reallies[whatTheWobblyIs[y]]
encrypt=encrypt[0:whereTheWobbliesAre[y]]+insert+encrypt[whereTheWobbliesAre[y]+2:]
AltSpellingsTried=AltSpellingsTried+1
if AltSpellingsTried>AltSpellingsAvail:
return "noluck"#a signal to the larger loop that none of the permuted spellings matches anything in the lexicons
#Whatever decision is made regarding decryption (give up, ignore it, guess if it's a noun or verb, etc.),
#will need to zero the globals at the top of this function
return encrypt
#a test
AltSpellingsAvail=0
trueSpelling="g a z q a gha r "
x=-1
gotIt=False
while x < AltSpellingsAvail:
ans=lazyTyping("g a z k a g a r ")
print "spelling being tried now is "+ans
if ans==trueSpelling:
print "Bingo!"
gotIt=True
else:
print "...nope..."
#should output all the alternate spellings
x=x+1
if not gotIt:
print "Couldn't find the word"