Code snippets of possible use to translators of Turkic languages

Maintaining no more than a toe in the Altaic pool - I've really been more interested in this lately - I've been reading Kazakh Grammar with Affix List (Karl A. Krippes, Dunwoody Press, Kensington MD, 1996) and at bedtime too, my sleepiness unchanged, I swear. Don't think I'll be going, even though the person who sold it to me through amazon.com enclosed a handwritten note urging Kazakhstan's beauties on me. Shoot, all I wanted was to compare Kazakh to Turkish. The verdict: "yes."

As in Yes, the next -stan I'm going to is Bulgaristan, whose language, by the way, sure is a surprise to those who take Russian as the Slavic paragon. As for Kazakh, I can guess this: a program to translate it would use subroutines similar to, but not identical to, those I wrote for my Turkish tutorial. Here, I sketch some of those subroutines. Borrow as you see fit. I ask only a citation in your comments. Python comments are inline and start with #, though multiliners can be managed here and there with """ on both ends. And if you aren't writing in Python, well, I admit it's lousy with XML but dynamic typing is Steak For Men, I don't care what anybody else says.

Dealing with Cyrillic: Finding it troublesome to handle Cyrillic characters inside a Python file - re.compile(something).search(something).start() gave funny numbers usually, and nothing good at all if the characters were Unicoded first, and anyway, it isn't convenient to type in Cyrillic - I decided to do for Kazakh what I did for Turkish, which is an upfront conversion of the encrypt to Latin letters. Everything as received was dropped into lower case; then (for example) undotted i became capital i, soft g became capital g...you get the idea. For Cyrillic incoming, I propose what you see below: letters are turned to two-character equivalents, with obvious one-character equivalents bearing a trailing space to keep everything all even-like. Looks funny, and you are free to change 'em to suit you, but I think you'd end up doing it this way because it solves a lot of problems while creating few if any new ones. Cyrillic characters are two-byte and that's all there is to it. And the user will never see how you write your own code.

#from www.machine-altaica.com/kaza.htm
frontUnroundedVowels='Ә','ә','Е','е','І','і'#use tuples 'cuz these won't ever change
#                         ae      e       i
ifrontUnroundedVowels_Latin='ae','ae','e ','e ','i ','i ',
frontRoundedVowels='Ө','ө','Ү','ү',
#                       oe      ue
frontRoundedVowels_Latin='oe','oe','ue','ue',
backUnroundedVowels='А','а','Ы','ы',
#                        a       ie
backUnroundedVowels_Latin='a ','a ','ie','ie',
backRoundedVowels='О','о','Ұ','ұ',
#                      o       u
backRoundedVowels_Latin='o ','o ','u ','u ',
frontVowels=frontUnroundedVowels+frontRoundedVowels+('И','и',)
frontVowels_Latin=frontUnroundedVowels_Latin+frontRoundedVowels_Latin+('ii','ii',)
#                                                         i
backVowels=backUnroundedVowels+backRoundedVowels+('Ё','ё','Ю','ю','Я','я',)
#                                                      yo      yu      ya
backVowels_Latin=backUnroundedVowels_Latin+backRoundedVowels_Latin+('yo','yo','yu','yu','ya','ya',)
unclassified='У','у','Й','й',#'cuz they're really consonants in Kazakh?
#                 wu      yi#no idea if these are apt representations
unclassified_Latin='wu','wu','yi','yi',
voicedConsonants='Б','б','В','в','Г','г','Ғ','ғ','Д','д','Ж','ж','З','з','Л','л','М','м','Н','н','Ң','ң','Р','р',
#                     b       v       g       gh      d       zh      z       l       m       n       ng      r
voicedConsonants_Latin='b ','b ','v ','v ','g ','g ','gh','gh','d ','d ','zh','zh','z ','z ','l ','l ','m ','m ','n ','n ','ng','ng','r ','r ',
unvoicedConsonants='К','к','Қ','қ','ф','Ф','Т','т','Х','х','Ц','ц','Ч','ч','Ш','ш','Һ','һ','П','п','С','с',
#                       k       q       f       t       kh      ts      ch      sh      h       p       s
unvoicedConsonants_Latin='k ','k ','q ','q ','f ','f ','t ','t ','kh','kh','ts','ts','ch','ch','sh','sh','h ','h ','p ','p ','s ','s ',

def toLatin(cyrLtr):
    kinds='frontUnroundedVowels','frontRoundedVowels','backUnroundedVowels','backRoundedVowels','unclassified','voicedConsonants','unvoicedConsonants',
    for k in range(7):
        thisTupleSrc=eval(kinds[k])
        thisTupleOut=eval(kinds[k]+"_Latin")
        for r in range(len(thisTupleSrc)):
            if  cyrLtr==thisTupleSrc[r]:
                return thisTupleOut[r]
    return cyrLtr#meaning it isn't Cyrillic

def convertCyrillicString(theStr):
    output=""
    x=0
    while x<len(theStr):
        #this MUST deal with ASCII characters...like spaces!
        twofer=theStr[x:x+2]
        if  ord(twofer[0])>207 and ord(twofer[0])<212:#it IS Cyrillic?
            output=output+toLatin(twofer)
            x=x+2
        else:
            output=output+twofer[0]+" "
            x=x+1
    return output

Finding an infix with mutable letters: Kazakh seems to be "floppier" than Turkish when it comes to these things. In Turkish, for example, the only plural infixes are lar and ler, and in my translator I accept either because they'd never be confused with anything else. (I gather that in Turkmen, there are "Iranicized" dialects which are less assiduous about vowel harmony and there, too, it might be useful for a translating program to accept either.) In Kazakh, and remember this one slim tome is my only source, it seems the canonical лар/лер can mutate to дар/дер or тар/тер, depending on the letter(s) immediately anterior.

#from www.machine-altaica.com/kaza.htm
#This fn. will be called when some other fn. believes it has found a noun/verb/adjective root.
#(I.e., by comparing the whole word to entries in its noun, verb, and/or adjective glossaries.)
#This fn.'s job is take the word being translated and, starting from the point the word deviates
#from the glossary entry, determine if the deviation is a plural infix, or begins with
#a plural infix.
def isGrammaticalPlural(theWord,matchOutTo):
    if  len(theWord)<len(theWord[0:matchOutTo])+6:#the plural infix is invariably 6 technical characters long
        print "infix too short"
        return False
    thePutativeInfix=theWord[matchOutTo:matchOutTo+6]
    print thePutativeInfix
    if  thePutativeInfix[4:6]!='r ':#the plural infix always ends in "r"
        print "infix doesn't end in r"
        return False
    rootsLastVowel=getLastVowel(theWord[0:matchOutTo+2])
    if  rootsLastVowel=="":
        print ("NO vowels? Hey, maybe in a SLAVIC language...but in a TURKIC language?")
        return False
    candidate2ndLetters='a ','? ',#the only letters allowed in the second position
    type1=""
    for a in range(2):
        if  thePutativeInfix[2:4]==candidate2ndLetters[a]:
            type1=kindOfVowel(thePutativeInfix[2:4])
            break
    if  type1=="":
        print "2nd letter not a or e"
        return False
    candidate1stLetters='l ','d ','t ',#the only letters allowed in the first position
     type0=""
    for l in range(3):
        if  thePutativeInfix[0:2]==candidate1stLetters[l]:
            type0=kindOfConsonant(thePutativeInfix[0:2])
            break
    if  type0=="":
        print "1st letter not l, d, or t"
        return False
    natureOfRootsLastVowel=kindOfVowel(rootsLastVowel).split("|")
    if  type1.find(natureOfRootsLastVowel[0])!=0:#both need be front or back; roundedness is immaterial for this infix
        print "affix seems 'plural' but vowels are mismatched"
        return False
    rootsLastLetter=theWord[matchOutTo-2:matchOutTo]
    natureOfTheC=kindOfConsonant(rootsLastLetter)
    if  (natureOfTheC=="" and thePutativeInfix[0:2]=='l ') or \
        (natureOfTheC=="unvoiced" and thePutativeInfix[0:2]=='d ') or \
        (natureOfTheC=="voiced" and thePutativeInfix[0:2]=='t '):
        return True
    else:
        return False

Testing alternate spellings of a word: I don't know whether this ever comes up in any of the Central Asian languages written in non-Russian Cyrillic (because I wonder just how much they are "written") but on some Turkish websites I see contributions from people who must have been stuck with Western keyboards and just did what they could. Every i with a dot, s's and c's all with shaven chins, etc. My translator has the ability to take, say, kucuk and permute all the letters until it gets to küçük, for which there is a match in the adjective lexicon. I am guessing here that some ardent Kazakh blogger, wanting to type газқағар, "gas mask," might settle for газкагар, swearing at his hardware's shortcomings but figuring his audience would understand what he meant. Which means translation software must be ready for it too.

#from www.machine-altaica.com/kaza.htm
#this fn. receives the number of alternate spellings tried (zero to begin with, of course)
#and generates a list of 0's and 1's corresponding to the "wobbly" letters in the encrypt.
#If there are 3 wobblies in a word, that means it can be spelled pow(2,3)=8 different ways.
#Those 8 different ways can be represented as follows:
#0:000
#1:001
#2:010
#3:011
#4:100
#5:101
#6:110
#7:111...every combination
def binList(altsTried,availables):
    global asTypedOrAsMeant
    total=0
    slot=availables-1
    if  altsTried==0:
        for a in range(availables):
            asTypedOrAsMeant.append(1)
    while slot>-1:
        if  pow(2,slot)+total<=altsTried:
            total=pow(2,slot)+total
            asTypedOrAsMeant[slot]=0
        else:
            asTypedOrAsMeant[slot]=1
        slot=slot-1
    return asTypedOrAsMeant
    #note asTypedOrAsMeant will be the binary in reverse, but that's OK

#In Cyrillic, imagine one would type 'n ','k ','g ', instead of 'ng', 'q ','gh',
#This fn. will be called at the bottom of the loop that checks the encrypt against lexicons of
#nouns, verbs, etc.; if you reach the bottom of that loop, the encrypt must still be untranslated
#and we now try alternate spellings, on the suspicion it IS a translatable word only it's been
#mistyped. With each new spelling, reloop. This fn. thus returns a new spelling of the word.
#Example: receive "g a z k a g a r", return "g a z q a gha r" and all the other possibilities
def lazyTyping(encrypt):
    global asTypedOrAsMeant#a list of 0's and 1's
    global whereTheWobbliesAre#where in the encrypt the wobbly letters are
    global whatTheWobblyIs#the indices, in the wobblies tuple, corresponding to the letter typed. E.g., 'k ' will be 1, 'cuz it's wobblies[1]
    wobblies='n ','k ','g ',#instead of 'ng', 'q ','gh',
    reallies='ng','q ','gh',
    global AltSpellingsAvail
    global AltSpellingsTried
    if  AltSpellingsAvail==0:#1st visit here for this word
        for z in range(0,len(encrypt),2):
            for y in range(len(wobblies)):
                if  encrypt[z:z+2]==wobblies[y]:
                    whereTheWobbliesAre.append(z)
                    whatTheWobblyIs.append(y)
                    break
        #so for "g a z k a g a r", whereTheWobbliesAre=[0,6,10]
        AltSpellingsAvail=pow(2,len(whereTheWobbliesAre))
        AltSpellingsTried=0
    if  AltSpellingsAvail>1:#not 0...a word with none of the wobblies has NO alt spellings
        asTypedOrAsMeant=binList(AltSpellingsTried,len(whereTheWobbliesAre))
        for y in range(len(whereTheWobbliesAre)):#at slots 0, 6, and 10, trade wobblies for reallies, depending on what asTypedOrAsMeant says
            insert=wobblies[whatTheWobblyIs[y]]
            if  asTypedOrAsMeant[y]==1:
                insert=reallies[whatTheWobblyIs[y]]
            encrypt=encrypt[0:whereTheWobbliesAre[y]]+insert+encrypt[whereTheWobbliesAre[y]+2:]
        AltSpellingsTried=AltSpellingsTried+1
    if  AltSpellingsTried>AltSpellingsAvail:
        return "noluck"#a signal to the larger loop that none of the permuted spellings matches anything in the lexicons
        #Whatever decision is made regarding decryption (give up, ignore it, guess if it's a noun or verb, etc.),
        #will need to zero the globals at the top of this function
    return encrypt

#a test
AltSpellingsAvail=0
trueSpelling="g a z q a gha r "
x=-1
gotIt=False
while x < AltSpellingsAvail:
    ans=lazyTyping("g a z k a g a r ")
    print "spelling being tried now is "+ans
    if  ans==trueSpelling:
        print "Bingo!"
        gotIt=True
    else:
        print "...nope..."
    #should output all the alternate spellings
    x=x+1
if  not gotIt:
    print "Couldn't find the word"