Code snippets of possible use to translators of Turkic languages
Maintaining no more than a toe in the Altaic pool - I've really been more interested in this lately - I've been reading Kazakh Grammar with Affix List (Karl A. Krippes, Dunwoody Press, Kensington MD, 1996) and at bedtime too, my sleepiness unchanged, I swear. Don't think I'll be going, even though the person who sold it to me through amazon.com enclosed a handwritten note urging Kazakhstan's beauties on me. Shoot, all I wanted was to compare Kazakh to Turkish. The verdict: "yes."
As in Yes, the next -stan I'm going to is Bulgaristan, whose language, by the way, sure is a surprise to those who take Russian as the Slavic paragon. As for Kazakh, I can guess this: a program to translate it would use subroutines similar to, but not identical to, those I wrote for my Turkish tutorial. Here, I sketch some of those subroutines. Borrow as you see fit. I ask only a citation in your comments. Python comments are inline and start with #, though multiliners can be managed here and there with """ on both ends. And if you aren't writing in Python, well, I admit it's lousy with XML but dynamic typing is Steak For Men, I don't care what anybody else says.
Dealing with Cyrillic:
Finding it troublesome to handle Cyrillic characters inside a Python file -
re.compile(something).search(something).start() gave funny numbers usually,
and nothing good at all if the characters were Unicoded first, and anyway, it isn't convenient
to type in Cyrillic - I decided to do for Kazakh what I did for Turkish, which is an
upfront conversion of the encrypt to Latin letters. Everything as received was dropped
into lower case; then (for example) undotted i became capital i, soft g became capital g...you
get the idea. For Cyrillic incoming, I propose what you see below: letters are turned to
two-character equivalents, with obvious one-character equivalents bearing a trailing
space to keep everything all even-like. Looks funny, and you are free to change 'em
to suit you, but I think you'd end up doing it this way because it solves a lot of problems
while creating few if any new ones. Cyrillic characters are two-byte and that's all there is
to it. And the user will never see how you write your own code.
Finding an infix with mutable letters:
Kazakh seems to be "floppier" than Turkish when it comes to these things. In Turkish, for example,
the only plural infixes are lar and
ler, and in my translator I accept either because they'd never be confused with
anything else. (I gather that in Turkmen, there are "Iranicized" dialects which are
less assiduous about vowel harmony and there, too, it might be useful for a translating
program to accept either.) In Kazakh, and remember this one slim tome is my only source,
it seems the canonical
/
can mutate to
/
or
/,
depending
on the letter(s) immediately anterior.
Testing alternate spellings of a word: I don't know whether this ever comes up in any of the Central Asian languages written in non-Russian Cyrillic (because I wonder just how much they are "written") but on some Turkish websites I see contributions from people who must have been stuck with Western keyboards and just did what they could. Every i with a dot, s's and c's all with shaven chins, etc. My translator has the ability to take, say, kucuk and permute all the letters until it gets to küçük, for which there is a match in the adjective lexicon. I am guessing here that some ardent Kazakh blogger, wanting to type , "gas mask," might settle for , swearing at his hardware's shortcomings but figuring his audience would understand what he meant. Which means translation software must be ready for it too.