Lingua::DetectCyrillic. Detection of 7 Cyrillic codings and 2 languages |
Lingua::DetectCyrillic. The package detects 7 Cyrillic codings as well as the language - Russian or Ukrainian. Uses embedded frequency dictionaries; usually one word is enough for correct detection.
use Lingua::DetectCyrillic; -or (if you need translation functions) - use Lingua::DetectCyrillic qw ( &TranslateCyr &toLowerCyr &toUpperCyr );
# New class Lingua::DetectCyrillic. By default, not more than 100 Cyrillic # tokens (words) will be analyzed; Ukrainian is not detected. $CyrDetector = Lingua::DetectCyrillic ->new();
# The same but: analyze at least 200 tokens, detect both Russian and # Ukrainian. $CyrDetector = Lingua::DetectCyrillic ->new( MaxTokens => 200, DetectAllLang => 1 );
# Detect coding and language my ($Coding,$Language,$CharsProcessed,$Algorithm)= $CyrDetector -> Detect( @Data );
# Write report $CyrDetector -> LogWrite(); #write to STDOUT $CyrDetector -> LogWrite('report.log'); #write to file
# Translating to Lower case assuming the source coding is windows-1251 $s=toLowerCyr($String, 'win'); # Translating to Upper case assuming the source coding is windows-1251 $s=toUpperCyr($String, 'win'); # Converting from one coding to another # Acceptable coding definitions are win, koi, koi8u, mac, iso, dos, utf $s=TranslateCyr('win', 'koi',$String);
See Additional information on usage of this package .
This package permits to detect automatically all live Cyrillic codings - windows-1251, koi8-r, koi8-u, iso-8859-5, utf-8, cp866, x-mac-cyrillic, as well as the language - Russian or Ukrainian. It applies 3 algorithms for detection: formal analysis of alphabet hits, frequency analysis of words and frequency analysis of 2-letter combinations.
It also provides routines for conversion between different codings of Cyrillic texts which can be imported if necessary.
The package permits to detect coding with one or two words only. Certainly, in case of one word reliability will be low, especially if you wrote the words for testing completely in lower or uppercase, as capitalization is a very important attribute for coding detection. Nethertheless the package correctly recognizes coding in a message containing one single word, even all lowercase - 'privet' ('hello' in Russian), 'ivan', 'vodka', 'sputnik'. ;-)))
Ukrainian language will be specified only if the text contains specific Ukrainian letters.
Performance is good as the analysis passes two stages: on the first only formal and fast analysis of proper capitalization and alphabet hit is carried out and only if these data are not enough, the input is analyzed second time - on frequency dictionaries.
The package requires so far Unicode::String and Unicode::Map8 which can be downloaded from http://www.cpan.org. See Additional information on packages to be installed .
I plan to implement my own support of character decoding so these packages will be not required in future releases.
Warning! This module requires preleminary compilation with a C++ compiler; under Unix this procedure goes smoothly and doesn't need commenting; but under Win32 with ActiveState Perl you must
ch = PerlIO_getc(f);
to
ch = getc(f);
In one word, you need to replace Perl wrapper for C function getc to the function itself. The compiler produces warnings, but as a result you'll get a 100% working DLL.
$CyrDetector = Lingua::DetectCyrillic ->new(); $CyrDetector = Lingua::DetectCyrillic ->new( MaxTokens => 100, DetectAllLang => 1 );
MaxTokens - the package stops analyzing the input, if the given number of Cyrillic tokens is reached. You have not to analyze all 100 or 200 thousand bytes from the input if after first 100 tokens the coding and the language can be easily determined. If not specified, this argument defaults to 100.
DetectAllLang - by default the package assumes Russian language only. Setting this parameter to any non-zero value will involve analysis on two languages - Russian and Ukrainian. This slows down perfomance by nearly 10% and can in rare cases may result in a worse coding detection.
my ($Coding,$Language,$CharsProcessed,$Algorithm)= $CyrDetector -> Detect( @Data );
$CyrDetector -> LogWrite(); #write to STDOUT $CyrDetector -> LogWrite('report.log'); #write to file
If the only argument is not specified or equal to stdout (in upper- or lowercase), the program writes the report to the STDOUT, otherwise to the file.
Started programming, I came from an obvious fact: a 'human' reader can easily determine the coding and language from one sight, or at least to say the text to be displayed in a wrong coding. The thing is that the alphabets, i.e. letters of most Cyrillic codings do not coincide so if we try to display text in a bad coding we will inevitably see on screen messy characters inside words which can not be typed with Russian or Ukrainian keyboard layout in a standard way - valuta signs, punctuation marks, Serbian letters, sometimes binary characters etc etc.
Indeed we have only one hard case: the two most popular Cyrillic codings - windows-1251 and koi8-r - have their alphabets in the same range from 192 to 255, but uppercase letters of windows-1251 are placed on the codes of lowercase letters of koi8-r and vice versa, so 'Ivan Petrov' in one of these codings will look like 'iVAN pETROV' in another, i.e. have absolutely wrong capitalization which can be also easily determined by formal analysis of characters. And as you may guess any more or less consistent Cyrillic text must have at least one word starting with a capital letter (I don't take in consideration some weird Internet inhabitants WRITING ALL WITH CAPITAL LETTERS ;-).
Also on the first stage of analysis the program consequently assumes the given text has been written in one of 6 or 7 Cyrillic codings and calculates:This formal analysis is very fast and suits for 99.9% of real texts. Wrong codings are easily filtered out and we get only one 'absolute winner'. This method is also reliable: I can hardly imagine a normal person writing in reverse capitalization. But what if we have only a few words and all them are in upper- or lowerscase?
In this case we apply frequency analysis of words and 2-letter combinations, called also hashes (not in Perl sense, certainly ;-).
The package has dictionaries for 300 most frequent Russian and Ukrainian words and for nearly 600 most frequent Russian and Ukrainian 2-letter combinations, built by myown (the input texts were maybe not be very typical for Internet authors but any linguist can assure you this is not very principal: first hundreds of the most popular words in any language are very stable, nothing to say about letter combinations).
Also the text is analyzed second time (this shouldn't take too much time as we may get into situation like this only in case of a very short text); all the Cyrillic letters analized, no matter in which capitalization they are. If we found at least one word - the coding is determined on it, otherwise - on comparison of letter hashes.
In some very rare cases (usually in a very artificial situation when we have only one short word written all in lower- or uppercase) the statistics on several codings are equal. In this case we prefer windows-1251 to mac, koi8-r to koi8-u and - if nothing helps - windows-1251 to koi8-r.
To judge about which algorithm was applied you may wish to analyze the 4th variable, returned by the function Detect - $Algorithm. More detailed explanation of it is in the table Algorithm codes explanation.
The supported codings are:
Algorithm codes explanation | |
---|---|
11 | Formal analysis of quantity/capitalization of Cyrillic characters; only one alternative found |
21 | Formal analysis of quantity/capitalization of Cyrillic characters; two alternatives found (koi8-r and koi8-u); koi8-r chosen |
22 | Formal analysis of quantity/capitalization of Cyrillic characters; two alternatives found (win1251 and mac); win1251 chosen |
31 | At least one word from the dictionary found and there is only one alternative |
32 | At least one hash from the hash dictionary found and there is only one alternative |
33 | Formally win1251 defined (most probably on analysis of hash) |
34 | Formally koi8-r defined (most probably on analysis of hash) |
40 | Most probable results were chosen, but reliability is very low |
100 | No single Cyrillic character detected |
December 01, 2002 - Extensive Russian documentation added. Version changed to 0.02.
November 19, 2002 - version 0.01 released.
1. Own Unicode support.
2. Option to detect only necessary codings from a list.
What else? Need your feedback!!
The author: Alexei Rudenko, Russia, Moscow. My home phone is (095) 468-95-63
Web-site: http://www.bible.ru/DetectCyrillic/
CPAN address: http://search.cpan.org/author/RUDENKO/
Email: rudenko@bible.ru
Copyright (c) 2002 Alexei Rudenko. All rights reserved.
Lingua::DetectCyrillic. Detection of 7 Cyrillic codings and 2 languages |