I am writing an application that takes in text from various sources and I was having a problem with foreign language texts creeping into my data.
I needed a quick (and dirty) way to detect if a block of text was written in English or some other language like Japanese, Hebrew, or what-have-you.
I came up with this PHP function.
function is_english($str, $cutoff=.9) {
$MAXLEN = 512;
if(strlen($str) > $MAXLEN)
$str = substr($str, 0, $MAXLEN);
$ulen = preg_match_all("#.#us", $str, $m);
$len = preg_match_all('#.#s', $str, $m);
$percent = $len == 0 ? 0 : $ulen/$len;
return $cutoff == -1 ? $percent : ($percent >= $cutoff);
}
This takes advantage of regular expressions to determine how many unicode/non-unicode characters are in the sample text.
This function is by no means complete. But, if you are in need of a quick hack, this is it. Comments/improvements are welcome! » Read more: Easily detecting if a block of text is written in English (non Unicode languages)