Archive for the ‘Language’ category

Easily detecting if a block of text is written in English (non Unicode languages)

January 14th, 2009

I am writing an application that takes in text from various sources and I was having a problem with foreign language texts creeping into my data.

I needed a quick (and dirty) way to detect if a block of text was written in English or some other language like Japanese, Hebrew, or what-have-you.

I came up with this PHP function.

function is_english($str, $cutoff=.9) {
	$MAXLEN = 512;

	if(strlen($str) > $MAXLEN)
		$str = substr($str, 0, $MAXLEN);

	$ulen = preg_match_all("#.#us", $str, $m);
	$len 	= preg_match_all('#.#s', $str, $m);

	$percent = $len == 0 ? 0 : $ulen/$len;
	return $cutoff == -1 ? $percent : ($percent >= $cutoff);
}

This takes advantage of regular expressions to determine how many unicode/non-unicode characters are in the sample text.

This function is by no means complete.  But, if you are in need of a quick hack, this is it.  Comments/improvements  are welcome! » Read more: Easily detecting if a block of text is written in English (non Unicode languages)