Easily detecting if a block of text is written in English (non Unicode languages)

January 14th, 2009 by Cory Leave a reply »

I am writing an application that takes in text from various sources and I was having a problem with foreign language texts creeping into my data.

I needed a quick (and dirty) way to detect if a block of text was written in English or some other language like Japanese, Hebrew, or what-have-you.

I came up with this PHP function.

function is_english($str, $cutoff=.9) {
	$MAXLEN = 512;

	if(strlen($str) > $MAXLEN)
		$str = substr($str, 0, $MAXLEN);

	$ulen = preg_match_all("#.#us", $str, $m);
	$len 	= preg_match_all('#.#s', $str, $m);

	$percent = $len == 0 ? 0 : $ulen/$len;
	return $cutoff == -1 ? $percent : ($percent >= $cutoff);
}

This takes advantage of regular expressions to determine how many unicode/non-unicode characters are in the sample text.

This function is by no means complete.  But, if you are in need of a quick hack, this is it.  Comments/improvements  are welcome!

To Use

is_english($text [ , $cutoff ])
$text: The sampling of text
Optional $cufoff: Float Percentage.
   Text must be at least this percentage to be considered english.
   Set to -1 to return the percentage instead of true/false.

$foreign = 'அதே போல முப்பெரும் கம்பெனிகளின் மேலதிகாரிகளும் லவட்டிச்';
$str = 'Normally, the search starts from the beginning of the subject string.';
$mixed = 'Normally, the search starts from the beginning of the subject string.'
         .'அதே போல முப்பெரும் கம்பெனிகளின் மேலதிகாரிகளும் லவட்டிச் செல்லக் கூடாது';

var_dump(is_english($foreign));
var_dump(is_english($str));
var_dump(is_english($mixed));

echo "<BR>";

var_dump(is_english($foreign, -1));
var_dump(is_english($str, -1));
var_dump(is_english($mixed, -1));

Output:
bool(false)
bool(true)
bool(false)

float(0.35483870967742)
int(1)
float(0.52452830188679)

Drawbacks

To speed the function up it will trim your text down to an arbitrary length using strlen (which is not multibyte aware).  An improvement would be to create another function to generate a random sampling of the text instead of the first 1024 characters.

This doesn’t work as well with languages that contain mostly (all) non-unicode characters.

I’m sure there are more.

5 comments

  1. Max says:

    This is a neat script, and what I’ve been looking for- I need to search a MySQL database, and depending upon the language used in the search, I use a different SQL query.

    One question. In:

    $ulen = preg_match_all(“#.#us”, $str, $m);

    What is #.#us ? Is this a character class? I’m only familiar with character classes like [:alpha:] or [:alnum:]

    Thanks!

    Reply

  2. Cory says:

    The # signs are the regex delimeters.

    http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html is a great reference for regex in PHP.

    From that tutorial:

    u – Pattern is treated as UTF-8
    s – Dotall – . class includes newline

    By comparing the s and us match counts, we can determine how many characters in the string are utf8 characters.

    Reply

  3. michel says:

    Hi!

    Thank you very much for that snippet. It was exactly what I was looking for!

    Regards,
    michel

    Reply

  4. dan says:

    That’s cool –

    I wanted also to thank you for the great work on Facelift –

    endofyourtrip.com uses facelift extensively with a combination of hundreds of fonts.

    Cheers,

    p.s: do you have facelift in a repository (e.g. git) so that changes/fixes can be pushed/suggested?

    Reply

Leave a Reply