I am writing an application that takes in text from various sources and I was having a problem with foreign language texts creeping into my data.
I needed a quick (and dirty) way to detect if a block of text was written in English or some other language like Japanese, Hebrew, or what-have-you.
I came up with this PHP function.
function is_english($str, $cutoff=.9) {
$MAXLEN = 512;
if(strlen($str) > $MAXLEN)
$str = substr($str, 0, $MAXLEN);
$ulen = preg_match_all("#.#us", $str, $m);
$len = preg_match_all('#.#s', $str, $m);
$percent = $len == 0 ? 0 : $ulen/$len;
return $cutoff == -1 ? $percent : ($percent >= $cutoff);
}
This takes advantage of regular expressions to determine how many unicode/non-unicode characters are in the sample text.
This function is by no means complete. But, if you are in need of a quick hack, this is it. Comments/improvements are welcome!
To Use
is_english($text [ , $cutoff ])
$text: The sampling of text
Optional $cufoff: Float Percentage.
Text must be at least this percentage to be considered english.
Set to -1 to return the percentage instead of true/false.
$foreign = 'அதே போல முப்பெரும் கம்பெனிகளின் மேலதிகாரிகளும் லவட்டிச்';
$str = 'Normally, the search starts from the beginning of the subject string.';
$mixed = 'Normally, the search starts from the beginning of the subject string.'
.'அதே போல முப்பெரும் கம்பெனிகளின் மேலதிகாரிகளும் லவட்டிச் செல்லக் கூடாது';
var_dump(is_english($foreign));
var_dump(is_english($str));
var_dump(is_english($mixed));
echo "<BR>";
var_dump(is_english($foreign, -1));
var_dump(is_english($str, -1));
var_dump(is_english($mixed, -1));
Output:
bool(false)
bool(true)
bool(false)
float(0.35483870967742)
int(1)
float(0.52452830188679)
Drawbacks
To speed the function up it will trim your text down to an arbitrary length using strlen (which is not multibyte aware). An improvement would be to create another function to generate a random sampling of the text instead of the first 1024 characters.
This doesn’t work as well with languages that contain mostly (all) non-unicode characters.
I’m sure there are more.
This is a neat script, and what I’ve been looking for- I need to search a MySQL database, and depending upon the language used in the search, I use a different SQL query.
One question. In:
$ulen = preg_match_all(“#.#us”, $str, $m);
What is #.#us ? Is this a character class? I’m only familiar with character classes like [:alpha:] or [:alnum:]
Thanks!
Reply
The # signs are the regex delimeters.
http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html is a great reference for regex in PHP.
From that tutorial:
u – Pattern is treated as UTF-8
s – Dotall – . class includes newline
By comparing the s and us match counts, we can determine how many characters in the string are utf8 characters.
Reply
Hi!
Thank you very much for that snippet. It was exactly what I was looking for!
Regards,
michel
Reply
That’s cool –
I wanted also to thank you for the great work on Facelift –
endofyourtrip.com uses facelift extensively with a combination of hundreds of fonts.
Cheers,
p.s: do you have facelift in a repository (e.g. git) so that changes/fixes can be pushed/suggested?
Reply