Archive for January, 2009

Easily detecting if a block of text is written in English (non Unicode languages)

January 14th, 2009

I am writing an application that takes in text from various sources and I was having a problem with foreign language texts creeping into my data.

I needed a quick (and dirty) way to detect if a block of text was written in English or some other language like Japanese, Hebrew, or what-have-you.

I came up with this PHP function.

function is_english($str, $cutoff=.9) {
	$MAXLEN = 512;

	if(strlen($str) > $MAXLEN)
		$str = substr($str, 0, $MAXLEN);

	$ulen = preg_match_all("#.#us", $str, $m);
	$len 	= preg_match_all('#.#s', $str, $m);

	$percent = $len == 0 ? 0 : $ulen/$len;
	return $cutoff == -1 ? $percent : ($percent >= $cutoff);
}

This takes advantage of regular expressions to determine how many unicode/non-unicode characters are in the sample text.

This function is by no means complete.  But, if you are in need of a quick hack, this is it.  Comments/improvements  are welcome! » Read more: Easily detecting if a block of text is written in English (non Unicode languages)

Proof of my geekery…

January 11th, 2009

So… I’m trying to write a natural language parser and was doing some research when I stumbled upon this forum thread.

I enjoyed it so much I thought I’d share it:

Jack

Hi there, just checking to see if anyone knows how to perform this
basic grammar function ? I searched CPAN and found nothing that was
able to in Perl..

Charlton

That’s because it’s not a trivial problem.

Consider: “Time flies like an arrow. Fruit flies like a banana.”

In the second sentence, what is the subject?

Jack

Thanks.. So my goal would be to capture what is the subject of the
sentence, Time, and Fruit, respectively. However you cant always take
the first noun that shows up as the subject for instance: “Despite
accusations, lawyers find a way to evade difficult situations”

Lawyers would be the subject of the sentence.. does that make sense ?
Essentially, identify the TOPIC/subject is what I want to do, for a
given sentence..

Jürgen

You just proved Charlton’s point:
In the second sentence the subject is not “fruit” but “fruit flies”

Natural language parsing is a _very_ difficult subject.

Charlton

And you just proved my actual point: Whether the subject is “fruit” or
“fruit flies” is indeterminate, and that’s where the humor in the
statement comes from.