|
|
|
|
![1105 Media [justice] 1105 Media [justice]](/images/ds1_pntmlogo.gif) |
 |
|
 |
|
|
 |

home > March 24, 2008 issue > article

Digital graffiti
 By Brian Robinson Special to Defense Systems
 Software seeks to read the writing on the wall — and elsewhere
 Making timely sense of information contained in printed documents,
handwritten letters and even graffiti scrawled on a wall
can be of huge value to warfighters, but doing that with English
sources is hard enough, let alone with Arabic script.

The Defense Advanced Research Projects Agency is trying to
overcome those barriers with a new language technology program
called Multilingual Automatic Document Classification Analysis
and Translation (MADCAT), whose goal is to develop ways to
automatically convert foreign-language text images into English
transcripts.

Such a system would reduce the militarys dependence on linguists
and analysts who are now needed to help decide what information
is valuable and what is not. Often, the value of information
is drastically reduced by the time the experts arrive on the scene
and sort through it all.

But researchers face a number of significant technical
challenges, according to Prem Natarajan, the principal
MADCAT investigator at BBN Technologies,
which was recently awarded a $5.7 million DARPA
grant for work on the project.

This is the first organized attempt to go after this
kind of hard-copy document processing, he said.
Its similar to the problems associated with [optical
character recognition scanning] which works well for
English-language, well-structured documents but not
at all well for degraded, real-world documents.

BBN has recently shown that the kind of vocabulary
training that current OCR systems can be given
to recognize and translate English documents can be
used with handwritten documents also, and that it
can probably be applied to similar Arabic and
Chinese documents, he said.

However, a big problem is the variability of language and script
used by writers, he said.

Were talking about handwritten messages here, of various orientations,
with certainly less-than-perfect lettering and spelling,
said Howard Bender, chairman of Any Language Communications.
The image software has to recognize individual letters so
they can be expressed in Unicode. If the image software cant do
it, no language analysis can be done.

DARPA is setting a goal of being able to accurately translate 90
to 95 percent of the content in 95 percent of the material scanned,
which is quite a high bar, Bender said.

On the other hand, Natarajan said, if the problems that DARPA
has set out are solved over the next four or five years, it will revolutionize
the field.


|
 |
|
|
 |
 |
|