Archive for December, 2006

Computational linguistics is an interdisciplinary field dealing with the statistical and logical modeling of natural language from a computational perspective. This modeling is not limited to any particular field of linguistics. Computational linguistics was formerly usually done by computer scientists who had specialized in the application of computers to the processing of a natural language. Recent research has shown that language is much more complex than previously thought, so computational linguistics work teams are now sometimes interdisciplinary, including linguists (specifically trained in linguistics). Computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, cognitive psychologists and logicians, amongst others.


Computational linguistics as a field predates artificial intelligence, a field under which it is often grouped. Computational linguistics originated with efforts in the United States in the 1950s to have computers automatically translate texts in foreign languages into English, particularly Russian scientific journals. Since computers had proven their ability to do arithmetics much faster and more accurately than humans, it was thought to be only a short matter of time before the technical details could be taken care of that would allow them the same remarkable capacity to process language.

When machine translation (also known as mechanical translation) failed immediately to yield accurate translations, the problem was recognized as far more complex than had originally been assumed. Computational linguistics was born as the name of the new field of study devoted to developing algorithms and software for intelligently processing language data. When artificial intelligence came into existence in the 1960s, the field of computational linguistics became that sub-division of artificial intelligence dealing with human-level comprehension and production of natural languages.

In order to translate one language into another, it was observed that one had to understand the syntax of both languages, and at least at the level of morphology (the syntax of words) and whole sentences. In order to understand syntax, one had to also understand the semantics of the vocabulary, and even to understand something of the pragmatics of how the language was being used. Thus, what started as an effort to translate between languages evolved into an entire discipline devoted to understanding how to represent and process individual natural languages using computers.


Computational linguistics can be divided into major areas depending upon the medium of the language being processed, whether spoken or textual; and upon the task being performed, whether analyzing language (parsing) or creating language (generation).

Speech recognition and speech synthesis deal with how spoken language can be understood or created using computers. Parsing and generation are sub-divisions of computational linguistics dealing respectively with taking language apart and putting it together. Machine translation remains the sub-division of computational linguistics dealing with having computers translate between languages.

Some of the areas of research that are studied by computational linguistics include:

  • Computer aided corpus linguistics
  • Design of parsers for natural languages
  • Design of taggers like POS-taggers (part-of-speech taggers)
  • Definition of specialized logics like resource logics for NLP
  • Research in the relation between formal and natural languages in general
  • Machine Translation, e.g. by a translating computer
  • Computational Complexity of Natural Language, largely modeled on Automata Theory, with the application of Context-sensitive grammar and Linearly-Bounded Turing Machines.

The Association for Computational Linguistics defines computational linguistics as:

…the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena.

From Wikipedia, the free encyclopedia


Read Full Post »

Translating Pens

I have found information about pens that allow students learn a second language. These are called translating pens and they work this way: you scan a single word or full line of printed text and the pen translates or defines single words. You can also see and hear the scanned word(s) read aloud. Moreover, they are multilingual and alternative input tools are also included for entering words that you can’t scan (e.g., street signs).

I think they are going to be very useful not only for students of a second language but also for travellers and maybe for translators. However, I imagine they are not well developed still because as we saw machine translators aren’t and present a lot of problems (most of them dealing with the context surrounding the words). Moreover, as the video I posted about Microsoft Speech “Recognition” showed, these Text-to-Speech technologies also present some problems. So, I think these pens are going to be also problematic in both areas.

They look like this:

Read Full Post »

This is a funny video that shows the problems speech recognition presents. This took place in August 2006.

Read Full Post »

Ontologies and NLP

Ontologies are  formal, explicit specifications of how to represent the objects, concepts, and other entities in a particular system, as well as the relationships between them.

Natural-language processing (NLP) is an area of artificial intelligence research that attempts to reproduce the human interpretation of language. NLP methodologies and techniques assume that the patterns in grammar and the conceptual relationships between words in language can be articulated scientifically. The ultimate goal of NLP is to determine a system of symbols, relations, and conceptual information that can be used by computer logic to implement artificial language interpretation.

Natural-language processing has its roots in semiotics, the study of signs. Semiotics was developed by Charles Sanders Peirce (a logician and philosopher) and Ferdinand de Saussure (a linguist). Semiotics is broken up into three branches: syntax, semantics, and pragmatics.

A complete natural-language processor extracts meaning from language on at least seven levels. However, we’ll focus on the four main levels.

Morphological: A morpheme is the smallest part of a word that can carry a discrete meaning. Morphological analysis works with words at this level. Typically, a natural-language processor knows how to understand multiple forms of a word: its plural and singular, for example.

Syntactic: At this level, natural-language processors focus on structural information and relationships.

Semantic: Natural-language processors derive an absolute (dictionary definition) meaning from context.

Pragmatic: Natural-language processors derive knowledge from external commonsense information.

Natural-language limitations

One of the major limitations of modern NLP is that most linguists approach NLP at the pragmatic level by gathering huge amounts of information into large knowledge bases that describe the world in its entirety. These academic knowledge repositories are defined in ontologies that take on a life of their own and never end up in practical, widespread use. There are various knowledge bases, some commercial and some academic. The largest and most ambitious is the Cyc Project. The Cyc Knowledge Server is a monstrous inference engine and knowledge base. Even natural-language modules that perform specific, limited, linguistic services aren’t financially feasible for use by the average developer.

In general, NLP faces the following challenges:

  • Physical limitations: The greatest challenge to NLP is representing a sentence or group of concepts with absolute precision. The realities of computer software and hardware limitation make this challenge nearly insurmountable. The realistic amount of data necessary to perform NLP at the human level requires a memory space and processing capacity that is beyond even the most powerful computer processors.
  • No unifying ontology: NLP suffers from the lack of a unifying ontology that addresses semantic as well as syntactic representation. The various competing ontologies serve only to slow the advancement of knowledge management.
  • No unifying semantic repository: NLP lacks an accessible and complete knowledge base that describes the world in the detail necessary for practical use. The most successful commercial knowledge bases are limited to licensed use and have little chance of wide adoption. Even those with the most academic intentions develop at an unacceptable pace.
  • Current information retrieval systems: The performance of most of the current information retrieval systems is affected by semantic overload. Web crawlers, limited by their method of indexing, more often than not return incorrect matches as a result of ambiguous interpretation.

Ontologies and solutions

The W3C’s Resource Definition Framework (RDF) was developed to enable the automated processing of Web resources by providing a means of defining metadata about those resources. RDF addresses the physical limitation of memory space by allowing a natural-language processor to access resources in a distributed environment. A networked computer processor can access RDF models on various other processors in a standard way.

RDF provides a unifying ontological syntax for defining knowledge bases. RDF is expressed in XML, a markup language designed to cleanly separate data formatting from data semantics. As a result of the extensible nature of XML (authors have only the restriction of being well-formed and valid), a number of categories of information can be expressed very clearly using XML and RDF.

RDF is by no means the perfect ontological syntax. For instance, there are five semantic principles: existence, coreference, relation, conjunction, and negation. RDF doesn’t inherently support conjunctions and negations. At its core, RDF allows users to define statements in a simple format about network resources.

This is taken from:


Read Full Post »

Merriam-Webster is America’s foremost publisher of language-related reference works. The company publishes a diverse array of print and electronic products, including Merriam-Webster’s Collegiate Dictionary, Eleventh Edition—America’s best-selling desk dictionary—and Webster’s Third New International Dictionary, Unabridged. These online dictionaries also give us the spoken pronountiation of each word.

I have found the dictionary -I have used both the monolingual and the bilingual (which translates the word or expression from Spanish into English)- quite complete since, in the case of the monolingual, it gives you synonims, definitions and also you can click on the words to listen to their pronunciation -what you cannot do in most online dictionaries-. In the case of the bilingual dictionary, I do not think it is really useful for us because we can clearly see that it is thought for learners of Spanish. Anyway, you enter a word in Spanish and the dictionary gives you synonims of the word. However, I think it is not very accurate because it does not say in which context each is used. Let’s see an example:

If wee write the word casa, we get:


3 entries found for casa
Main Entry: casa
Function: feminine noun
Usage: Spanish word
1 : house, building
2 : HOGAR : home
3 : household, family
4 : company, firm
5 echar la casa por la ventana : to spare no expense.

All in all, for us, learners of English, the monolingual dictionary is a useful tool when writing, translating or even speaking English because we can listen to the pronunciation of the words and, whatsmore, we can find some meanings that a bilingual dictionary forgets or even misinterprets.

Merriam-Webster webpage:


Read Full Post »

A device the size of a sugar cube will be able to record and store high resolution video footage of every second of a human life within two decades, experts said latest Tuesday.

Researchers said governments and societies must urgently debate the implications of the huge increases in computing power and the growing mass of information being collected on individuals.

Some fear that the advent of “human black boxes” combined with the extension of medical, financial and other digital records will lead to loss of privacy and a dramatic expansion of the nanny state.

Others highlight positive advances in medicine, education, crime prevention and the way history will be recorded.

Leading computer scientists, psychologists and neuroscientists gathered to debate these issues at Memories for Life, a conference held at the British Library yesterday.

Prof Nigel Shadbolt, president of the British Computer Society and professor of artificial intelligence at the University of Southampton, said: “In 20 years’ time it will be possible to record high quality digital video of an entire lifetime of human memories. It’s not a question of whether it will happen; it’s already happening.”

A lap top available in the High Street can hold some 80 gigabytes (GB) of information. One hour of high resolution video footage requires 12GB.

Since the year 2000, computing processing power has been doubling approximately every 18 months – a phenomenon known as Moore’s Law.

Prof Shadbolt has calculated that it would take 5.5 petabytes (PT) to record every awake second of a person’s life in high resolution video.

One PT equals one million GB. Experts expect the increase in computing power to lead to advances in “ehealth” with doctors having access to information from devices that monitor physiological data such as heart rate and blood sugar levels.

Retailers want to get more information on their customers’ habits than they already have from their loyalty cards. The technological advances will also have a dramatic impact on the writing of biographies and history, with authors and historians able to gain vastly more information on key figures.

Cliff Lynch, director of the US think tank Coalition for Networked Information, said the changes would allow the preservation of much more detailed memories, but could lead to a dramatic extension of state interference.

“We will be able to replicate and pass on so much more information. In future you are going to have a much more elaborate picture for more and more people.

“Biographers and other kinds of scholars who want to understand what someone was thinking are going to be based with an embarrassment of riches.

“There is a certain tendency towards a technological nanny state. Imagine having a personal companion that wines at you three times a day, telling you that you are eating the wrong things and that you spent more than you earned today and you’ll never be able to retire.

“Imagine we could end up with smart refrigerator that tells you ‘you’ve already had your beer for the day, you can’t have another one’.

“I don’t think people would want a world like that, but the scary thing is it might be foisted on them.”

Prof Wendy Hall, of the University of Southampton, said: “Technology can play a vital role in memory, for example by providing an artificial aid to help those with memory disorders or enabling communities to create and preserve their collective experiences.

“However, we must also consider the social, ethical and legal issues associated with technology development and how increased access to knowledge will affect our society in open, inter-disciplinary forums.”

 By Nic Fleming.



Read Full Post »