Archive for the ‘Language Resources’ Category


Today, Joseba Abaitua told me he hasn’t seen many articles written by me on Planet Littera. The reason is that I wrote them on my weblog and put some on them under the category: Littera, and some others under: Language Resources. Now, I have edited my weblog and put everything inside Littera. This is why many of them seem to be written right now. But if you go to my weblog and look for the date at the end of each article, you will see when they were written. I hope this is taken into account because I have spent much time on this subject. So, if there is any problem, just click either on my name or go to the link I put on Mediawiki and look at the date after reading my articles. You can also add a comment in here and I will read it.


Read Full Post »

When I wrote the article about “Variation in English Words and Phrases”, the one on corpora, I knew what corpora were but, thanks to the examples that Cambridge International Corpus offers, I could test it and now I know how they really work. These examples work like this:

You are given a question like:

Is the word ‘right’ more common in spoken English or in written English?

Then you click on the answer you think is the best and you are given the solution according to the collection of texts that Cambridge takes into account. In this case:

Right is very much more common in spoken English than in written English. Here’s a couple of different examples in spoken dialogue taken from the CIC:

“That’s right. Cos they’ve never seen him.”

“Oh well. And it’s going all right is it?”

As we can notice, this corpus is very useful not only for students of English but also for teachers. With this tool we can avoid mistakes due to the differences between languages such as the use of prepositions, the register or the dialectal variations.

Cambridge International Corpus:


Read Full Post »

SYSTRAN is one of the oldest machine translation companies. SYSTRAN has done an extensive work for the Defense Department of the United States and the European Commission.

It is the widest machine translation server and provides Altavista and Google’s online translation services with its technology. SYSTRAN has 50,000 basic words and 250,000 scientific words, and the speed of the processing of translations is 500,000 an hour.

Machine Translation is the process that uses a computer software to translate texts from a natural language into another. Machine Translation takes into account the grammatical structure of each language and uses some rules to change the grammatical structure of the text when it translates.

Translation is not an easy task since it is not a mere substitution of words but it is a question of knowing the rules and structures of the languages involved. Therefore, what machine translation takes into consideration is: Mophology, Syntax and Semantics.

SYSTRAN’s technology is developed for Linux and spreads to the Unix and Microsoft Windows platforms.SYSTRAN uses the most modern Natural Language Processing technologies. The systems integrate finite technology in order to accelerate the acces to the bases of linguistic knowledge. In fact, SYSTRAN has more bases of multilingual knowledge than normal computers can hold. XML, Unicode and HTTP are basic in applications to multilingual subjects, for instance, Web, email, Intranet, advertising… The development and evolution of SYSTRAN is possible thanks to constant research about the development of Linguistics and NLP. The aim is to prepare the new generation machine translation systems in order to improve the quality of the translations.

SYSTRAN’s operation is very easy. In SYSTRAN webpage  you find SYSTRAN 5.0 translator. Then, you enter a text of up to 150 words, the web page in which the text you want to translate appears and the language of both the original text and of the text you want to translate (from Arabic into English, from English into Arabic, From English into Swedish, from Swedish into English, from English into Dutch, from Dutch into English…). And so on it translates up to 12 languages from English into those languages and the other way round. The rest of the idiomatic options are from French into: Dutch, Spanish, French, German, Italian, English and Portuguese and viceversa. Moreover, there is an option to register to look for more specific things.

SYSTRAN translator:


Read Full Post »

Computational linguistics is an interdisciplinary field dealing with the statistical and logical modeling of natural language from a computational perspective. This modeling is not limited to any particular field of linguistics. Computational linguistics was formerly usually done by computer scientists who had specialized in the application of computers to the processing of a natural language. Recent research has shown that language is much more complex than previously thought, so computational linguistics work teams are now sometimes interdisciplinary, including linguists (specifically trained in linguistics). Computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, cognitive psychologists and logicians, amongst others.


Computational linguistics as a field predates artificial intelligence, a field under which it is often grouped. Computational linguistics originated with efforts in the United States in the 1950s to have computers automatically translate texts in foreign languages into English, particularly Russian scientific journals. Since computers had proven their ability to do arithmetics much faster and more accurately than humans, it was thought to be only a short matter of time before the technical details could be taken care of that would allow them the same remarkable capacity to process language.

When machine translation (also known as mechanical translation) failed immediately to yield accurate translations, the problem was recognized as far more complex than had originally been assumed. Computational linguistics was born as the name of the new field of study devoted to developing algorithms and software for intelligently processing language data. When artificial intelligence came into existence in the 1960s, the field of computational linguistics became that sub-division of artificial intelligence dealing with human-level comprehension and production of natural languages.

In order to translate one language into another, it was observed that one had to understand the syntax of both languages, and at least at the level of morphology (the syntax of words) and whole sentences. In order to understand syntax, one had to also understand the semantics of the vocabulary, and even to understand something of the pragmatics of how the language was being used. Thus, what started as an effort to translate between languages evolved into an entire discipline devoted to understanding how to represent and process individual natural languages using computers.


Computational linguistics can be divided into major areas depending upon the medium of the language being processed, whether spoken or textual; and upon the task being performed, whether analyzing language (parsing) or creating language (generation).

Speech recognition and speech synthesis deal with how spoken language can be understood or created using computers. Parsing and generation are sub-divisions of computational linguistics dealing respectively with taking language apart and putting it together. Machine translation remains the sub-division of computational linguistics dealing with having computers translate between languages.

Some of the areas of research that are studied by computational linguistics include:

  • Computer aided corpus linguistics
  • Design of parsers for natural languages
  • Design of taggers like POS-taggers (part-of-speech taggers)
  • Definition of specialized logics like resource logics for NLP
  • Research in the relation between formal and natural languages in general
  • Machine Translation, e.g. by a translating computer
  • Computational Complexity of Natural Language, largely modeled on Automata Theory, with the application of Context-sensitive grammar and Linearly-Bounded Turing Machines.

The Association for Computational Linguistics defines computational linguistics as:

…the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena.

From Wikipedia, the free encyclopedia

Read Full Post »

Translating Pens

I have found information about pens that allow students learn a second language. These are called translating pens and they work this way: you scan a single word or full line of printed text and the pen translates or defines single words. You can also see and hear the scanned word(s) read aloud. Moreover, they are multilingual and alternative input tools are also included for entering words that you can’t scan (e.g., street signs).

I think they are going to be very useful not only for students of a second language but also for travellers and maybe for translators. However, I imagine they are not well developed still because as we saw machine translators aren’t and present a lot of problems (most of them dealing with the context surrounding the words). Moreover, as the video I posted about Microsoft Speech “Recognition” showed, these Text-to-Speech technologies also present some problems. So, I think these pens are going to be also problematic in both areas.

They look like this:

Read Full Post »

This is a funny video that shows the problems speech recognition presents. This took place in August 2006.

Read Full Post »

Ontologies and NLP

Ontologies are  formal, explicit specifications of how to represent the objects, concepts, and other entities in a particular system, as well as the relationships between them.

Natural-language processing (NLP) is an area of artificial intelligence research that attempts to reproduce the human interpretation of language. NLP methodologies and techniques assume that the patterns in grammar and the conceptual relationships between words in language can be articulated scientifically. The ultimate goal of NLP is to determine a system of symbols, relations, and conceptual information that can be used by computer logic to implement artificial language interpretation.

Natural-language processing has its roots in semiotics, the study of signs. Semiotics was developed by Charles Sanders Peirce (a logician and philosopher) and Ferdinand de Saussure (a linguist). Semiotics is broken up into three branches: syntax, semantics, and pragmatics.

A complete natural-language processor extracts meaning from language on at least seven levels. However, we’ll focus on the four main levels.

Morphological: A morpheme is the smallest part of a word that can carry a discrete meaning. Morphological analysis works with words at this level. Typically, a natural-language processor knows how to understand multiple forms of a word: its plural and singular, for example.

Syntactic: At this level, natural-language processors focus on structural information and relationships.

Semantic: Natural-language processors derive an absolute (dictionary definition) meaning from context.

Pragmatic: Natural-language processors derive knowledge from external commonsense information.

Natural-language limitations

One of the major limitations of modern NLP is that most linguists approach NLP at the pragmatic level by gathering huge amounts of information into large knowledge bases that describe the world in its entirety. These academic knowledge repositories are defined in ontologies that take on a life of their own and never end up in practical, widespread use. There are various knowledge bases, some commercial and some academic. The largest and most ambitious is the Cyc Project. The Cyc Knowledge Server is a monstrous inference engine and knowledge base. Even natural-language modules that perform specific, limited, linguistic services aren’t financially feasible for use by the average developer.

In general, NLP faces the following challenges:

  • Physical limitations: The greatest challenge to NLP is representing a sentence or group of concepts with absolute precision. The realities of computer software and hardware limitation make this challenge nearly insurmountable. The realistic amount of data necessary to perform NLP at the human level requires a memory space and processing capacity that is beyond even the most powerful computer processors.
  • No unifying ontology: NLP suffers from the lack of a unifying ontology that addresses semantic as well as syntactic representation. The various competing ontologies serve only to slow the advancement of knowledge management.
  • No unifying semantic repository: NLP lacks an accessible and complete knowledge base that describes the world in the detail necessary for practical use. The most successful commercial knowledge bases are limited to licensed use and have little chance of wide adoption. Even those with the most academic intentions develop at an unacceptable pace.
  • Current information retrieval systems: The performance of most of the current information retrieval systems is affected by semantic overload. Web crawlers, limited by their method of indexing, more often than not return incorrect matches as a result of ambiguous interpretation.

Ontologies and solutions

The W3C’s Resource Definition Framework (RDF) was developed to enable the automated processing of Web resources by providing a means of defining metadata about those resources. RDF addresses the physical limitation of memory space by allowing a natural-language processor to access resources in a distributed environment. A networked computer processor can access RDF models on various other processors in a standard way.

RDF provides a unifying ontological syntax for defining knowledge bases. RDF is expressed in XML, a markup language designed to cleanly separate data formatting from data semantics. As a result of the extensible nature of XML (authors have only the restriction of being well-formed and valid), a number of categories of information can be expressed very clearly using XML and RDF.

RDF is by no means the perfect ontological syntax. For instance, there are five semantic principles: existence, coreference, relation, conjunction, and negation. RDF doesn’t inherently support conjunctions and negations. At its core, RDF allows users to define statements in a simple format about network resources.

This is taken from:


Read Full Post »

Older Posts »