Mallando na e-ira: O N-Gram Viewer ou cando Google se mete a lingüista: Popularización vs. rigor científico.

Hai uns dias facíamonos eco dos primeiros resultados de búsquedas no megacorpus de Google, que contén máis de 5 millóns de libros, segundo LanguageLog o 4% de todos os libros xamais publicados.

As búsquedas pódense facer en diferentes corpora, cada un deles cunhas características determinadas, mediante unha ferramenta, o NGram Viewer de Google, da que tendes máis información aquí.

O impacto da historia foi enorme. Os autores contan cunha páxina, culturomics.org, onde publican os seus achádegos, e entre os medios de comunicación a historia tivo grande éxito, como amosa a reportaxe na que a National Public Radio dos Estados Unidos fíxose eco da nova ou a do Guardian.

Resumo do Impacto: Críticas

Porén, non é ouro todo o que reluce: As críticas ao proxecto non se fixeron agardar e a maioría parecen coincidir en que a ferramenta de Google ten máis de entretemento (coa popularización da investigación cuantitativa sobre a lingua que conleva) e de publicidade (para Google e para o equipo de Harvard e do MIT) que de rigor científico:

Geoffrey Nunberg comenta:

The authors of the paper claim that the quantitative data gathered from the corpus are the bones that can be assembled into "the skeleton of a new science." They call the new field "culturomics," defining it as "the application of high-throughput data collection and analysis to the study of human culture," which "extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities."

Whatever misgivings scholars may have about the larger enterprise, the data will be a lot of fun to play around with. And for some—especially students, I imagine—it will be a kind of gateway drug that leads to more-serious involvement in quantitative research.

Mark Liberman ten dúbidas sobre o feito de o proxecto ser privado:

The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture". But as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.

Google is making an important contribution to the creation of the archives that will make new kinds of work possible. For that, the company deserves everyone's thanks.

But there is a potential problem. As it stands, outside scholarly access to this historical archive will be limited to tracking the frequency of words and word-strings (what they call in the trade "n-grams"). This is useful for addressing some questions, but most questions will require other kinds of processing, which are not possible without having the full underlying archive in your (digital) hands. For the material before 1922, there is no copyright issue. The only barrier is Google's competitive advantage.

This puts the rest of us in a difficult position. Given Google's large, well-run and successful effort to digitize these historical collections, for which the economic returns are fairly small on a per-book basis, it's unlikely that anyone else will duplicate their efforts in the visible future. So we're in the situation that would have existed if the Human Genome Project had been entirely private, rather than shared.

In this analogy, the access to "culturomic trajectories" to be made available at culturomics.org might correspond to information about the relative frequencies of nucleotide polymorphisms across individuals, without access to the underlying genomes.

Mark Davies, nunha defensa do seu COHA, critica algúns detalles do proxecto:

Google Books has received lots of attention because of its size. It has about 500 billion words, compared to 400 million words in COHA (and COHA is in turn about 100 times as large as any other historical corpus of English).

The question is -- do you really need 500 billion words?

Google Books can't use wildcards to search for parts of words. For example, try searching for freak* out (all forms: freak_, freaked, freaking, etc) or even a simple search like teenager*

if Google Books doesn't know about part of speech tags or variant forms of a word, then how can it look at change in grammar?

A pesar do cal,

We like the new Google Books interface. It's "cool" and "simple", which is what has made Google so popular over the years. We'll definitely have the students in our linguistics classes use it to look at specific aspects of historical American English and culture.

Porén, hai críticas máis serias, como a de David Crystal:

We mustn't exaggerate the significance of this project. It is no more than a collection of scanned books - an impressive collection, unprecedented in its size, and capable of displaying innumerable interesting trends, but far away from entire cultural reality. For a start, this is just a collection of books - no newspapers, magazines, advertisements, or other orthographic places where culture resides. No websites, blogs, social networking sites. No spoken language, of course, so over 90 percent of the daily linguistic usage of the world isn't here. Moreover, the books were selected from 'over 40 university libraries from around the world', supplemented by some books directly from publishers - so there will be limited coverage of the genres recognized in the categorization systems used in corpus linguistics . They were also, I imagine, books which presented no copyright difficulties. The final choice went through what must have been a huge filtering process. Evidently 15 million books were scanned, and 5 million selected partly on the basis of 'the quality of their OCR' [optical character recognition]. So this must mean that some types of text (those with a greater degree of orthographic regularity) will have been privileged over others.

The approach, in other words, shows trends but can't interpret or explain them. It can't handle ambiguity or idiomaticity. If your query is unique and unambiguous, you'll get some interpretable results

E a de The Binder Blog:

Setting aside my concerns about the accuracy of ngrams, there are still serious philosophical questions about ngrams’ utility as a tool of cultural research. Does the Ngrams Viewer really provide insight into how people once thought? Does it make it easier to study the past, or does it simply add more data without context to the pile? Does a project like this decode the cultures of the past, making them more accessible than ever, or does it obscure the past, making it even more difficult to do justice to our forbears?

What does word use frequency over time mean? Level of interest? If so, whose? All literate people? People whose books made it into the library? People who wrote books that eventually became Google Books? There were no historians, anthropologists or other cultural researchers on the ngrams team, so I have to wonder what problem this tool was designed to solve.

It’s thin description: data without context. As if the academy doesn’t already have enough of that to analyze.

Ao que engade The Lousy Linguist a posibilidade de sobreinterpretar os datos:

Andrew Sullivan has predictably misunderstood the value of Google's Ngram Viewer. He spent all day yesterday posting trite and simplistic mis-interpretations of the data. For example,

* the concept of ideology is a relatively recent one because the word ideology has become more frequent recently (this is almost certainly false).
* Jesus "wins" (his word, not mine) against the Beatles because the word Jesus is more frequent.

Resumo:

O N-Gram Viewer abre a fiestra da investigación cuantitativa sobre fenómenos lingüísticos-culturais ao mundo contribuindo Google á popularización dun campo como xa ten feito anteriormente con outros, mais, e sen negar este valor, presenta unha serie de problemas para o usuario que pretende facer un uso científico rigoroso coas mostras.

1. Problemas de fiabilidade, debido ao seu grande número de OCRs non corrixidos, dos erros de deseño da ferramenta ao non detectar as diferencias entre frases fixas e simples coocurrencias, ou ao non distinguir clases de palabras, entre outros moitos. En calqueira caso, problemas corrixibles co paso do tempo.

2. Problemas de deseño graves no seu uso como corpus lingüístico, xa que as mostras non son representativas da linguaxe, incumplindo un dos criterios que explicabamos nesta entrada para considerar un corpus como rigoroso:

-Os criterios de selección deben ser lingüísticos e cunha finalidade concreta dentro do marco dos estudos de lingüística. A mostra debe ser representativa, de cara a amosar a variedade da lingua a estudar, dentro do seu marco.

Xa que, como explicaba David Crystal, atopámonos con que non hai:

no newspapers, magazines, advertisements, or other orthographic places where culture resides. No websites, blogs, social networking sites. No spoken language, of course, so over 90 percent of the daily linguistic usage of the world isn't here.

3. Problemas de filosofía no seu uso como corpus cultural, xa que o feito de que queden trazos escritos non implica que eses fosen os pensamentos reais da xente, como explicaba The Binder Blog na súa entrada.

Conclusión:

O N-Gram Viewer é á Lingüística de Corpus e á investigación humanística algo semellante, se ben en menor escala, do que é a Wikipedia para o coñecemento científico en xeral: Unha ferramenta, interesante no seu papel de elemento popularizador da disciplina e cun grande potencial de facer chegar coñecementos a unha grande cantidade de persoas, mais unha ferramenta dun valor dubidoso para o seu uso nunha investigación científica rigorosa debido a problemas de plantexamento coma os que expresabamos na sección anterior.
Por outra banda, e malia que o seu uso pode contribuír a dar a coñecer o uso das novas tecnoloxías na lingüística, hai que ter en conta que ten un dobre fío, pois pode contribuír a unha maior arrogancia nos xa abondosos xuízos (errados) que fai a xente sobre diversos feitos lingüísticos, ao poder dar datos non rigorosos que "validen" hipóteses erradas feitas por non-profesionais. Deste xeito teríamos a consecuencia non desexada de esta ferramenta contribuír ao fenómeno, xa ben coñecido na profesión, de que o público en xeral pense que "calquera pode ser lingüista".

Raiz

26 dic 2010

O N-Gram Viewer ou cando Google se mete a lingüista: Popularización vs. rigor científico.

No hay comentarios:

Publicar un comentario