Archive for the ‘corpora’ Category

terminalBatchtextutil

Source: http://www.maketecheasier.com

Advertisements

I have received this through the CORPORA List:
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

At http://corpus.byu.edu/full-text/ you can now download full-text data for the two largest BYU corpora:

Corpus of Contemporary American English (COCA). 440 million words of downloadable text; the largest, most up-to-date, publicly-available corpus of English that is balanced for genre (spoken, fiction, magazine, newspaper, and academic).

The corpus of Global Web-Based English (GloWbE). 1.8 billion words of downloadable text; divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

With this full-text data, you will have the actual corpora on your computer, and you can search the data in any way that you’d like. You can generate your own frequency data, collocates, n-grams, or concordance lines; you can search by word, lemma, and part of speech; and you can carry out complex syntactic and semantic searches offline. You can even modify the lexicon and sources tables to search the corpora in ways that are not possible via the standard web interfaces.

The data comes in three different formats (see samples): data for relational databases (info), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data, you purchase the rights to any and all of these formats.