Archive for the ‘corpus linguistics’ Category




This free MOOC Offers practical introduction to the methodology of corpus linguistics for researchers in social sciences and humanities. It is an 8-week course and is run by Lancaster University.

More information here.



An NPR feature which disccusses W. Pennebaker contribution to human use of words.

This was also the subject of my own contribution to the 2010 World Congress of Behavioral and Cognitive Therapies within the session “Interdisciplinary research between Corpus Linguistics and Clinical Psychology” at Boston University, MA.


ReCALL Special Issue on Researching New Uses of Corpora for Language Teaching and Learning

Guest Edited by Alex Boulton and Pascual Pérez-Paredes.

ReCALL special issue: Researching uses of corpora for language teaching and learning

Editorial: Researching uses of corpora for language teaching and learning

University of Lorraine and CNRS, France

Universidad de Murcia, Spain

Boulton, A. Pérez-Paredes, P. 2014. Editorial: Researching uses of corpora for language teaching and learning. ReCALL, 26, 2, 121-127.

Pérez-Paredes, P. (2014). A review of Fluency in Native and Nonnative English Speech. Studies in Corpus Linguistics, 53. Amsterdam: John Benjamins, 2013. 238 pp. ISBN 978-9-027-203588. ICAME Journal.

Read the review. 

This research explores the search behaviour of EFL learners (n=24) by tracking their interaction with corpus-based materials during focus-on-form activities (Observe, Search the corpus, Rewriting). One set of learners made no use of web services other than the BNC during the central Search the corpus activity while the other set resorted to other web services and/or consultation guidelines. The performance of the second group was higher, the learners’ formulation of corpus queries on the BNC was unsophisticated and the students tended to use the BNC search interface to a great extent in the same way as they used Google or similar services. Our findings suggest that careful consideration should be given to the cognitive aspects concerning the initiation of corpus searches, the role of computer search interfaces, as well as the implementation of corpus-based language learning. Our study offers a taxonomy of learner searches that may be of interest in future research.

Pérez-Paredes, P., Sánchez-Tornel, M., & Alcaraz Calero, J. M. (2012). Learners’ search patterns during corpus-based focus-on-form activities.International Journal of Corpus Linguistics17(4), 483-516

Full text here.

Valoriser et développer les outils autour des corpus dans une perspective didactique / Enhancing and extending corpora and corpora tools for learning and teaching

Mardi/Tuesday, mai/May 27th

Salle/Room 205 Site Rabelais, UJF Valence, France


9h30 – Speed-dating : Présentations/Presentations

10h – Présentation et discussion autour du livre/presentation and discussion about the book « Des documents authentiques aux corpus. Démarches pour l’apprentissage des langues ». Boulton et Tyne (2014). Discussion autour de l’abondance de matières exploitables dans les corpus et la sous-exploitation dans l’enseignement des langues/Including the abondance of exploitable corpora materials and the general lack of their use in language teaching.

Conférencier: Alex Boulton

11h – Présentation de la Plate-forme Chamilo : comment l’utiliser pour les corpus ? Suivi d’une discussion en français/anglais.

Jérémie Grépiloux et Hubert Borderiou (SIMSU)

13h30 – Pedagogical uses of corpora: theories and practices / Utilisations pédagogiques  des corpus : théories et pratiques, 20-minute presentation followed by a group discussion

Conférencier: Pascual Pérez-Paredes

14h30 – Speed-dating : Consultation en ligne des corpus/Consulting on-line corpora: Montrer et voir des corpus en salle informatique

16h – Bilan de la journée et projets/Summary of the day and projects

Cristelle Cavalla and Laura Hartwell

Inscriptions (Gratuit et obligatoire)/Mandatoary free registration :

Logistics: Sylvain Perraud, (Compte rendu/minutes)




Chamilo :

Scientext : http ://

EmoBase/EmoProf :

I have received this through the CORPORA List:

At you can now download full-text data for the two largest BYU corpora:

Corpus of Contemporary American English (COCA). 440 million words of downloadable text; the largest, most up-to-date, publicly-available corpus of English that is balanced for genre (spoken, fiction, magazine, newspaper, and academic).

The corpus of Global Web-Based English (GloWbE). 1.8 billion words of downloadable text; divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

With this full-text data, you will have the actual corpora on your computer, and you can search the data in any way that you’d like. You can generate your own frequency data, collocates, n-grams, or concordance lines; you can search by word, lemma, and part of speech; and you can carry out complex syntactic and semantic searches offline. You can even modify the lexicon and sources tables to search the corpora in ways that are not possible via the standard web interfaces.

The data comes in three different formats (see samples): data for relational databases (info), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data, you purchase the rights to any and all of these formats.

The methodological transfer from the CL research area to the applied ring of language learning and teacher underwent no adaptation, and thus learners were presented with the same tools, corpora and analytical tasks as well-trained and professional linguists.


Reading concordances is, by no means, a trivial task. Sinclair (1991) recommends a complex procedure which involves five distinct stages. Let us review very briefly what they entail. The first stage is
that of initiation. Learners here will look to the left and to the right of the nodes and determine the dominant pattern. Then, learners are prompted to interpret and hypothesize about what it is that these
words have in common. Thirdly, the consolidation stage, where students are to corroborate their hypothesis by looking more closely at variations of their hypotheses. After this, these findings have to be reported and, finally a new round of observations starts. Although typically reduced in language classrooms, this procedure is common in the possibilities scenario and certainly characterises the so-called bottom-up approach (Mishan, 2004: 223). A recent analysis (Kreyer, 2008) deconstructs the idea of corpus competence in different skills, namely, interpreting corpus data, knowledge about corpus design, knowledge about resources in the Internet, some linguistic background, knowledge about how to use concordances and, finally, some corpus linguistics background. This is a positive effort in the
right direction as the author admits the need to create the conditions for the use of corpora in the language classroom or, in other words, the Kreyer recognizes that pedagogic mediation is necessary if we want to turn the corpus into a learning tool. Notwithstanding, the challenges are significant.

Pérez-Paredes, P. (2010). Corpus Linguistics and Language Education in Perspective: Appropriation and the Possibilities Scenario. In T. Harris & M. Moreno Jaén (Eds.), Corpus Linguistics in Language Teaching (pp. 53-73). Peter Lang.