Extracting n word phrases in large texts

Posted: January 3, 2014 in corpus linguistics, software

This is a summary of resources posted on [Corpora-List] early 2014

CMU-Cambridge Statistical Language Modeling toolkit

http://mi.eng.cam.ac.uk/~prc14/toolkit.html

Sketch Engine

http://www.sketchengine.co.uk/documentation/wiki/SkE/NGrams

Lawrence Anthony’s AntConc 

http://www.antlab.sci.waseda.ac.jp/software.html

kfNgram

http://www.kwicfinder.com/kfNgram/kfNgramHelp.html

Colibri

Software for the extraction of n-grams as well as patterns that are not consecutive (skipgrams). The software is written in C++ for speed and memory efficiency but comes with a Python binding for usage from Python script. It also has a standalone CLI tool that can do what you want.

https://github.com/proycon/colibri-core

http://proycon.github.io/colibri-core/doc/ f

Maarten van Gompel

GnuPG key: 0x1A31555C  XMPP: proycon@anaproy.nl

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s