We can ask for the topics covered by one or more documents, or for the documents included in one or more categories.

For convenience, the corpus methods accept a single fileid or a list of fileids.

However, the corpus is actually a collection of 55 texts, one for each presidential address.

An interesting property of this collection is its time dimension: Many text corpora contain linguistic annotations, representing POS tags, named entities, syntactic structures, semantic roles, and so forth.

Similarly, we can specify the words or sentences we want in terms of files or categories.

The first handful of words in each of these texts are the titles, which by convention are stored as upper case.Often there is insufficient government or industrial support for developing language resources, and individual efforts are piecemeal and hard to discover or re-use.Some languages have no established writing system, or are endangered.This chapter continues to present programming concepts by example, in the context of a linguistic processing task.We will wait until later before exploring each Python construct systematically.The previous example also showed how we can access the "raw" text of the book Although Project Gutenberg contains thousands of books, it represents established literature.

