Conceptually, your corpus file should be a collection of text on whatever you domain is.

For example, if you are doing american history, you should use an american history textbook.

Format the corpus so there is one line per paragraph, with no blank lines b/w paragraphs.

Make sure you have at least 1 MB. More is better.

Also, to LSA "the" and "the." are 2 different words. If that matters to you, you might want to selectively strip punctuation. But beware of situations like didn't -> didn t

To do transformations like this, I like to use TextPad, which can lowercase an entire document in 1 step, as well as use regular expressions for character substitution.

The reason GnuTutor doesn't do these transformations automatically is that you might have different requirements depending on your domain.

Also note that whatever transformation you do here needs to be matched at runtime. For example, if you corpus contained "Abraham Lincoln", you lowercased it to "abraham lincoln", and at runtime the student types "Abraham Lincoln", then you have to lowercase what the student typed if you expect to find it in the corpus/LSA space