3D grphique: Good training data for LDA generic classification?

samedi 11 avril 2015

I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science

This is the process i'm using,

9 topics -> Music, Technology, Arts, Science etc etc.

9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.

I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content

I then classify a test document, to see how well the classifier is trained

My Question is,

a.) Is this an efficient way to classify text (using the above steps)

b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)

3D grphique