dimanche 25 janvier 2015

number of support vectors in text classification


Vote count:

0




i am beginning to harness scikit's svm to perform some news analytics. While going through their tutorials they perform a classification (using linear SVM) on a dataset called 20 news group. I chose 4 categories and finally input a 2257 x 35843 sparse matrix (after performing tf-idf on it) into the svm. I was curious to find out how many support vectors the fit had and wanted to use these support vectors to glean more info about the data set in general. Turns out it has ~165k features (in a sparse matrix) as support vectors !! i couldn't wrap my head around the number here ..if my understanding is correct, the support vectors are those features that abut the hyperplane separating the classes , but it seems to be taking like 50% of the feature set as support .. is this normal in text classification ? my concern is that i would be feeding in 1000's of news articles (relating to a single topic) to perform sentiment analysis and i would like to know which words lean towards a particular sentiment (support vectors ? ) ..but if the algo is going to give me this huge number, even after i perform a chi2 test for "x" important features i still might not get the words that we humans intuitively ID as being positive or negative ..


apologies for the long question but i wanted to be as detailed as possible ..hope it makes sense and any pointers will be deeply appreciated.


Regards, vikram



asked 39 secs ago







number of support vectors in text classification

Aucun commentaire:

Enregistrer un commentaire