mardi 7 février 2017

How can I write a correct nltk regular expression tokenizer in python?

Vote count: 0

I want to implement a regular expression tokenizer with nltk in python but I have following problems. I use this page to write my regular expression.

import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def preprocess(sentence):
    sentence = sentence.lower()
    pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
     | \w+(?:-\w+)*        # words with optional internal hyphens
     | \$?\d+(?:\.\d+)?%?
     | \$?\d+%?
     |/\m+(?:[-'/]\w+)*
   '''
   tokenizer = RegexpTokenizer(pattern)
   tokens = tokenizer.tokenize(sentence)
   print tokens

str= 'i have one 98% 0.78 gener-alized 22 rule /m/0987hf /m/08876 i nees packages'
preprocess(str)

I got this ['i', 'have', 'one', '98', '0', '78', 'gener-alized', '22', 'rule', '/m/0987hf', '/m/08876', 'i', 'nees', 'packages']

I want this result

['i', 'have', 'one', '98%', '0.78', 'gener_alized', '22', 'rule', '/m/0987hf', '/m/08876', 'l', 'need', 'packages' ]

Also, if I want to remove digits what should I write in the regular?expression

asked 29 secs ago

Let's block ads! (Why?)



How can I write a correct nltk regular expression tokenizer in python?

Aucun commentaire:

Enregistrer un commentaire