How To Remove Punctuation?
Solution 1:
Solution 1: Tokenize and strip punctuation off the tokens
>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i notin punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i notin punctuations]
['He', 'said', 'that', 's', 'it']
Solution 2: remove punctuation then tokenize
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'>>> sent = '''He said,"that's it."'''>>> " ".join("".join([" "if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'>>> " ".join("".join([" "if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']
Solution 2:
If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer
. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u*
before stripping all punctuation.
One way to go about this, then, is to tokenize on gaps like so:
>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World'] # omits *u* per your third requirement
This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A"
. Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.
Post a Comment for "How To Remove Punctuation?"