Python NLTK | nltk.tokenize.mwe()
With the help of NLTK nltk.tokenize.mwe()
method, we can tokenize the audio stream into multi_word expression token which helps to bind the tokens with underscore by using nltk.tokenize.mwe()
method. Remember it is case sensitive.
Syntax :
MWETokenizer.tokenize()
Return : Return bind tokens as one if declared before.
Example #1 :
In this example we are using MWETokenizer.tokenize()
method, which used to bind the tokens which is defined before. We can also add the predefined tokens by using tokenizer.add_mwe()
method.
# import MWETokenizer() method from nltk from nltk.tokenize import MWETokenizer # Create a reference variable for Class MWETokenizer tk = MWETokenizer([( 'g' , 'f' , 'g' ), ( 'Beginner' , 'for' , 'Beginner' )]) # Create a string input gfg = "Beginner for Beginner g f g" # Use tokenize method geek = tk.tokenize(gfg.split()) print (geek) |
Output :
[‘Beginner_for_Beginner’, ‘g_f_g’]
Example #2 :
# import MWETokenizer() method from nltk from nltk.tokenize import MWETokenizer # Create a reference variable for Class MWETokenizer tk = MWETokenizer([( 'g' , 'f' , 'g' ), ( 'Beginner' , 'for' , 'Beginner' )]) tk.add_mwe(( 'who' , 'are' , 'you' )) # Create a string input gfg = "who are you at Beginner for Beginner" # Use tokenize method geek = tk.tokenize(gfg.split()) print (geek) |
Output :
[‘who_are_you’, ‘at’, ‘Beginner_for_Beginner’]
Contact Us