Sentence Autocomplete Using TensorFlow from Scratch ❤️

In this article, we will learn about sentence autocompletion using TensorFlow. We will follow all the steps that are needed for MLOPs. We will start with importing and cleaning the text, to creating and fitting the model and then we will create a website using Flask framework. In the end, we will deploy the website using Docker. The main goal of this article is to get a brief overview of how MLOPs work and what are the steps.

Dataset for Sentence Autocompletion

We’ve used the Shakespeare Plays dataset which is comprised of plays, characters, lines, and acts in the form of a CSV file.

1. Creating the model for Sentence Autocompletion

Step 1: Importing necessary libraries

Let’s first import all the necessary libraries.

Python3

import re 
import numpy as np  
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
  
import tensorflow as tf 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.optimizers import Adam 
from tensorflow.keras.utils import to_categorical 
import pickle 
import warnings 
warnings.filterwarnings('ignore')

Step 2: Loading the dataset

Now, let’s load the dataset and take a look at how the dataset is looking like.

Python3

data = pd.read_csv('Shakespeare_data.csv') 
print(data.head())

Output:

   Dataline      Play  PlayerLinenumber ActSceneLine         Player  \
0         1  Henry IV               NaN          NaN            NaN   
1         2  Henry IV               NaN          NaN            NaN   
2         3  Henry IV               NaN          NaN            NaN   
3         4  Henry IV               1.0        1.1.1  KING HENRY IV   
4         5  Henry IV               1.0        1.1.2  KING HENRY IV   
                                          PlayerLine  
0                                              ACT I  
1                       SCENE I. London. The palace.  
2  Enter KING HENRY, LORD JOHN OF LANCASTER, the ...  
3             So shaken as we are, so wan with care,  
4         Find we a time for frighted peace to pant,

Step 3: Extracting text from the dataset

We want to autocomplete the sentences and thus for that, we will need examples of sentences. We saw from the above dataset that the “PlayerLine” column contained the sentence information. So, let’s extract that column sentence into a text array.

Python3

# getting text from the data 
text = [] 
for i in data['PlayerLine']: 
    text.append(i) 
      
# lets see how the text is looking  
text[:5]

Output:

['ACT I',
 'SCENE I. London. The palace.',
 'Enter KING HENRY, LORD JOHN OF LANCASTER, the EARL of WESTMORELAND, SIR WALTER BLUNT, and others',
 'So shaken as we are, so wan with care,',
 'Find we a time for frighted peace to pant,']

Step 4: Cleaning Text

Let’s clean the text, this is a very important step as after this step, our sentences will not contain any numbers or any special characters. This will decrease the unnecesary word count that we will generate in the next step.

Python3

# Text Cleaning 
def clean_text(text): 
    # removing special characters like @, #, $, etc 
    pattern = re.compile('[^a-zA-z0-9\s]') 
    text = re.sub(pattern,'',text) 
  
    # removing digits 
    pattern = re.compile('\d+') 
    text = re.sub(pattern,'',text) 
  
    # converting text to lower case 
    text = text.lower() 
    return text 
    
texts = [] 
for t in text: 
    new_text = clean_text(t) 
    texts.append(new_text) 
      
# cleaned text 
texts[:5]

Output:

['act i',
 'scene i london the palace',
 'enter king henry lord john of lancaster the earl of westmoreland sir walter blunt and others',
 'so shaken as we are so wan with care',
 'find we a time for frighted peace to pant']

Step 5: Text vectorization and One hot encoding

1. Text vectorization: Let’s take around 10000 sentences, use the tokenizer from the TensorFlow library and fit it on the sentences. Then let’s generate text sequences i.e. encode the words to numbers. Next, we can pad the sequences up to the maximum sequence length. Here we are using pre-padding so, that 0 will be added before the sentences.

Python3

# lets take first 10000 words for the model training 
texts = texts[:10000] 
  
# using tensorflow tokenizer and 
tokenizer = Tokenizer() 
tokenizer.fit_on_texts(texts) 
  
# generating text sequences, i.e. encoding the text  
text_sequences = np.array(tokenizer.texts_to_sequences(texts)) 
print('Text -->>',texts[0]) 
print('Embedding -->>',text_sequences[0]) 
  
# padding the sequences  
Max_Sequence_Len = max([len(x) for x in text_sequences]) 
text_sequences = pad_sequences(text_sequences,  
                               maxlen = Max_Sequence_Len, padding='pre') 
  
print('Maximum Sequence Length -->>',Max_Sequence_Len) 
print('Text Sequence -->>\n',text_sequences[0]) 
print('Text Sequence Shape -->>',text_sequences.shape)

Output:

Text -->> act i
Embedding -->> [455, 4]
Maximum Sequence Length -->> 54
Text Sequence -->>
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 455   4]
Text Sequence Shape -->> (10000, 54)

2. Splitting the dataset and One hot encoding: Let’s split the dataset into X(input) and y(output). This is many to one-text generation case. Let’s understand the split between the input and output with an example. Assuming the sentence length is n then we will consider n-1 values as input and n^th item as target y.

Our text sequence is pre-padded means to make each sequence of equal length. as per the maximum sequence length which is 54. Here we will put 53 words in X as input, and the last one will be there as y. Hence, we have put every embedded sentence except the last word in X for input and the last word in y. We can see in the shape of X also that it is (10000, 53), which depicts that we have around 53 cols.
Now, as we have only one word in the output so we will have to one hot encode the output. This is done in the following code.

Python3

# getting X and y from the data 
X, y = text_sequences[:, :-1], text_sequences[:,-1] 
print('First Input :',X[0]) 
print('First Target :',y[0]) 
  
word_index = tokenizer.word_index 
  
# using one hot encoding on y  
total_words = len(word_index) + 1
print('Total Number of Words:',total_words) 
  
y = to_categorical(y, num_classes=total_words) 
  
# printing X and y shapes 
print('Input Shape :',X.shape) 
print('Target Shape :',y.shape)

Output:

First Input : [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 455]
First Target : 4
Total Number of Words: 7865
Input Shape : (10000, 53)
Target Shape : (10000, 7865)

Step 6: Building the model

Now, let’s build out the LSTM model for the autocompletion of sentences.

Embedding Layer: We’ll add an embedding layer in which we’ll input the total_words which depicts the size of the data.Next, a vector space of (Max_Sequence_Len – 1) dimensions in which words will be embedded and input sentences have 53 (Max_Sequence_Len – 1, i.e., 54-1) words each.
LSTM (Long Short Term Memory): Next, we will add the LSTM layer with, here the return_sequences set to be False, hence the layer will only return the last hidden state output.
Dense Layer: Finally, we’ll add the Dense layer with the softmax activation layer.
We’ve also added a Dropout layer so as to avoid overfitting.
Let’s see the implementation of the explanation in the following code.

Python3

model = Sequential(name="LSTM_Model") 
  
# adding embedding layer 
model.add(Embedding(total_words,  
                    Max_Sequence_Len-1,  
                    input_length=Max_Sequence_Len-1)) 
  
# adding a LSTM layer  
model.add(LSTM(512, return_sequences=False)) 
model.add(Dropout(0.5)) 
  
#adding the final output activation with activation function of softmax 
model.add(Dense(total_words, activation='softmax')) 
  
# printing model summary 
print(model.summary())

Output:

Model: "LSTM_Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 53, 53)            416845    
                                                                 
 lstm (LSTM)                 (None, 512)               1159168   
                                                                 
 dropout (Dropout)           (None, 512)               0         
                                                                 
 dense (Dense)               (None, 7865)              4034745   
                                                                 
=================================================================
Total params: 5,610,758
Trainable params: 5,610,758
Non-trainable params: 0
_________________________________________________________________
None

Step 7: Compiling and Training the Model

For compiling the model, we will use categorical cross-entropy as the labels are one-hot encoded. Next, let’s use Adam optimizer and accuracy as metrics. Then we will train the model for 50 epochs.

Python3

# Compiling the model 
model.compile( 
    loss="categorical_crossentropy", 
    optimizer='adam', 
    metrics=['accuracy'] 
) 
  
# Training the LSTM model 
history = model.fit(X, y, 
                       epochs=50, 
                       verbose=1)

Output:

Epoch 1/50
313/313 [==============================] - 18s 44ms/step - loss: 7.9333 - accuracy: 0.0114
Epoch 2/50
313/313 [==============================] - 5s 17ms/step - loss: 7.3617 - accuracy: 0.0172
Epoch 3/50
313/313 [==============================] - 5s 16ms/step - loss: 7.1117 - accuracy: 0.0244
Epoch 4/50
313/313 [==============================] - 4s 14ms/step - loss: 6.8334 - accuracy: 0.0297
Epoch 5/50
313/313 [==============================] - 4s 12ms/step - loss: 6.5347 - accuracy: 0.0356
Epoch 6/50
313/313 [==============================] - 4s 14ms/step - loss: 6.2033 - accuracy: 0.0424
Epoch 7/50
313/313 [==============================] - 4s 14ms/step - loss: 5.8106 - accuracy: 0.0550
Epoch 8/50
313/313 [==============================] - 4s 11ms/step - loss: 5.3727 - accuracy: 0.0734
Epoch 9/50
313/313 [==============================] - 4s 13ms/step - loss: 4.8842 - accuracy: 0.0998
Epoch 10/50
313/313 [==============================] - 4s 11ms/step - loss: 4.3616 - accuracy: 0.1479
Epoch 11/50
313/313 [==============================] - 4s 12ms/step - loss: 3.8136 - accuracy: 0.2175
Epoch 12/50
313/313 [==============================] - 4s 13ms/step - loss: 3.2721 - accuracy: 0.3033
Epoch 13/50
313/313 [==============================] - 4s 11ms/step - loss: 2.7658 - accuracy: 0.4034
Epoch 14/50
313/313 [==============================] - 4s 12ms/step - loss: 2.3215 - accuracy: 0.4981
Epoch 15/50
313/313 [==============================] - 4s 12ms/step - loss: 1.9347 - accuracy: 0.5872
Epoch 16/50
313/313 [==============================] - 4s 11ms/step - loss: 1.5971 - accuracy: 0.6602
Epoch 17/50
313/313 [==============================] - 4s 11ms/step - loss: 1.3542 - accuracy: 0.7134
Epoch 18/50
313/313 [==============================] - 4s 13ms/step - loss: 1.1341 - accuracy: 0.7649
Epoch 19/50
313/313 [==============================] - 4s 12ms/step - loss: 0.9632 - accuracy: 0.8013
Epoch 20/50
313/313 [==============================] - 4s 12ms/step - loss: 0.8233 - accuracy: 0.8348
Epoch 21/50
313/313 [==============================] - 4s 11ms/step - loss: 0.7023 - accuracy: 0.8637
Epoch 22/50
313/313 [==============================] - 4s 11ms/step - loss: 0.6235 - accuracy: 0.8786
Epoch 23/50
313/313 [==============================] - 4s 12ms/step - loss: 0.5537 - accuracy: 0.8945
Epoch 24/50
313/313 [==============================] - 4s 13ms/step - loss: 0.5086 - accuracy: 0.8969
Epoch 25/50
313/313 [==============================] - 4s 12ms/step - loss: 0.4629 - accuracy: 0.9075
Epoch 26/50
313/313 [==============================] - 4s 12ms/step - loss: 0.4205 - accuracy: 0.9201
Epoch 27/50
313/313 [==============================] - 4s 12ms/step - loss: 0.3947 - accuracy: 0.9210
Epoch 28/50
313/313 [==============================] - 4s 12ms/step - loss: 0.3685 - accuracy: 0.9276
Epoch 29/50
313/313 [==============================] - 4s 12ms/step - loss: 0.3534 - accuracy: 0.9275
Epoch 30/50
313/313 [==============================] - 4s 12ms/step - loss: 0.3257 - accuracy: 0.9323
Epoch 31/50
313/313 [==============================] - 4s 11ms/step - loss: 0.3175 - accuracy: 0.9317
Epoch 32/50
313/313 [==============================] - 4s 11ms/step - loss: 0.3131 - accuracy: 0.9329
Epoch 33/50
313/313 [==============================] - 4s 12ms/step - loss: 0.2985 - accuracy: 0.9387
Epoch 34/50
313/313 [==============================] - 4s 12ms/step - loss: 0.2844 - accuracy: 0.9385
Epoch 35/50
313/313 [==============================] - 4s 12ms/step - loss: 0.2781 - accuracy: 0.9413
Epoch 36/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2662 - accuracy: 0.9437
Epoch 37/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2526 - accuracy: 0.9441
Epoch 38/50
313/313 [==============================] - 4s 12ms/step - loss: 0.2492 - accuracy: 0.9464
Epoch 39/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2449 - accuracy: 0.9458
Epoch 40/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2426 - accuracy: 0.9467
Epoch 41/50
313/313 [==============================] - 4s 12ms/step - loss: 0.2334 - accuracy: 0.9474
Epoch 42/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2340 - accuracy: 0.9471
Epoch 43/50
313/313 [==============================] - 4s 12ms/step - loss: 0.2344 - accuracy: 0.9472
Epoch 44/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2302 - accuracy: 0.9468
Epoch 45/50
313/313 [==============================] - 4s 12ms/step - loss: 0.2090 - accuracy: 0.9541
Epoch 46/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2077 - accuracy: 0.9525
Epoch 47/50
313/313 [==============================] - 4s 11ms/step - loss: 0.1958 - accuracy: 0.9553
Epoch 48/50
313/313 [==============================] - 4s 11ms/step - loss: 0.2063 - accuracy: 0.9518
Epoch 49/50
313/313 [==============================] - 4s 12ms/step - loss: 0.1970 - accuracy: 0.9547
Epoch 50/50
313/313 [==============================] - 4s 11ms/step - loss: 0.1896 - accuracy: 0.9551

Step 8: Sentence Autocomplete

Now, that our model is trained, let’s auto-complete the sentence using the trained model. Here first used the tokenizer to transform the incomplete sentence into a vector and convert it into the NumPy array. Then we add pre-padding to make the input of max sequence length -1 i.e. 53 in this case. Then we provide this input to the model and get the prediction from the model. The output will be one hot encoded vector. We will get the index of the word from the predicted one hot encoded vector using np.argmax function. Then we will reconvert from index to word from tokenizer.word_index.

Python3

def autoCompletations(text, model): 
    # Tokenization and Text vectorization 
    text_sequences = np.array(tokenizer.texts_to_sequences()) 
    # Pre-padding 
    testing = pad_sequences(text_sequences, maxlen = Max_Sequence_Len-1, padding='pre') 
    # Prediction 
    y_pred_test = np.argmax(model.predict(testing,verbose=0)) 
      
    predicted_word = '' 
    for word, index in tokenizer.word_index.items(): 
        if index == y_pred_test: 
            predicted_word = word 
            break
    text += " " + predicted_word + '.'
    return text  
    
complete_sentence = autoCompletations('I have seen this', model) 
complete_sentence

Output:

'I have seen this king.'

We can also repeat the above process to add n number of words in our sentence. But it may not give the desired output. Because this is a Many to One model. For this step, we will simply use for loop to get the predicted word from the index using the word_index that was created before.

Python3

def generate_text(text, new_words): 
    for _ in range(new_words): 
        text = autoCompletations(text, model)[:-1] 
    return text  
    
generated_text = generate_text('I have seen', 5) 
generated_text

Output:

'I have seen lord me death heart king'

Step 9: Saving the model and the tokenizer

Let’s save our model in the form of a .h5 file and also the tokenizer in pickle format. We’ll have to use both of them for the Flask website.

Python3

# saving the model 
model.save('sentence_completion.h5') 
  
# saving the tokenizer 
filename = 'tokenizer.pkl'
pickle.dump(tokenizer, open(filename, 'wb'))

2. Creating the Flask website

Step 1: Creating the HTML files

Let’s first create the HTML files, that will work as frontend in our application. We will create the HTML files in the templates folder.

Our first file is index.html. This file contains a form where the user will input the starting text and then the number of words to generate.

HTML

<!DOCTYPE html> 
<html> 
    <head> 
        <title>Sentence Autocompletion App</title> 
    </head> 
      
    <body> 
        <form name="form", action = "/generate", method="POST", style="text-align: center;"> 
            <br> 
            Text: <input type="string" name="Text", placeholder="Enter Text" required/> 
            <br><br> 
            No Of Words: <input type="number" name="NoOfWords", placeholder="Enter the no of new words", value="generate" required/> 
            <br><br> 
            <button value="Submit">Generate Text</button> 
        </form> 
    </body> 
  
</html>

Next is generate.html, this will work as an output of the generated text.

HTML

<!DOCTYPE html> 
<html> 
    <head> 
        <title>Sentence Autocompletion App</title> 
    </head> 
      
    <body> 
        <form name="form", action = "/generate", method="POST", style="text-align: center;"> 
            <br> 
            Text: <input type="string" name="Text", placeholder="Enter Text" required/> 
            <br><br> 
            No Of Words: <input type="number" name="NoOfWords", placeholder="Enter the no of new words", value="generate" required/> 
            <br><br> 
            <button value="Submit">Generate Text</button> 
        </form> 
        <p style="text-align: center;">{{ output }}</p> 
    </body> 
      
</html>

Step 2: Creating app.py file

This is the main file, which we will run to run the website. We have simply taken inputs from the forms and then used the model and tokenizer to autocomplete the sentence and then the output will be shown on the website.

Python3

from flask import Flask, render_template, request 
import pickle 
import numpy as np 
from tensorflow.keras.preprocessing.sequence import pad_sequences 
import tensorflow as tf 
import re 
  
app = Flask(__name__, template_folder='templates') 
  
model =  tf.keras.models.load_model('sentence_completion.h5') 
with open("tokenizer.pkl", 'rb') as file: 
    tokenizer = pickle.load(file) 
  
@app.route('/') 
def home(): 
    return render_template('index.html') 
  
# Autocompletations function 
def autoCompletations(text, model): 
    # Tokenization and Text vectorization 
    text_sequences = np.array(tokenizer.texts_to_sequences()) 
    # Pre-padding 
    testing = pad_sequences(text_sequences, maxlen = 53, padding='pre') 
    # Prediction 
    y_pred_test = np.argmax(model.predict(testing,verbose=0)) 
      
    predicted_word = '' 
    for word, index in tokenizer.word_index.items(): 
        if index == y_pred_test: 
            predicted_word = word 
            break
    text += " " + predicted_word + '.'
    return text  
# Generate text function 
def generate_text(text, new_words): 
    for _ in range(new_words): 
        text = autoCompletations(text, model)[:-1] 
    return text 
  
  
@app.route('/generate', methods=['GET', 'POST']) 
def generate(): 
  
    # If a form is submitted 
    if request.method == "POST": 
          
        # Get values through input bars 
        text = request.form.get("Text") 
        no_of_words = request.form.get("NoOfWords") 
      
        # Get prediction from the generate_text function written above 
        generated_text = autoCompletations(text, model) 
          
    else: 
        generated_text = "" 
          
    return render_template("generate.html", output = generated_text) 
  
  
# Running the app 
if __name__ == "__main__": 
    app.run(debug=True)

Now, let’s run the project using the following command :

flask run

Step 3: Making a requirements.txt

Let’s create the requirements.txt file using the pip freeze command.

pip freeze > requirements.txt

Step 4: Publishing the code to GitHub

Now, we will have to publish this code to GitHub, for further deployment. For this, you can follow this GitHub article.

3. Step by Step Process for Deploying the project using Docker

For this step, docker must be installed on your machine. You can refer to this article on docker installation.

Checking if the docker is installed on our system or not. For this, you can write the following command on the terminal.

$ docker --version

Now, let’s start the deployment process.

Step 1: Cloning the project

Let’s clone the project on the virtual machine from the GitHub link. For this go to the GitHub repository and copy the link of the GitHub repository. This command will clone the project.

$ git clone https://github.com/<project_name>.git

After this let’s go into the project directory and take a look at the files that are present in the project folder.

$ cd <project_name>
$ ls

Output:

app.py requirement.txt sentence_complete.h5 templates tokenizer.pkl

Step 2: Create the Dockerfile

The next step is to create the docker file, for this, we will use the following command:

$ vi Dockerfile

This is how the docker file will look like. For more information on Docker refer to this article on Concepts of Dockerfile, where each step is explained briefly.

Step 3: Building an image by using the docker file

The next step is to build the image, for this, we will use the docker build command. Here ‘:1’ represents the first build.

$ docker build -t username/gfg:1

Step 4: Running container

Now, let’s run the container using an image that we have built. For this, we will need the image id, let’s get that from the docker image ls command. The latest build has the image id. Let’s put it in the docker run command.

$ docker image ls
$ docker run -d -p 5000:5000 <image_id>

Step 5: Checking if the container is running or not

Let’s check if the container is running successfully with the following command.

$ docker ps

Step 6: Accessing the website by using the URL from the internet

Now, let’s access the website from the URL on the internet. This is how the website looks like.

Sentence Autocomplete Using TensorFlow from Scratch

Dataset for Sentence Autocompletion

1. Creating the model for Sentence Autocompletion

Step 1: Importing necessary libraries

Python3

Step 2: Loading the dataset

Python3

Step 3: Extracting text from the dataset

Python3

Step 4: Cleaning Text

Python3

Step 5: Text vectorization and One hot encoding

Python3

Python3

Step 6: Building the model

Python3

Step 7: Compiling and Training the Model

Python3

Step 8: Sentence Autocomplete

Python3

Python3

Step 9: Saving the model and the tokenizer

Python3

2. Creating the Flask website

Step 1: Creating the HTML files

HTML

HTML

Step 2: Creating app.py file

Python3

Step 3: Making a requirements.txt

Step 4: Publishing the code to GitHub

3. Step by Step Process for Deploying the project using Docker

Step 1: Cloning the project

Step 2: Create the Dockerfile

Step 3: Building an image by using the docker file

Step 4: Running container

Step 5: Checking if the container is running or not

Step 6: Accessing the website by using the URL from the internet

Contact Us