[go: up one dir, main page]

0% found this document useful (0 votes)
8 views7 pages

MLRESEARCHPAPERfinal

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

MLRESEARCHPAPERfinal

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Text Generation:Use Technique Like Markov

Models or LSTM Network To Generate Realistic


Text In a Specific Style or Genre.
Hariprasad Boddepalli
School of Computerscience and
Engineering
Lovely Proffesional University
Punjab , India
hariprasadboddepalli4@gmail.com

Keywords—machine learning , deep learning , algorithms tasks. Raw textual data usually lies outside the programming
,LSTM , Markov models , one hot encoding, Tokenizer.
software within external files and, as such, ought to be
I. INTRODUCTION accessed from these files and loaded into memory for
processing purposes. This can easily be achieved using file
The fast-emerging domain of research in natural language handling methods in any programming language like the
processing and machine learning is that of text generation: the Python built-in file input/output functions. After opening a
development of models to generate autonomous, human-like file, you will have all your data read at once as a single large
text from input data. It certainly constitutes one of the most string or as a list of strings depending on the text structure.
interesting and impactful challenges in artificial intelligence. The file itself and particularly the encoding type, such as
Text generation has applications in many domains, such as UTF-8, for instance, would ensure that special symbols or
automated content creation, chatbots, machine translation, non-ASCII characters are read error-free. After reading the
creative writing, and information retrieval. When machines data, cleaning was also basic: stripping white space,
can understand and produce coherent text pertinent to the removing punctuation, or adjusting inconsistent formatting.
context, industries like communication, entertainment, and The cleaned text can then be subjected to further processing
education will be revolutionized in practice. such as tokenization and sequence formation. Hence, correct
text loading ensures that the data is represented correctly in
At its roots, text generation is based on the capacity to model the structured format to be used to train the machine learning
the structure of natural language and to predict the probability models.
of the occurrence of a word or phrase in a particular context.
Of course, this is a daunting task owing to the intrinsically 2. SEQUENCE CREATION
complicated and variable nature of language. The present
research paper goes on to discuss methods and techniques An important step of the pre-processing pipeline for text
applied in the processes of training text generation models, generation model training is also sequencing. This step
which include data preprocessing, devising model transforms raw text data to a form that can be fed to a machine
architectures, to the evaluation of the generated text. I will learning algorithm. By doing this, the model learns patterns
also be discussing some of the challenges and limitations that govern word order, sentence structure, and syntactic
made by these models, mainly the coherence and contextual dependencies in the data. The aim is to divide the text into
validity of the generated text. In this paper, through a detailed segments manageable for the model to understand and use in
exploration of the body of knowledge in the area, future predicting subsequent words or phrases. the creation process
outlooks of text generation and their possible implications for of sequences begins by dividing the cleaned text into parts
industries and society at large shall be reflected on. such as words or tokens. Each sequence represents a fixed-
size window of words that will be used to predict the next
II. METHODOLOGIES word in the sequence. For instance, if the text contains the
sentence "The quick brown fox jumps," then a chain may be
A. DATASET SOURCE the first words only to be for example: ["The", "quick",
I had taken this data from kaggle Customer Churn "brown"], such that the learner is trying to guess the word,
Prediction 2020.[1]this data set is about employee recharge "fox."
plan , employee usage ,employee messages.
Sequences are created and encoded and then divided into
pairs of input and output. The input is composed of the
III. PREPROCESSING sequence of words; the output is the next word in the
sequence. Input-output pairs that are used to train the model
1. TEXT LOADING AND READING such that it learns the word relationship in the text and
predicts the next word in a sequence of words before them.
Text loading and reading are pre-processing techniques in the
preparation of raw textual data for machine learning

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


4. PADDING
3. TOKENIZATION
Padding is one of the significant preprocessing techniques for
Tokenization can be considered one of the most elementary natural language processing, which comes to the rescue of
steps of NLP. It refers to breaking up a text into smaller units variable length sequences. Among a wide variety of NLP
called tokens. These are words, subwords, phrases, or even applications, such as text classification, sequence labeling, or
characters- depending on the application and context. text generation tasks, many models require consistent sizes;
Therefore, the importance of tokenization is that it enables while this is the case with recurrent neural networks (RNNs)
unstructured text data to take on a structured format which and convolutional neural networks (CNNs). However, text
goes on to become very important for the analysis and data, by nature, are not of the same length, and sentences,
subsequent processing by the machine learning model. In the paragraphs, or even documents might have dramatically
case of deep learning, especially for applications such as text different numbers of words or tokens. Padding is used to
classification, sentiment analysis, and text generation, proper standardize the length of input sequences, making it easier for
tokenization goes on to have a direct impact on the machine learning models to process the data.
performance and accuracy of the model.
Padding in the code below is also utilized with tokenization
to get the text data ready for an LSTM model training:. There
Inside this code, tokenization is performed with the help of is another type of RNN-LSTMs, which work particularly well
TensorFlow's Keras library namely Tokenizer class. It with sequence data and require sequences to have the same
transforms text body here used as a poem to machine learning lengths. The padding implementation in this code is
model understandable format. The above process of facilitated by the pad_sequences function from
tokenization used in this implementation steps are broken tensorflow.keras.preprocessing.sequence module. This
down into several steps. This is important in preparing textual function plays a critical role in shaping the tokenized
data before training any RNN model with LSTM layers. It sequences so that they fit the requirements for the input to the
begins with the creation of a Tokenizer object. The class LSTM model.
creates a set of methods to work with tokenization, such as
fitting the tokenizer to a corpus and converting the text into Padding is a preprocessing operation for the input sequence
sequences of integers. By default, the parameters for the of various lengths, which is crucial in natural language
initialization of the Tokenizer class are customizable: the processing. In the given code, padding helps transform
maximum number of words to keep and filters for characters variable length tokenized sequences into a uniform format
or symbols to ignore. suited for training the LSTM model. The code successfully
approaches the problems in regard to length variability by
utilization of the pad_sequences function for better batch
The vocabulary can be built by the tokenizer reading through processing performance for the model. While padding
the text and identifying unique words. It maps a unique ensures the imposition of structured input for deep learning
integer index to each word according to its frequency in the models, there needs to be a consideration of the implications
dataset. This indexing is particularly important because the and potential drawbacks regarding padding. This therefore
text will now be translated into numbers that the model can calls for careful approaches to padding to ensure that the
use for training and prediction purposes. The highest indices model learns nicely while minimizing the information loss
correspond to the least frequently appearing words in the text. and optimizing the training process.
This phenomenon aids in getting insight into the data because
it ensures common patterns are learned well by the model. 5. ONE HOT ENCODING

This technique is widely used in NLP as well as in other


Tokenization not only prepares the data for modeling but also machine learning techniques for encoding categorical data in
finds a great role in the generation process of text. The a binary matrix representation. It can be used more
sequences generated by tokenization are the core input to specifically for textual data by applying it to machine
train the LSTM model. Capturing the hidden patterns and learning models so that they might interpret the categorical
structures that the sequences are likely to have allows the variables in a numerical format. One-hot encoding is notably
model to generate coherent as well as contextually relevant used in NLP so that words or tokens can be represented in a
output once a seed input is provided. way that models might understand and interpret the text
better.
This process of tokenization helps establish a systematic
approach to handling text data, so huge volumes of text data One-hot encoding allows learning words relationship
can be effectively and efficiently trained to use deep learning properly. That is, due to representing each word using the
models. Given that the model generates new text at each neural network as an unique binary vector, the model
iteration, it will predict the next words of a sequence with processes and distinguishes various words without confusing
regard to the vocabulary and numerical representations set up their meaning. Ordinal relationships that may arise from the
during tokenization. For this reason, the effective use of use of integer encoding are thereby dispelled by one-hot
tokenization strategies represents an important aspect of the encoding. For example, if we use integer encoding, the model
success of NLP tasks. will probably misinterpret that the word with index 3 is
"bigger than" or "of a greater importance than" the word
indexed by 2. One-hot encoding avoids the above accuracy, it implies that the model learns and makes better
misinterpretation because it considers each word as an predictions; if it does not improve or even worsens, then it
individual case. may be overfitting in nature, which means that the model
memorizes the training data rather than learning
generalizable patterns.
Simply put, the technique of one-hot encoding is the most
significant technique in the preprocessing pipeline of natural Other metrics of evaluation include loss and perplexity.
language processing, especially in preparing categorical data Perplexity, in particular, is another very common metric used
for machine learning models. Below is an example code in language models. It calculates the uncertainty over the
where one-hot encoding is applied to the target variable and model's predictions. It is closely related to the cross-entropy
transforms integer-encoded words into binary vectors that loss. The lower the value of perplexity, the better the
will make the neural network learn effectively. One-hot performance, as it means more confident predictions by the
encoding is necessary since it allows the representation of model.
categorical data in a manner that avoids ordinal
misunderstandings, further leaning with requirements of The development of the methods is interested in the control
model training. Although the said advantages are enormous, generated text; there is usually a style, tone, or even content.
it is crucial to beware of the dangers of increased For instance, applications could include generating text that
dimensionality and sparsity. Thus one-hot encoding would follows guidelines or objectives, like a marketing message for
prove to be a basic and flexible resource in the whole scenario a brand or even educational material within a certain subject
of text representation and machine learning. area.

The broader societal implications of text generation are deep


IV. TRAINING and profound. As these models keep becoming so advanced,
they might eventually replace or augment human roles in
Text generation is that relatively new frontier in the field of many industries. At the same time, this raises some very
NLP and machine learning, concerning the development of important questions about the future of work, the ethics of
models that can autonomously generate human-like text using AI, and potential misuse such as generating fake news
given input data, and yet remain perhaps one of the most or misleading content. It requires constant discussion
interesting and influential tasks in artificial intelligence between the scientists, policymakers, and industry over how
today. Major applications include generating automated the latter applies these technologies responsibly for the
content, creating chatbots, machine translation, creative betterment of society.
writing, and information retrieval. Ultimately, such machines
will drive whole industries in terms of how communication, We will discuss here several methods and techniques used for
entertainment, and education are entertained. training the generation models on the text, preprocessing
data, developing model architectures, and finally, some
This training is done over several epochs, where every epoch methods and techniques of text evaluation. In the following
is defined as one complete pass through the whole dataset. In sections, we will discuss the limitations and weaknesses of
each epoch, backpropagation iteratively adjusts the model's these models, especially in terms of their ability to produce
parameters, namely weights and biases, based on the gradient coherent text in relevant context. Through an in-depth
of the loss function at the set of parameters given. Then, with analysis of the current state of the field, we hope to shed some
the calculated gradients, the weights of the model are adjusted light on future directions of text generation and their greater
in the direction where the loss would be minimized for better impact on various industries and society at large.
performance by the model in making predictions.

All training is conducted in minibatch mode. This implies that


data is divided into small parts referred to as batches. The
batch size determines how many input-output pairs are
processed in parallel before updating the model's weights. For
instance, if a batch size of 32 is adopted, then it means that
the model at hand will process 32 sequences simultaneously.
mini-batch training helps reduce the usage of memory and
letting the model generalize better as weights are updated
more frequently than in the case when training is performed
on the full data set at once.

In training, the performance of the model is controlled in a


way that it is learning properly. This is typically achieved by
monitoring the accuracy of the model over a validation
dataset after each epoch. This validation set is a separate,
independent data that was not used for training but rather for
validation purposes-that is, when the model generalizes well
to unseen data. That said, if one has increasing validation
Generally, Markov models efficiency in text generation
stems from their capability to produce almost short and
V. LITERATURE REVIEW superficial sequences of text. Although n-gram models
effectively model local word dependencies, they fail to model
One understanding of the vanishing gradient problem is that very long contexts and tend to produce repetitive or
the network wants to send all the information through longer inconsistent text with extended passages, with increasing the
sequences. And with longer sequences, gradients get size of the n-gram to capture more context to produce greater
diminished during BPTT. To rectify this problem, the concept complexity and higher memory requirements. Despite this,
of LSTMs has been introduced. It is able to hold onto long- Markov models remain relevant where quick, local responses
term information by propagating only the relevant are a need rather than an extended coherence, such as in bot
information back. LSTMs are a kind of RNNs. Here, instead responses and automated dialogue systems.[5]
of a single There are several gates regarding the activation
function in the hidden layer.[1] There is the input gate, forget LSTM networks are highly flexible and can generate text
gate, the output gate, and the cell states. With the gates, at based on various tasks by using huge datasets specific to a
every time point, it will determine which of the past style or genre. For example, Karpathy et al. (2015) trained
information to be held and which to discard for being several corpora of texts as diverse as poetry and code in an
forgotten. LSTM and demonstrated that it could generate syntactically
appropriate sequences in each domain, though coherence and
A lot of information from the previous state has to be depth vary.[14] These indeed account for the wide-scale
forgotten, and the output gate filters what information to pass applicability of LSTMs in tasks related to creative text
to the next layer. Such a nature makes LSTMs preserve generation.[11]
history in long sentences and have been used extensively in
NLP applications from question-answering systems to Other techniques have been created, including conditional
machine translation.[2] The input gate controls the input LSTMs, which condition a model onto specific attributes
concerning the current cell information and the forget gate such as genre or tone or sentiment. The model leans toward
concerning how much to keep of the old information. those attributes, which have been highlighted during
generation to better suit a desired output style. This
The Hochreiter and Schmidhuber LSTM networks were the conditional generation has been very useful for creative
innovation that tackled the "problem of vanishing gradients" applications, such as the generation of text that would
in traditional RNNs, unlocking models to incorporate more maintain the tone of a specific author or that adheres to a pre-
information on longer stretches of textual sequences. LSTMs defined theme.[15]
can keep track of context over sentences or even paragraphs,
so it is not surprising that they are widely used for tasks that Recent innovations have taken this type of generating ability
demand semantic and stylistic coherence preserved over a further with the advent of LSTM networks models, which
greater span of text text generation.[4] then moved on to Transformer-based models like GPT-2 and
GPT-3. These models, based on self-attention mechanisms,
LSTM networks have been very commonly used in stylized find long-term dependencies more effectively compared to
text generation such as for poetry and song lyrics up to news LSTMs since the connections are not in a sequential nature;
articles and dialogue generation. Sutskever, Martens, and therefore, much better in large-scale applications for text
Hinton (2011) already demonstrated that LSTM networks can generation.[16]
be learned upon complex syntactic structures and generate
high-quality grammatically correct text if they are trained on However, there is still an application for LSTMs, especially
large datasets. Other work has addressed the problem of text for the scenarios of limited resources used in the process of
generation within particular genres and demonstrated that training and, finally, in embedded systems where size
with enough training data, LSTMs can learn to mimic the matters. Moreover, they give foundation architectures for
stylistic flavor of certain authors or genres, such as most of the hybrid models, merging recurrent and attention-
Shakespearean English, modern news, and scientific writing based techniques into achieving such a balance between
. coherence and computational efficiency.
The oldest technique used for text generation is the so-called
Markov models, traditionally simple, but computationally Markov models and Long Short-Term Memory (LSTM)
efficient. Markov models also depend on the principle that a networks are popular applied techniques in natural language
word in sequence depends only on the immediately preceding processing for generating texts in a certain style or genre.
word(s), effectively forming a chain of probabilistic Simpler text generation tasks using Markov models have
predictions based on observed transitions between states often utilized their probabilistic structure that creates
(words or phrases). Studies such as Shannon's work in sequences based on transitions between states or words,
entropy in English text established the Markov assumption depending on the preceding words in a sequence. These
for natural language at the very early stages of research and models are better for shorter context and can indeed generate
threw more light upon probabilistic generation using n-gram text with what is considered basic stylistic patterns by
models, or Markov chains with a fixed number of words in a adjusting the order of the model to include more or fewer
sequence of length n.[13] preceding words.[11] They utterly fail in longer contextual
dependencies and fail to produce the nuance that would be
needed for highly realistic text generation.[12]
executes a certain number of times, once for every line to be
Generalizing from these features, LSTM networks are much produced. For each line, a temporary list, text, is populated
better for dealing with longer dependencies and thus are with words. A final, nested loop with a range of text length
generally much more useful for capturing and reproducing then starts the process of building one line, one word at a
the complex language patterns characteristic of specific time, by predicting one word at a time.
genres or styles. LSTMs actually use gated memory cells,
which enables earlier parts of a word being remembered The first step of each iteration of the inner loop is to encode
when deemed important and filtered out otherwise. As such, the input text using tokenizer texts to sequences that translate
they can produce longer passages containing cohesion, words into numerical values according to the trained model
context, and special stylistic features. It has also been vocabulary. The encoded sequence is padded with pad
demonstrated that LSTM-based models show impressive sequences to fit the expected input length of the neural
performance in text generation for purposes, including network, seq length. The padding is essential for the model to
literary style imitation, or in generating conversational agents process variable-length sequences in a consistent way. The
by adjusting hyperparameters such as temperature that padded input is then fed into the model and y pred = np
balance creativity with fittingness to the target features of argmax(model predict(encoded), axis=-1) predicts the next
genre[11] word index, printing the highest probability from the model's
vocabulary. To translate this predicted index back into a
combining LSTM networks with specific preprocessing human readable word, the function iterates through tokenizer
techniques—like tokenization, removing noise, and word index items() until the index of the word matches y
managing punctuation—enhances their ability to replicate pred, storing the matched word in predicted word.
complex text structures. Comparative analyses indicate that
LSTM networks generally outperform Markov models in The predicted word is added both to the input text for the
terms of contextual relevance and quality, especially for future prediction and the text list, consisting of the words of
longer and stylistically nuanced texts. Some studies also the current line. Adding to the input text each new word
explore hybrid approaches, leveraging Markov chains for enables the function to generate contextually relevant words
initial state generation and LSTM models for fine-tuning, that continue a coherent sequence. Ending the inner loop after
aiming to optimize both efficiency and coherence in text the iterations text length, joins together the text-a list of
production words-into a string, forming a coherent line, that is, appended
to general text.
To gain an in-depth understanding of these methods, refer to
the IEEE articles on features and performance of LSTM and Once all lines are generated, it outputs the general text in a
the IJRASET studies on text generation models and data structured form suitable for tasks in which generation
handling in text preprocessing for genre-specific tasks. For requires text lines to be multi-line, such as chatbot dialogues,
further reading, you can find resources on the IEEE Xplore storytelling, or style-specific text production. Iterative word-
and IJRASET websites[12] by-word prediction ensures that each produced line will be
coherent by using prior context in each prediction. Such a
structure is particularly well-suited to neural networks, as
VI. PROCEDURE these are designed to capture across-dependencies within
words for the production of fluent and relevant text output.

The generate text function is the code for a sequence


generation function; this means that the function can output a
specified number of lines in controlled text. It acts as an
iterative predictive loop, where each word inputted becomes
prediction for the next word. Function The function works on
a trained neural network language model; previous outputs
are used to generate coherent and consistent style text.

The function, generate text, works for neural network models.


It usually uses languages as inputs whereby the aim is to
predict a sentence from the initial word input. The function
updates a phrase in the input and then generates a specified
number of lines of text by predicting the next word
sequentially.

Setting text length to 15 means that this function sets the


number of words per line; this will limit the size of the
generated output since it has to ensure that each line uses a
constant number of words-this is specially useful in
generating paragraphs. Here in the function definition,
general text initializes with setting all the full output of
generated lines. The following loop for i in range(no lines)
previous studies with an accuracy rate of 85.50%. LSTMs are
found effective in coherent and contextually relevant
sequences of text. It has increased the efficiency in
composing paragraphs and significantly reduced the
investment time.

REFERENCES

[1] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al.


Learning long-term dependencies with gradient descent is
difficult. IEEE transactions on neural networks, 5(2):157–
166, 1994.

[2] Mike Schuster and Kuldip K Paliwal. Bidirectional


recurrent neural networks. Signal Processing, IEEE
Transactionson, 45(11):2673–2681, 1997.

[3] Andrew M Dai, Christopher Olah, and Quoc V Le.


Document embedding with paragraph vectors. arXiv preprint
arXiv:1507.07998, 2015.

[4] Laurens van der Maaten and Geoffrey Hinton. Visualizing


data using t-sne. Journal of machine learning
research,9(Nov):2579–2605, 2008.

[5] Rashid, A. Do-Omri, M. A. Haidar, Q. Liu and M.


Rezagholizadeh.”From Unsupervised Machine Translation to
Adversarial Text Generation," ICASSP 2020 - 2020 IEEE
International Conference on Acoustics, Speech and Signal
Processing(ICASSP), Barcelona, Spain, 2020, pp. 8194-
8198, doi:10.1109/ICASSP40776.2020.9053236.

[6] L. Xuyuan, T. Lihua and L. Chen, "TCTG:A Controllable


Text Generation Method Using Text to Control Text
Generation," 2021IEEE 6th International Conference on
Signal and Image Processing(ICSIP), Nanjing, China, 2021,
pp. 1118-1122, doi:10.1109/ICSIP52628.2021.9688767.
VI1. FUTURE WORK
[7] R. Ma, Y. Gao, X. Li and L. Yang, "Research on
LSTM networks proved to be the best existing type of model Automatic Generation of Social Short Text Based on
till today for prediction as well as classification over text- Backtracking Pattern,"2023 Asia-Pacific Conference on
based data. LSTM can effectively overcome the problem that Image Processing, Electronics and Computers (IPEC),
standard recurrent neural networks have been facing, i.e., the Dalian, China, 2023, pp. 336-347,
problem of vanishing gradient. It is an efficient model but doi:10.1109/IPEC57296.2023.00066.
overall it is quite computationally expensive and requires
high processing power, hence the use of GPUs in order to fit [8] S. V. Hemanth, Saravanan Alagarsamy & T. Dhiliphan
and train the model. They are currently employed in Rajkumar," A novel deep learning model for diabetic
numerous applications ranging from voice assistants, smart retinopathy detection in retinal fundus images using pre-
virtual keyboar- ds and automated chatbots to sentiment trained CNN and HWBLSTM, Journal of Biomolecular
analysis etc. As future research work, the model accuracy of Structure and Dynamics, 2024,
current LSTM models may be surpassed by appending more DOI:10.1080/07391102.2024.2314269.
layers and nodes to the network and applying the notion of
transfer learning on the same problem domain. [9] Y. Wu, H. Yin, D. Liu and Q. Zhou, "Text Semantic
Representation Based on Knowledge Graph Correction,"
2022 International Conference on Computer Engineering
VI11. CONCLUSION and Artificial Intelligence(ICCEAI), Shijiazhuang, China,
2022, pp. 404-408, doi:10.1109/ICCEAI55464.2022.00090.
Based on our research, we have been forced to use LSTMs as
the ubiquitous challenge of next word prediction is in such [10] Saraswat S, Srivastava G, Shukla S (2018) Classification
diverse contexts such as emails and WhatsApp messages. The of ECG signals using cross-recurrence quantification analysis
results have turned out excellent when compared to the and probabilistic neural network classifier for ventricular
tachycardia patients. Int J Biomed Eng Technol 26(2):141–
156.

[11]https://colab.research.google.com/drive/1RVns_kEechY
mQHBPq5hBmk_KJgKHb2lH#scrollTo=8GlQJ4DX6uz9

[12]https://arxiv.org/pdf/2005.00048

[13]https://www.ijraset.com/author.php

[14] Sutskever I, Martens J, Hinton GE (2011) Generating


text with recurrent neural networks. In:
Proceedings of the 28th international conference on machine
learning (ICML-11)

[15] Sundermeyer M et al (2012) LSTM neural networks for


language modeling. In: INTERSPEECH

[16] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,


Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-
learn: Machine learning in Python. Journal of Machine
Learning Research, 12, 2825–2830.
.
[17] Keskar, N. S., McCool, M., & Gulrajani, I. (2019).
CTRL: A Conditional Transformer Language Model for
Controllable Generation. arXiv preprintarXiv:1909.05858.

You might also like