MLRESEARCHPAPERfinal
MLRESEARCHPAPERfinal
Keywords—machine learning , deep learning , algorithms tasks. Raw textual data usually lies outside the programming
,LSTM , Markov models , one hot encoding, Tokenizer.
software within external files and, as such, ought to be
I. INTRODUCTION accessed from these files and loaded into memory for
processing purposes. This can easily be achieved using file
The fast-emerging domain of research in natural language handling methods in any programming language like the
processing and machine learning is that of text generation: the Python built-in file input/output functions. After opening a
development of models to generate autonomous, human-like file, you will have all your data read at once as a single large
text from input data. It certainly constitutes one of the most string or as a list of strings depending on the text structure.
interesting and impactful challenges in artificial intelligence. The file itself and particularly the encoding type, such as
Text generation has applications in many domains, such as UTF-8, for instance, would ensure that special symbols or
automated content creation, chatbots, machine translation, non-ASCII characters are read error-free. After reading the
creative writing, and information retrieval. When machines data, cleaning was also basic: stripping white space,
can understand and produce coherent text pertinent to the removing punctuation, or adjusting inconsistent formatting.
context, industries like communication, entertainment, and The cleaned text can then be subjected to further processing
education will be revolutionized in practice. such as tokenization and sequence formation. Hence, correct
text loading ensures that the data is represented correctly in
At its roots, text generation is based on the capacity to model the structured format to be used to train the machine learning
the structure of natural language and to predict the probability models.
of the occurrence of a word or phrase in a particular context.
Of course, this is a daunting task owing to the intrinsically 2. SEQUENCE CREATION
complicated and variable nature of language. The present
research paper goes on to discuss methods and techniques An important step of the pre-processing pipeline for text
applied in the processes of training text generation models, generation model training is also sequencing. This step
which include data preprocessing, devising model transforms raw text data to a form that can be fed to a machine
architectures, to the evaluation of the generated text. I will learning algorithm. By doing this, the model learns patterns
also be discussing some of the challenges and limitations that govern word order, sentence structure, and syntactic
made by these models, mainly the coherence and contextual dependencies in the data. The aim is to divide the text into
validity of the generated text. In this paper, through a detailed segments manageable for the model to understand and use in
exploration of the body of knowledge in the area, future predicting subsequent words or phrases. the creation process
outlooks of text generation and their possible implications for of sequences begins by dividing the cleaned text into parts
industries and society at large shall be reflected on. such as words or tokens. Each sequence represents a fixed-
size window of words that will be used to predict the next
II. METHODOLOGIES word in the sequence. For instance, if the text contains the
sentence "The quick brown fox jumps," then a chain may be
A. DATASET SOURCE the first words only to be for example: ["The", "quick",
I had taken this data from kaggle Customer Churn "brown"], such that the learner is trying to guess the word,
Prediction 2020.[1]this data set is about employee recharge "fox."
plan , employee usage ,employee messages.
Sequences are created and encoded and then divided into
pairs of input and output. The input is composed of the
III. PREPROCESSING sequence of words; the output is the next word in the
sequence. Input-output pairs that are used to train the model
1. TEXT LOADING AND READING such that it learns the word relationship in the text and
predicts the next word in a sequence of words before them.
Text loading and reading are pre-processing techniques in the
preparation of raw textual data for machine learning
REFERENCES
[11]https://colab.research.google.com/drive/1RVns_kEechY
mQHBPq5hBmk_KJgKHb2lH#scrollTo=8GlQJ4DX6uz9
[12]https://arxiv.org/pdf/2005.00048
[13]https://www.ijraset.com/author.php