1 - 5. YouTube Transcript Synthesis
1 - 5. YouTube Transcript Synthesis
2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N) | 979-8-3503-3086-1/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICAC3N60023.2023.10541713
Due to advancements in real-time video technology and Utilizing automated summarization techniques can save
inexpensive storage media, digital video has become an consumers time and provide a concise overview of the
integral part of education, entertainment, and commerce. content on YouTube, where more than one billion hours of
Consequently, there is a great need for systems that can video are viewed daily. T5 is a pre-trained encoder-decoder
organize and search video data based on its content. These model for unsupervised and supervised tasks that can be
systems should not only have search capabilities, but also fine-tuned for summarization.
generate concise and user-friendly data representations that
allow users to efficiently navigate the entire database or The purpose of this study is to summarize the original text
search results. These representations provide users with by applying automated summarization to YouTube video
quick insights into the scrutinized video's content while transcriptions using the TF-IDF method to extract keywords
maintaining the underlying message. and generate a concise text summary. Although automated
summaries may not be as coherent or intelligent as those
Designing effective representations for video browsing created by humans, readers are still able to comprehend the
presents distinctive algorithmic and technical challenges. most important information presented. The structure of the
Video is a sequential, information-dense medium that paper includes a section on techniques overview, followed
incorporates audio and motion, conveying the long-term by the research methodology and experimental results. The
logical relationships between shots and sequences. Video study concludes with a summary of the most important
data management is inherently more complex than image aspects.
database management. For example, images can be
represented as thumbnails, allowing users to rapidly evaluate II. Related Works
their relevance. ROUGE is a set of measures commonly used to evaluate the
quality of machine-generated summaries by comparing them
However, this is a time-consuming operation for video to summaries written by humans, according to [1]. The
sequences that contain over 100,000 frames per hour and are measures are based on the recall of overlapping units, such as
composed of numerous shots. Additionally, audio and
n-grams, word sequences, and word pairings, between the
dialogues, which frequently impart a significant portion of
the information, must be included in the representation, such generated summary and the reference summaries.
as a video of a person speaking. As dialogues are our primary In [2], a technique for automatically summarizing Arabic texts
focus, we developed the YouTube Video Transcript based on Rhetorical Structure Theory (RST) is described.
Summarizer system. Using a tree-like structure, the method identifies the rhetorical
relationships between various sections of the text. Based on
By utilizing Chrome extensions, the user interface can be the type of rhetorical relations present, the system then selects
made more functional. By adding a "summarize" icon, users
sentences for the final summary.
ISBN: 978-X-XXXX-XXXX-X/23/$31.00 ©2023 IEEE 1
Authorized licensed use limited to: Zhejiang University. Downloaded on January 19,2025 at 19:38:11 UTC from IEEE Xplore. Restrictions apply.
2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)
The article [4] introduces Kaldi, an open-source speech users may find it difficult and time-consuming to read
recognition research toolkit. Kaldi is constructed with finite- extensive reviews.
state transducers and the OpenFst library, and it includes
In [12], researchers developed an automatic text and video
comprehensive documentation and routines for constructing summarization system using natural language processing
complete recognition systems. The C++-written core library of techniques. The system utilized the term frequency-inverse
Kaldi supports arbitrary phonetic-context sizes and acoustic document frequency (TF-IDF) technique to extract significant
modeling with subspace Gaussian mixture models (SGMM) or keywords from the text and condense lengthy videos into a
standard Gaussian mixture models, as well as linear and affine few lines of text. Students and researchers who lack the time
transformations. to extricate valuable information from lengthy videos will
According to [5], this paper concentrates on Arabic benefit from the proposed system.
Documents Clustering, which is crucial for traditional The paper [13] proposed a Persian text summarizer system
Information Retrieval (IR) systems due to the growing number that employs a combination of graph-based and TF-IDF
of online Arabic documents. The endeavor entails clustering methods to evaluate sentences following word stemming. SA-
comparable documents using various similarity/distance GA-based sentence selection is used to generate a summary,
metrics. However, document length and disturbance can affect and SA-GA is a composite algorithm that incorporates the
the efficacy of clustering. Genetic Algorithm and Simulated Annealing.
Automatic text summarization is introduced in [6] as a method In [14], the development of an image captioning model to aid
for reducing the volume of text documents while preserving the blind and visually impaired in outdoor navigation is
the most important information. After stemming the words, the discussed. The CNN and the attention layer serve as encoders,
authors propose a Persian text summarizer system that while the LSTM serves as a decoder in this model. The
employs a combination of graph-based and TF-IDF methods encoder uses ResNet101 and ResNet152 to derive image
to assign sentence weights. features. The attention layer uses the Bahdanau attention
The purpose of [7] is to compare the quality of summaries mechanism.
generated by various automatic text summarization methods During the past decade, automated captioning systems have
and those produced by humans. Two series of experiments gained widespread use in technology, per [15]. Up until now,
were conducted: one with extractive summaries generated the focus of these Services has been on the technical aspects,
automatically using Fuzzy and Vector techniques, and the such as assisting students with special needs and teaching
other with summaries produced manually by English students of a second language. Its use for research has only
instructors. According to Ajmal and Haroon [8], the increased been the subject of a few limited studies: audio file
usability of documents has necessitated extensive research in transcriptions.
the field of automated text summarization. A summary is a III. Proposed System
condensed version of one or more texts that includes only the
most essential information from the original text(s) and is The objective of this project was to create a system that could
typically no longer than half the length of the original text(s), obtain transcripts/subtitles for a given YouTube video ID
if not substantially shorter. The primary purpose of a summary using a Python API, perform text summarization using
is to convey concisely the central ideas of a text. Hugging Face transformers, build a Flask backend REST API
According to [3], there has been a recent explosion of text data to expose the summarization service to the client, and create a
from numerous sources. This book contains invaluable Chrome extension that would use the backend API to display
knowledge and information that must be skillfully extracted summarized text to the user.
in order to be useful. This review explains the primary
methods for autonomous text summarization. Examining the
various summarizing procedures and discussing their
advantages and disadvantages, we analyze the various
approaches.
The article [10] discusses the use of automated captioning
services for research purposes, specifically audio
transcription. It provides a proof-of-concept analysis by
contrasting three instances of automated transcription with
manual transcription techniques. This article provides a
literature review of automated captioning and voice
recognition transcription tools. The authors describe the
processes and tools utilized for producing automated captions
and transcripts. Using software that checks for originality, the
percentage of similarity between the automated and manual
transcripts is determined.
In [11] Due to the increased use of smartphones and the FIG 1. Architectural Design of System
internet by individuals of all ages, online purchasing has
increased steadily. However, it can be difficult to determine According to figure 1. To achieve this, the user would first
which products are authentic and which to choose from the open a YouTube video and click the "summarize" button on
plethora of identically priced options. Users rely on the Chrome extension, which would create an HTTP request
evaluations to make well-informed choices. However, some
ISBN: 978-X-XXXX-XXXX-X/23/$31.00 ©2023 IEEE 2
Authorized licensed use limited to: Zhejiang University. Downloaded on January 19,2025 at 19:38:11 UTC from IEEE Xplore. Restrictions apply.
2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)
There are two primary methods for summarizing a text:
to the backend. The request would include the YouTube video extractive and abstractive.
ID taken from the URL. The response would be a transcript of
the video in JSON format. After obtaining the transcripts in In extractive summarization, the model extracts the most
text format, the system would perform transcript significant sentences or phrases from the source text and
summarization using Hugging Face transformers. Finally, the outputs them as the summary. This method entails selecting
summarized transcript would be displayed on the extension for and rearranging existing sentences from the text without
the user to read. revising or paraphrasing the content. Using techniques such as
clustering, graph-based methods, or machine learning
Overall, the project was successful in creating a functional algorithms, one can perform extractive summarization.
system that could obtain YouTube transcripts and perform
summarization, with the final result displayed to the user The advantage of extractive summarization is that it preserves
through the Chrome extension. the original verbiage and structure of the text, which can be
useful in fields such as legal and technical writing where
A. Back End precise terminology is essential. However, this can
occasionally result in disjointed or difficult-to-read
Flask is a prominent Python web framework that enables summaries.
developers to build web applications and APIs. RESTful APIs
are an API type that adheres to a set of design principles and D. API Rest Point
constraints that facilitate client-server communication.
To define the API route, we use the Flask framework to
Follow these steps to construct a Flask RESTful API with generate a route with a URI and the GET HTTP method. Using
dependencies such as youtube_transcript_api and query parameters, we derive the YouTube video ID from the
transformers: URL, and then we generate the transcript by invoking the
transcript generation function.
Create a new Python virtual environment using venv or http://[hostname]/api/summarize?youtube_url=#{url}. The
virtualenv to isolate this project's dependencies. transcript is then passed to the transcript summarizer function
to produce a summary. Then, we return the abridged transcript
The source command will activate the virtual environment. with an OK HTTP status code and manage any applicable
HTTP exceptions. This endpoint can be accessed by sending a
Using the pip package manager, install Flask and all required GET request with the YouTube video URL as the query
dependencies. parameter to the API. The API will then return a condensed
version of the video's transcript.
Create a new file with the name app.py and import the required
modules and packages, including Flask and any dependencies. E. Chrome Extension
Define your API endpoints using the @app.route() and Chrome extensions are software programmes that can modify
associated functions. and improve the browsing experience by introducing new
functionality or altering existing behavior. They are created
Using flask run, the Flask application is executed. using web technologies such as HTML, CSS, and JavaScript
and are encoded into a file that can be installed on the Chrome
Once your Flask RESTful API is operational, you can use it to web browser. You can create a Chrome extension by following
manage client requests and responses. For instance, the these steps:
youtube_transcript_api could be used to retrieve YouTube
video transcripts, while the transformers library could be used a. Create a new directory for your extension and include
to perform natural language processing duties on the text. the required files, including an HTML file for the user
interface, a JavaScript file for any functionality, and a CSS
B. Attain Transcript file for formatting.
b. Create a manifest.json file describing the extension's
Several Python APIs are available for retrieving transcripts functionality. This file contains the extension's name,
and translations for YouTube videos. You mentioned the version, and icons, as well as the necessary permissions for
youtube_transcript_api library in your initial query as an the extension to function.
example of such an API.
c. In Chrome, navigate to chrome://extensions and toggle
Create a function in your app.py file that accepts a YouTube the switch in the upper right corner to enable developer mode.
video ID as an input to use the youtube_transcript_api library
in a Flask application. This function can then use the d. Click the "Load unpacked" icon and navigate to the
youtube_transcript_api library to retrieve the video's transcript directory containing the files for your extension.
and extract the transcript text from the response.
e. Your extension is now installed in Chrome and is
C. Perform Text Summarizer accessible through the Chrome toolbar and context menus.
ISBN: 978-X-XXXX-XXXX-X/23/$31.00 ©2023 IEEE 3
Authorized licensed use limited to: Zhejiang University. Downloaded on January 19,2025 at 19:38:11 UTC from IEEE Xplore. Restrictions apply.
2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)
Using the YouTube API, you can retrieve the video's transcript
Every time you modify your extension's files, you must and metadata.
refresh the extension in Chrome by selecting "reload" on the
chrome://extensions page. Python is a popular programming language for text
summarization, and it can be used to construct customized
F. User Interface/ Extension Popup summarization techniques. Machine-Learning Algorithms.
We include the popup.css file for styling and the popup.js file V. Methodology
for user interaction in the popup.html file. We add a button Several components of the transcript summarization procedure
element with the name "Summarise" that generates a click utilize the Spacy library.
event when it is clicked, which is detected by an event Spacy is intended for natural language comprehension and
observer. Additionally, we add a div element where the data extraction. It is capable of separating text into words and
summarized text will be displayed when received from the punctuation, assigning word roots, serialization, and text
REST API call on the backend. In the popup.css file, we classification. Any text can be converted into a Doc object and
provide HTML elements with appropriate CSS formatting to its properties can be inferred using Spacy.
enhance the user experience.
Sentence Tokenization: This portion of the method utilizes
G. Display Summarized Text the Punkt Sentence Tokenizer from the NLTK tokenize
module to determine the start and conclusion of sentences
To enable the Chrome extension to communicate with the in a given text. The method also describes the various
backend server, some lacking connections must be added. In forms of available word tokenization, such as white space
this phase, the code in popup.js, contentScript.js, and the tokenization, dictionary-based tokenization, rule-based
manifest file will be modified. First, in popup.js, we'll attach tokenization, regular expression tokenization, Penn
an event listener to the Summarise button and use Treebank tokenization, Spacy tokenization, Moses
chrome.runtime.sendMessage to send an action message to tokenization, and subword tokenization.
contentScript.js. We will also add an event listener to monitor
for the message results from contentScript.js and use Word Tokenization: This portion of the method entails
JavaScript to programmatically display the summary in the div dividing a string sequence containing words, phrases, and
element. In contentScript.js, we will add an event listener that symbols into numerous tokens. WordTokenize() is a
will monitor for the message generator and extract the current wrapper function that executes tokenize() on an instance of
tab's URL. Then, we'll send a GET HTTP request using the the Treebank class in the Word Tokenizer Table. The
XMLHttpRequestWeb API to the backend to receive the process of separating a large text sample into individual
response containing the summarized text. words is known as word tokenization.
IV. Tools and technology Grammar and Spelling Check: In this step, the text is
It is possible to summarize YouTube video transcripts using a checked for grammar and spelling using Language Tool,
variety of techniques and technologies. Here are some an open-source programme that can be used as
instances: OpenOffice's spell checker. The programme is
Libraries for Natural Language Processing (NLP): NLP accessible via a command-line interface (CLI) or a
libraries such as NLTK, spaCy, and TextBlob may be utilized Python code fragment.
to derive relevant information from the transcript. VI. Conclusion
The purpose of this paper was to investigate the viability
ASR (automatic speech recognition) software Using ASR of using readily accessible web-based tools for automated
technologies such as Google Cloud Speech-to-Text and captioning to generate transcripts of audio and video
Amazon Transcribe, the video's audio can be converted to text. recordings. Based on our proof-of-concept, we've
Text summarization tools: To summarize the transcript, one determined that this is indeed possible and produces a
can use utilities such as Gensim, Sumy, and PyTeaser. satisfactory first transcript draft. Even with conservative
ISBN: 978-X-XXXX-XXXX-X/23/$31.00 ©2023 IEEE 4
Authorized licensed use limited to: Zhejiang University. Downloaded on January 19,2025 at 19:38:11 UTC from IEEE Xplore. Restrictions apply.
2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)
estimates, we can save a significant amount of time by
obtaining two-thirds of the transcript without editing in just
a few minutes of uploading, a few hours of waiting (which
can be used for other duties), and a minute of downloading.
For high-quality audio in optimal conditions, such as one-
on-one interviews, the auto-captioning accuracy can
exceed 90%. It is essential to note, however, that even with
such high rates of accuracy, we are not suggesting that
auto-captioning eliminates the need for manual
transcription; rather, it can facilitate the transcription
process.
REFERENCES
Authorized licensed use limited to: Zhejiang University. Downloaded on January 19,2025 at 19:38:11 UTC from IEEE Xplore. Restrictions apply.
2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)
evaluation of summaries. In Text summarization INDIA 2019 (pp. 535-547). Singapore: Springer
branches out (pp. 74-81). Singapore.
[14] Rahimi, S.R., Mozhdehi, A.T. and Abdolahi, M., 2017,
[2] Maâloul, M.H., Keskes, I., Belguith, L.H. and Blache, December. An overview on extractive text
P., 2010. Automatic Summarization of Arabic Texts summarization. In 2017 IEEE 4th international
conference on knowledge-based engineering and
based on RST Technique. In ICEIS (2) (pp. 434-437). innovation (KBEI) (pp. 0054-0062). IEEE.
[3] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., [15] Andhale, N. and Bewoor, L.A., 2016, August. An
overview of text summarization techniques. In 2016
Glembek, O., Goel, N., Hannemann, M., Motlicek, P., international conference on computing communication
Qian, Y., Schwarz, P. and Silovsky, J., 2011. The Kaldi control and automation (ICCUBEA) (pp. 1-7). IEEE.
speech recognition toolkit. In IEEE 2011 workshop on
automatic speech recognition and understanding (No.
CONF). IEEE Signal Processing Society.
[4] Zhang, J.J. and Fung, P., 2012. Active learning with
semi-automatic annotation for extractive speech
summarization. ACM Transactions on Speech and
Language Processing (TSLP), 8(4), pp.1-25.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 19,2025 at 19:38:11 UTC from IEEE Xplore. Restrictions apply.