Final Report
Final Report
DIPLOMA
IN
COMPUTER ENGINEERING
Submitted by
MOHAMMED SOHAIL
20221-CM-037
2023
SANKETIKA POLYTECHNIC COLLEGE
(Affiliated to AP SBTET)
P.M. Palem, Visakhapatnam-530041
BONAFIDE CERTIFICATE
This is to certify that the project report entitled Web-Scraping And PDF Processing Using
Advanced Data Science Techniques being submitted by MD.SOHAIL, bearing registered
number 20221-CM-037, in partial fulfilment for the award of the degree of “Diploma” in
Computer Engineering to the STATE BOARD OF TECHNICAL EDUCATION &
TRAINING, Govt of Andhra Pradesh, Is a record of bonafide work done by him under my
supervision during the academic year 2022-2023.
CERTIFICATE
Fluentgrid-Limited Fluentgrid-Limited
DECLARATION
MD.SOHAIL
(20221-CM-037)
ACKNOWLEDGEMENT
Our Parents have put us ahead of themselves. Because of their hard work
and dedication, I had opportunities beyond my wildest dreams. My heartfelt
thanks to them for giving me all I ever needed to be successful student.
.
Finally I express my thanks to all my other professors, classmates, friends,
neighbours and my family members who helped me for the completion of my
project and without their infinite love and patience this would never have been
possible.
- MD. Sohail
ABSTRACT
Web scraping is a powerful technique in the field of data science that involves
extracting data from websites. It plays a crucial role in gathering and analysing
data from diverse sources on the internet. This abstract provides an overview of
web scraping in the context of data science.
To perform web scraping, data scientists utilize libraries like Beautiful Soup,
Scrapy, or Selenium in combination with programming languages like Python.
These libraries provide functionalities to retrieve HTML content from websites,
parse and extract data from HTML elements, handle pagination and dynamic
content, and store the scraped data in a suitable format.
However, web scraping comes with ethical considerations. Data scientists must
respect the website's terms of service, and prioritize the privacy of website owners
and users. It is essential to ensure that web scraping activities are legal and
conducted responsibly.
In conclusion, web scraping is a valuable tool for data scientists, enabling them
to acquire data from diverse online sources for analysis, modelling, and decision-
making. By harnessing the power of web scraping, data scientists can expand their
data collection capabilities and derive insights from the vast amount of
information available on the web.
TABLE OF CONTENT
1. INTRODUCTION ................................................................................................................. 1
3. METHODOLOGY ................................................................................................................ 7
Project A – Results............................................................................................................... 28
Project D – Results............................................................................................................... 33
6. CONCLUSION ................................................................................................................... 36
0
1. INTRODUCTION
Web scraping is a technique used in data science to gather information from
websites automatically. It involves extracting data from web pages and saving it
for analysis. In simple words, web scraping allows us to collect data from the
internet in a structured format that can be used for various purposes.
Data scientists use web scraping to access data that is not readily available
through traditional sources. It helps them gather data from multiple websites
quickly and efficiently. By automating the process of data extraction, web
scraping saves time and effort compared to manual data collection.
Web scraping is particularly valuable in data science because it enables
researchers to analyse large volumes of data from different sources. It allows
them to gather data for research, build predictive models, perform sentiment
analysis, and much more.
To perform web scraping, data scientists use specialized tools and libraries that
can parse the HTML structure of web pages. These tools extract the desired data
by identifying specific elements on the page, such as text, images, tables, or
links. However, it is important to note that web scraping should be done
responsibly and ethically.
1
1.1 Software process Flow chart
2
CHAPTER-2
System Analysis
3
2. SYSTEM ANALYSIS
System Analysis is a process of collecting and interpreting facts,
identifying the problems and decomposition of a system into its components.
System analysis is considered for the purpose of studying a system or its parts
in order to identify its objectives. It is a problem solving technique that improves
the system and ensures that all the components of the system work efficiently
to accomplish their purpose.
4
Automation and Efficiency: Web scraping automates the data collection
process, significantly reducing the time and effort required compared to manual
data extraction. It enables the retrieval of data from multiple sources
simultaneously and can be scheduled to run at specific intervals, ensuring that
the collected data remains up to date. This automation improves efficiency and
productivity.
Personalization and Recommendation Systems: Web scraping can support
personalized experiences and recommendation systems. By collecting data on
user preferences, behaviour, or interactions from websites, platforms, or social
media, personalized recommendations can be generated.
5
CHAPTER-3
Methodology
6
3. METHODOLOGY
To design and build the system, we use the following software requirements-
Python and python packages like Beautiful Soup (Bs4), Selenium, Pandas,
Tabula, requests, chromedriver_win32.
Data Bases- MySQL and Excel
Python is widely used for web scraping due to its simplicity, extensive libraries,
and powerful tools.
Here are some key to Python libraries commonly used for web scraping:
Beautiful Soup (Bs4): Beautiful Soup is a library that makes it easy to scrape
information from web pages. Beautiful Soup transforms the unstructured HTML
mark-up into a parse tree, allowing you to search, navigate, and manipulate the
HTML or XML data.
7
Selenium (Se): Selenium is a popular open-source framework for automating
web browsers. It provides a programming interface for interacting with web
pages, filling out forms, clicking buttons, and extracting data. Selenium is often
used for web scraping, automated testing, and web application development.
Selenium functional for all browsers, works on all major operating system and
its scripts are written in various languages I.E Python, Java, C#, Etc., We Will Be
Working with Python. Selenium Uses – web-driver, Web Element, Unit Testing
for Browser Automation.
Pandas is built on top of the Numpy library and is widely used in data science,
machine learning, and data analysis tasks.
8
We can create and manipulate, Series and Data Frame objects, read and write
data, perform data analysis, and explore various data manipulation techniques
using the rich set of Pandas functions and methods.
Overall, Pandas is a versatile and efficient library for data manipulation and
analysis, making it a go-to choice for many data-related tasks in Python.
Tabula: Tabula is a Python library used for extracting tables from PDF files. It
provides a convenient way to parse PDF documents and extract tabular data,
saving you from manually copying and pasting data from PDF tables. Tabula is
especially useful when dealing with large or complex PDF files that contain
structured data in table format.
Requests: The requests library is a popular Python library for making HTTP
requests. It simplifies the process of sending HTTP requests, handling responses,
and working with APIs or web pages.
9
With requests, you can interact with web servers and retrieve data from URLs.
Using requests library, we can fetch the content from the URL given and
beautiful soup library helps to parse it and fetch the details the way we want.
To design and build the system, we use the following hardware requirements-
• Processor: Intel Pentium IV or above
• Ram: 512 MB or more
• Hard Disk: 40 GB or more
• Input Devices: Keyboard, Mouse
10
CHAPTER-4
System Implementation
11
4. SYSTEM IMPLEMENTATION
Web scraping is the process of automatically extracting data from websites using
software or scripts. By utilizing web scraping techniques, you can gather large
amounts of data from Flipkart for various purposes such as market research,
price comparison, trend analysis, and inventory tracking.
Here are some common use cases for Flipkart web scraping:
12
Price Monitoring: Tracking the prices of specific products over time to identify
price fluctuations, price drops, or special offers. This information can be
valuable for price comparison websites or for making informed purchasing
decisions.
It's important to note that when scraping Flipkart or any other website, you
should be aware of the website's terms of service and data usage policies.
Respect the website's policies, avoid overloading the server with requests, and
make sure to handle the scraped data in compliance with legal and ethical
standards.
Web Crawling No
Start
Process
HTML scripts
from Flipkart Data
Website successfully
obtained??
Insert Data in
Database
End
13
Goal and objective:
Fig 4.1
The Goal is to Scrap the iPhone data from the Flipkart Website as shown
in Fig 4.1. The main objective is to extract the Product Name, Product Price,
product Description, and Product Reviews.
Fig 4.2
14
As demonstrated in Fig 4.2 the python libraries are imported and then
empty lists are created so that data can be stored in those lists. Next the url of
Flipkart website is given and by the use of requests library, it sends the HTTP
requests to the web browser.
With the help of Beautiful Soup by using lxml/HTML parser the website data is
scraped. Here then common class from the html scripts of website is identified.
And then appended to the empty lists.
Fig 4.3
When the scrapped data is appended to the empty lists then cleaning of data is
done through pandas. The unstructured data is cleaned and arranged according
to the Headings. After the cleaning and arranging of data is done then all the
extracted data is converted into CSV File.
When we want to store the extracted data in Database, As demonstrated in
Fig 4.3 with the help of mysql.connectors the data is imported and stored in
the MySQL Database.
15
4.1.2 Project-B: Big Basket Web Scraping
By Keeping in mind that web scraping may have legal and ethical implications,
and it's important to ensure that you're scraping data responsibly and within
the bounds of the law. It’s important to note that when scraping Big basket or
any other website, you should be aware of the website's terms of service and
data usage policies. Respect the website's policies, avoid overloading the
server with requests, and make sure to handle the scraped data in compliance
with legal and ethical standards.
16
Python packages used: Selenium, Pandas, requests, webbrowser,
mysql.connectors
Process of implementation:
Web Crawling No
Start
Process
HTML scripts
from Big Basket Data
Website successfully
obtained??
Insert Data in
Database
End
Fig 4.4
17
Here The Goal is to Scrap the Fruits and Vegetables data from the Big
Basket Website as shown in Fig 4.4. The main objective is to extract the Product
Name, Product Price, product Weight, and Product Description.
Fig 4.5
As demonstrated in Fig 4.5 the python libraries are imported and then
empty lists are created so that data can be stored in those lists. Next the url of
Big Basket website is given and by using Chrome Driver it open the web browser
automatically and then it uses requests library, to send the HTTP requests of the
website.
With the help of Selenium library by using lxml/HTML parser the website data is
automatically scraped. Here by copying the Xpath of the element the data is
fetched and then appended to empty lists.
18
Fig 4.6
When the scrapped data is appended to the empty lists then cleaning of data is
done through pandas. The unstructured data is cleaned and arranged according
to the Headings. After the cleaning and arranging of data is done then all the
extracted data is converted into CSV File.
When we want to store the extracted data in Database, As demonstrated in
Fig 4.6 with the help of mysql.connectors the data is imported and stored in the
MySQL Database.
19
4.1.3 Project-C: TheBestChefs Awards Web Scraping
Web Crawling No
Start
Process
Insert Data in
Database
End
20
Goal and objective:
Fig 4.7
Here The Goal is to Scrap The top 100 best chef awards data from the
TheBestChefs Awards Website as shown in Fig 4.7. The main objective is to
extract the Chef Name and Chef’s Country.
Fig 4.8
21
As demonstrated in Fig 4.8 the python libraries are imported and then
empty lists are created so that data can be stored in those lists. Next the url of
TheBestChefs Awards website is given and by using Chrome Driver it open the
web browser automatically and then it uses requests library, to send the HTTP
requests of the website.
With the help of Selenium library by using lxml/HTML parser the website data
is automatically scraped. Here by copying the Xpath of the element the data is
fetched and then appended to empty lists.
When the scrapped data is appended to the empty lists then cleaning of data is
done through pandas. The unstructured data is cleaned and arranged according
to the given Headings. After the cleaning and arranging of data is done then all
the extracted data is converted into CSV File. And also can be stored in the
Databases.
22
4.1.4 Project-D: PDF’s Data Extraction using Data Science
PDF
Start
Document
Converting
PDF into CSV
using tabula
End
23
Goal and objective:
Fig 4.9
Here The Goal is to Scrap Complete data from table and convert into CSV
File. The main objective is to extract the data according to the Column headings.
Fig 4.10
24
As demonstrated in Fig 4.10 the python libraries tabula and pandas are
imported. Next with using of tabula library the pdf is converted into CSV.
After converting the data is arranged and cleaned according to Column headings
and the cleaning is done using pandas library. At last when the data is cleaned,
the extracted data is converted into final CSV File.
All the extracted data can also be stored in Databases using mysql.connectors
for MYSQL Database.
25
CHAPTER-5
Results And Analysis
26
5. RESULTS AND ANALYSIS
Results and analysis of web scraping depend on the specific goals and data
extracted from a website.
Here are some key aspects to consider when analysing the results of web
scraping:
Remember, the analysis process may vary depending on the specific project or
research requirements. The key is to interpret the scraped data in a meaningful
way that aligns with objectives and provides actionable insights.
27
Project A – Results
Flipkart Web Scraping can provide valuable insights into product data, pricing
trends, customer reviews, and product description.
Here are some aspects considered when analyzing the results of Flipkart web
scraping:
Fig 5.1
All the data extracted from Flipkart Website is stored in CSV File as
demonstrated in fig 5.1
28
Fig 5.2
All the data extracted from Flipkart Website is imported or in Data Base
as demonstrated in fig 5.2
Project B – Results
Big Basket Web Scraping, an online grocery platform, can be a valuable source
of data for various purposes, such as market research, price comparison,
inventory tracking, and more.
Here are some aspects considered when analysing the results of Big Basket
web scraping:
29
Setting up the scraping environment: Install the chosen scraping tool and
any required dependencies. We may also need to set up a programming
environment like Python and import the necessary libraries.
Inspecting the web page: Using web browser's developer tools to
inspect the HTML structure of the web page we want to scrape. This will
help us to identify the HTML elements containing the data you need and
their corresponding attributes.
Writing the scraping code: Using the selected scraping tool, writing a
code to access the Big Basket website, sending HTTP requests to the
desired pages, and retrieve the HTML content. Parsing the HTML
response to extract the relevant data using the identified elements and
attributes.
Storing the scraped data: Once we extract the desired data, we can
store it in a structured format such as a CSV or JSON file or directly
process it for further analysis or integration with other systems.
Fig 5.3
All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.3
30
Fig 5.4
All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.4
Project C – Results
TheBestChefs Awards Web Scraping, It involves extracting data related to top
chefs and their ranking along with their residence, can be a valuable source of
data for various purposes, such as market research, best chef comparison,
inventory tracking, and more.
Here are some aspects considered when analysing the results of TheBestChefs
website data scraping:
31
Identifying the HTML elements that contain the chef names, rankings, and
their country.
Inspecting the web page: Using web browser's developer tools (e.g.,
Chrome DevTools) to inspect the HTML structure of the web page
displaying the top chefs' rankings. This will help us to identify the specific
HTML elements and attributes we need to target.
Writing the scraping code: Utilizing the Selenium scraping tool to access
the Best Chefs Awards website, And sending HTTP requests to the
appropriate pages, and retrieve the HTML content. Parsing the HTML
response to extract the desired data based on the identified elements and
attributes.
Storing the scraped data: Once we extract the desired data, we can
choose to store it in a structured format such as a CSV or JSON file.
Alternatively, we can process the data directly for further analysis or
integration with other systems.
Fig 5.5
All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.5
32
Project D – Results
Scraping data from PDFs using data science techniques involves extracting
information from PDF files programmatically.
Identifying the PDF structure: PDFs can have various structures, such as
text-based, image-based, or table-based. Here we will be extracting table-
based PDF
Choosing a library: Selecting a suitable Python library for PDF extraction,
such as pdfminer, or Tabula. These libraries provide tools and functions
to parse and extract data from PDF files.
Extracting text from table-based PDFs: If the PDF contains text that can
be selected and copied, you can use the PDF extraction library to extract
the text content. The library will provide functions to open the PDF file,
extract text from each page, and concatenate the extracted text into a
usable format.
Cleaning and pre-processing the extracted data: Once we extract the text
data from the PDF, we may need to perform additional data cleaning and
pre-processing.
Analysing or storing the extracted data: After extracting and pre-
processing the data, we can analyse it directly using data science
techniques or store it in a structured format (e.g., CSV, JSON, or a
database) for further analysis or integration with other systems.
Fig 5.6(i)
33
Fig 5.6(ii)
All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.6 (i) & (ii)
34
CHAPTER-6
Conclusion
35
6. CONCLUSION
In conclusion, web scraping is a powerful technique for extracting data
from websites. It allows us to automate the process of gathering information,
which can be used for various purposes such as data analysis, research, or
building applications.
36
CHAPTER-7
References & Bibliography
37
7. REFERENCES AND BIBLIOGRAPHY:
1. Beautiful Soup: Beautiful Soup is a Python library that allows you to extract data from HTML
and XML files. It provides a convenient way to navigate, search, and manipulate the parse tree.
You can find the official documentation and examples at
https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
2. Scrapy: Scrapy is a Python framework specifically designed for web scraping. It provides an
integrated way to handle requests, parse responses, and extract data from websites. The
official documentation and tutorials can be found at https://docs.scrapy.org/.
3. Selenium: Selenium is a popular web testing framework that can also be used for web
scraping. It allows you to automate browser actions and extract data from dynamically
generated web pages. You can refer to the official documentation at
https://www.selenium.dev/documentation/en/.
4. Requests: Requests is a Python library for making HTTP requests. It simplifies the process of
sending HTTP requests and handling responses. While it's not specifically designed for web
scraping, it's often used in conjunction with other libraries like Beautiful Soup or Scrapy. You
can find the documentation and examples at https://docs.python-requests.org/en/latest/.
5. Pandas: The Pandas official documentation is a comprehensive resource that covers all
aspects of pandas. It includes a user guide, API reference, tutorials, and examples. You can find
it at: https://pandas.pydata.org/docs/
6. Tabula: The Tabula documentation provides information on how to install Tabula, use its
command-line interface (CLI), and integrate it into your own projects. It also covers various
features and options available in Tabula. You can access the documentation at:
https://tabula.technology/docs/
38