[go: up one dir, main page]

0% found this document useful (0 votes)
49 views46 pages

Final Report

Uploaded by

renukaraj77777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views46 pages

Final Report

Uploaded by

renukaraj77777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Web-Scrapping And PDF Processing

Using Advanced Data Science Techniques

A FINAL YEAR PROJECT


Submitted in partial fulfilment of the requirements for the
Award of the degree of

DIPLOMA
IN
COMPUTER ENGINEERING

Submitted by

MOHAMMED SOHAIL
20221-CM-037

Under The Esteemed Guidance Of


Mrs.V.GNANA PRASUNA
Professor, Computer Engineering

SANKETIKA POLYTECHNIC COLLEGE


(Approved by A.I.C.T.E., Affiliated to AP SBTET)
Beside YSR Cricket Stadium, PM Palem,
Visakhapatnam-530041

2023
SANKETIKA POLYTECHNIC COLLEGE
(Affiliated to AP SBTET)
P.M. Palem, Visakhapatnam-530041

BONAFIDE CERTIFICATE

This is to certify that the project report entitled Web-Scraping And PDF Processing Using
Advanced Data Science Techniques being submitted by MD.SOHAIL, bearing registered
number 20221-CM-037, in partial fulfilment for the award of the degree of “Diploma” in
Computer Engineering to the STATE BOARD OF TECHNICAL EDUCATION &
TRAINING, Govt of Andhra Pradesh, Is a record of bonafide work done by him under my
supervision during the academic year 2022-2023.

Project Guide Head of the Department


V.GNANA PRASUNA V.GNANA PRASUNA
M.Sc(CS),M.Tech(CST),B.Ed(M.E) M.Sc(CS),M.Tech(CST),B.Ed(M.E)

Professor, Head of the Department,


Dept. of Computer Engineering, Dept. of Computer Engineering,
Sanketika Polytechnic College Sanketika Polytechnic College
Fluentgrid Limited
Hill No-1, Plot No-2, Rushikonda,
Visakhapatnam – 530045

CERTIFICATE

This is to certify that MD.SOHAIL, third year student of “Diploma” in Computer


Engineering from SANKETIKA POLYTECHNIC COLLEGE has completed a
project titled Web-Scraping And PDF Processing Using Advanced Data Science
Techniques at FLUENTGRID LIMITED, Visakhapatnam. From 18th January, 2023 to
18th July, 2023. And the project done by him was found to be good.

Mr. E.Chaitanya Kumar Mr. N.Raj Kumar


Data Engineer Head of Analytics

Fluentgrid-Limited Fluentgrid-Limited
DECLARATION

I hereby declare that this project entitled “Web-Scraping And PDF


Processing Using Advanced Data Science Techniques” submitted to the
department of CME, Sanketika Polytechnic College, affiliated to AP SBTET,
Vijayawada as partial fulfilment for the award of diploma degree is entirely the
original work done by me and has not been submitted to any other organization.

MD.SOHAIL
(20221-CM-037)
ACKNOWLEDGEMENT

I feel immense pleasure to express my sincere thanks and profound sense


of gratitude to all those people who played a valuable role for the successful
completion of our project by their invaluable suggestions and advices.
I’m very much thankful to our principal Dr.A.Ramakrishna for permitting
and encouraging me for 6months training.
I’m deeply intended to Prof. V.Gnana prasuna, Head of Computer
Engineering Department, whose motivation and constant encouragement has led
to pursue a project in the field of software development.
I owe my great amount of gratitude to E.Chaitanya Kumar sir (Data
Engineer) and Y. Anil Kumar sir (Data Engineer) for sparing time from their
busy schedule for providing me with their able guidance at the time of need and
helping me to achieve the ultimate goal of the study. I would also like to thank
Mr. N. Raj Kumar (Head of Analytics) for their valuable support in helping me
to gain this opportunity of being with an organization of such esteem.

Our Parents have put us ahead of themselves. Because of their hard work
and dedication, I had opportunities beyond my wildest dreams. My heartfelt
thanks to them for giving me all I ever needed to be successful student.
.
Finally I express my thanks to all my other professors, classmates, friends,
neighbours and my family members who helped me for the completion of my
project and without their infinite love and patience this would never have been
possible.

- MD. Sohail
ABSTRACT

Web scraping is a powerful technique in the field of data science that involves
extracting data from websites. It plays a crucial role in gathering and analysing
data from diverse sources on the internet. This abstract provides an overview of
web scraping in the context of data science.

Web scraping involves automated extraction of data from websites using


specialized software or libraries. Data scientists leverage web scraping to collect
structured or unstructured data, which can be further analysed, visualized, and
used for various data-driven tasks. It enables data scientists to access valuable
information that may not be readily available through traditional APIs or datasets.

In the context of data science, web scraping offers numerous possibilities. It


allows researchers to collect large datasets for training machine learning models,
conduct sentiment analysis on social media data, monitor online pricing trends,
gather customer reviews, track news articles, and extract data for market research
and competitor analysis.

To perform web scraping, data scientists utilize libraries like Beautiful Soup,
Scrapy, or Selenium in combination with programming languages like Python.
These libraries provide functionalities to retrieve HTML content from websites,
parse and extract data from HTML elements, handle pagination and dynamic
content, and store the scraped data in a suitable format.

However, web scraping comes with ethical considerations. Data scientists must
respect the website's terms of service, and prioritize the privacy of website owners
and users. It is essential to ensure that web scraping activities are legal and
conducted responsibly.

In conclusion, web scraping is a valuable tool for data scientists, enabling them
to acquire data from diverse online sources for analysis, modelling, and decision-
making. By harnessing the power of web scraping, data scientists can expand their
data collection capabilities and derive insights from the vast amount of
information available on the web.
TABLE OF CONTENT

1. INTRODUCTION ................................................................................................................. 1

1.1 Software process Flow chart ............................................................................................ 2

2. SYSTEM ANALYSIS .......................................................................................................... 4

2.1 Existing System ............................................................................................................... 4

2.2 Proposed System .............................................................................................................. 4

3. METHODOLOGY ................................................................................................................ 7

3.1 Software & Hardware Requirements: .............................................................................. 7

4. SYSTEM IMPLEMENTATION ......................................................................................... 12

4.1 Projects Synopsis ........................................................................................................... 12

4.1.1 Project-A: Flipkart Web Scraping........................................................................... 12

4.1.2 Project-B: Big Basket Web Scraping ...................................................................... 16

4.1.3 Project-C: TheBestChefs Awards Web Scraping ................................................... 20

4.1.4 Project-D: PDF’s Data Extraction using Data Science ........................................... 23

5. RESULTS AND ANALYSIS ............................................................................................. 27

Project A – Results............................................................................................................... 28

Project B – Results ............................................................................................................... 29

Project C – Results ............................................................................................................... 31

Project D – Results............................................................................................................... 33

6. CONCLUSION ................................................................................................................... 36

7. REFERENCES AND BIBLIOGRAPHY:........................................................................... 38


CHAPTER-1
Introduction

0
1. INTRODUCTION
Web scraping is a technique used in data science to gather information from
websites automatically. It involves extracting data from web pages and saving it
for analysis. In simple words, web scraping allows us to collect data from the
internet in a structured format that can be used for various purposes.
Data scientists use web scraping to access data that is not readily available
through traditional sources. It helps them gather data from multiple websites
quickly and efficiently. By automating the process of data extraction, web
scraping saves time and effort compared to manual data collection.
Web scraping is particularly valuable in data science because it enables
researchers to analyse large volumes of data from different sources. It allows
them to gather data for research, build predictive models, perform sentiment
analysis, and much more.

To perform web scraping, data scientists use specialized tools and libraries that
can parse the HTML structure of web pages. These tools extract the desired data
by identifying specific elements on the page, such as text, images, tables, or
links. However, it is important to note that web scraping should be done
responsibly and ethically.

1
1.1 Software process Flow chart

2
CHAPTER-2
System Analysis

3
2. SYSTEM ANALYSIS
System Analysis is a process of collecting and interpreting facts,
identifying the problems and decomposition of a system into its components.
System analysis is considered for the purpose of studying a system or its parts
in order to identify its objectives. It is a problem solving technique that improves
the system and ensures that all the components of the system work efficiently
to accomplish their purpose.

2.1 Existing System


In Existing system, It is the manual web data extraction process has two major
problems.
 Firstly, it can’t measure costs efficiently and can escalate it very quickly.
The data collection costs increase as more data is collected from each
website. In order to conduct a manual extraction, businesses need to
hire large number of staffs, this increases the cost of labour significantly.
 Secondly, each manual extraction is known to be error prone. Further, if
any business process is very complex then cleaning up the data can get
expensive and time consuming.

2.2 Proposed System


Web Scraping (web harvesting or web data extraction) is a computer software
technique to extract information from websites.
Access to Large Amounts of Data: Web scraping allows for the extraction of vast
amounts of data from websites, providing access to a wealth of information that
may otherwise be time-consuming or impractical to collect manually. This data
can be valuable for various purposes, including research, analysis, decision-
making, and business intelligence.

4
Automation and Efficiency: Web scraping automates the data collection
process, significantly reducing the time and effort required compared to manual
data extraction. It enables the retrieval of data from multiple sources
simultaneously and can be scheduled to run at specific intervals, ensuring that
the collected data remains up to date. This automation improves efficiency and
productivity.
Personalization and Recommendation Systems: Web scraping can support
personalized experiences and recommendation systems. By collecting data on
user preferences, behaviour, or interactions from websites, platforms, or social
media, personalized recommendations can be generated.

5
CHAPTER-3
Methodology

6
3. METHODOLOGY

3.1 Software & Hardware Requirements:

To design and build the system, we use the following software requirements-
 Python and python packages like Beautiful Soup (Bs4), Selenium, Pandas,
Tabula, requests, chromedriver_win32.
 Data Bases- MySQL and Excel
Python is widely used for web scraping due to its simplicity, extensive libraries,
and powerful tools.

Here are some key to Python libraries commonly used for web scraping:

Beautiful Soup (Bs4): Beautiful Soup is a library that makes it easy to scrape
information from web pages. Beautiful Soup transforms the unstructured HTML
mark-up into a parse tree, allowing you to search, navigate, and manipulate the
HTML or XML data.

7
Selenium (Se): Selenium is a popular open-source framework for automating
web browsers. It provides a programming interface for interacting with web
pages, filling out forms, clicking buttons, and extracting data. Selenium is often
used for web scraping, automated testing, and web application development.

Selenium functional for all browsers, works on all major operating system and
its scripts are written in various languages I.E Python, Java, C#, Etc., We Will Be
Working with Python. Selenium Uses – web-driver, Web Element, Unit Testing
for Browser Automation.

Pandas: Pandas is a powerful and popular open-source data analysis and


manipulation library for Python. It provides easy-to-use data structures and data
analysis tools, making it a go-to library for working with structured data.

Pandas is built on top of the Numpy library and is widely used in data science,
machine learning, and data analysis tasks.

8
We can create and manipulate, Series and Data Frame objects, read and write
data, perform data analysis, and explore various data manipulation techniques
using the rich set of Pandas functions and methods.

Overall, Pandas is a versatile and efficient library for data manipulation and
analysis, making it a go-to choice for many data-related tasks in Python.

Tabula: Tabula is a Python library used for extracting tables from PDF files. It
provides a convenient way to parse PDF documents and extract tabular data,
saving you from manually copying and pasting data from PDF tables. Tabula is
especially useful when dealing with large or complex PDF files that contain
structured data in table format.

Requests: The requests library is a popular Python library for making HTTP
requests. It simplifies the process of sending HTTP requests, handling responses,
and working with APIs or web pages.

9
With requests, you can interact with web servers and retrieve data from URLs.
Using requests library, we can fetch the content from the URL given and
beautiful soup library helps to parse it and fetch the details the way we want.

Chrome Driver: It is an open-source tool for automated testing of webapps


across many browsers. It provides capabilities for navigating to web pages, user
input, JavaScript execution, and more. Chrome Driver is a standalone server that
implements the W3C Web Driver standard. It is used for the automation of
websites.

MySQL Database: MySQL is a popular relational database management system


(RDBMS) that allows you to store and retrieve structured data. When combined
with web scraping, MySQL can be used to store the data you extract from
websites, enabling you to organize and analyse it more effectively.

To design and build the system, we use the following hardware requirements-
• Processor: Intel Pentium IV or above
• Ram: 512 MB or more
• Hard Disk: 40 GB or more
• Input Devices: Keyboard, Mouse

10
CHAPTER-4
System Implementation

11
4. SYSTEM IMPLEMENTATION

4.1 Projects Synopsis

Project-A: Flipkart Web Scraping


Project-B: Big Basket Web Scraping
Project-C: TheBestChefs Awards Web Scraping
Project-D: PDF’s Extraction using Data Science
Project-E: IMDB Website Scraping

4.1.1 Project-A: Flipkart Web Scraping

Introduction to Flipkart Web Scraping


Flipkart is one of the largest e-commerce platforms in India, offering a
wide range of products including electronics, fashion, home appliances, books,
and more. Scraping Flipkart website allows us to extract product information,
reviews, pricing, and other relevant data from the website.

Web scraping is the process of automatically extracting data from websites using
software or scripts. By utilizing web scraping techniques, you can gather large
amounts of data from Flipkart for various purposes such as market research,
price comparison, trend analysis, and inventory tracking.

To perform web scraping on Flipkart, you will need to use programming


languages like Python and libraries such as Beautiful Soup or Selenium. These
tools provide the necessary functionality to fetch web pages, parse their HTML
content, and extract the desired data.

Here are some common use cases for Flipkart web scraping:

Product Data Extraction: Extracting detailed product information such as


product names, descriptions, prices, ratings, and reviews.
The Extracted data can be used for competitor analysis, pricing strategies, or
building a product catalog.

12
Price Monitoring: Tracking the prices of specific products over time to identify
price fluctuations, price drops, or special offers. This information can be
valuable for price comparison websites or for making informed purchasing
decisions.

Review Analysis: Scraping customer reviews and ratings to analyse sentiment,


identify product strengths and weaknesses, and gain insights into customer
preferences and satisfaction levels.

It's important to note that when scraping Flipkart or any other website, you
should be aware of the website's terms of service and data usage policies.
Respect the website's policies, avoid overloading the server with requests, and
make sure to handle the scraped data in compliance with legal and ethical
standards.

Python packages used: Beautiful Soup (Bs4), Pandas, requests, webbrowser,


mysql.connectors
Process of implementation:

Web Crawling No
Start
Process

HTML scripts
from Flipkart Data
Website successfully
obtained??

Retrieve Data using Yes


Beautiful soup,
Pandas, requests
Obtained
Data in Excel

Insert Data in
Database
End

13
Goal and objective:

Fig 4.1

The Goal is to Scrap the iPhone data from the Flipkart Website as shown
in Fig 4.1. The main objective is to extract the Product Name, Product Price,
product Description, and Product Reviews.

Specific Code or Scripts:

Fig 4.2

14
As demonstrated in Fig 4.2 the python libraries are imported and then
empty lists are created so that data can be stored in those lists. Next the url of
Flipkart website is given and by the use of requests library, it sends the HTTP
requests to the web browser.
With the help of Beautiful Soup by using lxml/HTML parser the website data is
scraped. Here then common class from the html scripts of website is identified.
And then appended to the empty lists.

Fig 4.3

When the scrapped data is appended to the empty lists then cleaning of data is
done through pandas. The unstructured data is cleaned and arranged according
to the Headings. After the cleaning and arranging of data is done then all the
extracted data is converted into CSV File.
When we want to store the extracted data in Database, As demonstrated in
Fig 4.3 with the help of mysql.connectors the data is imported and stored in
the MySQL Database.

15
4.1.2 Project-B: Big Basket Web Scraping

Introduction to Big Basket Web Scraping


Big Basket is an Indian online grocery delivery platform that allows users
to purchase a wide range of groceries and household items. While web scraping
is a technique used to extract data from websites, it's important to note that
scraping Big Basket's website may violate their terms of service. It's always
recommended to review a website's terms of service and obtain permission
before scraping any website.

Web scraping is the process of automatically extracting information from


websites. It involves writing code to simulate human browsing behaviour,
accessing web pages, and extracting data from them. The extracted data can be
saved in a structured format such as a spreadsheet or a database for further
analysis or use.

To perform web scraping, we would typically use a programming language like


Python along with libraries specifically designed for scraping, such as Pandas and
Selenium. These libraries provide functions and methods to navigate the HTML
structure of web pages, locate specific elements, and extract the desired data.

By Keeping in mind that web scraping may have legal and ethical implications,
and it's important to ensure that you're scraping data responsibly and within
the bounds of the law. It’s important to note that when scraping Big basket or
any other website, you should be aware of the website's terms of service and
data usage policies. Respect the website's policies, avoid overloading the
server with requests, and make sure to handle the scraped data in compliance
with legal and ethical standards.

16
Python packages used: Selenium, Pandas, requests, webbrowser,
mysql.connectors
Process of implementation:

Web Crawling No
Start
Process

HTML scripts
from Big Basket Data
Website successfully
obtained??

Retrieve Data using Yes


Selenium,
Pandas, requests
Obtained
Data in Excel

Insert Data in
Database
End

Goal and objective:

Fig 4.4
17
Here The Goal is to Scrap the Fruits and Vegetables data from the Big
Basket Website as shown in Fig 4.4. The main objective is to extract the Product
Name, Product Price, product Weight, and Product Description.

Specific Code or Scripts:

Fig 4.5

As demonstrated in Fig 4.5 the python libraries are imported and then
empty lists are created so that data can be stored in those lists. Next the url of
Big Basket website is given and by using Chrome Driver it open the web browser
automatically and then it uses requests library, to send the HTTP requests of the
website.
With the help of Selenium library by using lxml/HTML parser the website data is
automatically scraped. Here by copying the Xpath of the element the data is
fetched and then appended to empty lists.

18
Fig 4.6

When the scrapped data is appended to the empty lists then cleaning of data is
done through pandas. The unstructured data is cleaned and arranged according
to the Headings. After the cleaning and arranging of data is done then all the
extracted data is converted into CSV File.
When we want to store the extracted data in Database, As demonstrated in
Fig 4.6 with the help of mysql.connectors the data is imported and stored in the
MySQL Database.

19
4.1.3 Project-C: TheBestChefs Awards Web Scraping

Introduction to TheBestChefs Awards Web Scraping


Web scraping TheBestChefs Awards website involves extracting data from the
website to gather information about the awarded chefs. TheBestChefs Awards
Data is Scraped from the year 2017, 2018, 2019, 2020, 2021, 2022 and 2023.
The Data is Scraped and stored in CSV File, later it is used for Analysing Data
for The Best Chef.

Python packages used: Selenium, Pandas, requests, webbrowser,


mysql.connectors
Process of implementation:

Web Crawling No
Start
Process

HTML scripts from


TheBestChefs Data
Website successfully
obtained??

Retrieve Data using Yes


Selenium,
Pandas, requests
Obtained
Data in Excel

Insert Data in
Database
End

20
Goal and objective:

Fig 4.7

Here The Goal is to Scrap The top 100 best chef awards data from the
TheBestChefs Awards Website as shown in Fig 4.7. The main objective is to
extract the Chef Name and Chef’s Country.

Specific Code or Scripts:

Fig 4.8

21
As demonstrated in Fig 4.8 the python libraries are imported and then
empty lists are created so that data can be stored in those lists. Next the url of
TheBestChefs Awards website is given and by using Chrome Driver it open the
web browser automatically and then it uses requests library, to send the HTTP
requests of the website.
With the help of Selenium library by using lxml/HTML parser the website data
is automatically scraped. Here by copying the Xpath of the element the data is
fetched and then appended to empty lists.
When the scrapped data is appended to the empty lists then cleaning of data is
done through pandas. The unstructured data is cleaned and arranged according
to the given Headings. After the cleaning and arranging of data is done then all
the extracted data is converted into CSV File. And also can be stored in the
Databases.

22
4.1.4 Project-D: PDF’s Data Extraction using Data Science

Introduction to PDF’s Data Extraction


PDF data extraction refers to the process of extracting information from
PDF documents and converting it into a structured format that can be easily
analyzed, stored, or used for further processing. PDF (Portable Document
Format) is a widely used file format for presenting and sharing documents, and
extracting data from PDFs can be valuable in various domains, such as research,
finance, or data analysis.
The process of extracting data from PDFs involves parsing the document's
content and identifying the relevant information. Depending on the complexity
and structure of the PDF, different techniques can be employed for extraction.
Python packages used: Tabula, Pandas.
Process of implementation:

PDF
Start
Document

Converting
PDF into CSV
using tabula

Reading CSV File Converting into


Data Frame

Saving the CSV


File

End

23
Goal and objective:

Fig 4.9

Here The Goal is to Scrap Complete data from table and convert into CSV
File. The main objective is to extract the data according to the Column headings.

Specific Code or Scripts:

Fig 4.10

24
As demonstrated in Fig 4.10 the python libraries tabula and pandas are
imported. Next with using of tabula library the pdf is converted into CSV.
After converting the data is arranged and cleaned according to Column headings
and the cleaning is done using pandas library. At last when the data is cleaned,
the extracted data is converted into final CSV File.
All the extracted data can also be stored in Databases using mysql.connectors
for MYSQL Database.

25
CHAPTER-5
Results And Analysis

26
5. RESULTS AND ANALYSIS

Results and analysis of web scraping depend on the specific goals and data
extracted from a website.

Here are some key aspects to consider when analysing the results of web
scraping:

 Extracted Data: Evaluation of the quality and completeness of the


extracted data. Checking of all the desired fields and information have
been successfully scraped. Assessing if any data is missing, and consider
data cleaning and pre-processing techniques to address any issues.
 Data Volume: Assessing the volume of data obtained through web
scraping. Calculating the number of records, the range of variables
collected, and the overall size of the dataset. Determining the data
collected is sufficient for the intended analysis and if any additional data
collection is necessary.
 Data Insights: Analyzing the extracted data to derive meaningful
insights. This may involve statistical analysis, data visualization, or other
analytical techniques depending on the nature of the data and the
research questions or objectives.
 Comparison: Comparing the scraped data with existing data to gain a
comparative perspective. This could involve comparing prices,
performance metrics, or other relevant factors with competitors or
industry standards.
 Visualization: Visualizing the scraped data using charts, graphs, or other
visual representations for better understanding the patterns.

Remember, the analysis process may vary depending on the specific project or
research requirements. The key is to interpret the scraped data in a meaningful
way that aligns with objectives and provides actionable insights.

27
Project A – Results
Flipkart Web Scraping can provide valuable insights into product data, pricing
trends, customer reviews, and product description.

Here are some aspects considered when analyzing the results of Flipkart web
scraping:

 Product Information: Evaluating the extracted product data, such as


product names and descriptions. Assessing the completeness and
accuracy of the product information collected. Identifying and gaining a
deeper understanding of the available products range.
 Pricing Analysis: Analyzing the extracted pricing data to identify pricing
patterns, price fluctuations, and pricing strategies employed by sellers on
Flipkart. Comparing prices for similar products across different sellers.
 Customer Reviews and Ratings: Assessing the extracted customer
reviews and ratings to understand customer sentiments, satisfaction
levels, and product feedback.

Fig 5.1

All the data extracted from Flipkart Website is stored in CSV File as
demonstrated in fig 5.1

28
Fig 5.2

All the data extracted from Flipkart Website is imported or in Data Base
as demonstrated in fig 5.2

Project B – Results
Big Basket Web Scraping, an online grocery platform, can be a valuable source
of data for various purposes, such as market research, price comparison,
inventory tracking, and more.

Here are some aspects considered when analysing the results of Big Basket
web scraping:

 Understanding the website structure: Analysing the layout and structure


of Big Basket's website to identify the specific data we want to extract.
This may include product names, prices, descriptions, customer reviews,
and other relevant details.
 Choosing a scraping tool: There are several scraping tools and libraries
available in various programming languages, such as Python's
BeautifulSoup, Scrapy, or Selenium. These tools help in fetching and
parsing the HTML content of web pages.

29
 Setting up the scraping environment: Install the chosen scraping tool and
any required dependencies. We may also need to set up a programming
environment like Python and import the necessary libraries.
 Inspecting the web page: Using web browser's developer tools to
inspect the HTML structure of the web page we want to scrape. This will
help us to identify the HTML elements containing the data you need and
their corresponding attributes.
 Writing the scraping code: Using the selected scraping tool, writing a
code to access the Big Basket website, sending HTTP requests to the
desired pages, and retrieve the HTML content. Parsing the HTML
response to extract the relevant data using the identified elements and
attributes.
 Storing the scraped data: Once we extract the desired data, we can
store it in a structured format such as a CSV or JSON file or directly
process it for further analysis or integration with other systems.

Fig 5.3

All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.3

30
Fig 5.4

All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.4

Project C – Results
TheBestChefs Awards Web Scraping, It involves extracting data related to top
chefs and their ranking along with their residence, can be a valuable source of
data for various purposes, such as market research, best chef comparison,
inventory tracking, and more.

Here are some aspects considered when analysing the results of TheBestChefs
website data scraping:

 Choosing a scraping tool: Selecting a suitable scraping tool or library from


Python which has popular options such as Beautiful Soup, Scrapy, or
Selenium.
 Setting up the scraping environment: Installing the chosen scraping tool
and any necessary dependencies. Create a programming environment if
required.
 Understand the website structure: Analysing the Best Chefs Awards
website to understand its layout and the data we want to extract. And

31
Identifying the HTML elements that contain the chef names, rankings, and
their country.
 Inspecting the web page: Using web browser's developer tools (e.g.,
Chrome DevTools) to inspect the HTML structure of the web page
displaying the top chefs' rankings. This will help us to identify the specific
HTML elements and attributes we need to target.
 Writing the scraping code: Utilizing the Selenium scraping tool to access
the Best Chefs Awards website, And sending HTTP requests to the
appropriate pages, and retrieve the HTML content. Parsing the HTML
response to extract the desired data based on the identified elements and
attributes.
 Storing the scraped data: Once we extract the desired data, we can
choose to store it in a structured format such as a CSV or JSON file.
Alternatively, we can process the data directly for further analysis or
integration with other systems.

Fig 5.5

All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.5

32
Project D – Results
Scraping data from PDFs using data science techniques involves extracting
information from PDF files programmatically.

 Identifying the PDF structure: PDFs can have various structures, such as
text-based, image-based, or table-based. Here we will be extracting table-
based PDF
 Choosing a library: Selecting a suitable Python library for PDF extraction,
such as pdfminer, or Tabula. These libraries provide tools and functions
to parse and extract data from PDF files.
 Extracting text from table-based PDFs: If the PDF contains text that can
be selected and copied, you can use the PDF extraction library to extract
the text content. The library will provide functions to open the PDF file,
extract text from each page, and concatenate the extracted text into a
usable format.
 Cleaning and pre-processing the extracted data: Once we extract the text
data from the PDF, we may need to perform additional data cleaning and
pre-processing.
 Analysing or storing the extracted data: After extracting and pre-
processing the data, we can analyse it directly using data science
techniques or store it in a structured format (e.g., CSV, JSON, or a
database) for further analysis or integration with other systems.

Fig 5.6(i)

33
Fig 5.6(ii)

All the data extracted from Big Basket Website is stored in CSV File as
demonstrated in fig 5.6 (i) & (ii)

34
CHAPTER-6
Conclusion

35
6. CONCLUSION
In conclusion, web scraping is a powerful technique for extracting data
from websites. It allows us to automate the process of gathering information,
which can be used for various purposes such as data analysis, research, or
building applications.

During the web scraping project, we covered several important aspects:

1. Understanding the website structure: It is crucial to understand the structure of the


website that we want to scrape. This includes identifying the HTML elements and attributes
that contain the desired data.
2. Selecting the appropriate tools: We used libraries like BeautifulSoup, Selenium, pandas,
tabula, and requests to parse HTML, send HTTP requests, and extract data from web pages.
3. Navigating and extracting data: We used methods like find (), find all (), and CSS selectors
to locate specific elements and extract their data. We also handled cases where the data
was nested or required additional processing.
4. Cleaning and pre-processing data: Sometimes, the extracted data may contain unwanted
characters, formatting issues, or inconsistencies. We performed data cleaning and pre-
processing steps to ensure the data is in a usable format.
5. Storing the data: We used data structures like lists or data frames (using pandas) to store
the extracted data. Additionally, we explored options to save the data in various formats
such as CSV or a database.
6. Handling errors and exceptions: Web scraping can be prone to errors due to various
reasons like network issues, website changes, or data inconsistencies, or web driver
updating.
Remember, it’s important to scrape responsibly and ethically, ensuring that the scraping
activities do not negatively impact the website or violate any regulations.
Web scraping is a vast field, and there are many advanced techniques and scenarios that
can be explored.
Happy web scraping..!!

36
CHAPTER-7
References & Bibliography

37
7. REFERENCES AND BIBLIOGRAPHY:

1. Beautiful Soup: Beautiful Soup is a Python library that allows you to extract data from HTML
and XML files. It provides a convenient way to navigate, search, and manipulate the parse tree.
You can find the official documentation and examples at
https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

2. Scrapy: Scrapy is a Python framework specifically designed for web scraping. It provides an
integrated way to handle requests, parse responses, and extract data from websites. The
official documentation and tutorials can be found at https://docs.scrapy.org/.

3. Selenium: Selenium is a popular web testing framework that can also be used for web
scraping. It allows you to automate browser actions and extract data from dynamically
generated web pages. You can refer to the official documentation at
https://www.selenium.dev/documentation/en/.

4. Requests: Requests is a Python library for making HTTP requests. It simplifies the process of
sending HTTP requests and handling responses. While it's not specifically designed for web
scraping, it's often used in conjunction with other libraries like Beautiful Soup or Scrapy. You
can find the documentation and examples at https://docs.python-requests.org/en/latest/.

5. Pandas: The Pandas official documentation is a comprehensive resource that covers all
aspects of pandas. It includes a user guide, API reference, tutorials, and examples. You can find
it at: https://pandas.pydata.org/docs/

6. Tabula: The Tabula documentation provides information on how to install Tabula, use its
command-line interface (CLI), and integrate it into your own projects. It also covers various
features and options available in Tabula. You can access the documentation at:
https://tabula.technology/docs/

38

You might also like