0% found this document useful (0 votes)

9 views37 pages

DWV Unit Ii

The document provides an overview of XML (Extensible Markup Language) as a text-based format for storing and transporting structured data, highlighting its key features such as self-descriptiveness, extensibility, and platform independence. It discusses various applications of XML, including data storage, configuration files, document representation, and communication between systems, as well as its role in web development and multimedia. Additionally, the document covers binary data formats supported by pandas, emphasizing their efficiency for handling large datasets and their applications in data science and machine learning.

Uploaded by

Aavula Ravi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views37 pages

DWV Unit Ii

Uploaded by

Aavula Ravi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Data Loading, Storage, and File

Formats: XML and HTML

What is an XML file
 An XML (Extensible Markup Language) file is a text-based file

format used to store and transport structured data.

 It is widely used for data exchange between systems,

configuration files, and document storage.

 XML is both human-readable and machine-readable.

Key Features of XML:
 Structured Format

 XML uses tags to define data in a hierarchical structure,

similar to HTML but with user-defined tags.

 Example:

<name>Alice</name>

</person>
 Self-Descriptive: XML tags describe the data they contain,

making the file self-explanatory.

 Extensibility: Unlike HTML, XML doesn't have predefined tags.

You can define your own tags as needed.

 Platform Independent: XML files can be used across different

programming languages and platforms.

 Supports Hierarchical Data: XML is ideal for representing nested

data, such as records within records.

Structure of an XML File
 XML Declaration (Optional)
 Specifies version and encoding.
<?xml version="1.0" encoding="UTF-8"?>
 Root Element
 Every XML file must have a single root element that contains
all other elements.
<root> ... </root>
 Elements
 Data is stored within elements enclosed by opening and closing
tags.
<tag> value </tag>
 Attributes

 Elements can have attributes to store additional metadata.

</person>

 Comments

 Comments can be included for readability.

Applications of xml files
 Data Storage and Exchange

 XML is widely used to store and exchange data between

applications, systems, or platforms.

 Examples:

 Web Services: XML is the foundation for SOAP-based

web services.

 API Communication: Systems use XML to format data

exchanged via APIs.

 Industry Standards: Formats like RSS and Atom for

feeds are XML-based.
Applications of xml files
 Configuration Files

 Many software applications use XML files to store

configuration settings.

 Examples:

 Java Applications: web.xml in Java EE for web

application configuration.

 Build Tools: pom.xml in Maven to define dependencies

and build lifecycle.

 Software Configuration: Settings in games or

enterprise software.
Applications of xml files
 Document Representation

 XML is used as a markup language for document creation,

formatting, and storage.

 Examples:

 Microsoft Office: Uses Office Open XML formats (e.g.,

.docx, .xlsx).

 DocBook: An XML standard for writing structured

documents like technical manuals.

 eBooks: Formats like EPUB are XML-based.

Applications of xml files
 Web Development

 XML plays a role in web technologies for data

representation and configuration.

 Examples:

 RSS and Atom Feeds: For delivering news updates or

content syndication.

 Sitemaps: Websites use XML sitemaps to inform search

engines about their page structure.

 XHTML: An XML-compliant version of HTML.

Applications of xml files
 Multimedia and Graphics

 XML is used to define and store multimedia and graphics-

related data.

 Examples:

 SVG (Scalable Vector Graphics): XML-based format for

vector graphics.

 X3D: For 3D graphics representation.

 MIDI Files: XML is used in storing musical notation and

performance data.
Applications of xml files
 Industry-Specific Applications

 XML is tailored for specific industries with standardized

schemas.

 Examples:

 Healthcare: HL7 (Health Level Seven) standards for

sharing medical data.

 Finance: FpML (Financial products Markup Language)

for derivatives and other financial instruments.

 Publishing: DITA (Darwin Information Typing

Architecture) for creating and managing content.
Applications of xml files
 Communication Between Systems

 XML facilitates cross-platform and cross-language

communication.

 Examples:

 Enterprise Applications: Integration of disparate

systems using XML-based messages.

 Middleware: Formats like XMPP for instant messaging

and presence information.
Applications of xml files
 Storage and Transfer of Metadata

 XML is used to store metadata about various resources.

 Examples:

 Digital Libraries: Metadata encoding in MODS

(Metadata Object Description Schema).

 Media Files: Metadata for audio, video, and images

(e.g., EXIF in photos).
Applications of xml files
 Validation and Standardization

 XML files can enforce structure and rules using DTD

(Document Type Definition) or XSD (XML Schema
Definition).

 Examples:

 Input Validation: Ensuring data conforms to a

predefined schema.

 Data Standards: Maintaining consistency in file

structures across organizations.
Applications of xml files
 Communication in IoT (Internet of Things)

 XML is used to format data exchanged between IoT

devices and applications.

 Examples:

 Device Configurations: Using XML to store and

exchange IoT device settings.

 Sensor Data: Sending structured data from devices to

servers.
Applications of xml files
 Gaming and Simulations

 XML is used in gaming engines and simulations for defining

game objects and configurations.

 Examples:

 Unity: Uses XML for managing assets and settings.

 Game Configurations: Storing level data, character

stats, or scripts.
Applications of xml files
 Database Applications

 XML is used to store and query data in native XML

databases or as an intermediate format.

 Examples:

 Oracle: Supports XMLType for storing and querying

XML data.

 SQL Server: Allows XML data type and XPath queries.

Parsing XML with lxml.objectify
 XML is a common structured data format supporting

hierarchical, nested data with metadata.

 pandas.read_html function, which uses either lxml or Beautiful

Soup under the hood to parse data from HTML.

 XML and HTML are structurally similar, but XML is more general.
Parsing XML with lxml.objectify
 Using lxml.objectify, we parse the file and get a reference to
the root node of the XML file with getroot.

from lxml import objectify

path = "datasets/mta_perf/Performance_MNR.xml“

with open(path) as f:

parsed = objectify.parse(f)

root = parsed.getroot()
data = []
skip_fields = ["PARENT_SEQ", "INDICATOR_SEQ",
"DESIRED_CHANGE", "DECIMAL_PLACES"]
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
perf = pd.DataFrame(data)
perf.head()
Parsing XML with lxml.objectify
 pandas's pandas.read_xml function turns this process into a
one-line expression

from lxml import objectify

path = "datasets/mta_perf/Performance_MNR.xml“

perf2 = pd.read_xml(path)

Print(perf2.head())
Reading a html file in pandas
 To read an HTML file in Pandas we use pandas.read_html()
function.

 This function can parse HTML tables and convert them into a
DataFrame.

Basic Syntax

import pandas as pd

df_list = pd.read_html("file.html")

 file.html: Path to the HTML file (can also be a URL).

 Returns: A list of DataFrames, where each DataFrame

corresponds to a table found in the HTML.
Reading HTML from a URL
 We can read tables directly from a webpage URL.

Basic Syntax

url = "https://example.com/tables"

df_list = pd.read_html(url)

Writing a Dataframe to HTML

 We can write dataframe to a HTML file by using
pandas.DataFrame.to_html() function.

Basic Syntax

df.to_html("output.html", index=False)
Can we read text from a web page using pandas
 Pandas can read and process structured text (like tables) from a

webpage using the pandas.read_html() function, which extracts

HTML tables directly from the page.

 However, reading free-form text that isn’t in table format

(unstructured or non-tabular text from a webpage) requires

additional libraries such as BeautifulSoup from the bs4 package

or requests in combination with Pandas.

Extracting Plain Text from Entire Web Page
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
text = soup.get_text()
# Split the text into lines and remove empty ones
lines = [line.strip() for line in text.split("\n") if line.strip()]
df = pd.DataFrame(lines, columns=["Text"])
print(df)
Web scraping
 Web scraping with Python is a technique for extracting and
storing large amounts of data from websites using Python
programs.
 Web scraping is useful when there is no direct way to
download data from a website, or when there is no API
(application programming interface) available.
 The data extracted can be used for a variety of purposes, such
as data analysis, market research, or competitive intelligence.
 Python is a popular choice for web scraping because of its
ease of use, large library collection, and understandable
syntax.
Tools and libraries that can be used for web scraping with
Python include:

 Beautiful Soup: A library that can be used to build web crawlers

and scrapers

 Requests: A library that can be used to get HTML data from a

web page

 LXML: A third-party library that can be used to work with XML

 XPath: A tool that uses path expressions to navigate and extract

data from HTML or XML documents.

Binary Data Formats
 Pandas provides robust support for handling binary data formats,
which are efficient for storage and faster to read/write compared
to text-based formats.

 These formats are particularly useful when working with large

datasets.

 Parquet Format

 Feather Format

 Pickle Format

 ORC Format

 HDF5 Format
Binary Data Formats
 One simple way to store (or serialize) data in binary format is
using Python’s built-in pickle module.

 pandas objects all have a to_pickle method that writes the data
to disk in pickle format:

frame = pd.read_csv("Binary_Formats/ex1.txt“,sep=“ “)

Print(frame)

frame.to_pickle("examples/frame_pickle")
Binary Data Formats
 pickle is recommended only as a short-term storage format. The
problem is that it is hard to guarantee that the format will be
stable over time; an object pickled today may not unpickle with a
later version of a library.

 Pandas has built-in support for several other open source binary
data formats, such as

 HDF5

 ORC

 Apache Parquet
 Parquet is a popular columnar storage format optimized for

analytical queries.

 Feather is another columnar format designed for fast, lightweight

binary serialization of data frames.

 Pickle is a Python-specific binary format for serializing and

deserializing objects.
 Optimized Row Columnar (ORC) format is designed for high-

performance reads and writes in big data frameworks like

Hadoop. (pip install pyarrow)

 HDF5 (Hierarchical Data Format) is designed for storing and

managing large amounts of data, particularly useful for numerical

and scientific data. (pip install tables)

 Supports hierarchical organization.

 Allows selective reading of data subsets.

Comparison of Binary Formats

Format Compression Speed Portability Use Case

Big data,
High (Cross-
Parquet Yes Fast analytics, cross-
language)
platform.

Data exchange,

Feather No Very Fast High (Python, R) high-speed

operations.
Comparison of Binary Formats

Format Compression Speed Portability Use Case

Low (Python Python-specific

Pickle No Moderate
only) serialization.

High (Cross- Big data, Hadoop-

ORC Yes Fast
language) related workflows.

Scientific
Fast
HDF5 Optional Moderate computing,
(Subset)
hierarchical data.
Applications of Binary Format
 Binary data formats in pandas have a wide range of applications in
data science, machine learning, and software development due to
their efficiency in

 storing

 transferring

 processing large datasets

Applications of Binary Format
 Data Storage and Retrieval
 Big Data Analytics
 Machine Learning Pipelines
 Cloud and Distributed Systems
 Real-Time and Stream Processing
 Scientific Research and High-Performance Computing
 Web and Mobile Applications
 Financial and Business Analytics
 Cross-Platform Data Exchange
 Image and Media Processing

XML Processing With Python
100% (1)
XML Processing With Python
447 pages
Mastering XML: Essential Techniques
From Everand
Mastering XML: Essential Techniques
Brett Neutreon
No ratings yet
XML Processing With Perl, Python and PHP. Also Covers TCL, Rebol, Ruby and AppleScript
No ratings yet
XML Processing With Perl, Python and PHP. Also Covers TCL, Rebol, Ruby and AppleScript
447 pages
UNIT 3 Resource Description Framework and XML Technologies
No ratings yet
UNIT 3 Resource Description Framework and XML Technologies
22 pages
Chapter 5 - XML
No ratings yet
Chapter 5 - XML
14 pages
DBMS Unit4 Notes
No ratings yet
DBMS Unit4 Notes
95 pages
WT 1
No ratings yet
WT 1
74 pages
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Web Design
No ratings yet
Web Design
11 pages
Unit 4
No ratings yet
Unit 4
36 pages
Semantic Web Ontology Lec 7 8 Week 4
No ratings yet
Semantic Web Ontology Lec 7 8 Week 4
36 pages
UNIT 5 Part 01
No ratings yet
UNIT 5 Part 01
24 pages
Week2 - Data - Formats 3
No ratings yet
Week2 - Data - Formats 3
60 pages
Gu Into Reviewer
No ratings yet
Gu Into Reviewer
38 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
Automation - ch05
No ratings yet
Automation - ch05
35 pages
Replacement Tables in SAP Credit Management (FSCM)
No ratings yet
Replacement Tables in SAP Credit Management (FSCM)
3 pages
4020 Week 3
No ratings yet
4020 Week 3
75 pages
Web Development
No ratings yet
Web Development
59 pages
Tutorial 1
No ratings yet
Tutorial 1
22 pages
06 XML
No ratings yet
06 XML
33 pages
Introduction To XML
No ratings yet
Introduction To XML
44 pages
Unit 2
No ratings yet
Unit 2
296 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
Android Offloading Computing Over Cloud
No ratings yet
Android Offloading Computing Over Cloud
6 pages
Sap Ac620
No ratings yet
Sap Ac620
117 pages
Ch-03: Programming Fundamentals - Short Question Answers - PDF
No ratings yet
Ch-03: Programming Fundamentals - Short Question Answers - PDF
27 pages
5 XML (Unit 2)
No ratings yet
5 XML (Unit 2)
40 pages
XML Processing With Perl, Python, and PHP
No ratings yet
XML Processing With Perl, Python, and PHP
447 pages
UNIT-4 Web Authoring
No ratings yet
UNIT-4 Web Authoring
32 pages
Introduction To Python
No ratings yet
Introduction To Python
18 pages
Unit - I
No ratings yet
Unit - I
112 pages
LM Unit-1
No ratings yet
LM Unit-1
9 pages
XML and Internet Databases: Dawood Al-Nasseri Wade Meena MIS 409 DR - Sumali Conlon
No ratings yet
XML and Internet Databases: Dawood Al-Nasseri Wade Meena MIS 409 DR - Sumali Conlon
66 pages
XML and Web Databases
No ratings yet
XML and Web Databases
58 pages
Module 5 Notes
No ratings yet
Module 5 Notes
22 pages
Unit-1 XML
No ratings yet
Unit-1 XML
9 pages
XML-Based Servers - Communicating Meaningful Information Over The Web Using XML
No ratings yet
XML-Based Servers - Communicating Meaningful Information Over The Web Using XML
42 pages
AWS Module 3
No ratings yet
AWS Module 3
33 pages
Data Wrangling & Visualization - II
No ratings yet
Data Wrangling & Visualization - II
41 pages
HCI Unit 6 ( Final)
No ratings yet
HCI Unit 6 ( Final)
58 pages
Week 9 - XML and Web Data
No ratings yet
Week 9 - XML and Web Data
4 pages
Unified Planning Budgeting Execution and Analysis of Projects v1.4
No ratings yet
Unified Planning Budgeting Execution and Analysis of Projects v1.4
49 pages
CH 2 Data Collection Management
No ratings yet
CH 2 Data Collection Management
42 pages
Unit 4 - Internet and Web Technology - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Internet and Web Technology - WWW - Rgpvnotes.in
14 pages
Certificate: XML-Based Servers-Communicating Meaningful Information Over The Web Using XML
No ratings yet
Certificate: XML-Based Servers-Communicating Meaningful Information Over The Web Using XML
43 pages
XSL Primer
From Everand
XSL Primer
Stephen Cote
No ratings yet
Vvism Placement Brochure
No ratings yet
Vvism Placement Brochure
17 pages
M9A1
No ratings yet
M9A1
4 pages
XML Chap8 Sebesta Web2
No ratings yet
XML Chap8 Sebesta Web2
52 pages
XML Interview Questions With Answers Page I
No ratings yet
XML Interview Questions With Answers Page I
41 pages
XML and Web Database
No ratings yet
XML and Web Database
10 pages
Pec-Cs801d
No ratings yet
Pec-Cs801d
15 pages
Extensible: Markup Language
No ratings yet
Extensible: Markup Language
33 pages
XML Interview Questions
No ratings yet
XML Interview Questions
52 pages
Software Process in Hci
No ratings yet
Software Process in Hci
6 pages
Task 3 Solution
No ratings yet
Task 3 Solution
20 pages
UNIT 1 Introduction To XML: 1 Prepare By: Dr. A. GNANASEKAR ASP/CSE R.M.D. Engineering College
No ratings yet
UNIT 1 Introduction To XML: 1 Prepare By: Dr. A. GNANASEKAR ASP/CSE R.M.D. Engineering College
28 pages
XML (BScCSIT 5th Semester)
No ratings yet
XML (BScCSIT 5th Semester)
39 pages
Chapter 11: XML: Data Integration
No ratings yet
Chapter 11: XML: Data Integration
73 pages
Ais Chapter 4 Tesfu
No ratings yet
Ais Chapter 4 Tesfu
6 pages
Query Plan Interpretation
No ratings yet
Query Plan Interpretation
81 pages
XML (Extensible Markup Language)
No ratings yet
XML (Extensible Markup Language)
4 pages
DATA Pages
No ratings yet
DATA Pages
6 pages
Python XML Processing With LXML
No ratings yet
Python XML Processing With LXML
56 pages
XML Writing and Parsing: SOA - Lab2
No ratings yet
XML Writing and Parsing: SOA - Lab2
16 pages
What Is XML: XML (Extensible Markup Language) Is A Mark Up Language
No ratings yet
What Is XML: XML (Extensible Markup Language) Is A Mark Up Language
17 pages
Materi-01 OSI Network Layer
No ratings yet
Materi-01 OSI Network Layer
33 pages
IT g12 Unit 4 Note 1
No ratings yet
IT g12 Unit 4 Note 1
4 pages
User Guide Transfer-Unsettled-Accounts MA1519
0% (2)
User Guide Transfer-Unsettled-Accounts MA1519
5 pages
Cyber Law
No ratings yet
Cyber Law
15 pages
Web Technologies, Handout 2, by G Sreenivasulu
No ratings yet
Web Technologies, Handout 2, by G Sreenivasulu
10 pages
Pes University: 6 Semester Project Report On
No ratings yet
Pes University: 6 Semester Project Report On
70 pages
FOS - Chapter 6 - Deadlocks
No ratings yet
FOS - Chapter 6 - Deadlocks
30 pages
XECM Overview
No ratings yet
XECM Overview
24 pages
Module 2
No ratings yet
Module 2
5 pages
Functional Requirements For Recordkeeping: Appendix
No ratings yet
Functional Requirements For Recordkeeping: Appendix
12 pages
Granta EduPack Install One
No ratings yet
Granta EduPack Install One
3 pages
Edge Computing
No ratings yet
Edge Computing
14 pages
4CAE000545 - RTU500 Rel. 12.2 Engineer - Webinar
No ratings yet
4CAE000545 - RTU500 Rel. 12.2 Engineer - Webinar
48 pages
Test-Driven Development and Functional Testing
No ratings yet
Test-Driven Development and Functional Testing
24 pages
What Is XML and Its Applications &characterisrics of XML
No ratings yet
What Is XML and Its Applications &characterisrics of XML
4 pages
Which of The Following Is The Foundation of Mapreduce Operations?
No ratings yet
Which of The Following Is The Foundation of Mapreduce Operations?
12 pages
OOSE Minor Exam
No ratings yet
OOSE Minor Exam
9 pages
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
2 Security+Concepts
No ratings yet
2 Security+Concepts
10 pages
Q.1) Relation Between XML, HTML, SGML. Relation Between XML and HTML
No ratings yet
Q.1) Relation Between XML, HTML, SGML. Relation Between XML and HTML
7 pages
Unit 2
No ratings yet
Unit 2
14 pages
Shyam Pega
No ratings yet
Shyam Pega
3 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet