[go: up one dir, main page]

0% found this document useful (0 votes)
9 views37 pages

DWV Unit Ii

The document provides an overview of XML (Extensible Markup Language) as a text-based format for storing and transporting structured data, highlighting its key features such as self-descriptiveness, extensibility, and platform independence. It discusses various applications of XML, including data storage, configuration files, document representation, and communication between systems, as well as its role in web development and multimedia. Additionally, the document covers binary data formats supported by pandas, emphasizing their efficiency for handling large datasets and their applications in data science and machine learning.

Uploaded by

Aavula Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views37 pages

DWV Unit Ii

The document provides an overview of XML (Extensible Markup Language) as a text-based format for storing and transporting structured data, highlighting its key features such as self-descriptiveness, extensibility, and platform independence. It discusses various applications of XML, including data storage, configuration files, document representation, and communication between systems, as well as its role in web development and multimedia. Additionally, the document covers binary data formats supported by pandas, emphasizing their efficiency for handling large datasets and their applications in data science and machine learning.

Uploaded by

Aavula Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Loading, Storage, and File

Formats: XML and HTML


What is an XML file
 An XML (Extensible Markup Language) file is a text-based file

format used to store and transport structured data.

 It is widely used for data exchange between systems,

configuration files, and document storage.

 XML is both human-readable and machine-readable.


Key Features of XML:
 Structured Format

 XML uses tags to define data in a hierarchical structure,


similar to HTML but with user-defined tags.

 Example:

<person>

<name>Alice</name>

<age>30</age>

<city>New York</city>

</person>
 Self-Descriptive: XML tags describe the data they contain,

making the file self-explanatory.

 Extensibility: Unlike HTML, XML doesn't have predefined tags.

You can define your own tags as needed.

 Platform Independent: XML files can be used across different

programming languages and platforms.

 Supports Hierarchical Data: XML is ideal for representing nested

data, such as records within records.


Structure of an XML File
 XML Declaration (Optional)
 Specifies version and encoding.
<?xml version="1.0" encoding="UTF-8"?>
 Root Element
 Every XML file must have a single root element that contains
all other elements.
<root> ... </root>
 Elements
 Data is stored within elements enclosed by opening and closing
tags.
<tag> value </tag>
 Attributes

 Elements can have attributes to store additional metadata.

<person id="123">

<name>John</name>

<age>25</age>

</person>

 Comments

 Comments can be included for readability.

<!-- This is a comment -->


Applications of xml files
 Data Storage and Exchange

 XML is widely used to store and exchange data between


applications, systems, or platforms.

 Examples:

 Web Services: XML is the foundation for SOAP-based


web services.

 API Communication: Systems use XML to format data


exchanged via APIs.

 Industry Standards: Formats like RSS and Atom for


feeds are XML-based.
Applications of xml files
 Configuration Files

 Many software applications use XML files to store


configuration settings.

 Examples:

 Java Applications: web.xml in Java EE for web


application configuration.

 Build Tools: pom.xml in Maven to define dependencies


and build lifecycle.

 Software Configuration: Settings in games or


enterprise software.
Applications of xml files
 Document Representation

 XML is used as a markup language for document creation,


formatting, and storage.

 Examples:

 Microsoft Office: Uses Office Open XML formats (e.g.,


.docx, .xlsx).

 DocBook: An XML standard for writing structured


documents like technical manuals.

 eBooks: Formats like EPUB are XML-based.


Applications of xml files
 Web Development

 XML plays a role in web technologies for data


representation and configuration.

 Examples:

 RSS and Atom Feeds: For delivering news updates or


content syndication.

 Sitemaps: Websites use XML sitemaps to inform search


engines about their page structure.

 XHTML: An XML-compliant version of HTML.


Applications of xml files
 Multimedia and Graphics

 XML is used to define and store multimedia and graphics-


related data.

 Examples:

 SVG (Scalable Vector Graphics): XML-based format for


vector graphics.

 X3D: For 3D graphics representation.

 MIDI Files: XML is used in storing musical notation and


performance data.
Applications of xml files
 Industry-Specific Applications

 XML is tailored for specific industries with standardized


schemas.

 Examples:

 Healthcare: HL7 (Health Level Seven) standards for


sharing medical data.

 Finance: FpML (Financial products Markup Language)


for derivatives and other financial instruments.

 Publishing: DITA (Darwin Information Typing


Architecture) for creating and managing content.
Applications of xml files
 Communication Between Systems

 XML facilitates cross-platform and cross-language


communication.

 Examples:

 Enterprise Applications: Integration of disparate


systems using XML-based messages.

 Middleware: Formats like XMPP for instant messaging


and presence information.
Applications of xml files
 Storage and Transfer of Metadata

 XML is used to store metadata about various resources.

 Examples:

 Digital Libraries: Metadata encoding in MODS


(Metadata Object Description Schema).

 Media Files: Metadata for audio, video, and images


(e.g., EXIF in photos).
Applications of xml files
 Validation and Standardization

 XML files can enforce structure and rules using DTD


(Document Type Definition) or XSD (XML Schema
Definition).

 Examples:

 Input Validation: Ensuring data conforms to a


predefined schema.

 Data Standards: Maintaining consistency in file


structures across organizations.
Applications of xml files
 Communication in IoT (Internet of Things)

 XML is used to format data exchanged between IoT


devices and applications.

 Examples:

 Device Configurations: Using XML to store and


exchange IoT device settings.

 Sensor Data: Sending structured data from devices to


servers.
Applications of xml files
 Gaming and Simulations

 XML is used in gaming engines and simulations for defining


game objects and configurations.

 Examples:

 Unity: Uses XML for managing assets and settings.

 Game Configurations: Storing level data, character


stats, or scripts.
Applications of xml files
 Database Applications

 XML is used to store and query data in native XML


databases or as an intermediate format.

 Examples:

 Oracle: Supports XMLType for storing and querying


XML data.

 SQL Server: Allows XML data type and XPath queries.


Parsing XML with lxml.objectify
 XML is a common structured data format supporting

hierarchical, nested data with metadata.

 pandas.read_html function, which uses either lxml or Beautiful

Soup under the hood to parse data from HTML.

 XML and HTML are structurally similar, but XML is more general.
Parsing XML with lxml.objectify
 Using lxml.objectify, we parse the file and get a reference to
the root node of the XML file with getroot.

from lxml import objectify

path = "datasets/mta_perf/Performance_MNR.xml“

with open(path) as f:

parsed = objectify.parse(f)

root = parsed.getroot()
data = []
skip_fields = ["PARENT_SEQ", "INDICATOR_SEQ",
"DESIRED_CHANGE", "DECIMAL_PLACES"]
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
perf = pd.DataFrame(data)
perf.head()
Parsing XML with lxml.objectify
 pandas's pandas.read_xml function turns this process into a
one-line expression

from lxml import objectify

path = "datasets/mta_perf/Performance_MNR.xml“

perf2 = pd.read_xml(path)

Print(perf2.head())
Reading a html file in pandas
 To read an HTML file in Pandas we use pandas.read_html()
function.

 This function can parse HTML tables and convert them into a
DataFrame.

Basic Syntax

import pandas as pd

df_list = pd.read_html("file.html")

 file.html: Path to the HTML file (can also be a URL).

 Returns: A list of DataFrames, where each DataFrame


corresponds to a table found in the HTML.
Reading HTML from a URL
 We can read tables directly from a webpage URL.

Basic Syntax

url = "https://example.com/tables"

df_list = pd.read_html(url)

Writing a Dataframe to HTML


 We can write dataframe to a HTML file by using
pandas.DataFrame.to_html() function.

Basic Syntax

df.to_html("output.html", index=False)
Can we read text from a web page using pandas
 Pandas can read and process structured text (like tables) from a

webpage using the pandas.read_html() function, which extracts

HTML tables directly from the page.

 However, reading free-form text that isn’t in table format

(unstructured or non-tabular text from a webpage) requires

additional libraries such as BeautifulSoup from the bs4 package

or requests in combination with Pandas.


Extracting Plain Text from Entire Web Page
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
text = soup.get_text()
# Split the text into lines and remove empty ones
lines = [line.strip() for line in text.split("\n") if line.strip()]
df = pd.DataFrame(lines, columns=["Text"])
print(df)
Web scraping
 Web scraping with Python is a technique for extracting and
storing large amounts of data from websites using Python
programs.
 Web scraping is useful when there is no direct way to
download data from a website, or when there is no API
(application programming interface) available.
 The data extracted can be used for a variety of purposes, such
as data analysis, market research, or competitive intelligence.
 Python is a popular choice for web scraping because of its
ease of use, large library collection, and understandable
syntax.
Tools and libraries that can be used for web scraping with
Python include:

 Beautiful Soup: A library that can be used to build web crawlers

and scrapers

 Requests: A library that can be used to get HTML data from a

web page

 LXML: A third-party library that can be used to work with XML

 XPath: A tool that uses path expressions to navigate and extract

data from HTML or XML documents.


Binary Data Formats
 Pandas provides robust support for handling binary data formats,
which are efficient for storage and faster to read/write compared
to text-based formats.

 These formats are particularly useful when working with large


datasets.

 Parquet Format

 Feather Format

 Pickle Format

 ORC Format

 HDF5 Format
Binary Data Formats
 One simple way to store (or serialize) data in binary format is
using Python’s built-in pickle module.

 pandas objects all have a to_pickle method that writes the data
to disk in pickle format:

frame = pd.read_csv("Binary_Formats/ex1.txt“,sep=“ “)

Print(frame)

frame.to_pickle("examples/frame_pickle")
Binary Data Formats
 pickle is recommended only as a short-term storage format. The
problem is that it is hard to guarantee that the format will be
stable over time; an object pickled today may not unpickle with a
later version of a library.

 Pandas has built-in support for several other open source binary
data formats, such as

 HDF5

 ORC

 Apache Parquet
 Parquet is a popular columnar storage format optimized for

analytical queries.

 Feather is another columnar format designed for fast, lightweight

binary serialization of data frames.

 Pickle is a Python-specific binary format for serializing and

deserializing objects.
 Optimized Row Columnar (ORC) format is designed for high-

performance reads and writes in big data frameworks like

Hadoop. (pip install pyarrow)

 HDF5 (Hierarchical Data Format) is designed for storing and

managing large amounts of data, particularly useful for numerical

and scientific data. (pip install tables)

 Supports hierarchical organization.

 Allows selective reading of data subsets.


Comparison of Binary Formats

Format Compression Speed Portability Use Case

Big data,
High (Cross-
Parquet Yes Fast analytics, cross-
language)
platform.

Data exchange,

Feather No Very Fast High (Python, R) high-speed

operations.
Comparison of Binary Formats

Format Compression Speed Portability Use Case

Low (Python Python-specific


Pickle No Moderate
only) serialization.

High (Cross- Big data, Hadoop-


ORC Yes Fast
language) related workflows.

Scientific
Fast
HDF5 Optional Moderate computing,
(Subset)
hierarchical data.
Applications of Binary Format
 Binary data formats in pandas have a wide range of applications in
data science, machine learning, and software development due to
their efficiency in

 storing

 transferring

 processing large datasets


Applications of Binary Format
 Data Storage and Retrieval
 Big Data Analytics
 Machine Learning Pipelines
 Cloud and Distributed Systems
 Real-Time and Stream Processing
 Scientific Research and High-Performance Computing
 Web and Mobile Applications
 Financial and Business Analytics
 Cross-Platform Data Exchange
 Image and Media Processing

You might also like