DWV Unit Ii
DWV Unit Ii
Example:
<person>
<name>Alice</name>
<age>30</age>
<city>New York</city>
</person>
Self-Descriptive: XML tags describe the data they contain,
<person id="123">
<name>John</name>
<age>25</age>
</person>
Comments
Examples:
Examples:
Examples:
Examples:
Examples:
Examples:
Examples:
Examples:
Examples:
Examples:
Examples:
Examples:
XML and HTML are structurally similar, but XML is more general.
Parsing XML with lxml.objectify
Using lxml.objectify, we parse the file and get a reference to
the root node of the XML file with getroot.
path = "datasets/mta_perf/Performance_MNR.xml“
with open(path) as f:
parsed = objectify.parse(f)
root = parsed.getroot()
data = []
skip_fields = ["PARENT_SEQ", "INDICATOR_SEQ",
"DESIRED_CHANGE", "DECIMAL_PLACES"]
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
perf = pd.DataFrame(data)
perf.head()
Parsing XML with lxml.objectify
pandas's pandas.read_xml function turns this process into a
one-line expression
path = "datasets/mta_perf/Performance_MNR.xml“
perf2 = pd.read_xml(path)
Print(perf2.head())
Reading a html file in pandas
To read an HTML file in Pandas we use pandas.read_html()
function.
This function can parse HTML tables and convert them into a
DataFrame.
Basic Syntax
import pandas as pd
df_list = pd.read_html("file.html")
Basic Syntax
url = "https://example.com/tables"
df_list = pd.read_html(url)
Basic Syntax
df.to_html("output.html", index=False)
Can we read text from a web page using pandas
Pandas can read and process structured text (like tables) from a
and scrapers
web page
Parquet Format
Feather Format
Pickle Format
ORC Format
HDF5 Format
Binary Data Formats
One simple way to store (or serialize) data in binary format is
using Python’s built-in pickle module.
pandas objects all have a to_pickle method that writes the data
to disk in pickle format:
frame = pd.read_csv("Binary_Formats/ex1.txt“,sep=“ “)
Print(frame)
frame.to_pickle("examples/frame_pickle")
Binary Data Formats
pickle is recommended only as a short-term storage format. The
problem is that it is hard to guarantee that the format will be
stable over time; an object pickled today may not unpickle with a
later version of a library.
Pandas has built-in support for several other open source binary
data formats, such as
HDF5
ORC
Apache Parquet
Parquet is a popular columnar storage format optimized for
analytical queries.
deserializing objects.
Optimized Row Columnar (ORC) format is designed for high-
Big data,
High (Cross-
Parquet Yes Fast analytics, cross-
language)
platform.
Data exchange,
operations.
Comparison of Binary Formats
Scientific
Fast
HDF5 Optional Moderate computing,
(Subset)
hierarchical data.
Applications of Binary Format
Binary data formats in pandas have a wide range of applications in
data science, machine learning, and software development due to
their efficiency in
storing
transferring