Pandas: Series and DataFrame Summary
Pandas Definitions and Key Concepts
PANDAS DEFINITIONS & CONCEPTS (From PDF)
1. Data Science:
- Field involving data collection, cleaning, standardization, analysis, visualization, and reporting.
2. Data Processing:
- Prepares data through cleaning, merging, and restructuring before analysis.
3. Python Modules and Libraries:
- Libraries contain modules with pre-defined functions.
- Common libraries: NumPy, Pandas, Matplotlib.
4. Pandas:
- Open-source library for data analysis by Wes McKinney (2008).
- Derived from "Panel Data System".
- Built on NumPy and Matplotlib.
5. Key Features of Pandas:
- Handles missing data
- Efficient and flexible
- Tabular data representation
- Supports file formats, reshaping, sorting, and merging
6. Pandas vs NumPy:
- Pandas: Tabular data, DataFrame/Series, more memory use, slower indexing.
- NumPy: Numerical data, arrays, efficient memory, fast indexing.
7. Pandas Data Structures:
- Series: 1D labelled array (homogeneous data)
Pandas: Series and DataFrame Summary
- DataFrame: 2D labelled structure (heterogeneous data)
- Panel: 3D data structure (rarely used)
8. Series:
- 1D labelled array, homogeneous data.
- Mutable values, immutable size.
- Created from list, dict, array, scalar.
- Supports indexing (positional and labelled) and slicing.
- Missing values shown as NaN.
9. Series Operations:
- Supports vector and binary operations.
- NaN in mismatched indices.
- Use add(), sub() with fill_value to avoid NaN.
10. Series Attributes & Methods:
- Access using head(), tail(), drop(), del()
- Boolean indexing for conditional filtering
- Deleting elements using drop()
11. DataFrame:
- 2D data structure (rows and columns)
- Three components: data, rows, columns
- Mutable, labelled axes, arithmetic on rows/columns
- Created from list, list of lists, dict of lists, dict of series, series, numpy arrays, or another DataFrame