Dataframe MVP

We've already got several useful discussions open, on different topics. To try to give a bit of structure to the conversations, I propose we try to start with an initial MVP (minimum viable product), and we build iterating over it.

This is a draft of the topics that we may want to discuss, and a possible order to discuss them:

Dataframe class name Dataframe class name #17
Get number of rows and columns Get number of rows and columns #20
Get and set column names Get and set column names #21
Selecting/accessing columns (df[col], df[col1, col2]), and calling methods in 1 vs N columns
Filter data
Indexing, row labels Implicit alignment in operations #12
Sorting data
Missing data Missing Data #9
Constructor and loading/dumping data
Map operations (abs, isin, clip, str.lower,...)
Map operations with Python operators (+, *,...)
Reductions (sum, mean,...) Reductions #11
Aggregating data and window functions
Joining dataframes
Reshaping data (pivot, stack, get dummies...)
Setting data (mutability, adding new columns) Mutability #10
Displaying data, visualization, plotting API for viewing the frame #15
Time series operations
Sparse data

The idea would be to discuss and decide about each topic incrementally, and keep defining an API that can be used end to end (with very limited functionality at the beginning). So, focusing on being able to write code with the API, we should be identifying for each topic, the questions that need to be answered to construct the API. And then add to the RFC the API definition based on the agreements.

Next there is a very simple example of dataframe usage. And the questions that need to be answered to define a basic API for them.

>>> from whatever import dataframe

>>> data = {'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}

>>> df = dataframe.load(data, format='dict')
>>> df
a b c
-----
1 3 5
2 4 6

>>> len(df)
2
>>> len(df.columns)
3

>>> df.dtypes
[int, int, int]

>>> df.columns
['a', 'b', 'c']

>>> df.columns = 'x', 'y', 'z'
>>> df.columns
['x', 'y', 'z']

>>> df
x y z
-----
1 3 5
2 4 6

>>> df['q'] = [7, 8]
>>> df
x y z q
-------
1 3 5 7
2 4 6 8

>>> df['y']
y
-
3
4

>>> df['z', 'x']
z x
---
5 1
6 2

>>> df.dump(format='dict')
{'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'q': [7, 8]}

The simpler questions that need to be answered to define this MVP API are:

Name of the dataframe class. I can think of two main options (feel free to propose more):
- DataFrame or Dataframe, to be consistent with Python class capitalization
- dataframe, using Python type capitalization (as in int, bool, datetime.datetime...
How to obtain the size of the dataframe?
- Properties (num_columns, num_rows)
- Using Python len: len(df), len(df.columns)
- shape (it allows for N dimensions, which for a dataframe is not needed, since it's always 2D)
How to obtain the dtypes (is a dtypes property enough?)
Setting and getting column names
- Is using a Python property enough?
- What should be the name? columns, column_names...

The next two questions are also needed, but they are more complex, so I'll be creating separate issues for them:

Loading and exporting data
- Should the dataframe class provide a constructor? If it does, should support different formats (like pandas)?
- Should we have different syntax (as in pandas) for loading data from disk (pandas.read_csv...) and for loading data from memory (DataFrame.from_dict)? Or a standard way for all loading/exporting is preferred?
How to access and set columns in a dataframe
- With __getittem__ directly (df[col] / df[col] = foo)
- With __getitem__ over a property (df.col[col] / df.col[col] = foo)
- With methods (df.get(col) / df.set(col=foo))
- Is more than one way needed/preferred?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions