Description
We've already got several useful discussions open, on different topics. To try to give a bit of structure to the conversations, I propose we try to start with an initial MVP (minimum viable product), and we build iterating over it.
This is a draft of the topics that we may want to discuss, and a possible order to discuss them:
- Dataframe class name Dataframe class name #17
- Get number of rows and columns Get number of rows and columns #20
- Get and set column names Get and set column names #21
- Selecting/accessing columns (
df[col]
,df[col1, col2]
), and calling methods in 1 vs N columns - Filter data
- Indexing, row labels Implicit alignment in operations #12
- Sorting data
- Missing data Missing Data #9
- Constructor and loading/dumping data
- Map operations (abs, isin, clip, str.lower,...)
- Map operations with Python operators (+, *,...)
- Reductions (sum, mean,...) Reductions #11
- Aggregating data and window functions
- Joining dataframes
- Reshaping data (pivot, stack, get dummies...)
- Setting data (mutability, adding new columns) Mutability #10
- Displaying data, visualization, plotting API for viewing the frame #15
- Time series operations
- Sparse data
The idea would be to discuss and decide about each topic incrementally, and keep defining an API that can be used end to end (with very limited functionality at the beginning). So, focusing on being able to write code with the API, we should be identifying for each topic, the questions that need to be answered to construct the API. And then add to the RFC the API definition based on the agreements.
Next there is a very simple example of dataframe usage. And the questions that need to be answered to define a basic API for them.
>>> from whatever import dataframe
>>> data = {'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}
>>> df = dataframe.load(data, format='dict')
>>> df
a b c
-----
1 3 5
2 4 6
>>> len(df)
2
>>> len(df.columns)
3
>>> df.dtypes
[int, int, int]
>>> df.columns
['a', 'b', 'c']
>>> df.columns = 'x', 'y', 'z'
>>> df.columns
['x', 'y', 'z']
>>> df
x y z
-----
1 3 5
2 4 6
>>> df['q'] = [7, 8]
>>> df
x y z q
-------
1 3 5 7
2 4 6 8
>>> df['y']
y
-
3
4
>>> df['z', 'x']
z x
---
5 1
6 2
>>> df.dump(format='dict')
{'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'q': [7, 8]}
The simpler questions that need to be answered to define this MVP API are:
- Name of the dataframe class. I can think of two main options (feel free to propose more):
DataFrame
orDataframe
, to be consistent with Python class capitalizationdataframe
, using Python type capitalization (as inint
,bool
,datetime.datetime
...
- How to obtain the size of the dataframe?
- Properties (
num_columns
,num_rows
) - Using Python
len
:len(df)
,len(df.columns)
shape
(it allows for N dimensions, which for a dataframe is not needed, since it's always 2D)
- Properties (
- How to obtain the dtypes (is a
dtypes
property enough?) - Setting and getting column names
- Is using a Python property enough?
- What should be the name?
columns
,column_names
...
The next two questions are also needed, but they are more complex, so I'll be creating separate issues for them:
-
Loading and exporting data
- Should the dataframe class provide a constructor? If it does, should support different formats (like pandas)?
- Should we have different syntax (as in pandas) for loading data from disk (
pandas.read_csv
...) and for loading data from memory (DataFrame.from_dict
)? Or a standard way for all loading/exporting is preferred?
-
How to access and set columns in a dataframe
- With
__getittem__
directly (df[col]
/df[col] = foo
) - With
__getitem__
over a property (df.col[col]
/df.col[col] = foo
) - With methods (
df.get(col)
/df.set(col=foo)
) - Is more than one way needed/preferred?
- With