8000 API: read_excel signature by chris-b1 · Pull Request #11198 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

API: read_excel signature #11198

8000
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

chris-b1
Copy link
Contributor

Replace the **kwds in read_excel with the actual list of supported keyword args. This doesn't
change any functionality, just nicer for interactive use. Also a bit of clarification on the thousands
arg in the docstring.

Additionally, chunksize was a parameter in the ExcelFile.parse signature, but didn't do anything (xref #8011). I removed this and raise NotImplementedError if passed, which is potentially breaking.

@@ -156,11 +159,18 @@ def get_writer(engine_name):
Acceptable values are None or xlrd"""

@Appender(excel_doc_common % read_excel_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need the excel_doc_common?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - ExcelFile.parse and read_excel both use it.

Just thinking, maybe the api should be changed to match the to_excel semantics where an ExcelFile object could be passed to read_excel (and deprecate ExcelFile.parse).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, read_excel should also be able to take an ExcelFile object. Go ahead and add that on.
Yeah .parse was existing before read_excel. go ahead and deprecate that as well (but obviously don't change the use for now). Point people to read_excel. You'll need to change the docs as well.

@jreback jreback added API Design IO Excel read_excel, to_excel labels Sep 27, 2015
@jreback jreback added this to the 0.17.0 milestone Sep 27, 2015
@chris-b1
Copy link
Contributor Author

@jreback - alright, I deprecated ExcelFile.parse, modified a bunch of tests to no longer use that.

I also reordered the Excel docs to fit the new note; I think it reads more logically now.

@@ -989,6 +991,7 @@ Deprecations
- ``Series.is_time_series`` deprecated in favor of ``Series.index.is_all_dates`` (:issue:`11135`)
- Legacy offsets (like ``'A@JAN'``) listed in :ref:`here <timeseries.legacyaliases>` are deprecated (note that this has been alias since 0.8.0), (:issue:`10878`)
- ``WidePanel`` deprecated in favor of ``Panel``, ``LongPanel`` in favor of ``DataFrame`` (note these have been aliases since < 0.11.0), (:issue:`10892`)
- ``ExcelFile.parse`` has deprecated in favor ``read_excel(ExcelFile)`` (:issue:`11198`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is deprecated; in favor of

@chris-b1
Copy link
Contributor Author

@jreback - made the changes you noted.

@jreback
Copy link
Contributor
jreback commented Sep 27, 2015

looks good. ping when green.

@jorisvandenbossche
Copy link
Member

Is there a good reason to deprecate ExcelFile.parse?
The docs gives an explicit use for it compared to read_excel: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-excel-files

@chris-b1
Copy link
Contributor Author

@jorisvandenbossche - there's no super-compelling reason; the main idea was to match up with api of to_excel, i.e. the "ExcelFileWrapper" (ExcelFile, ExcelWriter) doesn't have any pandas-specific functionality, instead you pass it into the io functions (read_excel, to_excel).

I did update the docs to cover that specific example. edit: although it may be hard to see in the diff - rendered below.

image

@jorisvandenbossche
Copy link
Member

OK, I see that you changed that explanation in the docs I was pointing to. But, what is the point of ExcelFile then? What advantage does it give above just providing the string name?

@jreback
Copy link
Contributor
jreback commented Sep 27, 2015

@jorisvandenbossche

I would say there isn't any usecase for exposing ExcelFile in the current impl. Before read_excel, sure it was the way you passed things around. But it is not necessary anymore, unless I am missing something.

So should deprecate this as well.

I could see a use as an object holding the iterator if we do support chunksizing (e.g. kind of like the TableIterator/TextReader classes). But these are actually internal and not exposed (except thru the iteration itself).

@jorisvandenbossche
Copy link
Member

What I mean is:

xls = pd.ExcelFile('path_to_file.xls')
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None, na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)

In the above, the ExcelFile is rather superfluous, and does not seem to add any value (you can just pass the string to both, which yields the same but is shorter). So if we want to deprecate parse, I would rather deprecate ExcelFile itself.

One thing that is possible with Excelfile is to inspect the sheet names. So that is a reason to keep it I think. But if we keep it, I don't really see a reason to deprecate its parse method. It's just like you also have both read_hdf as HDFStore (but I know, HDFStore< 8000 /code> has more extra functionality).

@chris-b1
Copy link
Contributor Author

So right now ExcelFile does two useful things - first it exposes sheet_names (list of sheets in that file) which can be handy for interactive.

Second, because xlrd loads the whole worbook into memory (as far as I can tell) - there can be a performance benefit.

In [12]: df = pd.DataFrame({'a': np.arange(10000),
    ...:                    'b': np.arange(10000)})

In [13]: with pd.ExcelWriter('temp.xlsx') as f:
    ...:     df.to_excel(f, 'Sheet1')
    ...:     df.to_excel(f, 'Sheet2')

In [14]: %%time
    ...: with pd.ExcelFile('temp.xlsx') as f:
    ...:     pd.read_excel(f, 'Sheet1')
    ...:     pd.read_excel(f, 'Sheet2')
Wall time: 862 ms


In [16]: %%time
    ...: pd.read_excel('temp.xlsx', 'Sheet1')
    ...: pd.read_excel('temp.xlsx', 'Sheet2')
Wall time: 1.55 s

@jreback
Copy link
Contributor
jreback commented Sep 27, 2015

@chris-b1 oh, that is good 2 know (maybe add a note about that in the docs).

Ok, I do see some utility in ExcelFile. but .parse seems kind of 'internal' to me. Its superfluous. So we can deprecate to remove from the public API (though it doesn't hurt anything).

@chris-b1
Copy link
Contributor Author

@jreback - expanded the ExcelFile docs to try and better clarify the purpose and mention the performance consideration.

``read_excel`` can read more than one sheet, by setting ``sheetname`` to either
a list of sheet names, a list of sheet positions, or ``None`` to read all sheets.
# also can be used as a context manager
with pd.ExcelFile('path_to_file.xls')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing as xlsx:?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, good catch

@jorisvandenbossche
Copy link
Member

I am still not fully convinced that deprecating parse is needed. OK, it does not have much added value, but given this already exists such a long time (and is explicitly included in the api.rst), it will only annoy users by removing it. And it is also not very costly to keep, since the one just calls the other.

This is more subjective, but, when having such a ExcelFile object, it feels a bit more natural to do ExcelFile.parse than passing it to a function like pd.read_excel(ExcelFile) (although ExcelFile.read have been better).

And for clarity, big +1 on the signature and doc changes!

@jreback
Copy link
Contributor
jreback commented Sep 28, 2015

ok, @chris-b1 why don't you back out the deprecation of .parse. I don't think its a big deal.

@chris-b1
Copy link
Contributor Author

Sure, not a problem. Latest changes backs out the deprecation - I left the testing changes in (primarily using read_excel over ExcelFile) although did add back a couple calls to ExcelFile.parse.


``read_excel`` can read more than one sheet, by setting ``sheetname`` to either
a list of sheet names, a list of sheet positions, or ``None`` to read all sheets.
# ExcelFile.parse commamnd which is equivalent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take this out as its duplicative of the above

@jreback
Copy link
Contributor
jreback commented Sep 28, 2015

some minor doc changes

@chris-b1 chris-b1 force-pushed the read-excel-sig branch 2 times, most recently from 5f6e934 to 9763cbd Compare September 28, 2015 22:56
@chris-b1
Copy link
Contributor Author

@jreback - made those doc changes. The travis failure seems to be unrelated? Has to do with HTML formatting

@jreback
Copy link
Contributor
jreback commented Sep 29, 2015

yeh that failure happens once in a while, no idea why

@jorisvandenbossche
Copy link
Member

@chris-b1 I was just thinking: to make it more clear in the docs that ExcelFile.parse is superfluous, we could eg also limit its docstring with a reference to read_excel (and then the template is not needed)

@chris-b1
Copy link
Contributor Author

@jorisvandenbossche - sure makes sense to me, see update.

@jreback
Copy link
Contributor
jreback commented Sep 30, 2015

merged via 0d39ca1

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0