8000 ENH: ignoring comment lines and empty lines in CSV files by mdmueller · Pull Request #7470 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

ENH: ignoring comment lines and empty lines in CSV files #7470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Squashed commit of the following:
commit 0e9d792fc9d5159179efd810a1092671dbbef3b1
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Wed Sep 17 14:49:31 2014 -0400

    Added warnings about API changes

commit 06472c21000b489841cc8e486ceddf05fd87a1c5
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 22:36:06 2014 -0400

    Changed parameter name to skip_blank_lines

commit afd3be30b4afcae0d9bc6278237aab6a4c9e7eb8
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 21:50:08 2014 -0400

    Minor doc changes

commit b47876e074f5f683a9a51768e480e24d9d3249ab
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 19:26:22 2014 -0400

    Extended blank line skipping to custom line terminated/whitespace delimited reading

commit 3f4a20a831b1bc0ca29779b315dc72d78ad2301e
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 11:35:17 2014 -0400

    Changed around io docs section

commit 223e17ecdcbe377cc69fd962221e03412f5e54d3
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Tue Sep 9 23:13:37 2014 -0400

    Turned empty line skipping into a keyword parameter feature

commit dcd31ca6bd0849eab87ea1c3c5441c8630ca3a35
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Wed Sep 3 21:35:09 2014 -0400

    Squashed commit of the following:

    commit 9aea77954681c2f7d1336d94366221222d186c2b
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Tue Aug 26 22:43:21 2014 -0400

        Fixed header/skiprows combination issue

    commit 1975affea3bf0bd6f1769a79e4b0c7fde17962df
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 25 19:35:24 2014 -0400

        Added warning/notes about functionality change in docs, removed HTML changes

    commit 693c820092d9f17f9101074d29c2d7d53fa5a8ae
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 25 15:38:41 2014 -0400

        Fixed problem with HTML reading and infinite loop in PythonParser __init__

    commit 2a0a4babac7a5e53279eaa8281d0a51406caeb27
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Mon Jun 23 08:37:33 2014 -0400

        Updated docs with new read_csv functionality, removed unreachable code

    commit 19b5811e8d78c4e618e19ff5768aa2cfff041620
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 18 21:43:47 2014 -0400

        Fixed error in empty/whitespace removal function

    commit 3fd11a822cc0bee123d68240c62627da11ee88c2
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 18 18:48:08 2014 -0400

        Squashed commit of the following:

        commit 60a1cd1bc1042a9959ae75ff006052c433d98825
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 18:40:17 2014 -0400

            Fixed error with string/numerical types

        commit 7fe1bcf75466ea2b19d947aff0769c9f03bc23f5
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 17:47:56 2014 -0400

            release notes

        commit 835e490c8d3a3a96aeb6a6c3846217d36469656b
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 17:15:17 2014 -0400

            Release note

        commit 25cee3167b81b9c81e969629cd83968c6736a94f
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 16:56:44 2014 -0400

            Fixed whitespace issue, made C parser check for delimiters in whitespace lines

        commit 593495eb15162833de78d2da65f377fa977ad225
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 15:41:52 2014 -0400

            Added new functionality to Python reader

        commit 8a8325ed883034f176c929b41fe6fad16420e9b5
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Tue Jun 17 19:52:41 2014 -0400

            Adjusted tokenizer to ignore whitespace-only lines, fixed tests

        commit 3ea2eed22884a63a6e8dec1b795acdf29b030949
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Mon Jun 16 12:36:14 2014 -0400

            Moved tests to C parsing suite, corrected multi-index test

        commit d5540311ca44992148932ae27e16fc4d02a2a018
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Mon Jun 16 12:35:46 2014 -0400

            Changed empty file handling so that a ValueError is raised as expected

        commit 03a4c3d27c18052f04bd7cb862d289eabbc773ba
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Sun Jun 15 23:07:17 2014 -0400

            Wrote tests for empty lines and comment lines

        commit 01db817e97fc8ee0da85cc17603578b56d294b1b
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Sun Jun 15 23:02:04 2014 -0400

            Modified C tokenizer so that comments and empty lines are ignored
  • Loading branch information
mdmueller committed Sep 19, 2014
commit e4bcb5c2e6d34caf7152c0392739b0413d0c9847
73 changes: 49 additions & 24 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,10 @@ They can take a number of arguments:
a list of integers that specify row locations for a multi-index on the columns
E.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example are skipped). Note that this parameter
ignores commented lines, so header=0 denotes the first line of
data rather than the first line of the file.
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
denotes the first line of data rather than the first line of the file.
- ``skip_blank_lines``: whether to skip over blank lines rather than interpreting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the default here (True)

them as NaN values
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
also be an integer to skip the first ``n`` rows
- ``index_col``: column number, column name, or list of column numbers/names,
Expand Down Expand Up @@ -149,7 +151,7 @@ They can take a number of arguments:
- ``escapechar`` : string, to specify how to escape quoted data
- ``comment``: Indicates remainder of line should not be parsed. If found at the
beginning of a line, the line will be ignored altogether. This parameter
must be a single character. Also, fully commented lines
must be a single character. Like empty lines, fully commented lines
are ignored by the parameter `header` but not by `skiprows`. For example,
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
result in '1,2,3' being treated as the header.
Expand Down Expand Up @@ -261,27 +263,6 @@ after a delimiter:
print(data)
pd.read_csv(StringIO(data), skipinitialspace=True)

Moreover, ``read_csv`` ignores any completely commented lines:

.. ipython:: python

data = 'a,b,c\n# commented line\n1,2,3\n#another comment\n4,5,6'
print(data)
pd.read_csv(StringIO(data), comment='#')

.. note::

The presence of ignored lines might create ambiguities involving line numbers;
the parameter ``header`` uses row numbers (ignoring commented
lines), while ``skiprows`` uses line numbers (including commented lines):

.. ipython:: python

data = '#comment\na,b,c\nA,B,C\n1,2,3'
pd.read_csv(StringIO(data), comment='#', header=1)
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
pd.read_csv(StringIO(data), comment='#', skiprows=2)

The parsers make every attempt to "do the right thing" and not be very
fragile. Type inference is a pretty big deal. So if a column can be coerced to
integer dtype without altering the contents, it will do so. Any non-numeric
Expand Down Expand Up @@ -358,6 +339,50 @@ file, either using the column names or position numbers:
pd.read_csv(StringIO(data), usecols=['b', 'd'])
pd.read_csv(StringIO(data), usecols=[0, 2, 3])

.. _io.skiplines:

Ignoring line comments and empty lines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the ``comment`` parameter is specified, then completely commented lines will
be ignored. By default, completely blank lines will be ignored as well. Both of
these are API changes introduced in version 0.15.

.. ipython:: python

data = '\na,b,c\n \n# commented line\n1,2,3\n\n4,5,6'
print(data)
pd.read_csv(StringIO(data), comment='#')

If ``skip_blank_lines=False``, then ``read_csv`` will not ignore blank lines:

.. ipython:: python

data = 'a,b,c\n\n1,2,3\n\n\n4,5,6'
pd.read_csv(StringIO(data), skip_blank_lines=False)

.. warning::

The presence of ignored lines might create ambiguities involving line numbers;
the parameter ``header`` uses row numbers (ignoring commented/empty
lines), while ``skiprows`` uses line numbers (including commented/empty lines):

.. ipython:: python

data = '#comment\na,b,c\nA,B,C\n1,2,3'
pd.read_csv(StringIO(data), comment='#', header=1)
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
pd.read_csv(StringIO(data), comment='#', skiprows=2)

If both ``header`` and ``skiprows`` are specified, ``header`` will be
relative to the end of ``skiprows``. For example:

.. ipython:: python

data = '# empty\n# second empty line\n# third empty' \
'line\nX,Y,Z\n1,2,3\nA,B,C\n1,2.,4.\n5.,NaN,10.0'
print(data)
pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)

.. _io.unicode:

Dealing with Unicode Data
Expand Down
7 changes: 5 additions & 2 deletions doc/source/v0.15.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,11 @@ API changes

ewma(s, com=3., min_periods=2)

- Made both the C-based and Python engines for `read_csv` and `read_table` ignore empty lines in input as well as
whitespace-filled lines, as long as `sep` is not whitespace. This is an API change
that can be controlled by the keyword parameter `skip_blank_lines`.
(:issue:`4466`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a reference to the new section io.skiplines


- :func:`ewmstd`, :func:`ewmvol`, :func:`ewmvar`, :func:`ewmcov`, and :func:`ewmcorr`
now have an optional ``adjust`` argument, just like :func:`ewma` does,
affecting how the weights are calculated.
Expand Down Expand Up @@ -678,8 +683,6 @@ Enhancements





- ``tz_localize`` now accepts the ``ambiguous`` keyword which allows for passing an array of bools
indicating whether the date belongs in DST or not, 'NaT' for setting transition times to NaT,
'infer' for inferring DST/non-DST, and 'raise' (default) for an AmbiguousTimeError to be raised (:issue:`7943`).
Expand Down
55 changes: 42 additions & 13 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,8 @@ class ParserWarning(Warning):
a list of integers that specify row locations for a multi-index on the
columns E.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example are skipped). Note that this parameter
ignores commented lines, so header=0 denotes the first line of
data rather than the first line of the file.
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
denotes the first line of data rather than the first line of the file.
skiprows : list-like or integer
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
Expand Down Expand Up @@ -110,10 +110,11 @@ class ParserWarning(Warning):
comment : str, default None
Indicates remainder of line should not be parsed. If found at the
beginning of a line, the line will be ignored altogether. This parameter
must be a single character. Also, fully commented lines
are ignored by the parameter `header` but not by `skiprows`. For example,
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
result in '1,2,3' being treated as the header.
must be a single character. Like empty lines (as long as ``skip_blank_lines=True``),
fully commented lines are ignored by the parameter `header`
but not by `skiprows`. For example, if comment='#', parsing
'#empty\n1,2,3\na,b,c' with `header=0` will result in '1,2,3' being
treated as the header.
decimal : str, default '.'
Character to recognize as decimal point. E.g. use ',' for European data
nrows : int, default None
Expand Down Expand Up @@ -160,6 +161,8 @@ class ParserWarning(Warning):
infer_datetime_format : boolean, default False
If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values

Returns
-------
Expand Down Expand Up @@ -288,6 +291,7 @@ def _read(filepath_or_buffer, kwds):
'mangle_dupe_cols': True,
'tupleize_cols': False,
'infer_datetime_format': False,
'skip_blank_lines': True
}


Expand Down Expand Up @@ -378,7 +382,8 @@ def parser_f(filepath_or_buffer,
squeeze=False,
mangle_dupe_cols=True,
tupleize_cols=False,
infer_datetime_format=False):
infer_datetime_format=False,
skip_blank_lines=True):

# Alias sep -> delimiter.
if delimiter is None:
Expand Down Expand Up @@ -449,7 +454,8 @@ def parser_f(filepath_or_buffer,
buffer_lines=buffer_lines,
mangle_dupe_cols=mangle_dupe_cols,
tupleize_cols=tupleize_cols,
infer_datetime_format=infer_datetime_format)
infer_datetime_format=infer_datetime_format,
skip_blank_lines=skip_blank_lines)

return _read(filepath_or_buffer, kwds)

Expand Down Expand Up @@ -1338,6 +1344,7 @@ def __init__(self, f, **kwds):
self.quoting = kwds['quoting']
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
self.usecols = kwds['usecols']
self.skip_blank_lines = kwds['skip_blank_lines']

self.names_passed = kwds['names'] or None

Expand Down Expand Up @@ -1393,6 +1400,7 @@ def __init__(self, f, **kwds):

# needs to be cleaned/refactored
# multiple date column thing turning into a real spaghetti factory

if not self._has_complex_date_col:
(index_names,
self.orig_names, self.columns) = self._get_index_name(self.columns)
Expand Down Expand Up @@ -1590,6 +1598,7 @@ def _infer_columns(self):

while self.line_pos <= hr:
line = self._next_line()

unnamed_count = 0
this_columns = []
for i, c in enumerate(line):
Expand Down Expand Up @@ -1727,25 +1736,35 @@ def _next_line(self):
line = self._check_comments([self.data[self.pos]])[0]
self.pos += 1
# either uncommented or blank to begin with
if self._empty(self.data[self.pos - 1]) or line:
if not self.skip_blank_lines and (self._empty(self.data[
self.pos - 1]) or line):
break
elif self.skip_blank_lines:
ret = self._check_empty([line])
if ret:
line = ret[0]
break
except IndexError:
raise StopIteration
else:
while self.pos in self.skiprows:
next(self.data)
self.pos += 1
next(self.data)

while True:
orig_line = next(self.data)
line = self._check_comments([orig_line])[0]
self.pos += 1
if self._empty(orig_line) or line:
if not self.skip_blank_lines and (self._empty(orig_line) or line):
break
elif self.skip_blank_lines:
ret = self._check_empty([line])
if ret:
line = ret[0]
break

self.line_pos += 1
self.buf.append(line)

return line

def _check_comments(self, lines):
Expand All @@ -1766,6 +1785,15 @@ def _check_comments(self, lines):
ret.append(rl)
return ret

def _check_empty(self, lines):
ret = []
for l in lines:
# Remove empty lines and lines with only one whitespace value
if len(l) > 1 or len(l) == 1 and (not isinstance(l[0],
compat.string_types) or l[0].strip()):
ret.append(l)
return ret

def _check_thousands(self, lines):
if self.thousands is None:
return lines
Expand Down Expand Up @@ -1901,7 +1929,6 @@ def _get_lines(self, rows=None):

# already fetched some number
if rows is not None:

# we already have the lines in the buffer
if len(self.buf) >= rows:
new_rows, self.buf = self.buf[:rows], self.buf[rows:]
Expand Down Expand Up @@ -1966,6 +1993,8 @@ def _get_lines(self, rows=None):
lines = lines[:-self.skip_footer]

lines = self._check_comments(lines)
if self.skip_blank_lines:
lines = self._check_empty(lines)
return self._check_thousands(lines)


Expand Down
Loading
0