8000 PERF: add exact kw to to_datetime to enable faster regex format parsing by jreback · Pull Request #8904 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@jreback
Copy link
Contributor
@jreback jreback commented Nov 27, 2014

closes #8989
closes #8903

Clearly the default is exact=True for back-compat
but allows for a match starting at the beginning (has always been like this),
but doesn't require the string to ONLY match the format, IOW, can be extra stuff after the match.

Avoids having to do a regex replace first.

In [21]: s = Series(['19MAY11','19MAY11:00:00:00']*100000)

In [22]: %timeit pd.to_datetime(s.str.replace(':\S+$',''),format='%d%b%y')
1 loops, best of 3: 828 ms per loop

In [23]: %timeit pd.to_datetime(s,format='%d%b%y',exact=False)
1 loops, best of 3: 603 ms per loop

@jreback jreback added API Design Strings String extension data type and string data Datetime Datetime data dtype Performance Memory or execution speed performance labels Nov 27, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 27, 2014
@jorisvandenbossche
Copy link
Member
  • is the proposed keyword exact or complete?
  • the match can be anywhere? (not the start) but I suppose the most common use case is with / without time part, of subsecond part, timezone part (endings)

@jreback
Copy link
Contributor Author
jreback commented Nov 27, 2014

yeh, changing back to exact. Also going to change it to do a search rather than a match if exact=False so can find arbitrarily.

Acutally other use-cases for this

In [4]: s = Series(['19MAY11','foobar19MAY11','19MAY11:00:00:00','19MAY11 00:00:00Z'])

In [5]: s
Out[5]: 
0              19MAY11
1        foobar19MAY11
2     19MAY11:00:00:00
3    19MAY11 00:00:00Z
dtype: object

In [6]: pd.to_datetime(s,format='%d%b%y',exact=False)
Out[6]: 
0   2011-05-19
1   2011-05-19
2   2011-05-19
3   2011-05-19
dtype: datetime64[ns]

@jreback jreback force-pushed the timere branch 4 times, most recently from 8c7f505 to cc8a3de Compare November 30, 2014 17:24
@jreback
Copy link
Contributor Author
jreback commented Dec 1, 2014

@jorisvandenbossche ?

1 similar comment
@jreback
Copy link
Contributor Author
jreback commented Dec 3, 2014

@jorisvandenbossche ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add punctuation and capitals here? (when rendered html, sphinx does not keep the newline, so the visual formatting does not work there)

@jorisvandenbossche
Copy link
Member

Does this also work for example what now fails (reading in nanoseconds when specifying a format):

In [11]: pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:%S.%f")
-> 
ValueError: unconverted data remains: 001

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the new exact keyword as a separate entry in the enhancements section?

@jorisvandenbossche
Copy link
Member

for the keyword name: match is also a possibility, but not sure it is better

Out of curiousity, where does the speed-up come from? The extra cython type declerations?

@jreback
Copy link
Contributor Author
jreback commented Dec 3, 2014

@jorisvandenbossche the speedup comes from not have to do an .str.extract(.....) first, then regular parse it. Currently the regex parsers uses .match (and nothing trailing). This allows it to use .search.

@jreback
Copy link
Contributor Author
jreback commented Dec 3, 2014

ok with match rather than exact

@jreback
Copy link
Contributor Author
8000 jreback commented Dec 3, 2014

This works; is their an issue about this?

In [3]: pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:",exact=False)
Out[3]: Timestamp('2012-01-01 09:00:00')

@jorisvandenbossche
Copy link
Member

No, no issue I think, and using this keyword solves the error, but of course not the problem that the nanoseconds are not read in. But I don't know if we can do something about that, as this is defined python formatting strings where f means microseconds

@jreback
Copy link
Contributor Author
jreback commented Dec 3, 2014

You don't actually need to use a formatting string. But we could easily change the 'f' to allow for nanoseconds I think. let me see.

`I``
n [4]: pd.to_datetime("2012-01-01 09:00:00.000000001")
Out[4]: Timestamp('2012-01-01 09:00:00.000000001')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe timeseries_with_format_exact is a better name? (the replace does not really matter for what it is about)

@jorisvandenbossche
Copy link
Member

@jreback yes, I know, the format string is not needed in this example because I used a default format, but in general that does not hold always

@jorisvandenbossche
Copy link
Member

opened an issue for the nanoseconds #8989

@jorisvandenbossche
Copy link
Member

And I copied only part of the line. I meant pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:%S.%f") above

@jreback jreback force-pushed the timere branch 5 times, most recently from 2e165e2 to 5d972d0 Compare December 4, 2014 11:19
@jreback
Copy link
Contributor Author
jreback commented Dec 5, 2014

@jorisvandenbossche ok on the kw?

@jorisvandenbossche
Copy link
Member

you mean exact vs match ? No strong opinion, exact seems a bit more self-explanatory, match is a bit more in line with the regex naming.

@jreback
Copy link
Contributor Author
jreback commented Dec 5, 2014

ok, merging then.

jreback added a commit that referenced this pull request Dec 5, 2014
PERF: add exact kw to to_datetime to enable faster regex format parsing
@jreback jreback merged commit 6f7f5f8 into pandas-dev:master Dec 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Design Datetime Datetime data dtype Performance Memory or execution speed performance Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

to_datetime with format string cannot read nanoseconds ENH/PERF: new kw complete to to_datetime to allow partial matches on a datetime format

2 participants

0