8000 Fix misestimation of n_distinct for a nearly-unique column with many … · s-monk/postgres@a449ad0 · GitHub
[go: up one dir, main page]

Skip to content

Commit a449ad0

Browse files
committed
Fix misestimation of n_distinct for a nearly-unique column with many nulls.
If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
1 parent 1f63b0e commit a449ad0

File tree

5 files changed

+31
-20
lines changed

5 files changed

+31
-20
lines changed

doc/src/sgml/catalogs.sgml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5072,9 +5072,9 @@
50725072
<entry>The number of distinct nonnull data values in the column.
50735073
A value greater than zero is the actual number of distinct values.
50745074
A value less than zero is the negative of a multiplier for the number
5075-
of rows in the table; for example, a column in which values appear about
5076-
twice on the average could be represented by
5077-
<structfield>stadistinct</> = -0.5.
5075+
of rows in the table; for example, a column in which about 80% of the
5076+
values are nonnull and each nonnull value appears about twice on
5077+
average could be represented by <structfield>stadistinct</> = -0.4.
50785078
A zero value means the number of distinct values is unknown.
50795079
</entry>
50805080
</row>

src/backend/commands/analyze.c

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2044,8 +2044,11 @@ compute_minimal_stats(VacAttrStatsP stats,
20442044

20452045
if (nmultiple == 0)
20462046
{
2047-
/* If we found no repeated values, assume it's a unique column */
2048-
stats->stadistinct = -1.0;
2047+
/*
2048+
* If we found no repeated non-null values, assume it's a unique
2049+
* column; but be sure to discount for any nulls we found.
2050+
*/
2051+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
20492052
}
20502053
else if (track_cnt < track_max && toowide_cnt == 0 &&
20512054
nmultiple == track_cnt)
@@ -2390,8 +2393,11 @@ compute_scalar_stats(VacAttrStatsP stats,
23902393

23912394
if (nmultiple == 0)
23922395
{
2393-
/* If we found no repeated values, assume it's a unique column */
2394-
stats->stadistinct = -1.0;
2396+
/*
2397+
* If we found no repeated non-null values, assume it's a unique
2398+
* column; but be sure to discount for any nulls we found.
2399+
*/
2400+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
23952401
}
23962402
else if (toowide_cnt == 0 && nmultiple == ndistinct)
23972403
{
@@ -2695,7 +2701,7 @@ compute_scalar_stats(VacAttrStatsP stats,
26952701
else
26962702
stats->stawidth = stats->attrtype->typlen;
26972703
/* Assume all too-wide values are distinct, so it's a unique column */
2698-
stats->stadistinct = -1.0;
2704+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
26992705
}
27002706
else if (null_cnt > 0)
27012707
{

src/backend/tsearch/ts_typanalyze.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,7 @@ compute_tsvector_stats(VacAttrStats *stats,
296296
stats->stawidth = total_width / (double) nonnull_cnt;
297297

298298
/* Assume it's a unique column (see notes above) */
299-
stats->stadistinct = -1.0;
299+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
300300

301301
/*
302302
* Construct an array of the interesting hashtable items, that is,

src/backend/utils/adt/selfuncs.c

+8Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4383,12 +4383,14 @@ double
43834383
get_variable_numdistinct(VariableStatData *vardata)
43844384
{
43854385
double stadistinct;
4386+
double stanullfrac = 0.0;
43864387
double ntuples;
43874388

43884389
/*
43894390
* Determine the stadistinct value to use. There are cases where we can
43904391
* get an estimate even without a pg_statistic entry, or can get a better
4391-
* value than is in pg_statistic.
4392+
* value than is in pg_statistic. Grab stanullfrac too if we can find it
4393+
* (otherwise, assume no nulls, for lack of any better idea).
43924394
*/
43934395
if (HeapTupleIsValid(vardata->statsTuple))
43944396
{
@@ -4397,6 +4399,7 @@ get_variable_numdistinct(VariableStatData *vardata)
43974399

43984400
stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
43994401
stadistinct = stats->stadistinct;
4402+
stanullfrac = stats->stanullfrac;
44004403
}
44014404
else if (vardata->vartype == BOOLOID)
44024405
{
@@ -4420,7 +4423,7 @@ get_variable_numdistinct(VariableStatData *vardata)
44204423
{
44214424
case ObjectIdAttributeNumber:
44224425
case SelfItemPointerAttributeNumber:
4423-
stadistinct = -1.0; /* unique */
4426+
stadistinct = -1.0; /* unique (and all non null) */
44244427
break;
44254428
case TableOidAttributeNumber:
44264429
stadistinct = 1.0; /* only 1 value */
@@ -4442,10 +4445,11 @@ get_variable_numdistinct(VariableStatData *vardata)
44424445
* If there is a unique index for the variable, assume it is unique no
44434446
* matter what pg_statistic says; the statistics could be out of date, or
44444447
* we might have found a partial unique index that proves the var is
4445-
* unique for this query.
4448+
* unique for this query. However, we'd better still believe
4449+
* the null-fraction statistic.
44464450
*/
44474451
if (vardata->isunique)
4448-
stadistinct = -1.0;
4452+
stadistinct = -1.0 * (1.0 - stanullfrac);
44494453

44504454
/*
44514455
* If we had an absolute estimate, use that.

src/include/catalog/pg_statistic.h

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -67,13 +67,14 @@ CATALOG(pg_statistic,2619) BKI_WITHOUT_OIDS
6767
* > 0 actual number of distinct values
6868
* < 0 negative of multiplier for number of rows
6969
* The special negative case allows us to cope with columns that are
70-
* unique (stadistinct = -1) or nearly so (for example, a column in
71-
* which values appear about twice on the average could be represented
72-
* by stadistinct = -0.5). Because the number-of-rows statistic in
73-
* pg_class may be updated more frequently than pg_statistic is, it's
74-
* important to be able to describe such situations as a multiple of
75-
* the number of rows, rather than a fixed number of distinct values.
76-
* But in other cases a fixed number is correct (eg, a boolean column).
70+
* unique (stadistinct = -1) or nearly so (for example, a column in which
71+
* non-null values appear about twice on the average could be represented
72+
* by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
73+
* column is nulls). Because the number-of-rows statistic in pg_class may
74+
* be updated more frequently than pg_statistic is, it's important to be
75+
* able to describe such situations as a multiple of the number of rows,
76+
* rather than a fixed number of distinct values. But in other cases a
77+
* fixed number is correct (eg, a boolean column).
7778
* ----------------
7879
*/
7980
float4 stadistinct;

0 commit comments

Comments
 (0)
0