8000 Fix misestimation of n_distinct for a nearly-unique column with many … · s-monk/postgres@8f180a6 · GitHub
[go: up one dir, main page]

Skip to content

Commit 8f180a6

Browse files
committed
Fix misestimation of n_distinct for a nearly-unique column with many nulls.
If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
1 parent c63588f commit 8f180a6

File tree

6 files changed

+34
-21
lines changed

6 files changed

+34
-21
lines changed

doc/src/sgml/catalogs.sgml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5814,9 +5814,9 @@
58145814
<entry>The number of distinct nonnull data values in the column.
58155815
A value greater than zero is the actual number of distinct values.
58165816
A value less than zero is the negative of a multiplier for the number
5817-
of rows in the table; for example, a column in which values appear about
5818-
twice on the average could be represented by
5819-
<structfield>stadistinct</> = -0.5.
5817+
of rows in the table; for example, a column in which about 80% of the
5818+
values are nonnull and each nonnull value appears about twice on
5819+
average could be represented by <structfield>stadistinct</> = -0.4.
58205820
A zero value means the number of distinct values is unknown.
58215821
</entry>
58225822
</row>

src/backend/commands/analyze.c

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2102,8 +2102,11 @@ compute_minimal_stats(VacAttrStatsP stats,
21022102

21032103
if (nmultiple == 0)
21042104
{
2105-
/* If we found no repeated values, assume it's a unique column */
2106-
stats->stadistinct = -1.0;
2105+
/*
2106+
* If we found no repeated non-null values, assume it's a unique
2107+
* column; but be sure to discount for any nulls we found.
2108+
*/
2109+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
21072110
}
21082111
else if (track_cnt < track_max && toowide_cnt == 0 &&
21092112
nmultiple == track_cnt)
@@ -2450,8 +2453,11 @@ compute_scalar_stats(VacAttrStatsP stats,
24502453

24512454
if (nmultiple == 0)
24522455
{
2453-
/* If we found no repeated values, assume it's a unique column */
2454-
stats->stadistinct = -1.0;
2456+
/*
2457+
* If we found no repeated non-null values, assume it's a unique
2458+
* column; but be sure to discount for any nulls we found.
2459+
*/
2460+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
24552461
}
24562462
else if (toowide_cnt == 0 && nmultiple == ndistinct)
24572463
{
@@ -2755,7 +2761,7 @@ compute_scalar_stats(VacAttrStatsP stats,
27552761
else
27562762
stats->stawidth = stats->attrtype->typlen;
27572763
/* Assume all too-wide values are distinct, so it's a unique column */
2758-
stats->stadistinct = -1.0;
2764+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
27592765
}
27602766
else if (null_cnt > 0)
27612767
{

src/backend/tsearch/ts_typanalyze.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ compute_tsvector_stats(VacAttrStats *stats,
295295
stats->stawidth = total_width / (double) nonnull_cnt;
296296

297297
/* Assume it's a unique column (see notes above) */
298-
stats->stadistinct = -1.0;
298+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
299299

300300
/*
301301
* Construct an array of the interesting hashtable items, that is,

src/backend/utils/adt/rangetypes_typanalyze.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,9 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
203203
/* Do the simple null-frac and width stats */
204204
stats->stanullfrac = (double) null_cnt / (double) samplerows;
205205
stats->stawidth = total_width / (double) non_null_cnt;
206-
stats->stadistinct = -1.0;
206+
207+
/* Estimate that non-null values are unique */
208+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
207209

208210
/* Must copy the target values into anl_context */
209211
old_cxt = MemoryContextSwitchTo(stats->anl_context);

src/backend/utils/adt/selfuncs.c

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4629,14 +4629,16 @@ double
46294629
get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
46304630
{
46314631
double stadistinct;
4632+
double stanullfrac = 0.0;
46324633
double ntuples;
46334634

46344635
*isdefault = false;
46354636

46364637
/*
46374638
* Determine the stadistinct value to use. There are cases where we can
46384639
* get an estimate even without a pg_statistic entry, or can get a better
4639-
* value than is in pg_statistic.
4640+
* value than is in pg_statistic. Grab stanullfrac too if we can find it
4641+
* (otherwise, assume no nulls, for lack of any better idea).
46404642
*/
46414643
if (HeapTupleIsValid(vardata->statsTuple))
46424644
{
@@ -4645,6 +4647,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
46454647

46464648
stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
46474649
stadistinct = stats->stadistinct;
4650+
stanullfrac = stats->stanullfrac;
46484651
}
46494652
else if (vardata->vartype == BOOLOID)
46504653
{
@@ -4668,7 +4671,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
46684671
{
46694672
case ObjectIdAttributeNumber:
46704673
case SelfItemPointerAttributeNumber:
4671-
stadistinct = -1.0; /* unique */
4674+
stadistinct = -1.0; /* unique (and all non null) */
46724675
break;
46734676
case TableOidAttributeNumber:
46744677
stadistinct = 1.0; /* only 1 value */
@@ -4690,10 +4693,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
46904693
* If there is a unique index or DISTINCT clause for the variable, assume
46914694
* it is unique no matter what pg_statistic says; the statistics could be
46924695
* out of date, or we might have found a partial unique index that proves
4693-
* the var is unique for this query.
4696+
* the var is unique for this query. However, we'd better still believe
4697+
* the null-fraction statistic.
46944698
*/
46954699
if (vardata->isunique)
4696-
stadistinct = -1.0;
4700+
stadistinct = -1.0 * (1.0 - stanullfrac);
46974701

46984702
/*
46994703
* If we had an absolute estimate, use that.

src/include/catalog/pg_statistic.h

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -57,13 +57,14 @@ CATALOG(pg_statistic,2619) BKI_WITHOUT_OIDS
5757
* > 0 actual number of distinct values
5858
* < 0 negative of multiplier for number of rows
5959
* The special negative case allows us to cope with columns that are
60-
* unique (stadistinct = -1) or nearly so (for example, a column in
61-
* which values appear about twice on the average could be represented
62-
* by stadistinct = -0.5). Because the number-of-rows statistic in
63-
* pg_class may be updated more frequently than pg_statistic is, it's
64-
* important to be able to describe such situations as a multiple of
65-
* the number of rows, rather than a fixed number of distinct values.
66-
* But in other cases a fixed number is correct (eg, a boolean column).
60+
* unique (stadistinct = -1) or nearly so (for example, a column in which
61+
* non-null values appear about twice on the average could be represented
62+
* by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
63+
* column is nulls). Because the number-of-rows statistic in pg_class may
64+
* be updated more frequently than pg_statistic is, it's important to be
65+
* able to describe such situations as a multiple of the number of rows,
66+
* rather than a fixed number of distinct values. But in other cases a
67+
* fixed number is correct (eg, a boolean column).
6768
* ----------------
6869
*/
6970
float4 stadistinct;

0 commit comments

Comments
 (0)
0