distinct xx和count(distinct xx)的变态递归优化方法

背景

今天要说的这个优化是从前面一篇讲解《performance tuning case :use cursor or trigger replace group by and order by》

的延展.

CASE

例如一个表中有一个字段是性别, 这个表不管有多少条记录, 性别这个字段一般来说也就2个值

select count(distinct sex) from table;

得到的结果当然是2. 但是如果数据量很大的情况下, 这种运算就非常耗时, 需要排序，去重。

那么有什么优化手段呢？

场景还原

PostgreSQL

测试表

digoal=> create table sex (sex char(1), otherinfo text);  CREATE TABLE

测试数据

digoal=> insert into sex select 'm', generate_series(1,10000000)||'this is test';  INSERT 0 10000000  digoal=> insert into sex select 'w', generate_series(1,10000000)||'this is test';  INSERT 0 10000000

测试SQL1

digoal=> \timing on  digoal=> select count(distinct sex) from sex;   count   -------       2  (1 row)  Time: 47254.221 ms

测试SQL2

digoal=> select sex from sex t group by sex;   sex   -----   w   m  (2 rows)  Time: 6534.433 ms

执行计划

digoal=> explain select count(distinct sex) from sex;                               QUERY PLAN                                ---------------------------------------------------------------------   Aggregate  (cost=377385.25..377385.26 rows=1 width=2)     ->  Seq Scan on sex  (cost=0.00..327386.00 rows=19999700 width=2)  digoal=> explain select sex from sex t group by sex;                                QUERY PLAN                                 -----------------------------------------------------------------------   HashAggregate  (cost=377385.25..377385.27 rows=2 width=2)     ->  Seq Scan on sex t  (cost=0.00..327386.00 rows=19999700 width=2)

创建索引

digoal=> create index idx_sex_1 on sex(sex);  CREATE INDEX  digoal=> set enable_seqscan=off;  SET

使用索引后的执行计划, PostgreSQL可以使用Index Only Scan.

digoal=> explain select count(distinct sex) from sex;                                           QUERY PLAN                                           --------------------------------------------------------------------------------------------   Aggregate  (cost=532235.01..532235.02 rows=1 width=2)     ->  Index Only Scan using idx_sex_1 on sex  (cost=0.00..482234.97 rows=20000016 width=2)  digoal=> explain select sex from sex t group by sex;                                            QUERY PLAN                                            ----------------------------------------------------------------------------------------------   Group  (cost=0.00..532235.01 rows=2 width=2)     ->  Index Only Scan using idx_sex_1 on sex t  (cost=0.00..482234.97 rows=20000016 width=2)

创建索引后SQL耗时

digoal=> select count(distinct sex) from sex;   count   -------       2  (1 row)  Time: 49589.947 ms  digoal=> select sex from sex t group by sex;   sex   -----   m   w  (2 rows)  Time: 6608.053 ms

Oracle

测试表

SQL> create table sex(sex char(1), otherinfo varchar2(64));  Table created.

测试数据

SQL> insert into sex select 'm', rownum||'this is test' from dual connect by level <=10000001;  10000001 rows created.  SQL> commit;  Commit complete.  SQL> insert into sex select 'w', rownum||'this is test' from dual connect by level <=10000001;  10000001 rows created.  SQL> commit;  Commit complete.

测试SQL1:

SQL> set autotrace on  SQL> set timing on  SQL> select count(distinct sex) from sex;  COUNT(DISTINCTSEX)  ------------------                   2  Elapsed: 00:00:03.62  Execution Plan  ----------------------------------------------------------  Plan hash value: 2096505595  ---------------------------------------------------------------------------  | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  ---------------------------------------------------------------------------  |   0 | SELECT STATEMENT   |      |     1 |     3 | 13106   (3)| 00:02:38 |  |   1 |  SORT GROUP BY     |      |     1 |     3 |            |          |  |   2 |   TABLE ACCESS FULL| SEX  |    24M|    69M| 13106   (3)| 00:02:38 |  ---------------------------------------------------------------------------  Note  -----     - dynamic sampling used for this statement  Statistics  ----------------------------------------------------------            0  recursive calls            0  db block gets        74074  consistent gets            0  physical reads            0  redo size          525  bytes sent via SQL*Net to client          487  bytes received via SQL*Net from client            2  SQL*Net roundtrips to/from client            1  sorts (memory)            0  sorts (disk)            1  rows processed

测试SQL2

SQL> select sex from sex t group by sex;  S  -  w  m  Elapsed: 00:00:03.23  Execution Plan  ----------------------------------------------------------  Plan hash value: 2807610159  ---------------------------------------------------------------------------  | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  ---------------------------------------------------------------------------  |   0 | SELECT STATEMENT   |      |    24M|    69M| 14908  (14)| 00:02:59 |  |   1 |  HASH GROUP BY     |      |    24M|    69M| 14908  (14)| 00:02:59 |  |   2 |   TABLE ACCESS FULL| SEX  |    24M|    69M| 13106   (3)| 00:02:38 |  ---------------------------------------------------------------------------  Note  -----     - dynamic sampling used for this statement  Statistics  ----------------------------------------------------------            0  recursive calls            0  db block gets        74074  consistent gets            0  physical reads            0  redo size          563  bytes sent via SQL*Net to client          487  bytes received via SQL*Net from client            2  SQL*Net roundtrips to/from client            0  sorts (memory)            0  sorts (disk)            2  rows processed

创建索引

SQL> create index idx_sex_1 on sex(sex);  Index created.  Elapsed: 00:00:33.40

创建索引后的测试, 执行时间没有明显变化.

SQL> select count(distinct sex) from sex;  COUNT(DISTINCTSEX)  ------------------                   2  Elapsed: 00:00:04.32  Execution Plan  ----------------------------------------------------------  Plan hash value: 1805173869  -----------------------------------------------------------------------------------  | Id  | Operation             | Name      | Rows  | Bytes | Cost (%CPU)| Time     |  -----------------------------------------------------------------------------------  |   0 | SELECT STATEMENT      |           |     1 |     3 |  6465   (3)| 00:01:18 |  |   1 |  SORT GROUP BY        |           |     1 |     3 |            |          |  |   2 |   INDEX FAST FULL SCAN| IDX_SEX_1 |    24M|    69M|  6465   (3)| 00:01:18 |  -----------------------------------------------------------------------------------  Note  -----     - dynamic sampling used for this statement  Statistics  ----------------------------------------------------------            5  recursive calls            0  db block gets        36421  consistent gets        36300  physical reads            0  redo size          525  bytes sent via SQL*Net to client          487  bytes received via SQL*Net from client            2  SQL*Net roundtrips to/from client            1  sorts (memory)            0  sorts (disk)            1  rows processed  SQL> select sex from sex t group by sex;  S  -  w  m  Elapsed: 00:00:03.21  Execution Plan  ----------------------------------------------------------  Plan hash value: 2807610159  ---------------------------------------------------------------------------  | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  ---------------------------------------------------------------------------  |   0 | SELECT STATEMENT   |      |    24M|    69M| 14908  (14)| 00:02:59 |  |   1 |  HASH GROUP BY     |      |    24M|    69M| 14908  (14)| 00:02:59 |  |   2 |   TABLE ACCESS FULL| SEX  |    24M|    69M| 13106   (3)| 00:02:38 |  ---------------------------------------------------------------------------  Note  -----     - dynamic sampling used for this statement  Statistics  ----------------------------------------------------------            5  recursive calls            0  db block gets        74170  consistent gets            0  physical reads            0  redo size          563  bytes sent via SQL*Net to client          487  bytes received via SQL*Net from client            2  SQL*Net roundtrips to/from client            0  sorts (memory)            0  sorts (disk)            2  rows processed

对比以上测试, Oracle的性能要明显优于PostgreSQL.

将count(distinct sex)修改如下后PostgreSQL的执行速度有明显改善, 但是性能还是低于O一截, 约一半.

digoal=> select count(*) from (select sex from sex t group by sex) t;   count   -------       2  (1 row)  Time: 6231.965 ms

开始优化

那么如何优化呢?

在PostgreSQL中的递归SQL在这里就派上大用场了, 结合btree索引扫描. 性能可以提升几万倍.

来看如下优化过程 :

创建测试表 :

create table user_download_log (user_id int not null, listid int not null, apkid int not null, get_time timestamp(0) not null, otherinfo text);

插入测试数据

insert into user_download_log select generate_series(0,10000000),generate_series(0,10000000),generate_series(0,10000000),generate_series(clock_timestamp(),clock_timestamp()+interval '10000000 min',interval '1 min'), 'this is test';

创建索引 :

create index i1 on user_download_log (user_id);  create index i2 on user_download_log (otherinfo);

查看数据分布 :

用来说明递归SQL适合哪种场景的优化.

select count(distinct user_id), count(distinct otherinfo) from user_download_log;    count   | count   ----------+-------   10000001 |     1

查看未优化时以下SQL的执行计划以及耗时.

digoal=> explain analyze select count(distinct otherinfo) from user_download_log;                                                                 QUERY PLAN                                                             ------------------------------------------------------------------------------------------------------------------------------------   Aggregate  (cost=208334.36..208334.37 rows=1 width=13) (actual time=6295.493..6295.494 rows=1 loops=1)     ->  Seq Scan on user_download_log  (cost=0.00..183334.29 rows=10000029 width=13) (actual time=0.014..1612.333 rows=10000001 loops=1)   Total runtime: 6295.550 ms

优化后的SQL :

digoal=> with recursive skip as (  digoal(>   (  digoal(>     select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null  digoal(>   )  digoal(>   union all  digoal(>   (  digoal(>     select (select min(t.otherinfo) from user_download_log t where t.otherinfo > s.otherinfo and t.otherinfo is not null)   digoal(>       from skip s where s.otherinfo is not null  digoal(>   )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  digoal(> )   digoal-> select count(distinct otherinfo) from skip;   count   -------       1  (1 row)

优化后的SQL执行计划以及耗时, 性能提升了36390倍, 相比O也提升了上万倍.

digoal=> explain analyze with recursive skip as (    (      select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null    )    union all    (      select (select min(t.otherinfo) from user_download_log t where t.otherinfo > s.otherinfo and t.otherinfo is not null)         from skip s where s.otherinfo is not null    )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  )   select count(distinct otherinfo) from skip;                                                                                   QUERY PLAN                                           ------------------------------------------------------------------------------------------------------------------------------------  -----------------------------------------   Aggregate  (cost=10.55..10.56 rows=1 width=32) (actual time=0.094..0.094 rows=1 loops=1)     CTE skip       ->  Recursive Union  (cost=0.03..8.28 rows=101 width=32) (actual time=0.044..0.073 rows=2 loops=1)             ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.042..0.042 rows=1 loops=1)                   InitPlan 1 (returns $1)                     ->  Limit  (cost=0.00..0.03 rows=1 width=13) (actual time=0.038..0.039 rows=1 loops=1)                           ->  Index Only Scan using i2 on user_download_log t  (cost=0.00..296844.61 rows=10000029 width=13) (actual   time=0.037..0.037 rows=1 loops=1)                                 Index Cond: (otherinfo IS NOT NULL)                                 Heap Fetches: 1             ->  WorkTable Scan on skip s  (cost=0.00..0.62 rows=10 width=32) (actual time=0.013..0.013 rows=0 loops=2)                   Filter: (otherinfo IS NOT NULL)                   Rows Removed by Filter: 0                   SubPlan 3                     ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)                           InitPlan 2 (returns $3)                             ->  Limit  (cost=0.00..0.03 rows=1 width=13) (actual time=0.017..0.017 rows=0 loops=1)                                   ->  Index Only Scan using i2 on user_download_log t  (cost=0.00..107284.96 rows=3333343 width=13) (  actual time=0.015..0.015 rows=0 loops=1)                                         Index Cond: ((otherinfo > s.otherinfo) AND (otherinfo IS NOT NULL))                                         Heap Fetches: 0     ->  CTE Scan on skip  (cost=0.00..2.02 rows=101 width=32) (actual time=0.047..0.077 rows=2 loops=1)   Total runtime: 0.173 ms  (21 rows)

换一个字段, 数据分布广泛的字段上使用以上优化方法, 看是否妥当, 以下是原始SQL的执行计划以及耗时 :

digoal=> explain analyze select count(distinct user_id) from user_download_log;                                                                QUERY PLAN                                                              ------------------------------------------------------------------------------------------------------------------------------------  ---   Aggregate  (cost=208334.36..208334.37 rows=1 width=4) (actual time=4008.858..4008.858 rows=1 loops=1)     ->  Seq Scan on user_download_log  (cost=0.00..183334.29 rows=10000029 width=4) (actual time=0.014..1606.607 rows=10000001 loops=  1)   Total runtime: 4008.916 ms

换一个字段, 数据分布广泛的字段上使用以上优化方法, 看是否妥当, 以下是采用递归SQL后的执行计划以及耗时 :

显然性能是下降的, 所以使用递归SQL不适合数据分布广泛的字段的group by或者count(distinct)操作.

digoal=> explain analyze with recursive skip as (    (      select min(t.user_id) as user_id from user_download_log t where t.user_id is not null    )    union all    (      select (select min(t.user_id) from user_download_log t where t.user_id > s.user_id and t.user_id is not null)         from skip s where s.user_id is not null    )  -- 这里的where s.user_id is not null 一定要加,否则就死循环了.  )   select count(distinct user_id) from skip;                                                                                      QUERY PLAN                                        ------------------------------------------------------------------------------------------------------------------------------------  -----------------------------------------------   Aggregate  (cost=10.44..10.45 rows=1 width=4) (actual time=186741.338..186741.339 rows=1 loops=1)     CTE skip       ->  Recursive Union  (cost=0.03..8.17 rows=101 width=4) (actual time=0.047..178296.238 rows=10000002 loops=1)             ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)                   InitPlan 1 (returns $1)                     ->  Limit  (cost=0.00..0.03 rows=1 width=4) (actual time=0.042..0.042 rows=1 loops=1)                           ->  Index Only Scan using i1 on user_download_log t  (cost=0.00..285759.50 rows=10000029 width=4) (actual t  ime=0.040..0.040 rows=1 loops=1)                                 Index Cond: (user_id IS NOT NULL)                                 Heap Fetches: 1             ->  WorkTable Scan on skip s  (cost=0.00..0.61 rows=10 width=4) (actual time=0.017..0.017 rows=1 loops=10000002)                   Filter: (user_id IS NOT NULL)                   Rows Removed by Filter: 0                   SubPlan 3                     ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=10000001)                           InitPlan 2 (returns $3)                             ->  Limit  (cost=0.00..0.03 rows=1 width=4) (actual time=0.015..0.015 rows=1 loops=10000001)                                   ->  Index Only Scan using i1 on user_download_log t  (cost=0.00..103588.85 rows=3333343 width=4) (a  ctual time=0.014..0.014 rows=1 loops=10000001)                                         Index Cond: ((user_id > s.user_id) AND (user_id IS NOT NULL))                                         Heap Fetches: 10000000     ->  CTE Scan on skip  (cost=0.00..2.02 rows=101 width=4) (actual time=0.050..183449.391 rows=10000002 loops=1)   Total runtime: 186909.323 ms  (21 rows)  Time: 186910.482 ms

以下是同样的数据结构以及测试数据在O下的测试.

SQL> create table test (id int, otherinfo varchar2(32)) nologging;  Table created.  SQL> insert into test select rownum,'this is test' from dual connect by level <=10000001;  10000001 rows created.  SQL> commit;  SQL> create index i1 on test(id);  SQL> create index i2 on test(otherinfo);  SQL> explain plan for select count(distinct id) from test;  Explained.  SQL> select * from table(dbms_xplan.display());  PLAN_TABLE_OUTPUT  --------------------------------------------------------------------------------------------------------------------------------------------  Plan hash value: 1403727100  ------------------------------------------------------------------------------  | Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |  ------------------------------------------------------------------------------  |   0 | SELECT STATEMENT      |      |     1 |    13 |  4178   (3)| 00:00:51 |  |   1 |  SORT GROUP BY        |      |     1 |    13 |            |          |  |   2 |   INDEX FAST FULL SCAN| I1   |  9834K|   121M|  4178   (3)| 00:00:51 |  ------------------------------------------------------------------------------  Note  -----     - dynamic sampling used for this statement  13 rows selected.  SQL> explain plan for select count(distinct otherinfo) from test;  Explained.  SQL> select * from table(dbms_xplan.display());  PLAN_TABLE_OUTPUT  --------------------------------------------------------------------------------------------------------------------------------------------  Plan hash value: 2603667166  ---------------------------------------------------------------------------  | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  ---------------------------------------------------------------------------  |   0 | SELECT STATEMENT   |      |     1 |    18 |  5837   (3)| 00:01:11 |  |   1 |  SORT GROUP BY     |      |     1 |    18 |            |          |  |   2 |   TABLE ACCESS FULL| TEST |  9834K|   168M|  5837   (3)| 00:01:11 |  ---------------------------------------------------------------------------  Note  -----     - dynamic sampling used for this statement  13 rows selected.  SQL> set timing on  SQL> select count(distinct otherinfo) from test;  COUNT(DISTINCTOTHERINFO)  ------------------------                         1  Elapsed: 00:00:02.49  SQL> select count(distinct id) from test;  COUNT(DISTINCTID)  -----------------           10000001  Elapsed: 00:00:07.13

从执行耗时可以看出PostgreSQL在数据分布稀疏的字段上使用递归SQL优化后的性能相比Oracle有41213倍的性能提升.

补充

递归查询中不允许使用聚合函数 :

with recursive skip as (    (      select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null    )    union all    (      select min(t.otherinfo) from user_download_log t, skip s         where t.otherinfo > s.otherinfo         and t.otherinfo is not null        and s.otherinfo is not null    )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  )   select * from skip;  ERROR:  aggregate functions not allowed in a recursive query's recursive term  LINE 7:     select min(t.otherinfo) from user_download_log t, skip s...                     ^  Time: 0.581 ms

修改如下即可 :

with recursive skip as (    (      select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null    )    union all    (      select (select min(t.otherinfo) from user_download_log t where t.otherinfo > s.otherinfo and t.otherinfo is not null)         from skip s where s.otherinfo is not null    )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  )   select * from skip;

细心的朋友发现Oracle测试中未对表进行分析, 以下是分析后的结果, 执行计划无变化 :

SQL> analyze table sex estimate statistics for all columns sample 10 percent;  Table analyzed.  SQL> analyze index idx_sex_1 estimate statistics sample 10 percent;  Index analyzed.  SQL> select sex from sex t group by sex;  S  -  w  m  Elapsed: 00:00:03.17  Execution Plan  ----------------------------------------------------------  Plan hash value: 2807610159  ---------------------------------------------------------------------------  | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  ---------------------------------------------------------------------------  |   0 | SELECT STATEMENT   |      |     2 |     2 | 14519  (12)| 00:02:55 |  |   1 |  HASH GROUP BY     |      |     2 |     2 | 14519  (12)| 00:02:55 |  |   2 |   TABLE ACCESS FULL| SEX  |    20M|    19M| 13062   (2)| 00:02:37 |  ---------------------------------------------------------------------------  Statistics  ----------------------------------------------------------            1  recursive calls            0  db block gets        74074  consistent gets            0  physical reads            0  redo size          563  bytes sent via SQL*Net to client          487  bytes received via SQL*Net from client            2  SQL*Net roundtrips to/from client            0  sorts (memory)            0  sorts (disk)            2  rows processed  SQL> select count(distinct sex) from sex;  COUNT(DISTINCTSEX)  ------------------                   2  Elapsed: 00:00:03.85  Execution Plan  ----------------------------------------------------------  Plan hash value: 1805173869  -----------------------------------------------------------------------------------  | Id  | Operation             | Name      | Rows  | Bytes | Cost (%CPU)| Time     |  -----------------------------------------------------------------------------------  |   0 | SELECT STATEMENT      |           |     1 |     1 |  6454   (3)| 00:01:18 |  |   1 |  SORT GROUP BY        |           |     1 |     1 |            |          |  |   2 |   INDEX FAST FULL SCAN| IDX_SEX_1 |    20M|    19M|  6454   (3)| 00:01:18 |  -----------------------------------------------------------------------------------  Statistics  ----------------------------------------------------------            1  recursive calls            0  db block gets        36325  consistent gets            0  physical reads            0  redo size          525  bytes sent via SQL*Net to client          487  bytes received via SQL*Net from client            2  SQL*Net roundtrips to/from client            1  sorts (memory)            0  sorts (disk)            1  rows processed

Oracle在这类应用场景中还有一个选择，使用位图索引。

摘录一段O位图索引的介绍

位图索引 Bitmap index

场合：列的基数很少，可枚举，重复值很多，数据不会被经常更新

原理：一个键值对应很多行（rowid），格式：键值 start_rowid end_rowid 位图

优点：OLAP 例如报表类数据库重复率高的数据特定类型的查询例如count、or、and等逻辑操作因为只需要进行位运算即可得到我们需要的结果

缺点：不适合重复率低的字段，还有经常DML操作（insert，update，delete），因为位图索引的锁代价极高，修改一个位图索引段影响整个位图段，例如修改一个键值，会影响同键值的多行，所以对于OLTP 系统位图索引基本上是不适用的因bitmap在OLTP使用场景较少，PostgreSQL 没有实现这个类型的索引。

想了解更多PG索引的情况，请参考

标签

背景

场景还原

开始优化

补充

加速数据合并的场景