[go: up one dir, main page]

0% found this document useful (0 votes)
161 views24 pages

Db2mon Education Internal

Uploaded by

kanmanda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views24 pages

Db2mon Education Internal

Uploaded by

kanmanda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

db2mon

Kostas Rakopoulos
kostasr@ca.ibm.com

May 2020
Why did we create another monitoring tool?

Table functions are the recommended approach to monitoring (v9.7+)

Snapshots are deprecated
– New metrics are only being added to table functions

Table functions mean we need to write SQL queries to view metrics
– Some built-in options like MONREPORT.DBSUMMARY – but not enough

Most metrics are cumulative (since database activation)
– No way reset?!
– Need a way to view metrics for a given period of time

DSM uses table functions. Why not use DSM?
– It’s not practical for command line (scripts) usage or debugging customer issues remotely

2
Before db2mon there was psmon

New pureScale metrics in v9.8 were only available through table functions
– Some in db2pd but it was messy

Created psmon to report table function metrics specific to debugging pS
performance problems

Sufficient overlap between debugging pS and non-pS OLTP performance
problems
– psmon started to gain traction for analyzing OLTP issues in general

Eventually more metrics were added for debugging performance issues beyond
just OLTP
– Renamed psmon to db2mon to reflect this

3
How does it work?

Set of reporting queries that run on delta tables

Delta tables?
– Remember: No way to reset table functions stats
– We take before & after samples and store the delta in tables

Online mode (runs on Db2 server being analyzed)
– Use DGTTs to store before, after and delta metrics
– Report queries are run against DGTTs

Offline mode (only EXPORT command is run on Db2 server being analyzed)
– Metrics exported to IXF files
– On another system: Run scripts to IMPORT, compute deltas and run report queries

4
Design

Goals
– Lightweight – cannot impact performance of system being monitored
– Clean – do not create objects or otherwise pollute the system being monitored
– Backward compatibility – work with all supported versions of Db2
– Flexible – easily modifiable (such as adding new reporting queries)

Perl script which generates SQL to collect data and create a report
– User specifies Db2 version and generated SQL is compatible for that version
– Also generates script for simple online mode execution (db2mon.sh)

Creates (and removes) buffer pool and temp tablespace for DGTTs to isolate db2mon SQL
from other activities

5
Use Cases

Understand a customer’s workload/environment
– Saves you from having to ask basic questions about

Application: read/write ratio, types of statements, number of connections

Instance: topology, db2set, dbm cfg, db cfg

System: OS type, CPUs, memory

Find system bottlenecks

Find opportunities to tune the system (e.g., during a health check)

As part of a customer’s monitoring infrastructure
– Identify “What changed?” with before & after db2mon data

6
List of Reports - Lots of data!
Point in Time (SQL, Lock waits, Utilities) Table stats pureScale stats
START#EXSQL: Currently executing SQL at start of capture (non-zero metrics only TBL#ROWMC: Various table level metrics CF#GBPIO: Group bufferpool IO statistics by tablespace
START#LOCKW: Current lock waits at start of capture TBL#DATSH: Data sharing metrics CF#GBPHR: Group bufferpool data and index hit ratios
START#EXUTL: Currently executing utilities at start of capture CF#GBPIV: Group bufferpool invalid page statistics
END#EXSQL: Currently executing SQL at end of capture (non-zero metrics only
END#LOCKW: Current lock waits at end of capture
DB#SIZE: Size of database
TSP#SIZE: Tablespace properties
Tbsp stats CF#GBPDP: Tablespace data page prefetching statistics for group bufferpool
CF#GBPIP: Tablespace index page prefetching statistics for group bufferpool
END#EXUTL: Currently executing utilities at end of capture TSP#USAGE: Tablespace usage over monitoring interval CF#GBPFL: Count of group bufferpool full conditions
BPL#STATS: Bufferpool statistics by tablespace PAG#RCM: Page reclaim metrics for index and data pages
DB#THRUP: Throughput metrics at database level TSP#PRFST: Tablespace prefetching statistics PAG#RCSMP: Page reclaim metrics for SMP pages
DB#CLACT: Client activity (active connections have at least 1 stmt/s) TSP#BPMAP: Tablespace to bufferpool mapping CF#RTTIM: Round-trip CF command execution counts and average response times
DB#TIMEB: Time breakdown at database level (wait + processing) CF#CMDCT: Aggregate CF command execution counts
DB#WAITT: Wait times at database level BPL#SIZES: Bufferpool sizes
DB#PROCT: Processing times at database level DB level BP stats
BPL#HITRA: Bufferpool data and index hit ratios
CF#CMDTM: CF-side command execution counts and average response times
CF#CMDTO: CF-side total command execution counts
DB#SORT: Sort metrics at database level BPL#READS: Bufferpool read statistics (overall) CF#SYSRE: CF system resource information
SQL#TOPEXECT: Top SQL statements by execution time SQL stats BPL#RDSYNC: Bufferpool read statistics (synchronous reads)
BPL#RDASYNC: Bufferpool read statistics (asynchronous reads)
CF#SIZE: CF structure size information
SQL#TOPEXECP: Top SQL statements by execution time, aggregated by PLANID BPL#WRITE: Bufferpool write statistics (overall) BLU#PAGDI: Partial early aggregation / distincts
PKG#EXECT: Time spent executing by package BPL#WRSYNC: Bufferpool write statistics (synchronous writes)
SQL#TOPWAITT: Wait time breakdown for top SQL statements by execution time
SQL#TOPWAITW: Top SQL statements by time spent waiting
BPL#WRASYNC: Bufferpool write statistics (asynchronous writes) BLU: PAE/Distincts
SQL#TOPIOSTA: IO statistics per stmt - top statements by execution time
SQL#TOPROWS: Row level statistics per stmt - top statements by execution time
CON#WAITT: Wait times at connection level
CON#STATS: Various metrics at connection level
Connection level
SQL#TOPSORT: Sort statistics per stmt - top statements by execution time CON#PAGRW: Physical and logical page reads and writes at connection level
INF#EXPLN: Statement & plan identifiers - top statements by execution time

DB#SYSRE: Database system resource usage information OS, CPUs,


WLB#SLIST: Workload balancing server list
Workload Balancing
CFG#REGVA: DB2 registry variable settings
DB#LOGWR: Database log write times Memory
DB#LOGRE: Database log read times Logging
CFG#DB: Database configuration settings
CFG#DBM: Database manager configuration settings Config (db2set, db, dbm)
DB#LOGST: Other database log statistics INS#INFO: Instance information
Tbsp I/O
TSP#DSKIO: Disk read and write I/O times
TSP#DSKIOSYNC: Disk read and write I/O times (synchronous)
DB#MEMST: Database memory set information @ end
DB#MEMPL: Memory pool information @ end
Memory
TSP#DSKIOASYNC: Disk read and write I/O times (asynchronous)
DB#SEQIN: Sequences information
DB#EXTBM: External table metrics External Table Sequences
Index stats coming soon
LTC#WAITT: Latch wait metrics Latches
DB#DLCKS: Deadlocks, lock timeouts and lock escalations
7
Deadlocks
Must Read Reports

There are lot of reports and we don’t usually read them all

But there are some reports that we should always look at

Must read reports
– Database Throughput
– Client Activity
– Database Time Breakdown
– Database Wait Time Breakdown
– Top SQL Statements by Execution Time
– SQL Statement Wait Time Breakdown

8
Database Throughput

Monitor interval Read/Write Ratio


in seconds Includes
temporary tables
SEL_P_S / (UID_P_S + SEL_P_S)

================================================
DB#THRUP: Throughput metrics at database level
================================================

TS_DELTA MEMBER ACT_PER_S CMT_PER_S RB_PER_S DDLCK_PER_S SEL_P_S UID_P_S ROWS_INS_P_S ROWS_UPD_P_S ROWS_RET_P_S ROWS_MOD_P_S PKG_CACHE_INS_P_S P_RD_PER_S
----------- ------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ----------------- ------------
33 0 148589.4 24633.7 24.3 0.0 72455.2 25277.4 15663.8 19337.9 123311.8 80166.6 1.0 10148.6

Unless ACT_PER_S / Compare with Unless RB_PER_S / Unless < 10


CMT_PER_S >= 5 application CMT_PER_S << 1
throughput claims
Too frequent commits Too many query compiles?
High rollback frequency
Auto-commit enabled?

9
Client Activity
Connection Ratio of request
considered active if it time vs. client wait
executes at least 1 for active clients
statement per second

=======================================================================
DB#CLACT: Client activity (active connections have at least 1 stmt/s)
=======================================================================

MEMBER TOTAL_CLIENTS TOTAL_CIWT TOTAL_RQST TOT_CIWT_RQ_RATIO ACTIVE_CLIENTS ACTIVE_RQ_PER_S ACTIVE_CIWT ACTIVE_RQST ACTIVE_CIWT_RQ_RATIO
------ ------------- -------------------- -------------------- ----------------- -------------- --------------- -------------------- -------------------- --------------------
0 268 1641303 7196998 0.22 267 45697.36 1608219 7196998 0.22

Prefer to look at
Ideally close to zero
active clients as
many connections
Investigate if > 10
may just be idle

10
Database Time Breakdown

Ideally 100% in Times reported as


section and/or percentages (PCT_*)
column so easier to interpret
processing

================================================================
DB#TIMEB: Time breakdown at database level (wait + processing)
================================================================

MEMBER TOTAL_RQST_TM PCT_COMPILE PCT_SECTION PCT_SORT PCT_COL PCT_COL_SYNOP PCT_COMMIT PCT_RBACK PCT_CONN PCT_RTN_USR_CODE PCT_BACKUP PCT_IDX_BLD PCT_RUNSTATS PCT_REORG PCT_LOAD
------ ------------- ----------- ----------- -------- ------- ------------- ---------- --------- -------- ---------------- ---------- ----------- ------------ --------- --------
0 7193923 0.00 80.78 0.41 0.00 0.00 17.18 0.00 0.00 1.10 0.00 0.00 0.00 0.00 0.00

Unless < 10 Unless < 30

Too many query compiles Investigate further

11
Database Wait Time Breakdown

Times reported as Very useful high level wait Remember these metrics are
Look at
percentages (PCT_*) percentages for latching, locking, aggregated at the database level
PCT_RQST_WAIT first
so easier to interpret buffer pool physical read, LOB meaning individual statements may
reads/write, etc. be less or more severely impacted

========================================
DB#WAITT: Wait times at database level
========================================

MEMBER TOTAL_RQST_TM TOTAL_WAIT_TM PCT_RQST_WAIT PCT_LOCK PCT_GLB_LOCK PCT_LTCH PCT_LG_DSK PCT_LG_BUF PCT_RCLM PCT_CF PCT_PFTCH PCT_DIAG PCT_POOL_R PCT_DIR_R PCT_DIR_W PCT_FCM PCT_TCPIP
------ ------------- ------------- ------------- -------- ------------ -------- ---------- ---------- -------- ------- --------- -------- ---------- --------- --------- ------- ---------
0 272833 265769 97.41 0.00 0.00 0.35 0.04 0.00 0.00 0.00 0.00 2.05 0.00 0.03 0.02 93.11 0.03
1 7193923 2107537 29.29 0.24 0.00 0.37 16.49 0.00 0.00 0.00 0.00 0.00 11.74 0.18 0.00 0.00 0.23

For member 1, logging and buffer pool


For wait time hierarchy see Db2 reads are contributing to nearly 30% wait High PCT_FCM is common in
docs section: combined environments with a lot of TQ operators
(e.g., DPF, BLU, intra parallel)
"Time-spent monitor element Based on this would want to check log write
hierarchy" times and statements with high pool read So while this 93% wait in FCM seems
times bad, it is normally only a symptom of
some other latency in processing and
But neither one seems to be a major not a problem itself 12
bottleneck at this time
Top SQL Statements by Execution Time
Why COORD exec time? Why not just plain exec time?
Sorted by total coord Other reports sort by wait time so you can easily
exec time find statements dominated by wait time Because of DPF/MPP environments where the work is done via a
coord agent. With plain exec time we don’t get the total query
execution time (as seen by the coord agent).

====================================================
SQL#TOPEXECT: Top SQL statements by execution time
====================================================

MEMBER NUM_EXEC COORD_STMT_EXEC_TIME AVG_COORD_EXEC_TIME PCT_COORD_STMT_EXEC_TIME TOTAL_CPU_TIME AVG_CPU_TIME PCT_WAIT_TIME AVG_SECT_TIME AVG_COL_TIME STMT_TEXT
------ ----------- -------------------- ------------------- ------------------------ -------------------- -------------------- ------------- ------------- ------------ -----------------------------------
0 40134 3535865 88.10 28.56 3149789 78 3.19 0.03 0.00 CALL BROKERVOLUME_F1(?, ?, ?, ?, ?)
0 40139 3529657 87.93 28.51 3526253901 87851 0.02 87.93 0.00 SELECT ARRAY_AGG(B_NAME), ARRAY_AGG(
0 147325 632104 4.29 5.10 69830522 473 0.58 0.01 0.00 CALL MARKETWATCH_F1(?, ?, ?, ?, ?, ?
0 88301 570747 6.46 4.61 575044790 6512 0.00 6.46 0.00 SET :HV00014 :HI00014 = ( SELECT 100
0 78118 538322 6.89 4.34 3502808 44 97.87 0.01 0.00 CALL TRADERESULT_F6(?, ?, ?, ?, ?
0 155401 428013 2.75 3.45 9905143 63 0.25 0.04 0.00 CALL TRADESTATUS_F1(?, ?, ?, ?, ?
0 155409 402792 2.59 3.25 391322528 2518 2.81 2.59 0.00 SELECT ARRAY_AGG(T_ID), ARRAY_AGG
0 19683 258882 13.15 2.09 1492058 75 0.11 0.03 0.00 CALL TRADELOOKUP_F2(?, ?, ?, ?, ?, ?
0 19429 207988 10.70 1.68 1489319 76 0.15 0.04 0.00 CALL TRADELOOKUP_F3(?, ?, ?, ?, ?, ?
0 19682 205605 10.44 1.66 14673734 745 94.66 10.44 0.00 SELECT ARRAY_AGG(T_ID), ARRAY_AGG
0 19422 169865 8.74 1.37 16100432 828 92.28 8.74 0.00 SELECT ARRAY_AGG(T_ID), ARRAY_AGG
0 5428 112541 20.73 0.90 794506 146 97.23 0.04 0.00 CALL TRADEUPDATE_F2(?, ?, ?, ?, ?
0 19671 105273 5.35 0.85 1328777 67 0.08 0.04 0.00 CALL TRADELOOKUP_F1(?, ?, ?, ?, ?

Percent of total
Average coord CPU time in
coord exec time See which statements have
exec time microseconds
(across all queries) high percent (>30) wait time

13
SQL Statement Wait Time Breakdown
EXECUTABLE_ID for each
Look at PCT_WAIT
statement is available in later
first
sections of the db2mon report. Useful
for collecting access plan or
retrieving full statement text

============================================================================
SQL#TOPWAITT: Wait time breakdown for top SQL statements by execution time
============================================================================

MEMBER PCT_WAIT PCT_LG_DSK PCT_LG_BUF PCT_LOCK PCT_GLB_LOCK PCT_LTCH PCT_RCLM PCT_CF PCT_PFTCH PCT_DIAG PCT_POOL_R PCT_DIR_R PCT_DIR_W PCT_FCM STMT_TEXT
------ -------- ---------- ---------- -------- ------------ -------- -------- ------- --------- -------- ---------- --------- --------- ------- --------------------------------------------------------------
0 3.19 0.00 0.00 0.00 0.00 3.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL BROKERVOLUME_F1(?, ?, ?, ?, ?)
0 0.02 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 SELECT ARRAY_AGG(B_NAME), ARRAY_AGG(CAST(VOLUME AS DECI
0 0.58 0.00 0.00 0.00 0.00 0.58 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL MARKETWATCH_F1(?, ?, ?, ?, ?, ?, ?)
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 SET :HV00014 :HI00014 = ( SELECT 100 * ( ( CAST
0 97.87 96.14 0.00 0.00 0.00 1.73 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL TRADERESULT_F6(?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
0 0.25 0.00 0.00 0.00 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL TRADESTATUS_F1(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
0 2.81 0.00 0.00 0.00 0.00 0.33 0.00 0.00 0.00 0.00 2.47 0.00 0.00 0.00 SELECT ARRAY_AGG(T_ID), ARRAY_AGG(T_DTS), ARRAY_AGG(ST_NAME),
0 0.11 0.00 0.00 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL TRADELOOKUP_F2(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
0 0.15 0.00 0.00 0.00 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL TRADELOOKUP_F3(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
0 94.66 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 94.63 0.00 0.00 0.00 SELECT ARRAY_AGG(T_ID), ARRAY_AGG(T_BID_PRICE), ARRAY_AGG
0 92.28 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 92.25 0.00 0.00 0.00 SELECT ARRAY_AGG(T_ID), ARRAY_AGG(T_CA_ID), A
0 97.23 88.90 0.00 0.00 0.00 8.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL TRADEUPDATE_F2(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
0 0.08 0.00 0.00 0.00 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CALL TRADELOOKUP_F1(?, ?, ?, ?, ?, ?, ?, ?,

Log disk Pool read What Next?


PCT_WAIT > 30 dominates wait dominates wait
time for this time for this PCT_LG_DISK: Check log write times (in db2mon)
Typically worth investigating query query
PCT_POOL_R: Check pool read times, pool hit ratios
and access plan (in db2mon except for access plan) 14
Find Common Bottlenecks
Bottleneck DB, SQL and Connection Level Things to Check Possible Next Steps

PCT_LG_DISK Log write time in DB#LOGWR Investigate log disk file system and storage
Slow Log Writes subsystem
PCT_LTCH Hot latches in LTC#WAITT db2latchtrace, stacks, check access plan
Latch Contention Are particular statements affected? Bad plan?

Disk I/O PCT_POOL_R BP sizes in BPL#SIZES Increase BP size, inline LOBs, check access
(Buffer pool, PCT_DIR_R BP hit ratios in BPL#HITRA plan, check underlying storage
LOBs) PCT_DIR_W Are particular statements affected? Bad plan?

Expensive AVG_COORD_EXEC_TIME Rows read / rows returned ratio in Check access plan, CDE perf trace (if BLU),
Queries (High PCT_COORD_STMT_EXEC_TIME SQL#TOPROWS stacks, check for spilling
CPU, Sorting, AVG_CPU_TIME Sort stats in SQL#TOPSORT
Spilling) PCT_WAIT_TIME (could be low!) High average logical reads in SQL#TOPIOSTA

PCT_LOCK Point in time lock waits in START#LOCKW and Tune application (e.g., isolation level)
Lock Contention END#LOCKW Check access plan
PCT_RCLM Find what tables (or indexes) are causing reclaims Enable append mode, tune index, table
Page Reclaims High PCT_LTCH with non-zero from PAG#RCM and PAG#RCSMP partitioning, EHL
PCT_RCLM
PCT_CF CF round trip command times in CF#RTTIM Increase CF workers
CF side command times in CF#CMDTM Reduce CF traffic (EHL)
CF Wait Need RDMA if using TCPIP transport

15
Find Common Bottlenecks (Examples)
========================================
Wait times at database level - percents
======================================== What’s “high” for a log
MEMBER TOTAL_RQST_TIME_MS TOTAL_WAIT_TIME_MS RQST_WAIT LOCK GLB_LOCK LATCH LOG_DSK RECLAIM CF POOL_R DIR_R DIR_W write time?
FCM TCPIP DIAG
------ ------------------ ------------------ --------- ------- -------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
0 1171250 818956 69.92 1.23 0.85 10.24 36.63 7.30 2.98 0.33 2.35 6.66 0.63 1.37 0.00
1 1178600 1121127 95.12 1.09 0.36 16.66 46.83 5.72 9.16 0.34 3.17 8.75 Higher than 1
1.06 2.21 0.00

Slow log writes . . .

==========================
Database log write times
millisecond can be an
issue but it depends.
==========================
If you don’t see a high
MEMBER NUM_LOG_WRITE_IO LOG_WRITE_IO_PER_S LOG_WRITE_MB_PER_S LOG_WRITE_TIME percent wait for log disk
LOG_WRITE_TIME_PER_IO_MS NUM_LOG_BUFFER_FULL

0 64975 214.4389 0.8349 179470 5.7621


I/O then it might not be
------ -------------------- ------------------ ------------------ -------------------- ------------------------ --------------------
0
1 59409 196.0693 0.7656 191576 6.2246 your main bottleneck.
0

Latch contention
========================================
What next?
DB#WAITT: Wait times at database level
========================================
Based on latch name of top few
latches you might be able to
MEMBER TOTAL_RQST_TM TOTAL_WAIT_TM PCT_RQST_WAIT PCT_LOCK PCT_GLB_LOCK PCT_LTCH PCT_LG_DSK PCT_LG_BUF PCT_RCLM PCT_CF PCT_PFTCH PCT_DIAG PCT_POOL_R PCT_DIR_R PCT_DIR_W PCT_FCM PCT_TCPIP
------ ------------- ------------- ------------- -------- ------------ -------- ---------- ---------- -------- ------- --------- -------- ---------- --------- --------- ------- ---------
0 4850728 4598430 94.79 8.36 0.00 84.17 1.00 0.00 0.00 0.00 0.00 0.00 1.17 0.00 guess but if not, get stacks. Might
0.00 0.00 0.06

. . . also help to look at what


===============================
statements have high latch wait.
LTC#WAITT: Latch wait metrics
===============================
In this case (memlatch) we need
MEMBER
------
LATCH_NAME TOT_EXT_LATCH_WAIT_TIME_MS TOT_EXT_LATCH_WAITS TIME_PER_LATCH_WAIT_MS
------------------------------------------------------------ -------------------------- -------------------- ----------------------
to get stacks since it could be
0 SQLO_LT_SMemPool__MemLatchType__latch 3069940 519880 5.90 anything.
0 SQLO_LT_SQLB_BPD__bpdLatch_SX 65870 209 315.16
0 SQLO_LT_SMemSet__MemLatchType__latch 27250 25702 1.06
0 SQLO_LT_SQLB_DIRTY_LIST_SET__walkLatch 1860 424 4.38
0 SQLO_LT_SQLB_DIRTY_LIST_SET__appendLatch 367 63 5.82 16
0 SQLO_LT_stmmSort__statsLatch 336 112 3.00
0 SQLO_LT_SQLP_LHSH__hshlatch 118 3607 0.03
Find Common Bottlenecks (More examples)
Slow disk reads PCT_POOL_R >> 30 Why high PCT_LTCH?

It’s a symptom of the slow disk reads. We must


============================================================================
SQL#TOPWAITT: Wait time breakdown for top SQL statements by execution time
hold a BPD latch while reading from disk.
============================================================================

MEMBER PCT_WAIT PCT_LG_DSK PCT_LG_BUF PCT_LOCK PCT_GLB_LOCK PCT_LTCH PCT_RCLM PCT_CF PCT_PFTCH PCT_DIAG PCT_POOL_R PCT_DIR_R PCT_DIR_W PCT_FCM STMT_TEXT
------ -------- ---------- ---------- -------- ------------ -------- -------- ------- --------- -------- ---------- --------- --------- ------- ----------------------------------------
0 99.95 0.00 0.00 0.00 0.00 99.54 0.00 0.00 0.00 0.00 0.45 0.00 0.00 0.00 Insert into NEW_ORDER values (3132, 6, 1
0 100.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Insert into ORDER_LINE values (3116, 6,
0 100.00 0.00 0.00 0.00 0.00 93.75 0.00 0.00 0.00 0.00 6.19 0.00 0.00 0.00 Insert into ORDER_LINE values (3136, 3,
0 100.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Select D_NAME, D_STREET_1, D_STREET_2, D
0 99.66 0.00 0.00 0.00 0.00 74.91 0.00 0.00 0.00 0.00 25.08 0.00 0.00 0.00 Insert into ORDERS values (3136, 1007, 3
0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 Select C_LAST, C_CREDIT, C_DISCOUNT, C_B
0 99.22 0.00 0.00 0.00 0.00 77.51 0.00 0.00 0.00 0.00 21.70 0.00 0.00 0.00 Insert into ORDER_LINE values (3116, 6,
0 99.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.02 0.00 0.00 0.00 Select S_QUANTITY, S_DIST_01, S_DIST_02,
0 98.90 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.90 0.00 0.00 0.00 Update ORDER_LINE set OL_DELIVERY_D = '2
0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 Select S_QUANTITY, S_DIST_01, S_DIST_02,
0 98.64 0.00 0.00 0.00 0.00 98.64 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Insert into ORDER_LINE values (3122, 3,
0 98.55 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.55 0.00 0.00 0.00 Insert into ORDERS values (3132, 503, 6,

What’s “high” for a disk read time? What next?


==========================================
TSP#DSKIO: Disk read and write I/O times
==========================================
Under 5 ms would be ideal Check the buffer pool size.
MEMBER
------
TBSP_NAME NUM_READS AVG_READ_TIME DIRECT_READ_REQS 15 ms or higher is problematic
AVG_DRCT_READ_TIME NUM_WRITES AVG_WRITE_TIME DIRECT_WRITE_REQS AVG_DRCT_WRITE_TIME
-------------------- -------------------- ------------- -------------------- ------------------ -------------------- -------------- -------------------- -------------------
0 TBS_STKX 2612 23.92 0 - 0 - Check the file system and disk
0 -
0
0
TBS_STK
TEMPSPACE1
1878
0
23.88
-
0 -
But it depends – how large is the wait time
348 0.00
434
2
85.70
0.00
subsystem for issues. Check things
0
348
-
0.00
0
0
TBS_SML
TBS_NOX
439
204
11.79
25.32 relative to statement execution time?
0
0
-
-
29
151
69.17
50.55
like disk cache and queue depth.
0
0
-
-
0 TBS_OLX 274 29.20 0 - 66 55.69 0 -
0
0
TBS_OX
TBS_CSTX
220
265
26.15
36.94
0 -
In this case it’s pretty severe (> 99%)
0 -
93
0
114.00
- Determine what the disk
0
0
-
-
0
0
TBS_CST
TBS_OL
137
88
42.78
52.50
0
0
-
-
114
113
50.51
71.78
requirements for the workload are
0
0
-
-
and compare that to what the disk 17
subsystem is capable of.
Find Common Bottlenecks (More examples)
PCT_COORD_STMT_EXEC_TIME > 30

Expensive queries Top CPU consumer (TOTAL_CPU_TIME)

====================================================
AVG_COORD_EXEC_TIME / (AVG_CPU_TIME / 1000) close to 1
SQL#TOPEXECT: Top SQL statements by execution time which means this query is CPU bound
====================================================

MEMBER NUM_EXEC COORD_STMT_EXEC_TIME AVG_COORD_EXEC_TIME PCT_COORD_STMT_EXEC_TIME TOTAL_CPU_TIME AVG_CPU_TIME PCT_WAIT_TIME AVG_SECT_TIME AVG_COL_TIME STMT_TEXT
------ ----------- -------------------- ------------------- ------------------------ -------------------- -------------------- ------------- ------------- ------------ -----------------
0 10622 84442 7.94 42.71 82179091 7736 0.28 7.94 0.00 Select I_NAME,
0 10623 40384 3.80 20.43 1437751 135 96.57 3.80 0.00 Select S_QUANTITY
0 99 21787 220.07 11.02 1043864 10544 96.10 220.07 0.00 Select Count(Dist
0 693 16208 23.38 8.19 5568178 8034 65.36 23.38 0.00 Select C_ID,

===============================================================================
SQL#TOPROWS: Row level statistics per stmt - top statements by execution time
===============================================================================

MEMBER NUM_EXEC AVG_ROWS_MOD AVG_ROWS_READ AVG_ROWS_RET COL_SYNOPSIS_ROWS_INSERTED STMT_TEXT


------ ----------- -------------- -------------- -------------- -------------------------- ----------------------------------------------------------------------------------------------
0 10622 0.0 100000.0 1.0 0 Select I_NAME, I_PRICE, I_DATA from ITEM where I_ID = ?
0 10623 0.0 1.0 1.0 0 Select S_QUANTITY, S_DIST_01, S_DIST_02, S_DIST_03, S_DIST_04, S_DIST_05, S_DIST_06, S_DIST_07
0 99 0.0 201.2 1.0 0 Select Count(Distinct S_I_ID) from STOCK, ORDER_LINE where (S_W_ID = ? and S_QUANTITY < ? and
0 693 0.0 3000.0 3.4 0 Select C_ID, C_FIRST from CUSTOMER where (C_W_ID = ? and C_D_ID = ? and C_LAST = ?) ORDER BY

What Next?
AVG_ROWS_READ / AVG_ROWS_RET >> 1
AVG_D_LRD >> AVG_I_LRD
AVG_I_LRD is zero
=========================================================================
SQL#TOPIOSTA: IO statistics per stmt - top statements by execution time
=========================================================================
Probably no index! Check the plan
MEMBER NUM_EXEC AVG_D_LRD AVG_D_PRD AVG_I_LRD AVG_I_PRD AVG_TD_PRD AVG_TI_PRD AVG_COL_LRD AVG_COL_PRD AVG_DIR_R_RQS AVG_DIR_W_RQS STMT_TEXT
------ ----------- ------------------ ------------ ------------------ ------------ ------------ ------------ ------------------ ------------ ------------- ------------- ----------------
0 10622 2212.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Select I_NAME,
0 10623 1.8 0.8 4.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 Select S_QUANTITY 18
0 99 286.6 85.4 831.2 17.5 0.0 0.0 0.0 0.0 0.0 0.0 Select Count(Dist
0 693 3000.0 0.0 25.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Select C_ID,
A Word on Access Plans

Sometimes an application can be configured or do things such that the plans
the application gets do not match what you get via CLP
– Rare, but when it happens it can be very annoying and difficult to debug

The EXPLAIN_FROM_SECTION stored procedure is the most accurate way
to retrieve access plans (will be exactly what the application got)

Uses the EXECUTABLE_ID of the statement (from
MON_GET_PKG_CACHE_STMT)
– Need statement to still be in package cache

db2mon provides a report (INF#EXPLN) with the EXECUTABLE_IDs for the
top SQL statements

19
Looking for a specific metric?

There are a lot of metrics reported by db2mon
– Too many to list
– And growing as we slowly fill in gaps where the tool is lacking

grep db2mon.pl for the metric name

If you don’t find it open an issue or pull request to have it
included

20
Pain Points

Long running queries
– Most metrics are not rolled up until the statement completes
– Need to collect MON_GET_PKG_CACHE_STMT after query completes

Sometimes the query never completes!
– Would need to rely on “in-flight” metrics from active SQL monitors
– Normally end up having to dump stacks

No metrics for BLU evaluators

No sort or join spill metrics

No WLM metrics

21
Requirements

All versions of Db2 LUW are supported starting from v9.7

User temp tablespace is required for DGTTs
– Created by db2mon.sh

Database monitoring switches must be BASE or EXTENDED
– MON_REQ_METRICS *****************************
db2mon tries to warn
about missing prereqs
Checking db2mon prerequisites
at the top of the report
– MON_ACT_METRICS *****************************

– MON_OBJ_METRICS db2mon Prerequisite


-------------------------
Status
-----------------------
User temp tablespace? OK (1 defined)

Perl is required to run db2mon.pl MON_REQ_METRICS correct?
MON_ACT_METRICS correct?
OK (currently BASE)
OK (currently BASE)
MON_OBJ_METRICS correct? OK (currently EXTENDED)
– This can be done on any machine
– Copy the generated files to the system to be monitored

22
Where can I find it?

Stable version part of the Db2 source tree starting in v11.1

Can find it on customer systems on sqllib/samples/perf

Experimental version on git
– https://github.ibm.com/DB2-Performance/db2mon
– Everyone is welcome to open issues and or submit PRs!

Documentation is available
– Db2 Knowledge Center

“Collecting and reporting performance monitor data with db2mon”
– In the git repo above

23
Introducing Db2 Performance Bootcamp

Designed to help people gain experience debugging performance
problems
– Through exercises that simulate common performance problems

Work in progress (only 3 exercises so far)

Creates a VM, installs Db2, builds a database and deploys a workload

Great place to practice with db2mon

https://github.ibm.com/DB2-Performance/bootcamp
– Feel free to open issues or contribute through pull requests

24

You might also like