H2 NSD
H2 NSD
Daniel Nashed
● Memory Management
● Advanced Methods
– Semaphore Debugging
– Memory Dumps
● Performance Troubleshooting
– NashCom Tools
● nshcrash
● Nshmem
● C-API Toolkit
– Great source of information
What is Serviceability?
● RAS = Reliability Availability Serviceability
● It's not just about NSD & Memcheck but all parts of Domino
– Logging, Debug Options, etc
● Fault Recovery
– Generates NSD files and restarts servers automatically
● Semaphore Debugging
Fault Recovery
● Domino Server detects crash and restarts automatically
– Panic routine calls fault recovery code
– Cleans up resources
– Restarts Server
Transaction Logging
● Recommended for all Domino server types
– Changes the way databases are locked for concurrent access
● Lock-Manager optimizes performance
● Changes are written sequentially into translog
● Asynchronous Log Manager writes data into databases afterwards
– Process can continue to run meanwhile
– Configuration files
● Server Document (serverdoc_<server>_<date>_<time>.dxl)
● Configuration Document (configall_<server>_<date>_<time>.dxl)
● Server commands
– start consolelog / stop consolelog / sh server
– Reduces the need for remote control software and direct OS level access
– When a crash occurs, or NSD is run manually, dynamically created instance of NSD
will proxy the request to start NSD Service
● nsdcmds.html
– NSD commands
● nsdini.html
– nsd.ini options
● nsdopts.html
– NSD options
● memcheck.html
– Memcheck documentation (not yet available in D8.0 Gold)
Why Server Freeze and Server Panic?
● Domino uses shared memory to allocate global resources to
share between tasks and Domino core for different sub-
systems
– NIF, NSF, ... e.g. views are stored in memory ...
● Corrupt data
– Corrupt documents, etc …
● Insufficent Memory
– Often caused by „Memory Leaks“
First Steps Analyzing a Crash
● Find the crashing thread
– "Fatal" is the most common indication of the crashing task
– If you don't find fatal, look for "Panic", "Access Violation" or
"Segmentation Fault", "Signal" messages on Unix/Linux
– Tip: Last lines on console.log is helpful in most of the cases
– But the SDK (C-API Toolkit) gives you a good idea what to look for
C-API Toolkit
● The C-API Toolkit contains a sub-set of the internal APIs
used by Notes Development to build Notes/Domino
– Can be used to build your own servertasks, extension-manager, DSAPI filters,
client applications
● Components
– Header files
● extract from the original Notes/Domino code
– Reference Database
● Documentation for all exposed calls
– User Guide Database
● Documentation how to use the toolkit and information about Notes/Domino internal
Architecture etc.
– Sample Applications
Reproducible Call-Stack/Bug?
● Best case scenario: Reproducible call-stack on independent
machines which does not occur on boxes with other releases
● But we are not always that lucky ...
– If the call stack is similar at the end of the stack it could be a low-level API
problem
– If the call stack is similar at the higher level of the stack always in the same
Servertask it could be the Servertask
– If you see EM_BEFORE, EM_AFTER it might be an Extension-Manager
problem
– If it is always the same database it might be a data problem
How to find affected databases?
● Check the Physical Virtual Thread Mapping
– To find VTHREAD of crashing process/thread for open databases
############################################################
### FATAL THREAD 1/2 [nnshcrash: 18a0: 0594]
### FP=0x0012fd8c, PC=0x0040159c, SP=0x0012fc6c
### stkbase=00130000, total stksize=28672, used stksize=916
### EAX=0x00000000, EBX=0x7ffd6000, ECX=0x00000000, EDX=0x00000005
### ESI=0x003e757c, EDI=0x00000002, CS=0x0000001b, SS=0x00000023
### DS=0x00000023, ES=0x00000023, FS=0x0000003b, GS=0x00000000 Flags=0x00010206
Exception code: c0000005 (ACCESS_VIOLATION)
############################################################
@[ 1] 0x0040159c nnshcrash.AddInMain@12+540 (400000,2,3e757c)
@[ 2] 0x0040183f nnshcrash.NotesMain@8+47 (2,400000)
@[ 3] 0x004017b4 nnshcrash.notes_main+212 (0,0)
@[ 4] 0x004016a6 nnshcrash.main+22 (2,c20ea8)
@[ 5] 0x00401c50 nnshcrash.mainCRTStartup+368 (0,0)
[ 6] 0x7c817077 kernel32.RegisterWaitForInputIdle+73 (401ae0,0)
<@@ ------ Notes Data -> TLS Mapping :: [nnshcrash: 18a0] (Time 11:04:08) ------ @@>
<@@ ------ Notes Data -> OS Data -> MM/OS Structure Information (Time 11:04:07) ------ @@>
– Notes Peek
<@@ ------ Notes Data -> Server Data -> Last Console Log Messages (Time 11:04:07) ------ @@>
● Support can look into SPR database and find matching call-
stacks
– Support needs all information available in IBM_TECHNICAL_SUPPORT directory -
(please ZIP files!)
– Every new version of Domino provides more diagnostic information (NSD, ADC, …)
– For previous versions keep debugging code in your applications to get proper
annotated call-stack for 3rd party products
– Microsoft mapsym cannot be used to generate sym files for Notes/Domino
– No new Map2iSym available but you can use stripped PDB files
Lotus Notes Diagnostics (LND)
● Tool to annotate NSDs, semdebug files, memory dumps etc
– Current Version 2.9 downloadable from IBM
– http://www.ibm.com/support/docview.wss?rs=899&uid=swg24019151
<@@ ------ Notes Memory -> Usage Summary -> Top 10 Memory Block Usage -> Memhandles By Size ::
(Shared) (Time 15:05:03) ------ @@>
– Block Codes
– Size
– Debug_Trapleaks=0x3A45
● For shared memory ensure that you take care of the shared memory bit
0x8000 – the bit must be removed from the value
– Debug_Trapleaks_ShowStack=1
– DEBUG_SHOWLEAKS=1
– DEBUG_DUMP_FULL_HANDLE_TABLE=1
– DEBUG_DUMP_BLOCKCODES=1
– For very large databases, the Backup Context can consume a lot of memory and
overflows shared memory
● Semaphore issues
– Deadlocks, low level looping
– DEBUG_CAPTURE_TIMEOUT=10
– DEBUG_THREADID=1
– Optional: DEBUG_SEM_TIMEOUT=X
(in milliseconds, default 30000)
– „Show stat Sem.Timeouts“ to check semaphores
– Example:
ti="0025CA9C-C1257353" sq="00004CE8"
THREAD [28208:00241-169659312]
WAITING FOR SEM 0x0931 Task sync semaphore
(@0F7711A4) (OWNER=28208:158743472) FOR 5000 ms
Annotate Semaphore Logs
● Example:
– ti="0025CA9C-C1257353" sq="00004CE8"
– THREAD [28208:00241-169659312]
● SmtpSaveImportErrors=1
– Save if error occurs during message itemization
● SmtpSaveImportErrors=2
– Always save
● SmtpSaveImportErrors=3
– Only save temporary before message conversion and delete after successful
conversion. Useful for rare occurring server crashes
● Extra Tip!
– SMTPSaveFileFrom=string in combination with SmtpSaveImportErrors=3
● keeps log files after conversion if string partially matches with RFC822 “from”
● Undocumented but very useful to trace issues with certain
● users or domains in production!
More SMTP Debugging (notes.ini)
● SmtpSaveOutboundToFile=1
– Similar to inbound logging all messages are saved to temporary files
● SMTPClientDebug=1
– Logs RFC821 conversation for outgoing messages
● SMTPDebugIO=1
– Logs transferred bytes
● SMTPDebugIO=2
– Not implemented
● SMTPDebugIO=3
– Logs all RFC822 headers
● SMTPDebugIO=4
– Use this very carefully! Logs also RFC822 data / body!!!
Lotus Domino Statistics
● Valuable resource of information
– Combines Domino Statistics and Platform statistics
– Platform statistics depend on the OS platform but are sort of unified between
platforms
– Check events4.nsf for a description of each platform stat per platform
– Configure statistic events for important stats with the right thresholds
– Database.Database.BufferPool. PercentReadsInBuffer
● Interpretation
– Bad < 90% < PercentReadsInBuffer < 98% < Perfect
– Database.DbCache.CurrentEntries
– Database.DbCache.MaxEntries
– Database.DbCache.OvercrowdingRejections
● Interpretation
– Good = HighWaterMark < MaxEntries
– Good = 0 OvercrowdingRejections
– Replica.Cluster.SecondsOnQueue
– Replica.Cluster.WorkQueueDepth
● Interpretation
– Perfect < 10 < SecondsOnQueue > 15 > Bad
● Tune:
– Add more cluster replicators
– optimize cluster server usage (e.g. Split active users between cluster mates)
Top Statistics – Transactions
● Use for Indication of Server Load
● Interpretation:
– Heavy < 30 < Trans.PerMinute (per User) > 10 > Light
– Server.ConcurrentTasks.Waiting
● Interpretation
– Waiting should be ZERO
● Tune:
– Server_Pool_Tasks = n ( e.g. 80)
– Mem.Allocated.Process
– Mem.Allocated.Shared
● Interpretation
– Memory Leaks when increasing over days / weeks
● Tune
– By several parameters (bufferpool, cache, namelookup…
● Note
– Be careful interpreting this statistic… Not all memory might be included
Top Statistics – Platform CPU
● Used to check CPU Utilization on Server
– Platform.System.PctTotalPrivilegedCpuUtil
– Platform.System.PctTotalUserCpuUtil
● Interpretation:
– OK < 90% CombinedCpuUtil > 90% > TOO HIGH
● Tune
– Many Root Causes Possible
Top Statistics – Paging File
● Use to check Server Memory Swapping to Disk
● Interpretation
– OK < 0% < PctUtil.Avg > 10% > BAD
● Tune
– OS Level tuning, Check Memory
Top Statistics – Platform Disk
● Used to check Disk Performance
– Platform.LogicalDisk.1.PctUtil
● Interpretation
– Good < 2% < AvgQueueLen > 5% > BAD
● Tune
– By several parameters (bufferpool, cache, namelookup
● Note
– Platform.LogicalDisk.1.AssignedName=C points to the disk
Analysis Tools
● Domino Admin Client contains analysis Tools
– Located in Server/Analysis Tab
– Cluster Analysis
– Log Analysis
– debug_console=1
● Enables a debug text window -- never close this manual, causes a crash
– debug_outfile=c:\debug_notes.log
● Writes a debug log file
Still an Issue: Broken Design Collection
● Domino has an internal design cache in each database to find
design notes
– Used by NIFFindDesignNoteExt (Transaction: FINDDESIGN_NOTES)
● Server_Clock=3, DEBUG_TRANSACTION_TIME=n
– Dumps only transaction taking longer than the specified time
– Can help to reduce the number of transactions dumped and only lists “slower”
transactions
– For example: 5000 ms
– Take care: But some transaction like open view collections might take longer than
5 seconds without indicating a problem
Summary
● There are a lot of diagnostic features in Lotus Notes/Domino
– Some features are designed for crash and failure analysis
● Contact
– nsh@nashcom.de
– http://www.nashcom.de
– http://blog.nashcom.de