Database Sizing – How To
Off-late, I have been going through a couple of Database Sizing requests and hence thought of organizing my findings / thoughts.
Understand the Sizing Problem
Beware that Sizing for DW is more complicated than for OLTP
- OLTP: work units per transaction can be easily found
- OLAP/DW: work unit is generally variable, and more so as its data size independent
Accuracy of Sizing Estimates
Doesn’t matter how magical and mystical you are, remember that it will always be an estimate only. And real number will only be known when system is exposed to actual load.
- Estimate can never be 100% accurate, remember that
- Try defining an accuracy goal and margin of error
- Try Alternate route as well (if possible). i.e. Run custom benchmark using specific application and data.
Target for estimation should be a flexible system, which can be extended if possible. Get servers with available free slots for memory, network, CPU etc.
Target should be to hit the balance between CPU, storage and IO resource utilization.
Database Sizing Process
1. Understand the present system’s capabilities and limitations
2. Establish biz requirements and system expectations
3. Understand workload and resource consumption characteristics
4. Estimate size of system for optimal use of resources
What all could eat up Storage?
- Tables
- Indexes
- Logs
- Temporary Space
- Staging area or Raw Data
Estimating for Table Data
- row data size
- rows per storage page (including free space definitions)
- page size
- number of rows
Estimating for Index Data
- index size per row of table
- index rows per page (plus, free space definition)
- page size
- index page overhead
Estimating for Log Space
- frequency of data updates (e.g. in DW you might have some update patterns)
- volume of change per cycle (e.g. ref data rarely changes, try identifying rare & frequently updated tables and physically segregate them on different data files)
- data recovery requirements
- transactional characteristics of workload
Estimating for Temp Space
Generally the hardest part of DB estimation
- percent of data used in large sort or joins
- number of concurrent queries (and query segments per query)
- growth of concurrency levels across the DB
Estimating for Staging
- DW loading procedures,
- file storage and backup requirements (some system take a local backup and then ship them)
- DB/DW maintenance procedures (read space management, file archival)
- future data growth (not just core DB , but raw storage requirements change too)
- RAID overheads (there are systems which keep raw storage away on non-RAID systems)
Resources and Workload Categorization
Resources include:
- CPU,
- Memory,
- IO: Network and Disks (sequential and random IO loads)
Different SQL Queries / workload “can” be categorized for each resource types as follows:
- Small: when <20% data needed to produce results
- Medium: when 20-50% data needed to produce results
- Large: when >50% data needed to produce results
NOTE: 1. Above is just a suggestive guideline. Please feel free to amend as you prefer. And, 2. Majority of DB platforms generally mark auto stats collections after 20% data change.
Few Thumb Rules
- IBM used to recommend atleast 1GB memory per CPU (irrespective of clock speed) and has increased it to 4GB per CPU. I checked out around 17 systems so far, and found it to be case on majority of large systems already, but yet to testify it with a real world sizing exercise.
- Sizing for CPU is rather confusing. For simple queries, its rather linear. For large complex queries, with joins, its definitely not linear and should definitely be tested out. For a simple case, you can use example below:
QueryA (qA) is worked out with 150MB/Sec using one 1.7GHz machine. For same query to work with 800GB data, to give results in 360sec, we will need ~2276MB/Sec; i.e. 16 CPUs.