Archive

Posts Tagged ‘Capacity’

Database Sizing – How To

January 22, 2011 Leave a comment

Off-late, I have been going through a couple of Database Sizing requests and hence thought of organizing my findings / thoughts.

Understand the Sizing Problem
Beware that Sizing for DW is more complicated than for OLTP
- OLTP: work units per transaction can be easily found
- OLAP/DW: work unit is generally variable, and more so as its data size independent

Accuracy of Sizing Estimates
Doesn’t matter how magical and mystical you are, remember that it will always be an estimate only. And real number will only be known when system is exposed to actual load.
- Estimate can never be 100% accurate, remember that
- Try defining an accuracy goal and margin of error
- Try Alternate route as well (if possible). i.e. Run custom benchmark using specific application and data.

Target for estimation should be a flexible system, which can be extended if possible. Get servers with available free slots for memory, network, CPU etc. :-) Target should be to hit the balance between CPU, storage and IO resource utilization.

Database Sizing Process
1. Understand the present system’s capabilities and limitations
2. Establish biz requirements and system expectations
3. Understand workload and resource consumption characteristics
4. Estimate size of system for optimal use of resources

What all could eat up Storage?
- Tables
- Indexes
- Logs
- Temporary Space
- Staging area or Raw Data

Estimating for Table Data
- row data size
- rows per storage page (including free space definitions)
- page size
- number of rows

Estimating for Index Data
- index size per row of table
- index rows per page (plus, free space definition)
- page size
- index page overhead

Estimating for Log Space
- frequency of data updates (e.g. in DW you might have some update patterns)
- volume of change per cycle (e.g. ref data rarely changes, try identifying rare & frequently updated tables and physically segregate them on different data files)
- data recovery requirements
- transactional characteristics of workload

Estimating for Temp Space
Generally the hardest part of DB estimation
- percent of data used in large sort or joins
- number of concurrent queries (and query segments per query)
- growth of concurrency levels across the DB

Estimating for Staging
- DW loading procedures,
- file storage and backup requirements (some system take a local backup and then ship them)
- DB/DW maintenance procedures (read space management, file archival)
- future data growth (not just core DB , but raw storage requirements change too)
- RAID overheads (there are systems which keep raw storage away on non-RAID systems)

Resources and Workload Categorization
Resources include:
- CPU,
- Memory,
- IO: Network and Disks (sequential and random IO loads)

Different SQL Queries / workload “can” be categorized for each resource types as follows:
- Small: when <20% data needed to produce results
- Medium: when 20-50% data needed to produce results
- Large: when >50% data needed to produce results
NOTE: 1. Above is just a suggestive guideline. Please feel free to amend as you prefer. And, 2. Majority of DB platforms generally mark auto stats collections after 20% data change.

Few Thumb Rules
- IBM used to recommend atleast 1GB memory per CPU (irrespective of clock speed) and has increased it to 4GB per CPU. I checked out around 17 systems so far, and found it to be case on majority of large systems already, but yet to testify it with a real world sizing exercise.
- Sizing for CPU is rather confusing. For simple queries, its rather linear. For large complex queries, with joins, its definitely not linear and should definitely be tested out. For a simple case, you can use example below:
QueryA (qA) is worked out with 150MB/Sec using one 1.7GHz machine. For same query to work with 800GB data, to give results in 360sec, we will need ~2276MB/Sec; i.e. 16 CPUs.

Tags:

Less known (or remembered?) one’s from SQL Server 2008

Do you know that in SQL Server 2008, you can
- include upto 16 columns in a primary or foreign or index key,
- select upto 4096 columns (only) in a SELECT or INSERT or UPDATE query,
- have upto 32 nested subqueries or stored proc’s or trigger levels,
- publish upto 256 articles at a time in merge replication mode.

Its not that you need to remember the ceilings, but one needs to know the limits of his system. Imagine you trying something out of the box and it failing and you realizing it after a raising a call with Microsoft (it happened for one of my friend’s clients recently).

Anyways, there are more such definitions (or capacity ceilings) which are defined here, here (processors), and here (memory). And there are similar pages for SQL Server 2005 as well (well, obviously) which can be referenced from here (capacity ceilings), here (processor), and here (memory).

Tags:
Follow

Get every new post delivered to your Inbox.