OLAP Benchmark Study

Note: This page refers to the OLAP Council APB-1 OLAP Benchmark Release I. Release I has been superceded by the OLAP Council APB-1 OLAP Benchmark Release II. This page and its related documents remain for reference purposes only.


Introduction

The OLAP Council has sponsored the development of an analytical processing benchmark, the APB-1. The benchmark simulates a realistic On-Line Analytical Processing (OLAP) business situation that exercises server-based software. The goal of the APB-1 is to measure a server's overall OLAP performance rather than the performance of specific tasks. To ensure the relevance of the APB-1 to actual business environments, the operations performed on the database have been carefully chosen to reflect common business operations. These operations include the following:

Bulk loading of data from internal or external data sources

Successful OLAP applications must provide "just-in-time information". The key to effective decision-making is having the right information at the right time. Judging a server's ability to accomplish this goal is more than simply measuring an OLAP server's processing performance. Its ability to represent complex business relationships and respond to changing business requirements are equally important.

While the APB-1 does not try to measure a system's ability to respond to change, the amount and clarity of programming code are used as a qualitative measure of responsiveness. The audience of the APB-1 must be able to evaluate a given solution both in terms of its quantitative and qualitative appropriateness to the task. For this reason, publication of APB-1 benchmark results must include the database schema and the code required to execute the benchmark.

Analytical processing applications are information delivery systems. The primary concern of the information recipients is how long it takes to accomplish their work from the time the data is made available to the system. For purposes of the APB-1, the average time needed to process a query once data is available to the system is defined as analytical throughput. The performance metric is expressed as analytical query time (AQT). Broadly stated, AQT equals the total server processing time divided by the number of queries processed. Total server processing time equals the time elapsed from the receipt of the data through the execution of the last query. The calculation of the AQT metric includes the time required to:


It is generally agreed that OLAP applications require a multidimensional view of data. For this reason, the benchmark is specified in multidimensional terms. The terms dimension, dimension member, and hypercube used in this document do not, however, preclude the use of any commercially available database management system.

The APB-1 is a general OLAP application and does not reflect the entire range of OLAP requirements. The benchmark should not be used to test whether a given database management product contains a full OLAP feature set. Benchmark results will vary based on the execution platform and workloads. Comparisons not based on identical runs on identical equipment are not advised. Customers should not substitute the APB-1 for application benchmarking where specific product features or operational characteristics are required.

The software used to implement the APB-1 benchmark must be generally available. The use of software that is configured specifically for the benchmark is not acceptable.

Application Environment

The APB-1 contains a set of business operations that exercise basic functionality essential to OLAP applications. The benchmark database balances the need to approximate a real world business application with the need to provide a performance benchmark that can be executed without extreme effort. To this end, the benchmark reduces the diversity of operations while retaining fundamental functionality.

The OLAP application used in the benchmark is a sales and marketing analysis system. The benchmark is a synthesis of general business practices, not a model of a specific industry or market. The database contains the information required by a supplier to analyze product sales to customers through distribution channels over time. Units sold, dollar sales, costs, and margins are tracked by actuals, budget, and forecast.

The database design has no structural requirements. The varied nature of database technologies (multidimensional and relational) and the lack of generally accepted design criteria (denormalization is the rule) would make any structural requirements prejudicial in nature. Specifically, storing calculated values is neither strictly forbidden nor encouraged. Whether the calculated values are preprocessed and stored in the database or are calculated at query time, the time to calculate values is included in the computation of the AQT.

OLAP databases tend to be updated incrementally at a specified time interval (i.e., weekly or monthly). Therefore, although database setup and loading of historical actuals is required in order for the benchmark to perform queries, these operations are not included in the calculation of AQT. Loading current month actuals and loading current year budget are included in the calculation of AQT. Calculated values stored in the database are treated the same way, i.e., if a calculated value depends only on historical actuals, it is not included in the AQT. If, however, a calculated value depends on current month actuals, current year budget, or a mix of historical and current actuals/budget, then the calculation is included in the AQT.

Database Structure

The logical database structure is made up of six dimensions: time, scenario, measure, and three aggregation dimensions that define the database size (product, customer, and channel). The APB1GEN program (described later) uses an input parameter to determine the number of members in each of these dimensions. The minimum number of members required in each dimension and the relationships between the dimensions are described below.

Product

Of the three aggregation dimensions, the product dimension has the most members. The number of members in the product dimension is ten times the number of members in the customer dimension. The minimum number of members in the product dimension is 10,000. The product hierarchy is a steep hierarchy containing seven levels. Each member of the hierarchy contains at most one parent. Every member of the hierarchy, except the member at the top level, has a parent. The bottom level of the product hierarchy contains 90% of the members. The top level of the product hierarchy contains a single member (the grand total). The other members of the hierarchy are distributed among the five remaining levels.

The level names for the product dimension hierarchy are:

Customer

The number of members in the customer dimension is 100 times the number of members in the channel dimension. The minimum number of members in the customer dimension is 1,000. The customer hierarchy is a flat hierarchy containing three levels. The bottom level of the customer hierarchy contains 90% of the members. The top level of the customer hierarchy contains a single member (the grand total). The remaining members belong to the middle level of the hierarchy. Each member of the hierarchy contains at most one parent. Every member of the hierarchy, except the top level member, has a parent.

The level names for the customer dimension hierarchy are:

Channel

Of the three aggregation dimensions, the channel dimension has the fewest members. The number of members in the channel dimension is an input parameter to the APB1GEN program. The minimum number of members in the channel dimension is ten. The channel hierarchy contains two levels. The bottom level of the channel hierarchy contains all but one of the members. The top level of the channel hierarchy contains a single member (the grand total). Each member of the hierarchy contains at most one (1) parent. Every member of the hierarchy, except the top level member, has a parent.

The level names for the channel dimension hierarchy are:

Time

The time dimension is made up of two years (1995 and 1996) of monthly members. A Julian calendar (January to December) is used. The time hierarchy includes quarterly, yearly, and year-to-date aggregations. The current month used in the benchmark processing is June 1996. Periods before this are considered historical periods. Periods after this are considered future periods.

The most frequently used aggregation along the time dimension is summation. The exceptions to this are:

The ending balance calculation computes the parent value as the last period of its children. For example, the ending balance of Quarter 1 is March. An average balance calculation computes the parent value as the sum of its children's values divided by the number of children. For example, the average balance of Quarter 1 is the sum of January, February, and March divided by 3. The percentage measures cannot be aggregated along the time dimension, but rather must be computed from the aggregations of their components. For example, percent variance for Quarter 1 cannot be calculated from the percent variances of January, February, and March. The values must be calculated from the Quarter 1 values of actuals and budget.

The queries reference a set of time dimension calculations. The current periods (June 1996, Quarter 2 1996, Year 1996, and YTD 1996) are compared with the previous period and the same period last year. This comparison is stated as both a value difference and a percent change.

Scenario

There are three base values in the scenario dimension: two are input from data files and one is modeled from the other two. The input scenarios contain actuals and budget and the modeled scenario contains forecast. The scenario members have differing time frames. The actual scenario contains data from January 1995 to June 1996. The budget and forecast scenarios contain data from January 1996 to December 1996.

The forecast scenario is modeled as a function of both the actual and budget scenarios. The computation of the forecast scenario is:

Calculate total 1996 forecast values by the retailer level of the customer hierarchy as the annualized values of the second half actuals of 1995 increased by 15%. Allocate to the months in 1996 based on the 1996 budget. Allocate to the stores from the retailer based on the year-to-date actuals. Allocate to products based on year-to-date budget.

The queries also reference a set of scenario dimension calculations. Variance includes both value and percent differences.

Budget vs. Actuals
Forecast vs. Actuals

Measures

The measure dimension contains ten members, five input and five calculated members. They are:

Input

Calculated

Data Files

There are two sets of data files. The first set of data files is required to initialize the database. The second set of data files is required for the incremental processing. The APB1GEN program generates all files and presents them as fixed field length ASCII.

To allow for the greatest diversity in database designs, the files containing the product, customer, channel members, and hierarchy are presented in both tree and level formats. The tree format presents the dimension hierarchy as a record for each member of the hierarchy that has a parent. The record contains fields for the child member, the parent member, and the name of the level to which the child member belongs. The level format presents the dimension hierarchy as a record for each dimension member at the bottom level of the hierarchy. The record contains a field for each level in the hierarchy.

Hierarchy data files are provided for the product, customer, and channel dimensions. The scenario and measure dimensions do not have hierarchies and therefore do not have hierarchy data files.

The data files reference scenario members by the names ACTUAL, BUDGET, and FORECAST.

The data files do not explicitly reference members of the measure dimension. The naming of the measure dimension members is left to the benchmark implementors, with the restriction that the names must be at least twelve characters in length.

The data files reference members of the time dimension at the month level. The names of the members in the data files are dictated by the benchmark and are in the format of YYMMDD (950301 will be the first day of March in the year 1995). Benchmark implementors can choose to use the member names or they can translate the member names to another format during the data load. It is important to remember that any processing that occurs during the incremental load is included in the computation of the AQT performance metric.

The data files will not be in any sorted order. To do so would prejudice the benchmark and favor certain database designs. Sorting of the data before loading can be performed, but the sorting time will be included in the calculation of the AQT.

Initialization Data Files

Product Hierarchy - Level Format (PRODHIER.DAT)

Column Width Field
1 12 Code
13 12 Class
25 12 Group
37 12 Family
49 12 Line
61 12 Division

Product Hierarchy - Tree Format (PRODTREE.DAT)

Column Width Field
1 12 Member
13 12 Parent
25 12 Level

Customer Hierarchy - Level Format (CUSTHIER.DAT)

Column Width Field
1 12 Store
13 12 Retailer

Customer Hierarchy - Tree Format (CUSTTREE.DAT)

Column Width Field
1 12 Member
13 12 Parent
25 12 Level

Channel Hierarchy - Level Format (CHANHIER.DAT)

Column Width Field
1 12 Base

Channel Hierarchy - Tree Format (CHANTREE.DAT)

Column Width Field
1 12 Member
25 12 Level

Time Hierarchy - Level Format (TIMEHIER.DAT)

Column Width Field
1 12 Mon
13 12 Qtr
25 12 Yr

Time Hierarchy - Tree Format (TIMETREE.DAT)

Column Width Field
1 12 Member
13 12 Parent
25 12 Level

Standard Product Cost (PRODCOS.DAT)

Column Width Field
1 12 Product - Code Level
13 12 Scenario - Actual, Budget, Forecast
25 6 Time - January 1995 through December 1996
31 10 Value - 999999.999

Standard Shipping Cost (CUSTSHIP.DAT)

Column Width Field
1 12 Customer - Store Level
13 12 Scenario - Actual, Budget, Forecast
25 6 Time - January 1995 through December 1996
31 10 Value - 999999.999

Historical Sales (HISTSALE.DAT)

Column Width Field
1 12 Customer - Store Level
13 12 Product - Code Level
25 12 Channel - Base Level
37 6 Time - January 1995 through May 1996
43 10 Unit Sales - 9999999999
53 10 Dollar Sales - 9999999.99

Historical Inventory (HISTINV.DAT)

Column Width Field
1 12 Customer - Store Level
13 12 Product - Code Level
31 10 Inventory 199501
41 10 Inventory 199502
51 10 Inventory 199503
61 10 Inventory 199504
71 10 Inventory 199505
81 10 Inventory 199506
91 10 Inventory 199507
101 10 Inventory 199508
121 10 Inventory 199509
131 10 Inventory 199510
141 10 Inventory 199511
151 10 Inventory 199512
161 10 Inventory 199601
171 10 Inventory 199602
181 10 Inventory 199603
191 10 Inventory 199604
201 10 Inventory 199605

Incremental Load

Current Month Sales (CURRSALE.DAT)

Column Width Field
1 12 Customer - Store Level
13 12 Product - Code Level
25 12 Channel - Base Level
37 10 Unit Sales - 9999999999
47 10 Dollar Sales - 9999999.99

Current Month Inventory (CURRINV.DAT)

Column Width Field
1 12 Customer - Store Level
13 12 Product - Code Level
25 10 Inventory - 9999999999

Budget (BUDGET.DAT)

Column Width Field
1 12 Customer - Store Level
13 12 Product - Code Level
25 6 Time - January 1996 through December 1996
31 10 Unit Sales - 9999999999
41 10 Dollar Sales - 9999999.99

Query Execution

Column Width Field
1 24 Query number
25 24 Parameter 1
49 24 Parameter 2
73 24 Parameter 3
97 24 Parameter 4
121 24 Parameter 5
124 24 Parameter 6

Queries

OLAP queries are ad hoc and very dynamic. The types of information requested span the full scope of the available data. Queries must be able to take advantage of the business relationships represented in the database. The time periods, products, customers, and channels must be dynamically generated from their respective hierarchies. For example, a query requesting the sales for a predetermined list of stores is the exception rather than the rule in analytical processing. The more common type of query asks for the sales of all the stores of a given retailer.

Time is a unique dimension in any application and as such, has special characteristics. Hierarchical relationships (the months in a given quarter) are not the only way to request time periods. Relative positioning is just as important. The most often used of the relative positioning functions is the time period immediately preceding the requested one and the same time period a year ago.

The queries vary in how they treat null values. The query definition states for each dimension whether to suppress members where all values are null. This means that the server should not generate a requested dimension member if every value it returns is null.

To best simulate a realistic operating environment, each of the queries will have substitution parameters. These parameters will not only state the dimension members, but also the function to apply to them. Note that the computation of the members lists for the returned data set must be performed on the server and not on the client.

The benchmark queries are detailed below. The definition of each query will contain the following information:

Query 1 - Channel Sales Analysis

This query shows actual units sold, dollar sales, and average selling price for a given channel. The product, customer, channel, and time period vary with each execution of the query. Only the instances of product and customer that contain data should be returned.

parameter 1 = ?product
parameter 2 = ?customer
parameter 3 = ?channel
parameter 4 = ?time

get UNITS, DOLLARS, PRICE
by SCENARIO = "ACTUAL"
by PRODUCT = <children(?product)> option suppress null
by CUSTOMER = <children(?customer)> option suppress null
by CHANNEL = <?channel>
by TIME =<children(?time)>

Query 2 - Customer Margin Analysis

This query shows actual sales, cost and margin for a given customer for the sum of all channels for a requested period. The product, customer and time period vary with each execution of the query. Only the instances of product that contain data should be returned.

parameter 1 = ?product
parameter 2 = ?customer
parameter 3 = ?time

get UNITS, DOLLARS, COST, MARGIN, MARGINPCT,
PRODUCTCOST, CUSTOMER COST
by SCENARIO = "ACTUAL"
by PRODUCT = <children(?product)> option suppress null
by CUSTOMER = <?customer>
by CHANNEL = attribute("level", "TOP")
by TIME = <?time>

Query 3 - Product Inventory Analysis

This query shows actual sales, cost and inventory for a given product regardless of channel. The product and customer vary with each execution of the query. All months from January 1995 through June 1996 should be returned.

parameter 1 = ?product
parameter 2 = ?customer

get UNITS, DOLLARS, COST, INVENTORY
by SCENARIO = "ACTUAL"
by PRODUCT = <?product>
by CUSTOMER = <children(?customer)>
by CHANNEL = attribute("level", "TOP")
by TIME = "9501" thru "9606"

Query 4 - Time Series Analysis

This query shows actual sales and a 6 month moving average of smoothed sales for a given customer and a group of time periods over all channels. The product, customer, and time period vary with each execution of the query.

parameter 1 = ?product
parameter 2 = ?customer
parameter 3 = ?time

get DOLLARS, SMOOTHED
by SCENARIO = "ACTUAL"
by PRODUCT = <children(?product)>
by CUSTOMER = <?customer>
by CHANNEL
by TIME = <children(?time)>

Query 5 - Customer Budget

This query shows a customer's budget across all channels for all months in 1996. The customer varies with each execution of the query.

parameter 1 = ?customer

get UNITS, DOLLARS, PRICE, COST, MARGIN
by SCENARIO = "BUDGET"
by PRODUCT = attribute("level", "TOP")
by CUSTOMER = <?customer>
by TIME = "9601" thru "9612"

Query 6 - Product Budget

This query shows a product's budget for all quarters in 1996. The product varies with each execution of the query.

parameter 1 = ?product

get UNITS, DOLLARS, PRICE, COST, MARGIN
by SCENARIO = "BUDGET"
by PRODUCT = <?product>
by CUSTOMER = attribute("level", "TOP")
by TIME = children("1996")

Query 7 - Forecast Analysis

This query shows a channel's forecast for a given time period in 1996. The product, customer, channel, and time period vary with each execution of the query.

parameter 1 = ?product
parameter 2 = ?customer
parameter 4 = ?time

get UNITS, DOLLARS, PRICE, COST, MARGIN
by SCENARIO = "FORECAST"
by PRODUCT = <children(?product)> option suppress null
by CUSTOMER = <children(?customer)> option suppress null
by TIME = <?time>

Query 8 - Budget Performance

This query shows budget vs. actual and this year vs. last year's performance for the current month and current year-to-date. The product and customer vary with each execution of the query.

parameter 1 = ?product
parameter 2 = ?customer
parameter 3 = ?channel

get DOLLARS[SCENARIO:"ACTUAL"]
DOLLARS[SCENARIO:"BUDGET"]
val_diff(DOLLARS, SCENARIO, "ACTUAL", "BUDGET")
pct_diff(DOLLARS, SCENARIO , "ACTUAL", "BUDGET")
last_year_val_dif(DOLLARS[SCENARIO:"ACTUAL"])
last_year_pct_dif(DOLLARS[SCENARIO:"ACTUAL"])
by PRODUCT = <?product>
by CUSTOMER = <?customer>
by TIME = {"9606", YTD("9606")}

Query 9 - Forecast Performance

This query shows forecast vs. actual and this month's vs. last month's performance for the current month and current year-to-date. The product and customer vary with each execution of the query.

parameter 1 = ?product
parameter 2 = ?customer

get DOLLARS[SCENARIO:"ACTUAL"]
DOLLARS[SCENARIO:"FORECAST"]
val_diff(DOLLARS, SCENARIO, "ACTUAL", "FORECAST")
pct_diff(DOLLARS, SCENARIO , "ACTUAL", "FORECAST")
last_period_val_dif(DOLLARS[SCENARIO:"ACTUAL"])
last_period_pct_dif(DOLLARS[SCENARIO:"ACTUAL"])
by PRODUCT = <?product>
by CUSTOMER = <?customer>
by TIME = {"9606", YTD ("9606")}

Query 10 - Ad Hoc

This query answers ad hoc questions of the database. The measure, scenario, product, customer, channel and time period vary with each execution of the query.

parameter 1 = ?product
parameter 2 = ?customer
parameter 3 = ?channel
parameter 4 = ?time
parameter 5 = ?scenario
parameter 6 = ?measure

get <?measure>
by SCENARIO = <?scenario>
by PRODUCT = <?product>
by CUSTOMER = <?customer>
by CHANNEL = <?channel>
by TIME =<?time>

Query Distribution

The following chart shows the distribution of queries over the course of an APB-1 benchmark execution.

10% Query 1 - Channel Sales Analysis
10% Query 2 - Customer Margin Analysis
15% Query 3 - Product Inventory Analysis
3% Query 4 - Time Series Analysis
5% Query 5 - Customer Budget
5% Query 6 - Product Budget
15% Query 7 - Forecast Analysis
20% Query 8 - Budget Performance
15% Query 9 - Forecast Performance
2% Query 10 - Ad Hoc

Implementation Rules

The APB-1 benchmark has the following two database design restrictions: All the data must be stored on the server and all calculations must be performed on the server. Each benchmark implementor may decide how much and which derived data to pre-calculate and store. The benchmark is designed to test the performance of business analysis, not a specific technology or design.

The APB-1 is executed in six steps:

  1. Execution of the APB1GEN program to produce hierarchy files and historical data
  2. Database initialization and historical data load
  3. Execution of the APB1GEN program to produce incremental data files
  4. Incremental data load and pre-calculation
  5. Execution of the APB1GEN program to produce query data files
  6. Query execution

The APB1GEN program is executed between each stage of the benchmark process to prevent an implementor from "wiring" the benchmark. This processing sequence hides specific knowledge about the next benchmark step until it is needed. This is done to ensure that the database design is application specific not data specific. It is important that the calculation of the AQT metric apply to general application processing and not to a particular benchmark execution.

The purpose of the APB-1 is to measure the performance of database servers for OLAP applications. All processing must be done on the server. The client in this benchmark exists only to provide mechanisms for supplying parameter substitutions, to pass the query to the server, and to write the returned data to a disk file for audit. The computation of dimension member sets and derived data are specifically prohibited from being performed on the client.

The AQT is computed as the execution time to perform steps 4 and 6 above divided by the total number of queries performed. The execution time for step 4 begins when the incremental data files start being processed. If the benchmark implementor prefers the data files in a sorted order and performs that sort, the sort time is included in the execution time. The execution time for step 6 begins when the first user initiates the first query and ends when the last user completes the last query. The results of each query must be written to disk on the client platform to be available for auditing.

The APB1GEN Program

The APB1GEN program is used to generate data and query information for the APB-1 benchmark. The program should be run with each benchmark execution. This ensures that the implementation of the APB-1 is not designed for a specific data and/or query set.

The size of the domain of each of the three aggregation dimensions (product, customer, and channel) is controlled by the APB1GEN program. When the APB1GEN program is used to initialize the benchmark database, the number of members for the channel dimension is given as a parameter. The channel dimension must have at least 10 members. The customer dimension is generated with 100 times the number of members in the channel dimension. The product dimension is generated with 1,000 times the number of members in the channel dimension.

The number of records in the historical sales data file is controlled by the density parameter of the APB1GEN program. This parameter given as percentage determinate how many of the total possible of channel, customer, and product combinations at the lowest level in the hierarchy will have data for each of the 17 historical months. Remember that 10% of the members of each of these dimensions are aggregates.

records = (length(CHANNEL) * .9) *
(length(CUSTOMER) * .9) *
(length(PRODUCT) * .9) * 17 * density%

The number of records in the historical inventory data file will be approximately three times greater than the number of records in the sales file. This is because although not every product is sold in every month, to sell a product there must be inventory. Once all of a product is sold, there is no more inventory. The APB1GEN program manages these operational characteristics.

The current month sales and inventory data files will be 1/17th the size of the actual sales and inventory files because the two files contain data only for one month.

The final function of the APB1GEN program is to generate a file for each simultaneous user in the benchmark execution. The number of queries generated for each user is 250 times the number of members in the channel dimension times the number of users. The minimum number of users allowed is 10.

Full Disclosure and Auditing

Full disclosure is required by anyone publishing APB-1 benchmark results so that others can duplicate results using the same documentation and products. The disclosure includes the information an evaluator needs to judge a server's processing performance and its suitability to the task of analytical processing.

Disclosure items:

An independent audit of APB-1 benchmark results must be performed by an auditor certified by the OLAP Council. The auditor cannot have any financial gain from the benchmark other than fees directly related to the audit. The auditor cannot be involved, in any manner, with the execution of the benchmark, other than to ensure its proper execution.

The audit of the APB-1 benchmark will consist of the following five tasks:

  1. Verifying that the benchmark was run in the prescribed order as detailed in the implementation rules section.
  2. Examining the programming code on both the client and the server to verify that:
  3. Verifying that the results returned are correctly calculated. Thirty queries (three of each query type) must be randomly chosen and must be audited to:
  4. Ascertaining that all disclosure items are correctly stated in the disclosure document.
  5. Producing a report detailing the audit steps and results.

The OLAP Council sponsored development of the APB-1 benchmark, which was released in April 1996 for public comment. The benchmark was written by Mr. Dan Bulos.

Dan Bulos, 41, is president and founder of Symmetry Corporation, an On-Line Analytical Processing (OLAP) consulting firm in San Rafael, CA. He has 18 years of experience consulting to Fortune 500 companies that implement OLAP applications. A guest columnist for DBMS Magazine, Mr. Bulos has also been featured as a speaker at various OLAP conferences.

Prior to founding Symmetry in 1986, Mr. Bulos spent four years as an independent OLAP consultant and four years at Tymshare, where he pioneered the use of multidimensional databases for financial applications as the Western Region Financial Consulting Manager. Mr. Bulos earned a B.A. in business administration from Antioch College.


© 1997, 1998 OLAP Council, all rights reserved.