# An Overview of the Integrated Business Statistics Program Sampling Methodology

## Introduction

Statistics Canada’s Corporate Business Architecture (CBA) initiative implements measures designed to reduce operating costs, enhance quality assurance and improve responsiveness in the delivery of new statistical programs. The proposal to develop the Integrated Business Statistics Program (IBSP) represents a way for Statistics Canada’s business statistics programs to achieve these objectives. The IBSP aims to rebuild the existing Unified Enterprise Survey (UES) platform into a CBA-compliant generalized model for producing business statistics. This model will cover all survey steps, from frame and sampling to dissemination. The initial objective was to apply the IBSP model to all business programs with the exception of Prices and International Trade. There are six pillars underlying the IBSP proposal that will produce the expected efficiencies:

- full use of the Business Register (BR) as the frame for all business surveys;
- use of electronic data collection as the primary mode of collection;
- use of a common edit strategy for automated and manual editing;
- establishment of an earlier cut-off for active collection;
- use of the tax information for estimating financial information;
- improved governance across all areas involved in statistical data output and notably stronger change management.

The foundations of the IBSP are described in detail in a document written by the Enterprise Statistics Division ESD 2009. According to this document, the IBSP must be unified, harmonized and flexible enough to

- integrate new surveys;
- conduct industry-based surveys as well as activity-based surveys; and
- have the choice between enterprise, establishment and location as the sampling unit.

This document summarizes the sampling methodology adopted for the IBSP. Each of the main steps of the sampling process will be detailed and the methodology implemented will be discussed.

## Two-phase sampling

Some surveys that will integrate into the IBSP are facing situations where the information available on the BR is either not up-to-date or insufficient for the needs of the survey. For instance the industrial classification on the BR may not be consistently up-to-date across all units or the survey may require extra information such as commodities produced by the business to make its sampling process more efficient.

In order to deal with this, the IBSP offers surveys the possibility of using a two-phase sampling strategy. Surveys choosing to do two-phase sampling are encouraged to join the main phase 1 survey, referred to as the Business Activity, Expenditure and Output (BAEO) survey. Some surveys for which a two-phase sampling plan is not appropriate will have the option of using a single-phase sampling strategy. The objectives of any phase 2 survey must be known in order to ensure that the phase 1 sample (on which the phase 2 surveys depend) will be large enough to satisfy them. The size of the phase 1 sample is also quite important for negative coordination considerations in phase 2.

## Frame

One of the basic assumptions of the IBSP in terms of frame information is that all the necessary information from the BR or survey data is available and up-to-date at the time of phase 1 sampling. No updates to the frame information are permitted within the sampling process.

### Use of additional information

In order to create the frame for a survey, it can also be useful to refer to information from the survey’s previous cycles or information from the survey’s first phase. For instance, this will be the case for the phase 2 surveys that use the BAEO survey as a first phase. When creating the frame for these phase 2 surveys, the sampling units to be included will be the ones selected as part of the BAEO sample.

Also, for the purpose of sample coordination, it may be required to access sample files from other surveys in order to manage the coordination of samples between surveys to control response burden. It may also be useful to access sample files from previous occasions of the same survey in order to manage sample rotation between occasions of a single given survey. Furthermore, all second-phase surveys will require information from their first-phase survey to define the frame.

## Unit hierarchy and size

Businesses on the BR are made up of four statistical levels (i.e. enterprise, company, establishment or location). Each survey must define the level at which the proportions of the total size that comes from the unit (contributions) will be considered at sampling (targeted operating entity level) as well as the level of the unit that will be selected at sampling (sampling unit level) and estimation will be done.

### Targeted operating entities and sampling units

The targeted operating entity and sampling unit levels are identified in the metadata and can be any of the statistical levels from the BR. The only restriction is that the sampling unit level may not be lower in the hierarchy than the targeted operating entity level.

### Sampling size measures for targeted operating entities

For targeted operating entities, the default size measure is revenue-based. Size concepts other than revenue can also be used and can come directly from the BR or be derived based on variables present on the BR. The user could also specify a revenue-based size function different from the default one. It is also possible to use a combination of different size concepts such as revenue and assets, revenue and capacity or heads of cattle and heads of pigs for agricultural surveys (as is seen in the “Use of Multiple Size Concepts” section of this document).

## Sampling Cells

For each survey, the basic level of stratification or cell is usually defined by the most refined domain for which estimates will be required. The purpose is to ensure that good quality estimates can be produced by taking this information into account at stratification.

The cell usually includes both industrial and geographical dimensions (where the geographical dimension typically represents the provinces). The level of industry varies between surveys. Extra dimensions can also be taken into account such as country of control or profit/non-profit status of the entity.

Targeted operating entities naturally fall into one cell based on the classification information available on the BR for that entity.

### Importance factors

Once the size measure has been derived for all targeted operating entities, importance factors for each sampling cell

One choice of an importance factor (

The importance factor

Here, the numerator is the total amount of revenue in province

For the IBSP, geography is not the only dimension of interest; industrial classification is also key. If we add industrial classification as a secondary dimension, then we have

The index

## Cell stratification

One of the key aspects of the IBSP sample design is the desire to be enterprise-centric. That is, within a particular survey of the IBSP, it is important to either sample all eligible targeted operating entities within an enterprise or to sample none at all. By ensuring that all targeted operating entities associated with an enterprise are pulled together into a single sampling unit, it is both easier to coordinate samples across surveys and draw enterprise-centric samples without the cumbersome network sampling approach of the UES, a process that greatly complicated estimation. For simple enterprises with a single establishment, there is a one-to-one relationship between the enterprise’s sampling unit and a sampling cell. However, this relationship for complex enterprises is one-to-many and therefore it is not straightforward to define to which sampling cell each sampling unit should be assigned.

In business surveys where the distribution of revenue among units in the population tends to be highly skewed (revenue being a quantity correlated to many key estimates), selecting larger units to the sample with greater probability increases both the quality of estimates and the efficiency of the sample. With this in mind, the choice of the sampling cell to which a complex enterprise’s sampling unit will be stratified is taken to be a function of the proportion of revenue within each sampling cell that can be attributed to the enterprise as well as the sampling cells’ importance factors.

For the IBSP, the size measure of sampling unit

where

ωj represents the importance factor of sampling cell (domain)j ,xji is the total amount of size in sampling cellj of all targeted operating entities in sampling uniti , andtxj represents the total size of sampling cellj .

The term

is referred to as **the contribution of sampling unit i
to sampling cell
j**. The sampling cell

The function applied to the contributions (4) ensures that the total size of sampling unit

Given that the scope and coverage of IBSP surveys change across sampling phases, it is necessary to re-evaluate the stratification of complex sampling units at each phase to ensure the larger units are being handled in the most advantageous way possible.

### Use of Multiple Size Concepts

Some survey programs require the use of more than one concept to define the size of units in their population. For example, the Annual Electric Power Generating Stations survey uses both revenue and electrical capacity while the Livestock survey measures size using heads of cattle, sheep and pigs. In order to successfully integrate these programs into the IBSP, it is necessary to allow the use of multiple size concepts to derive the overall size of a unit.

The current sampling system is not flexible enough to allow for multiple passes through the sampling processes. Furthermore, some difficulties exist in mapping a multivariate size measure into a single size measure for stratification and sampling purposes. To circumvent these issues, it was decided to consider each size concept as another level of the cell stratification. The overall process itself would not be disturbed as, by default, the subroutines described further along in this document (such as Take-None threshold definition and size stratification) are executed sampling cell by sampling cell.

As an example, consider a targeted operating entity in NAICS 111111 and Province P raising both cattle and pigs. This entity would be represented by one entity placed in the 111111×P×Cattle cell with a size based on head of cattle only and a second entity placed in the 111111×P×Pigs cell with a size based on number of pigs. The size measures are standardized in order to avoid differences in scale between different concepts. The contributions of each operating entity towards their cell memberships (as well as that of other targeted operating entities in the sampling unit, if applicable) are calculated to determine the sampling cell in which the sampling unit should be stratified. Two advantages of this classification are that sampling units are classified in line with their dominant industry/geography combination and, because every cell has a different size concept component associated with it, the sampling units get classified in line with their dominant size concept. The standardization and normalization of sizes allows the use of a single size measure during the sampling process.

All of the ensuing steps of the sampling process described hereafter continue as in the univariate case. Care must be taken though to restructure the final sample files in order to avoid duplicating the operating entities (once for each size measure). The additional size measures can appear on each record but if they do, they appear as different variables of size that were used in the process.

## Take-None

The Royce-Maranda algorithm as described in Royce and Maranda (1998) is the method used by the IBSP to reduce the response burden on small units by excluding them from the sample. This method is part of Statistics Canada’s new Generalized Sampling system (G-). The user has to provide a list containing the exclusion thresholds, and G- determines the appropriate exclusion threshold for each cell.

The first step of the process is to assess which targeted operating entities should be Take-None. In order to do that, in each cell, targeted operating entities are sorted in descending order based on their revenue. A threshold is selected in such a way as to retain at least 90% of the revenue in each cell unless the minimum threshold eliminates more than 10% of the cell’s revenue. The level of exclusion can be controlled but it is assumed to be 10% for all IBSP surveys.

Once the status of each targeted operating entity of the population has been assessed, the status of the sampling unit as a whole has to be evaluated. Since sampling units can consist of a mix of Take-None and eligible targeted operating entities, a rule is required to decide what to do with such cases. To keep the exclusion below the targeted 10% level, the rule is to exclude the sampling unit only if all targeted operating entities belonging to that sampling unit are Take-None based on the previous assessment.

The assessment of the Take-None is not redone when processing a second-phase survey. Targeted operating entities retain their Take-None status from the first phase and the sampling unit’s status is determined in the same way as described in the previous paragraph. If a sampling unit becomes Take-None at the second phase, it will undergo a special treatment. None of these units will receive a questionnaire and financial data will be obtained from tax while characteristic information will either be zeroed or imputed.

## Must-Take Units

Given the skewed nature of the distribution of revenue in economic surveys, it is essential that certain influential units that cannot be represented by other units in the population be sampled. This ensures that these influential units’ data are included in the final estimates. Revenue-based estimates aside, there are also some smaller units that are equally crucial for characteristic estimates. In both cases, a mechanism exists to allow particular units to be sampled with certainty. The specified units are placed in a single Must-Take stratum in which all units are selected into the sample.

There are other cases where it is advantageous to put units in the Must-Take stratum, such as taking a census in a cell or survey, needing to produce an estimate from a cell containing an extremely small number of units, or identifying a unit as a large outlier. In total, there are eight ways in which a targeted operating entity can become Must-Take throughout the IBSP sampling process. They are as follows.

- Subject matter experts can specify the unit directly
- Must-take criteria set up in the metadata identify the unit (the default criterion targets sampling units with contributions in at least six unique sampling cells)
- The unit belongs to a sampling unit in which another unit was identified by one of the two methods above
- The unit is a member of a sampling cell containing fewer than ten sampling units above the exclusion threshold. (see the section on Small Cells)
- The unit is a member of a stratum of five or fewer sampling units.
- The unit is in a sampling unit that is a large outlier within its sampling cell.
- The unit is in a census survey
- The unit is in a frame cross-classification containing a total of one or two units and is therefore critical to estimation.

Note that, as per case 3, all targeted operating entities in a sampling unit in which one of the targeted operating entities is identified as an outlier will also be flagged as a Must-Take unit.

## Special units

Some classes of units may be in-scope for some IBSP surveys, but do not need to pass through every step of the sampling process. These “special units” are subject to special treatments to avoid having them interfere with the sampling process as it is applied to the remainder of the units.

## Size stratification

After the sets of take-none, must-take and special units within each cell have been defined, the remaining units form the set of eligible units from which the sample will be drawn. However, if the sample is drawn within each cell as-is, the skewed distributions of revenue are such that a relatively large number of units will need to be sampled in order to target a reasonable level of quality within each cell. In order to increase the efficiency of the sample, units in the take-some strata are first stratified by size in order to group similarly-sized units together. The sample selection within each cell will then be conducted one stratum at a time.

The classic method of stratification used for the UES was Lavallée-Hidiroglou stratification (Lavallée and Hidiroglou, 1988). An advantage of using this method for stratification is that sample allocation is also performed at the same time. However, a required input for this method is an *a priori *knowledge of the target levels of quality and the outputs are a set of stratum bounds and sample sizes for each stratum which minimizes the overall sample size. In the IBSP, target quality levels are not known as inputs. Rather, the targeted sample size is used as an input and the target quality becomes an output. As a result, alternative methods for size stratification and sample allocation are required.

The Geometric Method of stratification (Gunning and Horgan, 2004) is a very simple method that requires only a distribution of size values and a desired number of strata in order to function. The stratum bounds are derived with the intention of equalizing the coefficient of variation (CV) of size within each stratum. That is, if in stratum

The formula used by the Geometric Method to establish the stratum boundaries is

where

- minsize and maxsize are the minimum and maximum sampling unit sizes, respectively, in cell
j K is the total number of bounds to be derived in cellj k is the index of the bound (1,...,K ).

It was shown in Li (2012) that the Geometric method was sensitive to both small and large outliers. Extreme values at either end of the distribution could have drastic effects on the values of the stratum bounds. As a result, an outlier detection functionality was developed to precede the stratification procedure. Both large and small outliers were detected according to the Sigma Gap method. Hellec (2013) determined that the global Sigma Gap parameter λ should be set to 60. This parameter is applied indiscriminately to all sampling cells after removal of previously-defined Must-Take units. After the outlier detection process is run in each cell, both types of outliers are set aside and the remaining distribution is used to derive stratum bounds. Once the bounds are derived, the small outliers are replaced in the small unit strata in their respective cells while the large outliers are assigned to the Must-Take stratum.

In most cases, the value of

### Small cells

Statistics Canada is mandated to produce estimates in all of the cross-classifications of provinces and industrial classification variants defined as domains of estimation for the IBSP. However, some of the sampling cells used to represent these domains can contain very few units. In order to produce reasonable estimates within a cell, it is necessary to have a minimum number of responding units. Partitioning some cells into three size strata, sampling less than 100% of the units in these strata and then being subjected to non-response can lead to some scenarios where estimates of reasonable quality are not possible. As a result, it was necessary to develop custom strategies to allocate samples in particularly small cells. In this section, when discussing the sizes of cells, it is the number of sampling units that are eligible for sampling after removing Must-Take units that is taken into account.

Targeted operating entities that naturally stratify into sampling cells with 1 or 2 sampling units will all be made Must-Take. Being a member of such a small estimation domain has a negative impact on the performance of the sample allocation methodology and therefore it is best to simply exclude them from that process by making them Must-Take.

In cells containing fewer than 10 sampling units, all units will also be made Must-Take. For surveys where data can be tax replaced, data will be collected for important cells. A cell is deemed to be important if its importance factor is above the 50th percentile of all cells within the same province or industry. For cells that are less important, no collection will take place and tax data will be used for the financial portion of the questionnaire. The characteristic portion will either be zeroed or imputed in these cases. If the survey cannot perform tax replacement, all units will be collected.

In cells with 10 to 29 sampling units, the number of size strata will be reduced from 3 to 2 (Geometric Method applied with

### Sample allocation

For the IBSP, an enterprise may contribute to several industries and/or provinces. If we consider the amount that an enterprise contributes to these sampling cells as variable values, then a multivariate allocation scheme needs to be used to allocate the sample to the stratified population. A power allocation method will be used in order to obtain quality estimates at both the industry and provincial levels. The method is described in Bankier (1998) and Särndal, Swensson and Wretman (2003).

Multivariate power allocation can be set up in two ways. One can seek to either:

- minimize a cost function under precision constraints for each domain and variable, or
- minimize a precision function for the domains under a fixed cost constraint.

As noted in the section on size stratification, the inputs for the survey will be quantities of available resources, rather than known quality targets. As a result, option 2 above is the approach that is taken. The function to be minimized is:

where

For phase 1 of the IBSP, the domains

The powers

Consider the power

In order to ensure that estimates for all cells are of reasonable quality, a constraint on the maximum allowable coefficient of variation (CV) can be set. It should be noted that the CVs computed internally within G- that are compared against these quality constraints are conditional CVs and can be quite a bit higher than the effective CVs computed once the sample is actually selected.

Furthermore, in order to ensure the selection of a sufficient number of units in each strata as well as to prevent the negative impact of units with very large weights changing stratum between sampling and estimation, constraints can be set at the sample allocation stage. The constraints that have been implemented translate a user-defined upper bound on weights into a minimum number of units to be allocated to each type of strata.

The implementation of either of these types of constraints moves the G- solution away from optimality. Therefore, the sample allocation will first be run without the use of either CV or maximum weight constraints. The CVs and weights will then be evaluated and if issues are found the allocation will be rerun with constraints in place. The only exception is for the BAEO (phase 1) survey, in which maximum weight constraints are generally set at fixed levels for each of the possible size strata. These constraints are relaxed for some industry groups (corresponding to the second-phase surveys) where the population makeup prevents convergence of the allocation program for the default maximum weight values. Note that these constraints can be adjusted by users.

All units stratified in the Must-Take stratum are given sampling probabilities of

### Sample selection and coordination

The default method for sample selection in the IBSP is Bernoulli sampling. This method has been chosen because it simplifies the variance computation while simplifying sample coordination and allowing for combination of samples. In some cases, however, the Bernoulli selection may yield an insufficient sample in a particular stratum. To mitigate this risk, a Simple Random Sample (SRS) is used in place of the Bernoulli sample when the number of selected units drops below the specified threshold of 2.

Through the use of G-, the IBSP will allow sample coordination between the occasions of a given survey as well as between IBSP surveys. As it is the only method available in G- to do sample coordination at this point, sequential sampling will be used to manage sample coordination. This method uses Permanent Random Numbers (PRNs) to select the sample. By selecting different starting points, it allows for different types of sample coordination.

Eventually, it will be required to have some sample rotation implemented. The strategy that is currently envisioned to achieve that goal is to shift the start of the sampling window for the selection of the BAEO sample by the desired amount and to select the second phase surveys’ samples with maximum overlap with their previous occasion. This will yield the expected amount of rotation in the second phase as shown in Paulus (2014).

### Treatment of Dead Units

In the case of drawing a sample for cycle 2 or beyond of a repeated survey that controls overlap across cycles, it is possible that the death rate outside the sampled portion is understated as compared to the rate in the sampled portion (due to a number of deaths that are only known due to survey feedback). In order to reduce the chance of this repeated sampling resulting in a biased estimate, it is necessary to manipulate the frame in order to stabilize the weights across survey occasions (equivalent to ensuring that the death rates are constant in both portions of the population across cycles). This could entail reintroducing known dead, previously-sampled units back onto the frame (though never doing so for units whose death was indentified via an independent source) or selecting units in the unsampled portion of the population and treating them as deaths. Note that the term “death” here refers to any sort of change to a unit that would make it in-scope for one cycle but out-of-scope for the next (including NAICS changes, Business Status changes or changes to the population criteria). NAICS changes leaving a unit in-scope for a survey are not considered deaths.

Should the frame manipulation require the reintroduction of known dead units to the frame, these dead units may be sampled as any other unit and may therefore carry a weight, but they will not be sent a questionnaire under any circumstance.

## Bibliography

Baillargeon, S., Rivest, L.-P., and Ferland, M. (2007). Stratification en enquêtes entreprises : une revue et quelques avancées. Proceedings of the Survey Methods Section, Statistical Society of Canada

Bankier, M. D. (1988). Power allocations: Determining Sample Sizes for Subnational Areas. *The American Statistician*,

Demnati, A., and Turmelle, C. (2011). Proposed Sampling and Estimation Methodology. Internal Statistics Canada document

ESD. (2009). "Integrated Business Statistics Program - Blueprint". Internal Statistics Canada document

Gaudet, J. (2014). Contact Strategy for the RY2014 Business Activity, Expenditures and Output (BAEO) Survey. Internal Statistics Canada document

Gunning, P., and Horgan, J. M. (2004). A New Algorithm for the Construction of Stratum Boundaries in Skewed Populations. *Survey Methodology*, 30, pp. 159-166.

Hellec, S. (2013). Diagnostic et contribution à la méthodologie d’échantillonnage du PISE. Internal Statistics Canada document.

Lavallée, P. and Hidiroglou, M. A. (1988). On the stratification of skewed populations*. Survey Methodology*, 14, pp. 33-43.

Li, Y. (2012). Detection and treatment of outliers in the creation of the IBSP sampling frame. Internal Statistics Canada document.

Paulus, P. (2014). Rotation et coordination des échantillons dans le cadre du Programme intégré de la statistique des entreprises. Internal Statistics Canada document.

Royce, D., and Maranda, F. (1998). Groupe de travail sur l’acquisition des données auprès des entrerpises. . Internal Statistics Canada document.

Särndal, C.-E., Swensson, B., and Wretman, J. (2003). *Model Assisted Survey Sampling*.